axis vs. index, sparse vs. dense array semantics and syntax? #4

sneakers-the-rat · 2024-02-03T03:20:51Z

catching up with what y'all have been doing over here - is this the right place to talk about arrays? i also see it's already in the metamodel here: https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/array.yaml so move this if this is the wrong spot!

Following the example here: https://github.com/linkml/linkml-arrays/blob/main/tests/input/temperature_dataset.yaml

it looks like an array consists of:

A top-level DataArray model that contains...
A set of linkml:axis attributes that specify class ranges for an axis index
- The class ranges of linkml:axis attributes implement linkml:NDArray, and have a values attribute that...
  - implement linkml:elements and declare the range and unit for the axis
A linkml:array attribute that specifies the actual data of the array as a class range
- The linkml:array class is both a linkml:NDArray and a linkml:RowOrderedArray that has its dimensionality specified as an annotation and...
  - has a values attribute that declares the range and unit for the array

I have a few questions about the semantics of the axis specification:

Are axis indices required for all arrays? It seems like the DataArray and NDArray classes are somewhat distinct, but I can't tell if an NDArray is intended to always be a part of a DataArray. If it wasn't, then one would presumably specify NWB-style array constraints on size and number of dimensions on an NDArray, but then the division of labor between axis and NDArray becomes unclear - some parts of the dimensions are specified by the axis classes, and others are specified by the NDArray
Is it possible to have open-ended dimension specification? Or, another way of putting that, is it possible to annotate only a subset of the axes? I am thinking of the NWB TimeSeries class, which ideally would specify "a first dimension that is always time, but then n other dimensions that are values over time", but the limitations in the schema language make it only possible to express 4-D timeseries arrays. The NDArray dimension specification seems like it would be able to support that by accepting something like 2.. to say "at least two dimensions" or ..4 for "up to 4 dimensions" and so on, but that sacrifices the ability to annotate some of the dimensions (ie. in the TimeSeries example, we want to say "axis 1 is a time in seconds")
What is the desired API for interacting with array and axis data? currently the generated model looks like:

class TemperatureDataset(ConfiguredBaseModel):
    
    name: str = Field(...)
    latitude_in_deg: LatitudeSeries = Field(...)
    longitude_in_deg: LongitudeSeries = Field(...)
    time_in_d: DaySeries = Field(...)
    temperatures_in_K: TemperatureMatrix = Field(...)

and without the adjoining schema one would have a hard time knowing that 3/5 of those fields are an index onto temperatures_in_K. It would also make the metaprogramming hard to be able to, say be able to do data[x,y,z] to make use of the indices to select elements in the array.

It seems like the axis attributes are behaving like axis indices, and that might help clarify the meaning of generated model attributes and simplify the syntax a bit. It also seems like this specification is mostly centered on sparse arrays, so DataArray might also benefit from being clarified as SparseArray that requires indices, and NDArray is another top-level class alongside it for compact arrays, but that might be another issue?

If they are indices, we can make their definition a little more concise by taking advantage of knowing they will be 1D series, so maybe one example that tries to stick close to the existing structure looks like this:

classes:
  TemperatureDataset:
    implements:
      - linkml:DataArray
    attributes:
      index:
        range: TemperatureDatasetIndex
      value:
        implements:
          - linkml:NDArray
        range: float
        unit:
          ucum_code: K

  TemperatureDatasetIndex:
    implements:
      - linkml:indices
    slots:
      - latitude
      - longitude
      - time
    annotations:
      strict_index: true


slots:
  latitude:
    implements:
      - linkml:index
    range: float
    unit:
      ucum_code: deg

  longitude:
    implements:
      - linkml:index
    range: float
    unit:
      ucum_code: deg

  time:
    implements:
      - linkl:index
    range: float
    unit:
      ucum_code: d

where strict_index indicates that only those indices are allowed (rather than allowing additional dimensions to be present) and all are required, linkml:indices indicates a collection of indices for an array, and the rest is standard.

So with that we might generate models that look like this (using nptyping syntax for array constraints):

class TemperatureDataset(ArrayModel):
    index: TemperatureDatasetIndex = Field(...)
    value: NDArray[
        Shape["* latitude, * longitude, * time"],
        Float
    ] = Field(..., linkml_meta={{'unit': {'ucum_code': 'K'}}})

class TemperatudeDatasetIndex(IndicesModel):
    # also would put the units in the Fields here but abbreviated for example
    latitude: list[float] = Field(...) 
    longitude: list[float] = Field(...)
    time: list[float] = Field(...)

which gives us clear convention for being able to build a metamodel ArrayModel that could declare a __getitem__ method for accessing items in the array using items in the indices.

One could also omit the indices construction and infer that all the implements: index attributes on a class are that array's index, since they're unlikely to be reused in a meaningful way as far as I can tell.

anyway just some ideas!

The text was updated successfully, but these errors were encountered:

rly · 2024-02-03T08:41:20Z

is this the right place to talk about arrays?

For now, I'd say let's keep the discussions here for organization

This hidden NWB-like example may also be useful: https://github.com/linkml/linkml-model/blob/main/tests/input/examples/schema_definition-array-2.yaml

In particular, ElectricalDataArray corresponds loosely to the NWB ElectricalSeries.

The overall model is based off of the xarray.DataArray class. We should be more clear about that. linkml axis = xarray coordinate. (we can change that to use the same terms...)

Are axis indices required for all arrays? It seems like the DataArray and NDArray classes are somewhat distinct, but I can't tell if an NDArray is intended to always be a part of a DataArray.

The DataArray class is designed to store an NDArray (slot "array") and the labels of at least one of the axes of that array (slot "axis"). At least one axis is required. Otherwise you should not use this class and just use an NDArray.

An NDArray can exist independently of a DataArray, but in general, if you have the labels for the axes of the NDArray, it would be good to associate them with the NDArray.

If it wasn't, then one would presumably specify NWB-style array constraints on size and number of dimensions on an NDArray, but then the division of labor between axis and NDArray becomes unclear - some parts of the dimensions are specified by the axis classes, and others are specified by the NDArray

The division of labor isn't totally clean between these interlinked pieces. One way to think about it is that you start with a bunch of arrays that represent your data (e.g., voltage recordings) and your axis labels (aka xarray coordinates, e.g., timestamps). The arrays have properties like number of dimensions. The DataArray bundles the NDArray with its axis labels and allows you to specify which axis label corresponds to which axis/dimension of the NDArray. It is possible to have inconsistencies between these and we would need to check for those.

Is it possible to have open-ended dimension specification? Or, another way of putting that, is it possible to annotate only a subset of the axes? I am thinking of the NWB TimeSeries class, which ideally would specify "a first dimension that is always time, but then n other dimensions that are values over time", but the limitations in the schema language make it only possible to express 4-D timeseries arrays. The NDArray dimension specification seems like it would be able to support that by accepting something like 2.. to say "at least two dimensions" or ..4 for "up to 4 dimensions" and so on, but that sacrifices the ability to annotate some of the dimensions (ie. in the TimeSeries example, we want to say "axis 1 is a time in seconds")

We haven't worked out this example yet, so let's give it a try, building off of the NWB-like schema above. It could look like:

  GenericTimeDataArray:
    implements:
      - linkml:DataArray
    attributes:
      time:
        range: TimestampSeries
        required: true
        implements:
          - linkml:axis
        inlined: true
        annotations:
          axis_index: 0
      values:
        range: NDArray  # can be any number of dimensions
        required: true
        inlined: true
        implements:
          - linkml:array

I think that would work. The user cannot add a non-time axis label without creating a new class though. This is also not supported in NWB aside from creating a new neurodata type.

What is the desired API for interacting with array and axis data? currently the generated model looks like:
class TemperatureDataset(ConfiguredBaseModel):
    
    name: str = Field(...)
    latitude_in_deg: LatitudeSeries = Field(...)
    longitude_in_deg: LongitudeSeries = Field(...)
    time_in_d: DaySeries = Field(...)
    temperatures_in_K: TemperatureMatrix = Field(...)
and without the adjoining schema one would have a hard time knowing that 3/5 of those fields are an index onto temperatures_in_K. It would also make the metaprogramming hard to be able to, say be able to do data[x,y,z] to make use of the indices to select elements in the array.

temperatures_in_K can be indexed normally. If you want to get all the temperatures at the latitude closest to 38 and all longitudes and all times, then you would have to do

latitude_diff = np.absolute(latitude_in_deg-38)
latitude_index = latitude_diff.argmin()  # or use np.searchsorted if latitude_in_deg is sorted
print(temperatures_in_K[latitude_index,:,:])

Critically, you would have to remember that the 0th axis of temperatures_in_K corresponds to latitude_in_deg. This is not handled in the generated pydantic model for TemperatureDataset, but the plan is to use the axis indices in the schema to link latitude_in_deg to the 0th dimension. We could make a nice __getitem__ and we can also use these to create an xarray.DataArray object. Then we can do something like

temps = TemperatureDataset(...)
temps_xr = temps.as_xarray()
temps_xr.coords  # returns:
# Coordinates:
#  * latitude_in_deg     (latitude_in_deg) float64 37.1 37.6 38.1
#  * longitude_in_deg    (longitude_in_deg) float64 37.1 37.8 38.5
#  * time_in_d    (time_in_d) float64 1.0 2.0 3.0
temps_xr.sel(latitude_in_deg=38, method="nearest")  # returns xarray.DataArray
temps.where(temps.time >= start_time & temps.time < end_time)  # returns temps but values not between start time and end time are nan

Hopefully this will provide for a nicer API experience.

seems like this specification is mostly centered on sparse arrays, so DataArray might also benefit from being clarified as SparseArray that requires indices, and NDArray is another top-level class alongside it for compact arrays, but that might be another issue?

These are actually only dense arrays. We don't have support for sparse arrays yet. Maybe we are not on the same page about what the indices / axis labels / coordinates represent?

 TemperatureDatasetIndex:
   implements:
     - linkml:indices
   slots:
     - latitude
     - longitude
     - time
   annotations:
     strict_index: true

class TemperatureDataset(ArrayModel):
    index: TemperatureDatasetIndex = Field(...)
    value: NDArray[
        Shape["* latitude, * longitude, * time"],
        Float
    ] = Field(..., linkml_meta={{'unit': {'ucum_code': 'K'}}})

class TemperatudeDatasetIndex(IndicesModel):
    # also would put the units in the Fields here but abbreviated for example
    latitude: list[float] = Field(...) 
    longitude: list[float] = Field(...)
    time: list[float] = Field(...)

which gives us clear convention for being able to build a metamodel ArrayModel that could declare a __getitem__ method for accessing items in the array using items in the indices.

This model looks good, I think. However, what is nice about separating the axes is that you can reuse the axes for other data arrays. In NWB, we can use the same timestamps dataset for multiple TimeSeries to 1) show their alignment and 2) use less disk space. I think that would be hard using this Index structure.

related work:

sneakers-the-rat · 2024-02-06T01:44:42Z

This hidden NWB-like example may also be useful: https://github.com/linkml/linkml-model/blob/main/tests/input/examples/schema_definition-array-2.yaml

Yes this is super useful! i missed that on first time through.

The overall model is based off of the xarray.DataArray class. We should be more clear about that. linkml axis = xarray coordinate. (we can change that to use the same terms...)

Also very useful to have a reference, I was trying to look around for what this might be quoting but i'll take a look at that before commenting further on DataArray :)

At least one axis is required. Otherwise you should not use this class and just use an NDArray.

OK good, this answers my question!

The DataArray bundles the NDArray with its axis labels and allows you to specify which axis label corresponds to which axis/dimension of the NDArray. It is possible to have inconsistencies between these and we would need to check for those.

I think this conversation continued over on another issue, which has some other examples of groupings for axis and values #7 - i still need to list out all the constraints here for myself and take a look at the xarray docs, but imo it would be nice to consolidate that :)

We don't have support for sparse arrays yet.

Yes i think i am still not on the same page, because to me having all the indices and the array data on the same level and the array data being a flat list (though i was wrong in my reading of that comment) seems like the coo_array format, but i think where i was getting things wrong was thinking that those lists would be the same length. I guess it's worth suggesting that you already do have sparse arrays implemented if the indices are the same length as the array, then :).

the plan is to use the axis indices in the schema to link latitude_in_deg to the 0th dimension. We could make a nice getitem

This feels like a pandas index vs. iloc thing to me, and that's where my intuition is coming from here, so it could be wrong! So that might look something like (sorry for the pseudocode, but hopefully it illustrates the gist)

class ArrayMixin():

    def __getitem__(self, val):
        # pseudocode!!!!!
        # find what our array is
        for field in self.fields:
            if field.array:
                array = field

        # handle single case
        if isinstance(val, int):
            # find the first axis
            first_ax = # do something...
            idx = np.where(first_ax == val)
            return array[idx]

        elif isinstance(val, slice):
            low_idx = np.where(first_ax == val.start)
            # and so on..
            return array[low_idx:high_idx]

        elif isinstance(val, tuple):
            # multiple indices passed....

class TemperatureDataset(ConfiguredBaseModel, ArrayMixin):
    
    name: str = Field(...)
    latitude_in_deg: LatitudeSeries = Field(..., axis=0)
    longitude_in_deg: LongitudeSeries = Field(..., axis=1)
    time_in_d: DaySeries = Field(..., axis=2)
    temperatures_in_K: TemperatureMatrix = Field(..., array=True)

So then one would be do

>>> data = TemperatureDataset(...)
>>> data[15:20] # to select between 15 and 20 degrees latitude
>>> data.temperatures_in_K[15:20] # to select between the 15th and 20th element

which just seems a bit messy to me - one would need to add a fair bit of logic in the __getitem__ method to determine what index is what, validate the slice, etc. You probably wouldn't want to do that on init because it can start to get messy overriding init and postinit when you potentially are making diamond or tree MROs, could be the source of some befuddlement if someone were to further inherit from these models,

unless...

we can also use these to create an xarray.DataArray object

which i suppose is up to the linkml ppl if they want to add dependencies to generated pydantic model, it's sort of nice having the only dependency be pydantic, but i think one way or another we'll lose that with array models. it's another point to be wary of letting the encoding/serialization leak into the schema description, but seems fine in this case.

However, what is nice about separating the axes is that you can reuse the axes for other data arrays. In NWB, we can use the same timestamps dataset for multiple TimeSeries to 1) show their alignment and 2) use less disk space. I think that would be hard using this Index structure.

This is fair, and that is a nice quality about NWB. to me that seems like more of a format-specific question rather than something at the schema level - for example that would be pretty trivial to do in a linked data context:

@prefix : <localhost#> .
@prefix linkml: <linkml.com/> .

<#latitude> a :latitudeType ;
  :values (0 1 2 3).

<#myDataset>
  a linkml:DataArray ;
  :latitudeType <#latitude> .

where you would then just SPARQL for #myDataset's :latitudeType object, but that's comparatively harder in object-oriented python mode where you would literally be sharing the latitude index object in order to have a shared index across different instances, which would be a real pain to keep track of on instantiation/load. It seems like one probably wouldn't want to indicate shared indices at the schema level, because that sort of implies that we create a new class for every object?

so it's worth thinking about where we would want that to live, but i think by the time we are dealing with instantiated python models (that are generic, ie. don't have any format-specific logic that NWB might want to do to link indices between instances) we are probably working on copies of the index. Then when repacking the data from the instantiated object is when you would do the deduplicating. I think this deserves its own issue ;).

But in any case, it would still be totally possible to reuse indices in the same way, because longitude, latitude, etc. are still independent slots, they're just bundled together within that class, and putting them behind an index object is more a "keeping alike things alike" detail than anything, so

obj1 = MyDatatype()
obj2 = MyDatatype()

obj1.longitude = obj2.longitude

seems the same as

obj1.index.longitude = obj2.index.longitude

to me, but i could be missing what you're thinking would be hard here (very likely!)

This issue is getting a bit sprawly and overlapping with other ones, so we might want to split it apart and narrow this one down just to some subset of the semantics of indices - probably the difference between specifying axes/indices in NDArray vs. specifying them in DataArray

cmungall mentioned this issue Feb 5, 2024

Use lists of lists for elements #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

axis vs. index, sparse vs. dense array semantics and syntax? #4

axis vs. index, sparse vs. dense array semantics and syntax? #4

sneakers-the-rat commented Feb 3, 2024

rly commented Feb 3, 2024

sneakers-the-rat commented Feb 6, 2024

axis vs. index, sparse vs. dense array semantics and syntax? #4

axis vs. index, sparse vs. dense array semantics and syntax? #4

Comments

sneakers-the-rat commented Feb 3, 2024

rly commented Feb 3, 2024

sneakers-the-rat commented Feb 6, 2024