-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
axis vs. index, sparse vs. dense array semantics and syntax? #4
Comments
For now, I'd say let's keep the discussions here for organization This hidden NWB-like example may also be useful: https://github.com/linkml/linkml-model/blob/main/tests/input/examples/schema_definition-array-2.yaml In particular, The overall model is based off of the
The An
The division of labor isn't totally clean between these interlinked pieces. One way to think about it is that you start with a bunch of arrays that represent your data (e.g., voltage recordings) and your axis labels (aka xarray coordinates, e.g., timestamps). The arrays have properties like number of dimensions. The
We haven't worked out this example yet, so let's give it a try, building off of the NWB-like schema above. It could look like: GenericTimeDataArray:
implements:
- linkml:DataArray
attributes:
time:
range: TimestampSeries
required: true
implements:
- linkml:axis
inlined: true
annotations:
axis_index: 0
values:
range: NDArray # can be any number of dimensions
required: true
inlined: true
implements:
- linkml:array I think that would work. The user cannot add a non-time axis label without creating a new class though. This is also not supported in NWB aside from creating a new neurodata type.
latitude_diff = np.absolute(latitude_in_deg-38)
latitude_index = latitude_diff.argmin() # or use np.searchsorted if latitude_in_deg is sorted
print(temperatures_in_K[latitude_index,:,:]) Critically, you would have to remember that the 0th axis of temps = TemperatureDataset(...)
temps_xr = temps.as_xarray()
temps_xr.coords # returns:
# Coordinates:
# * latitude_in_deg (latitude_in_deg) float64 37.1 37.6 38.1
# * longitude_in_deg (longitude_in_deg) float64 37.1 37.8 38.5
# * time_in_d (time_in_d) float64 1.0 2.0 3.0
temps_xr.sel(latitude_in_deg=38, method="nearest") # returns xarray.DataArray
temps.where(temps.time >= start_time & temps.time < end_time) # returns temps but values not between start time and end time are nan Hopefully this will provide for a nicer API experience.
These are actually only dense arrays. We don't have support for sparse arrays yet. Maybe we are not on the same page about what the indices / axis labels / coordinates represent?
This model looks good, I think. However, what is nice about separating the axes is that you can reuse the axes for other data arrays. In NWB, we can use the same timestamps dataset for multiple TimeSeries to 1) show their alignment and 2) use less disk space. I think that would be hard using this Index structure. related work: |
Yes this is super useful! i missed that on first time through.
Also very useful to have a reference, I was trying to look around for what this might be quoting but i'll take a look at that before commenting further on
OK good, this answers my question!
I think this conversation continued over on another issue, which has some other examples of groupings for axis and values #7 - i still need to list out all the constraints here for myself and take a look at the xarray docs, but imo it would be nice to consolidate that :)
Yes i think i am still not on the same page, because to me having all the indices and the array data on the same level and the array data being a flat list (though i was wrong in my reading of that comment) seems like the
This feels like a class ArrayMixin():
def __getitem__(self, val):
# pseudocode!!!!!
# find what our array is
for field in self.fields:
if field.array:
array = field
# handle single case
if isinstance(val, int):
# find the first axis
first_ax = # do something...
idx = np.where(first_ax == val)
return array[idx]
elif isinstance(val, slice):
low_idx = np.where(first_ax == val.start)
# and so on..
return array[low_idx:high_idx]
elif isinstance(val, tuple):
# multiple indices passed....
class TemperatureDataset(ConfiguredBaseModel, ArrayMixin):
name: str = Field(...)
latitude_in_deg: LatitudeSeries = Field(..., axis=0)
longitude_in_deg: LongitudeSeries = Field(..., axis=1)
time_in_d: DaySeries = Field(..., axis=2)
temperatures_in_K: TemperatureMatrix = Field(..., array=True) So then one would be do >>> data = TemperatureDataset(...)
>>> data[15:20] # to select between 15 and 20 degrees latitude
>>> data.temperatures_in_K[15:20] # to select between the 15th and 20th element which just seems a bit messy to me - one would need to add a fair bit of logic in the unless...
which i suppose is up to the linkml ppl if they want to add dependencies to generated pydantic model, it's sort of nice having the only dependency be pydantic, but i think one way or another we'll lose that with array models. it's another point to be wary of letting the encoding/serialization leak into the schema description, but seems fine in this case.
This is fair, and that is a nice quality about NWB. to me that seems like more of a format-specific question rather than something at the schema level - for example that would be pretty trivial to do in a linked data context: @prefix : <localhost#> .
@prefix linkml: <linkml.com/> .
<#latitude> a :latitudeType ;
:values (0 1 2 3).
<#myDataset>
a linkml:DataArray ;
:latitudeType <#latitude> . where you would then just SPARQL for so it's worth thinking about where we would want that to live, but i think by the time we are dealing with instantiated python models (that are generic, ie. don't have any format-specific logic that NWB might want to do to link indices between instances) we are probably working on copies of the index. Then when repacking the data from the instantiated object is when you would do the deduplicating. I think this deserves its own issue ;). But in any case, it would still be totally possible to reuse indices in the same way, because obj1 = MyDatatype()
obj2 = MyDatatype()
obj1.longitude = obj2.longitude seems the same as obj1.index.longitude = obj2.index.longitude to me, but i could be missing what you're thinking would be hard here (very likely!) This issue is getting a bit sprawly and overlapping with other ones, so we might want to split it apart and narrow this one down just to some subset of the semantics of indices - probably the difference between specifying axes/indices in NDArray vs. specifying them in DataArray |
catching up with what y'all have been doing over here - is this the right place to talk about arrays? i also see it's already in the metamodel here: https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/array.yaml so move this if this is the wrong spot!
Following the example here: https://github.com/linkml/linkml-arrays/blob/main/tests/input/temperature_dataset.yaml
it looks like an array consists of:
DataArray
model that contains...linkml:axis
attributes that specify class ranges for an axis indexlinkml:axis
attributes implementlinkml:NDArray
, and have avalues
attribute that...linkml:elements
and declare the range and unit for the axislinkml:array
attribute that specifies the actual data of the array as a class rangelinkml:array
class is both alinkml:NDArray
and alinkml:RowOrderedArray
that has its dimensionality specified as an annotation and...values
attribute that declares the range and unit for the arrayI have a few questions about the semantics of the axis specification:
DataArray
andNDArray
classes are somewhat distinct, but I can't tell if anNDArray
is intended to always be a part of aDataArray
. If it wasn't, then one would presumably specify NWB-style array constraints on size and number of dimensions on anNDArray
, but then the division of labor betweenaxis
andNDArray
becomes unclear - some parts of the dimensions are specified by theaxis
classes, and others are specified by theNDArray
TimeSeries
class, which ideally would specify "a first dimension that is always time, but then n other dimensions that are values over time", but the limitations in the schema language make it only possible to express 4-D timeseries arrays. TheNDArray
dimension specification seems like it would be able to support that by accepting something like2..
to say "at least two dimensions" or..4
for "up to 4 dimensions" and so on, but that sacrifices the ability to annotate some of the dimensions (ie. in theTimeSeries
example, we want to say "axis 1 is a time in seconds")and without the adjoining schema one would have a hard time knowing that 3/5 of those fields are an index onto
temperatures_in_K
. It would also make the metaprogramming hard to be able to, say be able to dodata[x,y,z]
to make use of the indices to select elements in the array.It seems like the
axis
attributes are behaving like axis indices, and that might help clarify the meaning of generated model attributes and simplify the syntax a bit. It also seems like this specification is mostly centered on sparse arrays, soDataArray
might also benefit from being clarified asSparseArray
that requires indices, andNDArray
is another top-level class alongside it for compact arrays, but that might be another issue?If they are indices, we can make their definition a little more concise by taking advantage of knowing they will be 1D series, so maybe one example that tries to stick close to the existing structure looks like this:
where
strict_index
indicates that only those indices are allowed (rather than allowing additional dimensions to be present) and all are required,linkml:indices
indicates a collection of indices for an array, and the rest is standard.So with that we might generate models that look like this (using
nptyping
syntax for array constraints):which gives us clear convention for being able to build a metamodel
ArrayModel
that could declare a__getitem__
method for accessing items in the array using items in the indices.One could also omit the
indices
construction and infer that all theimplements: index
attributes on a class are that array's index, since they're unlikely to be reused in a meaningful way as far as I can tell.anyway just some ideas!
The text was updated successfully, but these errors were encountered: