-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use lists of lists for elements
#7
Comments
Yep, we were thinking of a list of lists for the yaml/json representation of an array. We wrote out an example, but it got lost in another repo... We should add that back. The list of lists representation should be the default, since it is the most language/library/backend-agnostic representation even if terribly inefficient. |
I think the time has come to have better first-class support for LoLs (and LoLoLs...). I was initially very hesitant to make major changes to the core LinkML instance datamodel here, we have a tradition of isomorphism with RDF, SQL, ... and LoLs or NDArrays aren't native to these formalisms. Some history here: The current version of linkml-arrays is layered on top of the LinkML instance model, hence the need for mapping between native instances (flat lists) and the NDArray implementation class necessitating "implementation details" like specifying array ordering, which both you and Oliver have rightly criticised as mixing implementation and modeling concerns (#6). But this has always been a short term measure! We have succeeded in the first phase of bootstrapping ourselves using implementation classes, once we have first-class support we can jettison some of the machinery we originally needed. If we add a metaslot classes:
TemperatureDataset:
attributes:
temperatures_in_K:
range: float
multivalued: true. ## now implicit due to "array_shape" below
required: true
unit:
ucum_code: K
array_shape:
exact_dimensions: 3
# optional additional metadata here This has a direct jsonschema mapping that is natural, with a natural LoLoL json/yaml serialization. Things get a bit more awkward when serializing to RDF -- we can discuss this at the hackathon. And of course we will keep the existing pydanticgen hooks so you can serialize the array "offline". The RGB example in #5 can be compacted to a single attribute:
Which when mapped to numpy and nptyping: rgb_matrix: Optional[NDArray[Shape["* x, * y, 3 rgb"], Float]] = Field(None) We of course still want to support xarrays style DataArray containers as per #4. But this now becomes more optional - if you want a direct array representation without metadata about the axes you can do that more concisely. So the TemperatureDataset DataArray container might look like: TemperatureDataset:
tree_root: true
attributes:
name:
identifier: true
range: string
latitude_in_deg:
required: true
range: float
multivalued: true
unit:
ucum_code: deg
array_shape:
exact_dimensions: 1
longitude_in_deg:
required: true
range: float
multivalued: true
unit:
ucum_code: deg
array_shape:
exact_dimensions: 1
time_in_d:
range: float
multivalued: true
implements:
- linkml:elements
required: true
unit:
ucum_code: d
array_shape:
exact_dimensions: 1
temperatures_in_K:
range: float
multivalued: true
required: true
unit:
ucum_code: K
array_shape:
exact_dimensions: 3
array_axes:
x:
rank: 0
alias: latitude_in_deg
y:
rank: 1
alias: longitude_in_deg
t:
rank: 2
alias: time_in_d Here there is a bit of redundancy between specifying shapes in the NDArray vs the container DataArray but I think the flexibility is useful, and could easily be checked for inconsistencies at schema compile time. We can also keep supporting standalone NDArray classes for things like the LatitudeSeries, NWB-style. |
I love this!!!! that is close to exactly how i would specify if it i had the choice :). Understood re: temporary measures! would love to help out with the specification and code generation here! I'm packing right now to come to the hackathon, but a brief note - for the sake of giving ourselves futureproofing room, adding few terms to the base model as possible, and also modularizing array-logic, two alternative spitballing syntaxes:
|
anyway here's this implemented: https://numpydantic.readthedocs.io/en/slotarray/api/linkml/slotarray.html |
Similar to: #6
Currently it seems like the
elements
slot is intended to store a flat list of values that are given shape by the indices, and that might bleed into the other parts of the implementation. It would be really nice to be able to use the domain-specific representation of arrays in different formats though! One way of doing this would be to make the in-memory storage totally generic w.r.t. the generator format, which I like, but since the format is YAML there is some degree of "JSON-like primacy" where the "natural" representation of data has to fit in JSON, which makes sense even if it's worth being careful about not specifying too much about serialization in the schema.JSON-Schema actually can represent arrays natively as lists of lists! So here for a dense array for an RGB image:
So i'm not sure what the current status is for in-memory JSON-like arrays, but it would be nice to use lists of lists (as has been discussed in a bunch of places over on the main linkml repo :)) instead of a flat list with indices, at least for dense arrays. Another reason why we might want to make
NDArray
andSparseArray
separable classes too, since the flat model with adjoining indices is a perfectly good model for sparse arrays.The text was updated successfully, but these errors were encountered: