Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division between array serialization and specification #6

Open
sneakers-the-rat opened this issue Feb 3, 2024 · 0 comments
Open

Division between array serialization and specification #6

sneakers-the-rat opened this issue Feb 3, 2024 · 0 comments

Comments

@sneakers-the-rat
Copy link

I've said this a few times when we've talked on zoom during the hackathons, so I don't mean to be a broken record, but one of the places that a lot of prior schema languages have messed up array specification is taking on too much of the weight of specifying the actual encoding of the arrays, rather than being a schematic description that is generic across serializations.

The generality of the current form is pretty good! one way that I see us buying more complexity than we need to though is in this GroupingByArrayOrder idea:
https://github.com/linkml/linkml-model/blob/aab9842be0e230c0040688dfc6ffa26696c97827/linkml_model/model/schema/array.yaml#L67-L94

That's an implementation detail of how arrays are stored and indexed - I don't think we should touch the storage part in the schema, and the indexing part is handled by the rest of the array specification, right? I could be missing something that requires that to be specified in the schema, but I think in general it would be good to make a clear separation of concerns here - a decent test is "can this array specification be satisfied in such a way that the schema knows absolutely nothing about the way that the array is serialized?" where the responsibility for getting the array ordering correct is that of the dumper/loader, similarly to how we would expect the dumper/loader to correctly handle chunking and other serialization details.

This is actually what i want to work on at the hackashop - to work on a second set of specifications for declaring serializations, so in a linked data context one would be able to say "this particular array has n linked serializations - this numpy format, that zarr format, etc." without having that be specified in the array's schema. So a way of saying "this particular hash of a binary stream is annotated with being a numpy ndarray with shape (x,y)" and all the other details needed to handle the serialization/deserialization that could be consumed by a generalized dumper/loaders. So we may want to just talk about this next week :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant