Consider splitting schema vocabulary from grammar #867

bruno-f-cruz · 2024-03-29T14:37:40Z

bruno-f-cruz
Mar 29, 2024
Collaborator

Following up the workshop discussion might be a good idea to start taking notes on this subject...

One major hurdle that prevents me from developing on top of the current schemas is the inability to freeze a dependency version. One of the reasons for this stems from the vocabulary and grammar in the schema being versioned together.
To make this more explicit: The aind-data-schema defines a contract. This contract follows a specific structure (grammar) and uses a specific set of values to populate that structure (vocabulary). While it is true that changing either the vocabulary or grammar might result in breaking changes, the frequency (and drive) with which these occur is (or should be) greatly different.

For instance, changing a field name (e.g. camera -> cameras) or field typing (e.g. Camera -> List[Camera]) should be considered a grammar change. While adding an additional manufacturer to the Manufacturer enumerator is only adding vocabulary. From the point of view of describing experiments, most of the time one would ideally like to freeze the grammar (which affords stability in how the data is handled and interoperated with other tools, like automatic code generators, data mappers, etc...) but have the ability to move forward with the vocabulary. Think, for instance, when a new device is added to the setup.

As a side note, it is important to know that:
1 - In the previous example, a new json schema model must be regenerated as the enumerable is used to validate against OneOf/AnyOf logic. As a result, the json-schema should be bumped when either of the packages is updated. However, this also means that for a given grammar version there may be several compatible vocabulary versions (which can be easily managed via poetry).
2 - To be fair and completeness, it is very easy to break the schema using vocabulary too. Simply modifying or removing an allowed value. However, I would still argue that changes to the vocabulary are made for different reasons than changes to the grammar.

This conceptual change would require keeping these two concepts in two separate modules, but it may be worth it in the long run. It would also provide a fertile ground for a case where the AIND could provide a general structure for data schemas and allow different users/institutes to come up with extensions to the vocabulary.

dyf · 2024-03-29T17:52:02Z

dyf
Mar 29, 2024
Maintainer

We have been talking about this for a long time (https://github.com/AllenNeuralDynamics/aind-data-schema/issues/178). HCA has some ideas (1) (2).

I agree we should break these out. One alternative I am considering is pulling vocabularies out and putting them into a database with a REST API in front.

However, validation becomes difficult and less stable this way. Every client that consumes the schema would need to implement their own validation logic. We could bake it into aind-data-schema without too much work via custom validators, but e.g. https://metadata-entry.allenneuraldynamics.org/ would need to implement this itself too (probably via run-time schema injection).

1 reply

dyf Mar 29, 2024
Maintainer

@saskiad and @jtyoung84 please weigh in.

bruno-f-cruz · 2024-03-29T17:58:16Z

bruno-f-cruz
Mar 29, 2024
Collaborator Author

However, validation becomes difficult and less stable this way. Every client that consumes the schema would need to implement their own validation logic. We could bake it into aind-data-schema without too much work via custom validators, but e.g. https://metadata-entry.allenneuraldynamics.org/ would need to implement this itself too (probably via run-time schema injection).

This is an interesting idea. Out of curiosity, by validation, do you mean validation at the level of the json-schema or pydantic? If the former, the solution could be to version vocabulary and grammar separately and generate pair-wise json-schema combination of both. You could then try to validate to that specific grammar x vocabulary version, or even brute force by use the AnyOf logic afforded by the json-shema spec.

1 reply

dyf Apr 2, 2024
Maintainer

Out of curiosity, by validation, do you mean validation at the level of the json-schema or pydantic?

Either, or both. Agree that we would need to hack the JSONschema to make it general purpose.

jtyoung84 · 2024-03-29T18:32:29Z

jtyoung84
Mar 29, 2024
Maintainer

Pydantic v2 supports a validation context. We can probably set it up so that when a user creates a class (Subject for example) then a validator will query a database for info to validate against. It does require a user to be able to have easy access to that database. We would need to redesign the aind-metadata-entry app. Instead of static json schemas, we'd need to generate them via python. The alternative is to periodically build a new schema with the latest controlled vocabulary changes.

3 replies

bruno-f-cruz Mar 30, 2024
Collaborator Author

Unless I am missing something this doesn't add much on top of simply validating against the json-schema directly (e.g using https://python-jsonschema.readthedocs.io/en/stable/). In mind, if the validation logic is simple enough to be in a database, it will probably also be able to be validated straight with the json. In which case, you really just need to keep a library of json-schemas like you already do and, for each income instance find the correct combination of grammar by vocabulary. Otherwise, if you need complex (e.g. across property/schema, like you do with rig and session) validation I don't think you will easily be able to bake this in the queriable database anyway as it would require deserializing the validation logic itself.

Another alternative (which I am not sure I fully understand the consequences of) is to cache multiple project.toml environments, where one would specify the grammar by vocabulary table, in a server somewhere and automatically grab the one that should be used for each instance. This way you could still keep using pydantic to do the heavy lifting. This would require maintaining a compatibility table of the two versions (similarly to what other projects do, e.g. https://docs.nvidia.com/deeplearning/cudnn/reference/support-matrix.html), which should probably be done anyway since some vocabulary might require a minimum version of the grammar and vice-versa.

jtyoung84 Mar 30, 2024
Maintainer

I may be mixing up the discussions. David pointed to this issue: #178. In that discussion, as an example, we want some way to ensure that a user selects an option from this endpoint: brain-map. It's possible we may run into a database that's too large to feasibly dump into a json document.

For small enough lists, I think it might make sense to have a lightweight aind-data-models repository.

bruno-f-cruz Mar 30, 2024
Collaborator Author

Oh i understand now, thanks for adding the brain-map link, now i got it! Sure that solution makes sense to me too if you don't want to rely on pydantic as the source for the vocabulary. I think it is very much within the same flavor as validation via json-schema using the anyOf logic. Conceptually aligned at least!

dyf · 2024-04-02T15:28:28Z

dyf
Apr 2, 2024
Maintainer

Let's move the controlled vocabularies to a separate importable python repo for now, since that is pretty painless. In the future we can do something more rigorous.

#874

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider splitting schema vocabulary from grammar #867

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Consider splitting schema vocabulary from grammar #867

bruno-f-cruz Mar 29, 2024 Collaborator

Replies: 4 comments · 5 replies

dyf Mar 29, 2024 Maintainer

dyf Mar 29, 2024 Maintainer

bruno-f-cruz Mar 29, 2024 Collaborator Author

dyf Apr 2, 2024 Maintainer

jtyoung84 Mar 29, 2024 Maintainer

bruno-f-cruz Mar 30, 2024 Collaborator Author

jtyoung84 Mar 30, 2024 Maintainer

bruno-f-cruz Mar 30, 2024 Collaborator Author

dyf Apr 2, 2024 Maintainer

bruno-f-cruz
Mar 29, 2024
Collaborator

Replies: 4 comments 5 replies

dyf
Mar 29, 2024
Maintainer

dyf Mar 29, 2024
Maintainer

bruno-f-cruz
Mar 29, 2024
Collaborator Author

dyf Apr 2, 2024
Maintainer

jtyoung84
Mar 29, 2024
Maintainer

bruno-f-cruz Mar 30, 2024
Collaborator Author

jtyoung84 Mar 30, 2024
Maintainer

bruno-f-cruz Mar 30, 2024
Collaborator Author

dyf
Apr 2, 2024
Maintainer