Consider splitting schema vocabulary from grammar #867
Replies: 4 comments 5 replies
-
We have been talking about this for a long time (https://github.com/AllenNeuralDynamics/aind-data-schema/issues/178). HCA has some ideas (1) (2). I agree we should break these out. One alternative I am considering is pulling vocabularies out and putting them into a database with a REST API in front. However, validation becomes difficult and less stable this way. Every client that consumes the schema would need to implement their own validation logic. We could bake it into aind-data-schema without too much work via custom validators, but e.g. https://metadata-entry.allenneuraldynamics.org/ would need to implement this itself too (probably via run-time schema injection). |
Beta Was this translation helpful? Give feedback.
-
This is an interesting idea. Out of curiosity, by validation, do you mean validation at the level of the json-schema or pydantic? If the former, the solution could be to version vocabulary and grammar separately and generate pair-wise json-schema combination of both. You could then try to validate to that specific grammar x vocabulary version, or even brute force by use the |
Beta Was this translation helpful? Give feedback.
-
Pydantic v2 supports a validation context. We can probably set it up so that when a user creates a class (Subject for example) then a validator will query a database for info to validate against. It does require a user to be able to have easy access to that database. We would need to redesign the aind-metadata-entry app. Instead of static json schemas, we'd need to generate them via python. The alternative is to periodically build a new schema with the latest controlled vocabulary changes. |
Beta Was this translation helpful? Give feedback.
-
Let's move the controlled vocabularies to a separate importable python repo for now, since that is pretty painless. In the future we can do something more rigorous. |
Beta Was this translation helpful? Give feedback.
-
Following up the workshop discussion might be a good idea to start taking notes on this subject...
One major hurdle that prevents me from developing on top of the current schemas is the inability to freeze a dependency version. One of the reasons for this stems from the vocabulary and grammar in the schema being versioned together.
To make this more explicit: The aind-data-schema defines a contract. This contract follows a specific structure (grammar) and uses a specific set of values to populate that structure (vocabulary). While it is true that changing either the vocabulary or grammar might result in breaking changes, the frequency (and drive) with which these occur is (or should be) greatly different.
For instance, changing a field name (e.g.
camera
->cameras
) or field typing (e.g.Camera
->List[Camera]
) should be considered a grammar change. While adding an additional manufacturer to theManufacturer
enumerator is only adding vocabulary. From the point of view of describing experiments, most of the time one would ideally like to freeze the grammar (which affords stability in how the data is handled and interoperated with other tools, like automatic code generators, data mappers, etc...) but have the ability to move forward with the vocabulary. Think, for instance, when a new device is added to the setup.As a side note, it is important to know that:
1 - In the previous example, a new json schema model must be regenerated as the enumerable is used to validate against
OneOf
/AnyOf
logic. As a result, the json-schema should be bumped when either of the packages is updated. However, this also means that for a given grammar version there may be several compatible vocabulary versions (which can be easily managed via poetry).2 - To be fair and completeness, it is very easy to break the schema using vocabulary too. Simply modifying or removing an allowed value. However, I would still argue that changes to the vocabulary are made for different reasons than changes to the grammar.
This conceptual change would require keeping these two concepts in two separate modules, but it may be worth it in the long run. It would also provide a fertile ground for a case where the AIND could provide a general structure for data schemas and allow different users/institutes to come up with extensions to the vocabulary.
Beta Was this translation helpful? Give feedback.
All reactions