Allow for multiple input_data in data_description. #1042

seanmcculloch · 2024-08-27T21:13:59Z

seanmcculloch
Aug 27, 2024
Collaborator

Is your feature request related to a problem? Please describe.

When a derived data asset is a result of processing more than one data assets, there needs to be a strategy for documenting all relevant input data. An example of this is mfish round-to-round or other multi data-asset registration tasks.

permalink to relevant line: https://github.com/AllenNeuralDynamics/aind-data-schema/blob/687b7a3c4aa55552c89a80a3e6634a6c84f022fa/src/aind_data_schema/core/data_description.py#L153C5-L153C20

Describe the solution you'd like
The solution to this should take search-ability into account, along with compatibility to existing records and schema versions. I can see at least 3 potential solutions - alternate solutions are described below.

Proposed Solution 1. Add new field of data_description.additional_input_data -> Optional[List[str]]. When there are more than one input-data, the most important input data asset could be documented in input_data_name (Chosen contextually per type of data asset being described). The rest of the input data could be put into data_description.additional_input_data.

By using input_data_asset along with an optional field for more than one input-data is used, the input_data_asset field could be used more consistently for searching by users. The added complexity for end-user queries would only apply to users who are expecting data assets with more than one input data, and should be familiar with the conventions of how input_data_name and additional_input_data_name are used. A naive search of input_data_asset should still return closely related derived data with this strategy.

Example: mFISH round-to-round registration results.
A newly created derived data asset contains the displacement field transform and other info that maps data asset HCR_BL6-000_2023-06-07_00-00-00/ into the coordinates of data asset HCR_BL6-000_2023-06-1_00-00-00/

Following Solution 1 we would notate:
input_data_name: "HCR_BL6-000_2023-06-07_00-00-00/"
additional_input_data_name: ["HCR_BL6-000_2023-06-01_00-00-00/"]

Where the convention for this type of data asset would be to put the moving-image data into input_data_name since the transform is intended to be applied to that data ("HCR_BL6-000_2023-06-07_00-00-00/") but the other dataset was used in it's calculation and must be documented as an input data in some form.

Describe alternatives you've considered

Alternate Solution 2. Change type of data_description.input_data_name from str to List[str]. The most straightforward change to the type of input_data_name would change how user's interact with this field when searching, and would create inconsistencies with typing across versions.

Alternate Solution 3. Use data_description.input_data_name as a comma-separated list within a single str. This would then require end-users to use regex-based queries to properly parse this field and search for data efficiently.

tmchartrand · 2024-11-26T00:19:12Z

tmchartrand
Nov 26, 2024
Collaborator

As far as I can tell, input_name as linked is not an actual schema field, but a property of the class that seems to be there just to help construct the name of derived data assets. The schema location for this would be in the input_location field of the relevant DataProcess, which we currently have a PR up to allow lists in. We've also been discussing the option a composite input object with additional contextual metadata on why those inputs were selected: see #1167

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for multiple input_data in data_description. #1042

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Allow for multiple input_data in data_description. #1042

seanmcculloch Aug 27, 2024 Collaborator

Replies: 1 comment

tmchartrand Nov 26, 2024 Collaborator

seanmcculloch
Aug 27, 2024
Collaborator

tmchartrand
Nov 26, 2024
Collaborator