Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at native NDArray support. #181

Merged
merged 16 commits into from
Feb 8, 2024
Merged

First pass at native NDArray support. #181

merged 16 commits into from
Feb 8, 2024

Conversation

cmungall
Copy link
Member

@cmungall cmungall commented Feb 5, 2024

See

This introduces first-class array support into LinkML.

A minimal example would be:

    attributes:
      temperature_matrix:
        range: float
        array_info:
          exact_dimensions: 3

The native serialization of this in json/yaml will be a LoLoL. Using linkml-xarrays it will be possible to serialize using hdf5/zarr/etc.

The corresponding nptyping type would be NDArray[Shape["*, *, *"], Float].

(note: modelers will want the ability to use ctypes but this is orthogonal)

Note that this does not force any metadata on the array; we are deferring on the datamodel for what is equivalent to xarray DataArrays, these will be supported via implements for now and first-class incorporation in a future version. This will allow binding between axes are other LinkML arrays.

Minimal metadata can be introduced via naming the axes

    attributes:
      temperature_matrix:
        range: float
        array_info:
          exact_dimensions: 3
          dimensions:
            x:
            y:
            z:

The corresponding nptyping type would be NDArray[Shape["* x, * y, * z"], Float].

The shape can be further constrained; imagine an RGB matrix with coords x, y, and a length 3 r/g/b:

    attributes:
      rgb:
        range: float
        array_info:
          exact_dimensions: 3
          dimensions:
            x:
            y:
            rgb:
              exact_cardinality: 3
              description: r, g, b values
              annotations:
                names: "[red, green, blue]"

corresponds to NDArray[Shape["* x, * y, 3 rgb"]

For now if you do want to bind dimensions to additional metadata this can be done via annotations:

classes:

  TemperatureDataset:
    tree_root: true
    annotations:
      array_data_mapping:
        data: temperatures_in_K
        dims: [x, y, t]
        coords:
          latitude_in_deg: x
          longitude_in_deg: y
          time_in_d: t
    attributes:
      name:
        identifier: true
        range: string
      latitude_in_deg:
        required: true
        range: float
        multivalued: true
        unit:
          ucum_code: deg
        array_info:
          exact_dimensions: 1
      longitude_in_deg:
        required: true
        range: float
        multivalued: true
        unit:
          ucum_code: deg
        array_info:
          exact_dimensions: 1
      time_in_d:
        range: float
        multivalued: true
        implements:
          - linkml:elements
        required: true
        unit:
          ucum_code: d
        array_info:
          exact_dimensions: 1
      temperatures_in_K:
        range: float
        multivalued: true
        required: true
        unit:
          ucum_code: K
        array_info:
          exact_dimensions: 3

@cmungall cmungall requested a review from rly February 7, 2024 00:45
@@ -26,7 +26,7 @@ classes:
implements:
- linkml:NDArray
annotations:
dimensions: 1
dimensions_info: "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I reformat these to use:

        array_info:
          exact_dimensions: 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this TimestampSeries class, I'm actually not sure it would be allowed to put array_info at the class definition level, but maybe for the other ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately that doesn't quite work because the range of dimensions_info is a collection of dimension_expressions, so these have to be at a different level

e.g. this is invalid yaml:

dimensions_info: 3
  x:
  y:
  z:

Comment on lines 28 to 34
x:
y:
rgb:
exact_cardinality: 3
description: r, g, b values
annotations:
names: "[red, green, blue]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x:
y:
rgb:
exact_cardinality: 3
description: r, g, b values
annotations:
names: "[red, green, blue]"
x:
dimension_index: 0
y:
dimension_index: 1
rgb:
dimension_index: 2
exact_cardinality: 3
description: r, g, b values
annotations:
names: "[red, green, blue]"

It would be good to be explicit here. This also allows some dimensions to be unnamed.

@rly
Copy link
Contributor

rly commented Feb 7, 2024

Notes:

  • It is possible to have scalar arrays in many array formats (e.g., NumPy, HDF5, Zarr) but not all (e.g., Matlab). They seem primarily used to allow API functions to work seamlessly with both arrays and scalars and as a box to pass scalars to a function by reference. Matlab gets around this by not having special dtypes for scalars and not allowing scalar arrays - they are just 1x1 matrices. I do not know that we need to support scalar arrays. To be conservative, perhaps it would be best to leave out support until there is a need. Refs:
  • If not specified, minimum_number_dimensions is 1, maximum_number_dimensions is Infinity.
  • It is possible to specify minimum_number_dimensions without maximum_number_dimensions. We have a use case for a min without a max, but I cannot think of one for a max without a min.
  • I am not sure there is a use case for minimum_cardinality and maximum_cardinality. Maybe we can omit them and add them later if there is a need? I am concerned about adding features to the metamodel that are useless but need to be supported for a long time to avoid breaking backwards compatibility.

Some useful validation checks:

  1. The number of items in dimensions_info must not exceed maximum_number_of_dimensions
  2. Each dimension_index must not exceed maximum_number_of_dimensions - 1
  3. Each dimension_index must be unique across dimensions_info items
  4. If array_info exists, then multivalued must be True (if we allow scalar arrays, then this is no longer always true)
  5. minimum_number_dimensions <= maximum_number_dimensions
  6. Cannot have both exact_number_dimensions and minimum_number_dimensions or both exact_dimensions and maximum_number_dimensions
  7. minimum_number_dimensions, exact_number_dimensions, maximum_number_dimensions all > 0
  8. minimum_cardinality <= maximum_cardinality
  9. Cannot have both exact_cardinality and minimum_cardinality or both exact_cardinality and maximum_cardinality
  10. minimum_cardinality, exact_cardinality, maximum_cardinality all > 0

Copy link
Contributor

@sneakers-the-rat sneakers-the-rat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I am mostly settled on this, but I have a pitch that i'll make in person about extending the notion of implements and how it could allow for plugins in a minute (and then update this so that it's public later)

Comment on lines 1423 to 1430
array_info:
domain: slot_definition
range: array_info_expression
inherited: true
description: coerces the value of the slot into an array and defines the dimensions of that array
status: testing

dimensions_info:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully not annoying naming question, why not just array and dimensions here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we did change it to that)

inlined: true
status: testing

minimum_dimensions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we model mutual exclusivity and constraints between properties here? eg. one shouldn't use exact_dimensions alongside maximum_dimensions, maximum_dimensions shouldn't be less than minimum_dimensions, etc. Part of why i like expressing ranges/values in a single object like dimensions: 3 or dimensions: {min: 2, max: 3} is that you can model within the scope of that object, but it seems like that can also be done here i'm just not sure how

range: integer
status: testing

maximum_dimensions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate but related question to above - it seems like ranges and exact values will/is a common pattern, thoughts on having a range syntax so we could just have a single property that can take an integer or a 1..2 specification?

@@ -1420,6 +1420,39 @@ slots:
- BasicSubset
- ObjectOrientedProfile

array_info:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there no minimum properties that need to be specified on an array? eg. having this property without any values is an Any shaped array?

@sneakers-the-rat
Copy link
Contributor

sneakers-the-rat commented Feb 8, 2024

for posterity, archived version of matrix of tradeoffs to be added to consolidated docs later, along with @rly 's notes and examples :)

live: https://wiki.jon-e.net/LinkML_Arrays
archive: https://web.archive.org/web/20240207234701/https://wiki.jon-e.net/LinkML_Arrays

Comment on lines +1472 to +1477
has_extra_dimensions:
description: If this is set to true
domain: array_expression
range: boolean
status: testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
has_extra_dimensions:
description: If this is set to true
domain: array_expression
range: boolean
status: testing

- exact_number_dimensions
- minimum_number_dimensions
- maximum_number_dimensions
- has_extra_dimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- has_extra_dimensions

@cmungall cmungall merged commit 2ca9c98 into main Feb 8, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants