Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lists of lists for elements #7

Open
sneakers-the-rat opened this issue Feb 3, 2024 · 4 comments
Open

Use lists of lists for elements #7

sneakers-the-rat opened this issue Feb 3, 2024 · 4 comments

Comments

@sneakers-the-rat
Copy link

Similar to: #6

Currently it seems like the elements slot is intended to store a flat list of values that are given shape by the indices, and that might bleed into the other parts of the implementation. It would be really nice to be able to use the domain-specific representation of arrays in different formats though! One way of doing this would be to make the in-memory storage totally generic w.r.t. the generator format, which I like, but since the format is YAML there is some degree of "JSON-like primacy" where the "natural" representation of data has to fit in JSON, which makes sense even if it's worth being careful about not specifying too much about serialization in the schema.

JSON-Schema actually can represent arrays natively as lists of lists! So here for a dense array for an RGB image:

# sorry for the long ass import string, still working on virtualizing imports
import json
from nwb_linkml.models.pydantic.core.v2_6_0_alpha.core_nwb_image import RGBImage

print(json.dumps(RGBImage.model_json_schema(), indent=2))
{
  "additionalProperties": false,
  "description": "A color image.",
  "properties": {
    "hdf5_path": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The absolute path that this object is stored in an NWB file",
      "title": "Hdf5 Path"
    },
    "name": {
      "title": "Name",
      "type": "string"
    },
    "resolution": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Pixel resolution of the image, in pixels per centimeter.",
      "title": "Resolution"
    },
    "description": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Description of the image.",
      "title": "Description"
    },
    "array": {
      "anyOf": [
        {
          "items": {
            "items": {
              "items": {
                "type": "number"
              },
              "maxItems": 3,
              "minItems": 3,
              "type": "array"
            },
            "type": "array"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Array"
    }
  },
  "required": [
    "name"
  ],
  "title": "RGBImage",
  "type": "object"
}

So i'm not sure what the current status is for in-memory JSON-like arrays, but it would be nice to use lists of lists (as has been discussed in a bunch of places over on the main linkml repo :)) instead of a flat list with indices, at least for dense arrays. Another reason why we might want to make NDArray and SparseArray separable classes too, since the flat model with adjoining indices is a perfectly good model for sparse arrays.

@rly
Copy link
Collaborator

rly commented Feb 3, 2024

Yep, we were thinking of a list of lists for the yaml/json representation of an array. We wrote out an example, but it got lost in another repo... We should add that back. The list of lists representation should be the default, since it is the most language/library/backend-agnostic representation even if terribly inefficient.

@cmungall
Copy link
Member

cmungall commented Feb 5, 2024

I think the time has come to have better first-class support for LoLs (and LoLoLs...). I was initially very hesitant to make major changes to the core LinkML instance datamodel here, we have a tradition of isomorphism with RDF, SQL, ... and LoLs or NDArrays aren't native to these formalisms.

Some history here:

The current version of linkml-arrays is layered on top of the LinkML instance model, hence the need for mapping between native instances (flat lists) and the NDArray implementation class necessitating "implementation details" like specifying array ordering, which both you and Oliver have rightly criticised as mixing implementation and modeling concerns (#6). But this has always been a short term measure! We have succeeded in the first phase of bootstrapping ourselves using implementation classes, once we have first-class support we can jettison some of the machinery we originally needed.

If we add a metaslot array_shape to the metamodel (directly, not as an implementation class) we can have a more direct representation of e.g. a 3D matrix:

classes:
  TemperatureDataset:
    attributes:
      temperatures_in_K:
        range: float
        multivalued: true.  ## now implicit due to "array_shape" below
        required: true
        unit:
          ucum_code: K
        array_shape:
          exact_dimensions: 3
          # optional additional metadata here

This has a direct jsonschema mapping that is natural, with a natural LoLoL json/yaml serialization. Things get a bit more awkward when serializing to RDF -- we can discuss this at the hackathon. And of course we will keep the existing pydanticgen hooks so you can serialize the array "offline".

The RGB example in #5 can be compacted to a single attribute:

    rgb_matrix:
      range: float
      array_shape:
        exact_dimensions: 3
        array_axes:
          x:
          y:
          rgb:
            exact_cardinality: 3
            description: r, g, b values

Which when mapped to numpy and nptyping:

  rgb_matrix: Optional[NDArray[Shape["* x, * y, 3 rgb"], Float]] = Field(None)

We of course still want to support xarrays style DataArray containers as per #4. But this now becomes more optional - if you want a direct array representation without metadata about the axes you can do that more concisely.

So the TemperatureDataset DataArray container might look like:

  TemperatureDataset:
    tree_root: true
    attributes:
      name:
        identifier: true
        range: string
      latitude_in_deg:
        required: true
        range: float
        multivalued: true
        unit:
          ucum_code: deg
        array_shape:
          exact_dimensions: 1
      longitude_in_deg:
        required: true
        range: float
        multivalued: true
        unit:
          ucum_code: deg
        array_shape:
          exact_dimensions: 1
      time_in_d:
        range: float
        multivalued: true
        implements:
          - linkml:elements
        required: true
        unit:
          ucum_code: d
        array_shape:
          exact_dimensions: 1
      temperatures_in_K:
        range: float
        multivalued: true
        required: true
        unit:
          ucum_code: K
        array_shape:
          exact_dimensions: 3
          array_axes:
            x:
              rank: 0
              alias: latitude_in_deg
            y:
              rank: 1
              alias: longitude_in_deg
            t:
              rank: 2
              alias: time_in_d

Here alias relates the axis in the

there is a bit of redundancy between specifying shapes in the NDArray vs the container DataArray but I think the flexibility is useful, and could easily be checked for inconsistencies at schema compile time.

We can also keep supporting standalone NDArray classes for things like the LatitudeSeries, NWB-style.

@sneakers-the-rat
Copy link
Author

I love this!!!! that is close to exactly how i would specify if it i had the choice :).

Understood re: temporary measures! would love to help out with the specification and code generation here!

I'm packing right now to come to the hackathon, but a brief note - for the sake of giving ourselves futureproofing room, adding few terms to the base model as possible, and also modularizing array-logic, two alternative spitballing syntaxes:

array property

First, we could make a set of models for axes, dimensions, etc. under a single array property s.t. a) one specifies all the properties of an array in a way that is grouped together, but also b) allows the different groups of terms to be reused if-needed in other parts of the spec? so in this case, the quality of array-ness is indicated unambiguously by the presence/absence of the array property, where otherwise it might be difficult to determine when array-like processing of a class/slot is necessary if there are several properties that could indicate that like array_shape etc. this is similar to how you have it above:

Examples

Exactly 3 unspecified axes/dimensions

classes:
  TemperatureDataset:
    attributes:
      temperatures_in_K:
        range: float
        multivalued: true
        required: true
        unit:
          ucum_code: K
        array:
          dimensions: 3

Exactly 3 named dimensions

array:
  axes:
    x:
      rank: 0
      alias: latitude_in_deg
    y:
      rank: 1
      alias: longitude_in_deg
    t:
      rank: 2
      alias: time_in_d

At least three, at most 5, and between 3 and 5 anonymous dimensions

array:
  dimensions:
    min: 3

  # or
  dimensions:
    max: 5

  # or
  dimensions:
    min: 3
    max: 5

One specified, named dimension, and any number of other dimensions

array:
  dimensions:
    min: 1
    # optionally, to be explicit:
    max: null
  axes:
    x:
      rank: 0
      alias: latitude_in_deg

Expanding implements

Another way: i like the idea of implements as being "not a class inheritance, not a mixin, but still an indication of type." specifying array is a different kind of property than, eg. specifying range, in that it changes how the slot/class/etc. is to be interpreted in a pretty big way. We could do the above but as a parameterization of implements. That gives us lots of room in the future for plugins and other interesting implementations without necessarily needing to modify the core schema - a plugin could indicate a parameterizable implements and then the generators could accept hooks that know how to render that type of implements.

classes:
  RGBMatrix:
    range: float
    implements:
      - type: linkml:ndarray
        dimension: 3
        axes:
          x:
          y:
          rgb:
            cardinality: 3
            description: r, g, b values

and so on for the other examples.

Quick idea, looking forward to hashing this out tomorrow :)

@sneakers-the-rat
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants