Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dumping arrays into Base64 encoding #9

Open
bendichter opened this issue Feb 6, 2024 · 3 comments
Open

Dumping arrays into Base64 encoding #9

bendichter opened this issue Feb 6, 2024 · 3 comments

Comments

@bendichter
Copy link

prototype for expressing array in Base64:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Base64 Encoded n-Dimensional Array",
  "type": "object",
  "properties": {
    "binaryData": {
      "type": "string",
      "description": "The array data encoded as a Base64 string."
    },
    "shape": {
      "type": "array",
      "items": {
        "type": "integer",
        "minimum": 0
      },
      "description": "The shape of the array, specifying size in each dimension."
    },
    "dtype": {
      "type": "string",
      "pattern": "^[<=>|]?(int|uint|float|complex)[0-9]+|^[<=>|]?bool$",
      "description": "The data type of the array elements, including endianness."
    },
    "ordering": {
      "type": "string",
      "enum": ["C", "F"],
      "description": "The memory ordering of the array. 'C' for row-major order, 'F' for column-major order."
    },
    "encoding": {
      "type": "string",
      "enum": ["Base64"],
      "default": "Base64",
      "description": "The encoding used for binaryData. Currently, only Base64 is supported."
    }
  },
  "required": ["binaryData", "shape", "dtype", "ordering"],
  "additionalProperties": false
}
@bendichter
Copy link
Author

Writing:

import numpy as np
import base64
import json

def write_array_to_json(array: np.ndarray, file_path: str):
    # Ensure the array is in a byte form suitable for Base64 encoding
    binary_data = array.tobytes()
    # Encode the array data as Base64
    encoded_data = base64.b64encode(binary_data).decode('utf-8')
    
    # Determine the array's ordering
    ordering = 'F' if array.flags['F_CONTIGUOUS'] else 'C'
    
    # Construct the JSON object
    array_metadata = {
        "binaryData": encoded_data,
        "shape": array.shape,
        "dtype": str(array.dtype),
        "ordering": ordering,
        "encoding": "Base64"
    }
    
    # Write the JSON object to a file
    with open(file_path, 'w') as json_file:
        json.dump(array_metadata, json_file)

# Example usage
array = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)
write_array_to_json(array, 'array_data.json')

Reading:

def read_array_from_json(file_path: str) -> np.ndarray:
    # Read the JSON object from the file
    with open(file_path, 'r') as json_file:
        array_metadata = json.load(json_file)
    
    # Decode the Base64-encoded data
    binary_data = base64.b64decode(array_metadata["binaryData"])
    
    # Reconstruct the array using the decoded data and metadata
    array = np.frombuffer(binary_data, dtype=array_metadata["dtype"])
    array = array.reshape(array_metadata["shape"], order=array_metadata["ordering"])
    
    return array

# Example usage
reconstructed_array = read_array_from_json('array_data.json')
print(reconstructed_array)

@sneakers-the-rat
Copy link

sneakers-the-rat commented Feb 6, 2024

Got this built into the model_dump_json method here:
https://github.com/p2p-ld/numpydantic/blob/9906e6c50799a6011bd8d6e8f8d4c7fb3bc5bf8c/numpydantic/ndarray.py#L70

I figure if we're b64 encoding we might as well compress :)

This is what I wanna do today anyway: we'll want to make a meta-schema that lets us express multiple encodings for a given array schema. So that'll include a reference to what format the array is in, and then for a given format like numpy what information is needed to pack and unpack. I like what ya got there as a start for the numpy schema, you got a linkml version? I think we'll also want to store numpy array format version, dtype, and order in a way that matches their format spec I think: https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format

@bendichter
Copy link
Author

This is great, @sneakers-the-rat! I would say I'd probably prefer gzip as it comes with Python and does not have any external dependencies. Using a bool for F vs. C sounds fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants