Upgrade PDAL to 2.6.0 - schema extraction changes #924

olsen232 · 2023-10-17T21:28:34Z

Kart extracts the schema for tile-based datasets and puts them in a meta item called schema.json.
Unlike tabular datasets, this information is not necessary for decoding the dataset contents - the dataset contents are all valid tiles and are all self-explanatory, when read with the right software - it is more for informational purposes - it highlights what the current schema is, it highlights when any changes are made to the schema, it ensures there is only one valid schema for the entire dataset, and extra tooling that needs to read the schema can be added without reading individual tiles.

For this reason, the exact format of the schema file is a bit arbitrary compared to the tabular datasets, where Kart stores info that it needs for its own purposes. The format of the tile-based dataset schema files was intended to be as similar as possible to the tabular dataset schema file, since there's no point having more than one syntax to mean the exact same thing (ie, in all cases an int16 type is described using the properties `{"dataType": "integer", "size": 16}, since that is the format that was chosen for tabular).

PDAL 2.6.0 changes how it interprets schemas: PDAL/PDAL#4186 - specifically, instead of having one "ClassFlags" field it now has multiple 1-bit flags fields. This change seems sensible so we should incorporate it going forward, to remain in sync with PDAL - even though we don't need the schema to have any particular format, agreeing with PDAL on the number and name of the dimensions seems sensible and may avoid issues later if we add more PDAL-based functionality to Kart.

An issue is how to avoid this change in interpretation showing up as a diff in existing point-cloud repos. Elsewhere in Kart we have code that recognises spurious diffs that occur for other reasons and detects them, and we can use something similar here.

Revisiting the point-cloud schema, I found some other potential issues with our representation of the schema:

we output the type that PDAL exposes, rather than the type that is stored - eg for fixed point fields, PDAL reads it as floating point, so we describe that field as floating point. This was by design and is maybe okay as-is but would be good to have some more clarity on this by looking into likely the use-cases of likely consumers of this data - do they need to know the exact storage format of a field so that they can, eg, predict what level of detail can be stored in that file, or is the main benefit just in knowing what fields are there at all.
we don't document whether a type is unsigned or not. Again, this may be more detail than is necessary, but for raster files, we do store whether or not a field is unsigned, so this is an inconsistency.
(most in need of fixing) PDAL exposes the size of each field in bytes, but we use bits. However, we give the wrong number of bits for any fields that are not a whole number of bytes, eg single bit flags (we can't use PDAL to tell us this info, we have to hard code it based on the spec, but this is not actually much work - it's just a case of making sure we change it in a backwards compatible way, as above).
Again, it's not 100% clear if exposing all of this information is necessary - at least some of it is useful, but somewhere we have to draw the line - but right now some of it the exposed information is wrong or at least inconsistent.

tl:dr;

Upgrade to PDAL 2.6.0 and see that extracted schemas change slightly, write diff-suppression code so that existing datasets don't change schemas
While making this backwards compatible change, fix existing schemas by at least fixing the reported field-size for 1-bit flags etc, and maybe adding "unsigned": True, and maybe also adding info about fixed-point fields.

The text was updated successfully, but these errors were encountered:

craigds · 2023-10-17T22:44:32Z

perhaps we shouldn't even be storing schema.json for pointclouds? since, afaik it is 100% derived from the LAS PDRF, which is stored separately and more compactly in format.json anyway

olsen232 · 2023-10-17T22:59:20Z

That's a good point - we would lose two things but both of debatable value:

the schema, although more verbose, is simpler to understand than the PDRF number. LAS authors will probably be able to take a PDRF and use the specification to derive the schema, but perhaps its nice that Kart does it for them - and cloners / consumers of a kart repo are not guaranteed to know everything about the data in it, so making things more explicit could be useful to them.
currently it appears to be 100% derived from the LAS PDRF, but currently I don't have any custom-schema LAS files - you're allowed to add extra fields to every point and document what the fields are in a VLR, but I have yet to see it done. If I find some / construct some, it'd be worth seeing what PDAL does with them

olsen232 · 2023-11-16T03:43:45Z

Fixed by
#928
#937

olsen232 self-assigned this Oct 17, 2023

olsen232 mentioned this issue Oct 19, 2023

Tracking bug for metadata-only-import from S3 or other hosted storage #905

Open

olsen232 closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade PDAL to 2.6.0 - schema extraction changes #924

Upgrade PDAL to 2.6.0 - schema extraction changes #924

olsen232 commented Oct 17, 2023

craigds commented Oct 17, 2023

olsen232 commented Oct 17, 2023

olsen232 commented Nov 16, 2023

Upgrade PDAL to 2.6.0 - schema extraction changes #924

Upgrade PDAL to 2.6.0 - schema extraction changes #924

Comments

olsen232 commented Oct 17, 2023

craigds commented Oct 17, 2023

olsen232 commented Oct 17, 2023

olsen232 commented Nov 16, 2023