Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade PDAL to 2.6.0 - schema extraction changes #924

Closed
2 tasks
olsen232 opened this issue Oct 17, 2023 · 3 comments
Closed
2 tasks

Upgrade PDAL to 2.6.0 - schema extraction changes #924

olsen232 opened this issue Oct 17, 2023 · 3 comments
Assignees

Comments

@olsen232
Copy link
Collaborator

Kart extracts the schema for tile-based datasets and puts them in a meta item called schema.json.
Unlike tabular datasets, this information is not necessary for decoding the dataset contents - the dataset contents are all valid tiles and are all self-explanatory, when read with the right software - it is more for informational purposes - it highlights what the current schema is, it highlights when any changes are made to the schema, it ensures there is only one valid schema for the entire dataset, and extra tooling that needs to read the schema can be added without reading individual tiles.

For this reason, the exact format of the schema file is a bit arbitrary compared to the tabular datasets, where Kart stores info that it needs for its own purposes. The format of the tile-based dataset schema files was intended to be as similar as possible to the tabular dataset schema file, since there's no point having more than one syntax to mean the exact same thing (ie, in all cases an int16 type is described using the properties `{"dataType": "integer", "size": 16}, since that is the format that was chosen for tabular).

PDAL 2.6.0 changes how it interprets schemas: PDAL/PDAL#4186 - specifically, instead of having one "ClassFlags" field it now has multiple 1-bit flags fields. This change seems sensible so we should incorporate it going forward, to remain in sync with PDAL - even though we don't need the schema to have any particular format, agreeing with PDAL on the number and name of the dimensions seems sensible and may avoid issues later if we add more PDAL-based functionality to Kart.

An issue is how to avoid this change in interpretation showing up as a diff in existing point-cloud repos. Elsewhere in Kart we have code that recognises spurious diffs that occur for other reasons and detects them, and we can use something similar here.

Revisiting the point-cloud schema, I found some other potential issues with our representation of the schema:

  • we output the type that PDAL exposes, rather than the type that is stored - eg for fixed point fields, PDAL reads it as floating point, so we describe that field as floating point. This was by design and is maybe okay as-is but would be good to have some more clarity on this by looking into likely the use-cases of likely consumers of this data - do they need to know the exact storage format of a field so that they can, eg, predict what level of detail can be stored in that file, or is the main benefit just in knowing what fields are there at all.
  • we don't document whether a type is unsigned or not. Again, this may be more detail than is necessary, but for raster files, we do store whether or not a field is unsigned, so this is an inconsistency.
  • (most in need of fixing) PDAL exposes the size of each field in bytes, but we use bits. However, we give the wrong number of bits for any fields that are not a whole number of bytes, eg single bit flags (we can't use PDAL to tell us this info, we have to hard code it based on the spec, but this is not actually much work - it's just a case of making sure we change it in a backwards compatible way, as above).
    Again, it's not 100% clear if exposing all of this information is necessary - at least some of it is useful, but somewhere we have to draw the line - but right now some of it the exposed information is wrong or at least inconsistent.

tl:dr;

  • Upgrade to PDAL 2.6.0 and see that extracted schemas change slightly, write diff-suppression code so that existing datasets don't change schemas
  • While making this backwards compatible change, fix existing schemas by at least fixing the reported field-size for 1-bit flags etc, and maybe adding "unsigned": True, and maybe also adding info about fixed-point fields.
@olsen232 olsen232 self-assigned this Oct 17, 2023
@craigds
Copy link
Member

craigds commented Oct 17, 2023

perhaps we shouldn't even be storing schema.json for pointclouds? since, afaik it is 100% derived from the LAS PDRF, which is stored separately and more compactly in format.json anyway

@olsen232
Copy link
Collaborator Author

That's a good point - we would lose two things but both of debatable value:

  • the schema, although more verbose, is simpler to understand than the PDRF number. LAS authors will probably be able to take a PDRF and use the specification to derive the schema, but perhaps its nice that Kart does it for them - and cloners / consumers of a kart repo are not guaranteed to know everything about the data in it, so making things more explicit could be useful to them.
  • currently it appears to be 100% derived from the LAS PDRF, but currently I don't have any custom-schema LAS files - you're allowed to add extra fields to every point and document what the fields are in a VLR, but I have yet to see it done. If I find some / construct some, it'd be worth seeing what PDAL does with them

@olsen232
Copy link
Collaborator Author

Fixed by
#928
#937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants