Validate character encoding #18

rotu · 2024-04-03T16:12:39Z

Where the schema expects a declared character encoding, validate that it is one of the character encodings recognized by the compiler. When using a smart editor, this makes it easier to author a .ksy file.

I took this list from the recognized encoding list in Kaitai Struct compiler.

GreyCat · 2024-04-04T11:50:34Z

ksy_schema.json

+        "windows-1258",
+        "IBM437",
+        "IBM866"
+      ]


We'll have to keep this list in sync with https://github.com/kaitai-io/kaitai_struct_compiler/blob/master/shared/src/main/scala/io/kaitai/struct/EncodingList.scala

I wonder if there is any easy way out here? I can imagine:

some tool which will copy ksc's EncodingList.scala list to here automatically

some tool which will copy this file to EncodingList.scala

getting ksc to read ksy_schema.json as source of truth (but then we'll need to keep a list of aliases somewhere?)

some tool which will copy ksc's EncodingList.scala list to here automatically

I think this is the way to go (but I don't think it has to be "automatically"). Practically speaking, I'm thinking of adding a CI job to kaitai_struct_compiler that compares EncodingList.scala with the latest ksy_schema.json. If it finds any inconsistencies, it can do one of two things:

just fail the CI job and thus prompt a human operator to look into the CI log, which would contain the details about the failed comparison. The human would then go here to https://github.com/kaitai-io/ksy_schema and update the list manually.

This should be simple to implement and would probably do the job good enough. Yes, the updating would still need to be done manually, but I don't think the encoding list in KSC will be changing so rapidly that this becomes annoying.

have @kaitai-bot open a PR with the update in https://github.com/kaitai-io/ksy_schema/pulls and wait for the human operator to merge it. This would perhaps be more convenient if we were changing the encoding list in KSC often, but as I already mentioned, I don't think we will, so this is probably an overkill. It adds a bunch of implementation challenges on top of the variant 1 - for example, editing the JSON file so that the overall formatting stays intact, opening a PR via some API, etc.

To be honest, I don't think we need this, it seems overengineered. I suppose that there will be like less than 10 updates of the encoding list in the next 5 years, and we can certainly handle that much manually.

We'll have to keep this list in sync

The same could be said for the list of attributes (e.g. the valid attribute, implemented by the compiler but missing from the spec kaitai-io/kaitai_struct#944).

Practically speaking, I'm thinking of adding a CI job to kaitai_struct_compiler that compares EncodingList.scala with the latest ksy_schema.json

I think it would be ideal to add a CI job in https://github.com/kaitai-io/kaitai_struct_formats which checks that *.ksy files validate against the schema.

GreyCat

Generally looks good to me, if anything, that's a great first step towards user-friendly canonicalization.

generalmimon · 2024-04-04T13:08:49Z

@rotu On the one hand, naming Git branches after animals is kind of funny (I wonder what tool does that?), but on the other hand, sensible branch names that at least slightly indicate the content are better. It's not that different from the default naming scheme that GitHub uses for edits created via the Web GUI (patch-1, patch-2 etc.), but we can probably agree that this is not exactly best practice either.

Speaking of which, please try to write at least a brief explanation in the pull request description. In general, every change deserves it.

rotu · 2024-04-04T14:05:20Z

@rotu On the one hand, naming Git branches after animals is kind of funny (I wonder what tool does that?), but on the other hand, sensible branch names that at least slightly indicate the content are better. It's not that different from the default naming scheme that GitHub uses for edits created via the Web GUI (patch-1, patch-2 etc.), but we can probably agree that this is not exactly best practice either.

This is the default behavior of Visual Studio Code. The branch names should not be conveying meaning - please see the PR for canonical description.

Speaking of which, please try to write at least a brief explanation in the pull request description. In general, every change deserves it.

I thought the title of the PR was sufficient but I'll elaborate.

generalmimon · 2024-04-04T14:17:48Z

The branch names should not be conveying meaning - please see the PR for canonical description.

Why shouldn't they?

I thought the title of the PR was sufficient but I'll elaborate.

It's not just about what you did (commit / PR titles and code are probably indeed sufficient for that), but why and how you did it are often important questions. For example, here we had to guess that you took the list from EncodingList.scala, which you didn't mention. In kaitai-io/kaitai_struct#1101, @GreyCat had to ask you why you're making that change.

And I'm not talking just about this PR, but in general - most of the time, leaving PR descriptions empty is a bad habit.

rotu · 2024-04-04T14:39:55Z

The branch names should not be conveying meaning - please see the PR for canonical description.

Why shouldn't they?

Here's why I don't like naming my branches: (1) branch names are immutable, even if the scope of a PR changes or you named the branch poorly to start (2) it creates a conflicting source of truth for the intent of a branch.

I thought the title of the PR was sufficient but I'll elaborate.

It's not just about what you did (commit / PR titles and code are probably indeed sufficient for that), but why and how you did it are often important questions. For example, here we had to guess that you took the list from EncodingList.scala, which you didn't mention). In kaitai-io/kaitai_struct#1101, @GreyCat had to ask you why you're making that change.

And I'm not talking just about this PR, but in general - most of the time, leaving PR descriptions empty is a bad habit.

Good points!

Validate character encoding

7830488

GreyCat reviewed Apr 4, 2024

View reviewed changes

GreyCat self-requested a review April 4, 2024 11:50

GreyCat approved these changes Apr 4, 2024

View reviewed changes

generalmimon approved these changes Apr 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate character encoding #18

Validate character encoding #18

rotu commented Apr 3, 2024 •

edited

Loading

GreyCat Apr 4, 2024

generalmimon Apr 4, 2024 •

edited

Loading

rotu Apr 4, 2024

GreyCat left a comment

generalmimon commented Apr 4, 2024 •

edited

Loading

rotu commented Apr 4, 2024

generalmimon commented Apr 4, 2024 •

edited

Loading

rotu commented Apr 4, 2024

Validate character encoding #18

Are you sure you want to change the base?

Validate character encoding #18

Conversation

rotu commented Apr 3, 2024 • edited Loading

GreyCat Apr 4, 2024

Choose a reason for hiding this comment

generalmimon Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

rotu Apr 4, 2024

Choose a reason for hiding this comment

GreyCat left a comment

Choose a reason for hiding this comment

generalmimon commented Apr 4, 2024 • edited Loading

rotu commented Apr 4, 2024

generalmimon commented Apr 4, 2024 • edited Loading

rotu commented Apr 4, 2024

rotu commented Apr 3, 2024 •

edited

Loading

generalmimon Apr 4, 2024 •

edited

Loading

generalmimon commented Apr 4, 2024 •

edited

Loading

generalmimon commented Apr 4, 2024 •

edited

Loading