TOML support #248

yawkat · 2021-03-25T11:09:25Z

This is WIP, creating a PR for discussion on the approach.

As discussed in #219, the toml parser landscape in java is mediocre. For this reason, I decided it was worth rolling our own, though I'm open to feedback on that decision.

TOML has an ABNF grammar available that I assume to be authoritative. The ABNF parser generator landscape is also not great in Java, so I decided to transform it by hand (potentially with errors?) into a jflex lexer file (toml.jflex), which generates a fairly efficient lexer using the jflex maven plugin. From that, I implemented a basic recursive descent parser (Parser.java). This approach also has the benefit of not requiring an external runtime dependency like antlr, unlike some existing parsers.

I have created a test class that includes all examples from the TOML docs. Most already work, but there are some samples that don't yet conform, mostly parsing successfully when parsing should fail.

I do not expect serialization to be a big problem, so I am keeping that for the end.

A few things are still outstanding:

Should the toml parsers be encapsulated? Right now they are package-private. Most of jackson offers some extendability, but in my opinion, here it is beneficial to have the freedom to swap the parser implementation in the future.
Is ObjectNode really the best model to parse to at the start?
- A full streaming solution is out because of the out-of-order nature of TOML
- We need to somehow remember which objects are "explicitly" and "implicitly" declared to fully conform to the spec.
How to handle datetime? The TOML types map well to java.time, but not sure if this can be stored properly in an ObjectNode or how it's emitted by the parser I'll add eventually.
How to handle integer types? There's some flags on ObjectMapper for this. TOML has some requirements here ("Arbitrary 64-bit signed integers (from −2^63 to 2^63−1) should be accepted and handled losslessly", "Floats should be implemented as IEEE 754 binary64 values."), so it might be feasible to just use long and double, but we could also use BD/BI like the code does right now.
What's the license for the toml spec? Using test cases from it may require a note in the project dir, though it should not affect the license of the final project, since we only use it for tests and generating code.
TOML has a compliance test suite, but I can't find the actual test files :) may be worth including in our test suite.
2.13 backport? Maybe once this is done...

aalmiray · 2021-03-25T12:35:39Z

FWIW the Toml GitHub repository shows MIT as license. You can also see MIT at the bottom of https://toml.io/en/

cowtowncoder · 2021-03-25T17:10:56Z

Excellent. I haven't yet looked at code (will get to that later), but I can give some suggestions in the meantime.

Encapsulation: I think I'd hide them right away, sort of like how YAML backend tries to avoid exposing SnakeYAML, despite there not being much plan to swap it out on short term. I don't think that makes sense as an extension point without some planning (as in: if someone can show a workable way to expose, fine, but not just leaving it open by default)
ObjectNode: that works, I think, although does not have facility to store TOML-specific stuff. Not super difficult to replicate if need be; Properties backend uses something different I think. But whatever is used needs to expose JsonParser api, and that exists for ObjectNode (or JsonNode in general). TokenBuffer is an alternative but probably would not work as well wrt random access, re-sorting.
Datetime: there is the "Embedded value" concept both at streaming (JsonToken.VALUE_EMBEDDED_OBJECT) and JsonNode -- in latter case, JsonNodeFactory.rawValueNode() (note: there is some overlap with idea of "POJO node"), which produces POJONode. Simply means an opaque atomic value that may be serialized, deserialized, but not traversable on its own. As long as date/time deserializers take embedded objects, it should work through fine; if there are issues, I can help
Integer etc types: int/long/BigInteger should work fine with usual Jackson semantics (use the smallest that fits). For floating-point there are settings that default to double but can be changed on TokenStreamFactory -- we could make Toml variant default to "always use BigDecimal" and just let users opt-out if they want?
2.13 backport: yes, I can help with that, should not be a big problem (some effort, but not hard conceptually)
Test suite, files: would definitely be great to have.

toml/pom.xml

brunoborges · 2021-03-25T17:31:59Z

I was able to build this after adding a copy of PackageVersion.java.in from the yaml module, and also fix the artifactId in the pom.xml.

Now to some play :-)

yawkat · 2021-03-25T17:54:25Z

I've changed the code to do state transitions inside the parser instead of the lexer now, since the parser has contextual information.

cowtowncoder · 2021-03-25T18:05:42Z

One more though on Date/Time: while embedding value is easy enough, there are potential challenges on doing that at streaming parser level, as most handling usually is at jackson-databind and not below.
This has problems that:

Duplicating decoding, without access to configuration
Coupling values to Java 8 date/time over f.ex Joda (not sure this is problematic any more)

And so I guess the alternative would be to simply exposing these values as JsonToken.VALUE_STRINGs.
Could have a TomlParser.Feature.USE_JAVA8_DATE_TIME or something to toggle, and allow non-configurable Java 8 types at low-level OR "Jackson default" handling, should be easy enough.

Also note on Jackson 3.0.0-SNAPSHOT -- this may be somewhat volatile due to my still making wide-ranging changes in non-atomic manner. If so, just ping me. I will try to keep snapshots working of course but may sometimes miss things for a bit until finding out problems.

yawkat · 2021-03-25T18:16:42Z

@cowtowncoder I'm not sure about the config parts. The TOML datetime types are fixed in the grammar (see the ABNF), so allowing customization of the parsing does not make sense. The format is also limited enough so that we can just do LocalDateTime.parse and friends, without having to implement much decoding ourselves (except for handling a space instead of the T separator).

You're right about interaction with the deserializer though. What would happen if a user has an Instant in their data model, but the parser returns an OffsetDateTime with UTC+0 for example? InstantDeserializer handles this fine if it's a string, but I doubt it can do the conversion if it's a raw value instead.

Another thing is symmetry with the serializer... Not sure about that.

cowtowncoder · 2021-03-25T18:40:19Z

@yawkat Right, I don't mean changing definition of low-level data semantics at all, just one technical aspect relevant to piping within Jackson pipeline. If parser exposes something as LocalDate, databind will not have facilities to do the usual coercions (if anyone apply here); it is hard-coded opaque value.

I think the one switch that might prove useful is really just "expose as String vs Java 8 Date/Time" and nothing more (and no extension points for more complex handling).
Then again, if this became an issue, adding other extension (low-level decoder) would be possible as well :)

But for now that's just an idea, no need to implement. Probably best to do parsing to strict type first and see well that works. Should be easy to change if need be.

yawkat · 2021-03-25T18:53:08Z

Re using something else than ObjectNode as the intermediate: I got away with creating a subclass of ArrayNode and ObjectNode with additional boolean flags. That sounds like a reasonable solution, doesn't require creating a whole class structure. Does remove the possibility of a custom JsonNodeFactory though, but that's not so bad.

@cowtowncoder re Temporal: Adding an option sounds reasonable and is no difficulty, I've gone ahead and done that. Has the added benefit of making the related tests work :)

There's also an option for Long/Double-based number parsing now.

cowtowncoder · 2021-03-25T20:37:05Z

@yawkat Ok: if using a custom sub-class, maybe we'll make sure this model is not directly exposed to users and is just used for internal processing. Node system is designed as extensible to allow adding arbitrary data, although that idea was never fully developed as it turns out rather tricky to make actually work (there is TreeNode interface so JsonNode is theoretically but one model).

I probably should look at code first to make sure I understand the implementation, before more comments. :)

But progress sounds good so far!

yawkat · 2021-03-26T11:32:28Z

@cowtowncoder I've created a read feature flag for java.time parsing. This works fine for parsing a field of type Object, which is then set to LocalDate as expected, but for a field of type LocalDate, it errors:

com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Java 8 date/time type `java.time.LocalDate` not supported by default: add Module "com.fasterxml.jackson.datatype:jackson-datatype-jsr310" to enable handling
 at [No location information] (through reference chain: com.fasterxml.jackson.dataformat.toml.TomlMapperTest$ObjectField["foo"])

	at com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:65)
	at com.fasterxml.jackson.databind.DeserializationContext.reportBadDefinition(DeserializationContext.java:1807)
	at com.fasterxml.jackson.databind.deser.impl.UnsupportedTypeDeserializer.deserialize(UnsupportedTypeDeserializer.java:31)
	at com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:138)
	at com.fasterxml.jackson.databind.deser.bean.BeanDeserializer._vanillaDeserialize(BeanDeserializer.java:326)
	at com.fasterxml.jackson.databind.deser.bean.BeanDeserializer.deserialize(BeanDeserializer.java:193)
	at com.fasterxml.jackson.databind.deser.DeserializationContextExt.readRootValue(DeserializationContextExt.java:268)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2536)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:1550)

This is obviously because there's no deserializer for LocalDate, and it looks for one even before seeing that the next token is a VALUE_EMBEDDED_OBJECT. Is there a way to work around this? It would be nice if the parser could support java.time natively without the java.time jackson module, since we don't need the special config options.

cowtowncoder · 2021-03-26T20:37:02Z

@yawkat Interesting... the "safety" feature of blocking java.time is tricky. I agree, it should not be blocked.
There is some recent work sort of related:

FasterXML/jackson-modules-java8#207

I can probably figure something out, will file an issue.

My first instinct is to change handling so that instead of failing on lookup, code should probably construct "potentially failing" deserializer that can handle java.time. types as long as they come as VALUE_EMBEDDED_OBJECTs (nulls are not handled by deserializers so that's fine).

Luckily 2.13 requires Java 8, so testing should be easy.

I'll first file an issue, and then hopefully have time to tackle it.

cowtowncoder · 2021-03-26T20:39:06Z

Also: sort of vaguely related: a new toy project I wrote (but not yet published):

https://github.com/cowtowncoder/Jackformer

allows "blind" NxN transformations and if we get this going, I'd really like to check out "YAML to TOML" case :)
(and vice versa).
Doing it strictly via JsonNode obviously cannot do anything super smart, but it'd be kind of interesting nonetheless.
I was especially proud to see that "Transform pom.xml into pom.properties" worked surprisingly well :)

toml/src/main/java/com/fasterxml/jackson/dataformat/toml/TomlFactory.java

toml/src/main/java/com/fasterxml/jackson/dataformat/toml/TomlMapper.java

cowtowncoder · 2021-03-26T20:50:59Z

toml/src/main/java/com/fasterxml/jackson/dataformat/toml/TomlReadFeature.java

+ */
+public enum TomlReadFeature
+        implements FormatFeature {
+    USE_BIG_INTEGER_FOR_INTS(false),


Maybe brief javadoc explanation?

Still not sure if there's a better way to do this. e.g. the heuristic for BigInteger vs smaller int types that JsonParser uses

toml/src/main/java/com/fasterxml/jackson/dataformat/toml/Parser.java

cowtowncoder · 2021-03-26T21:43:08Z

toml/src/main/java/com/fasterxml/jackson/dataformat/toml/Parser.java

+            long v;
+            try {
+                if (text.startsWith("0x") || text.startsWith("0X")) {
+                    v = Long.parseLong(text.substring(2), 16);


May want to extract this, as well as cases above, to call a helper method that takes String and base, but then also catches NumberFormatException and range checks that JDK throws. This because JDK exceptions are pretty unhelpful if I remember, and we want to change these into something more meaningful.

It is unfortunate that helper methods from ParserMinimalBase are not available (as we can't quite use that without implementing token stream, something that JsonNode provides us), like:

_reportOverflowLong()

_reportInvalidNumber()

etc.

These are probably easier to add when there are tests for overflow, misformatted numbers etc.

cowtowncoder · 2021-03-26T21:51:50Z

Looks good so far. Some suggestions:

ParserTest should probably be split into a few test files to make easier to browse
Would be good to verify contents of exceptions (to some degree); I use verifyException() helper to match parts of the exception but I assume junit probably has similar helpers
Basic error handling tests via mapper (just to verify that IOException gets wrapped into JacksonException)
Replace asserts with a helper that throws some suitable "internal problem" exception (but not Error, IMHO)

I'd be happy to merge this whenever; probably makes sense to have a generator impl first?

yawkat · 2021-03-27T10:13:50Z

I just added a test case for the format from https://github.com/toml-lang/compliance . However, I did not add any actual test files. There are some in forks of that repo, but they still have outstanding issues: For example, sometimes temporal types have their precision truncated, and sometimes strings are double-escaped.

yawkat · 2021-03-27T10:41:59Z

@cowtowncoder Re BigI/D: I'm still not sure how to proceed here. Precedent in JsonParser is to determine whether a number is in range for int/long and use those types where possible. Maybe copying that behavior is a better solution than adding parser options for it. I'd have to adapt my tests to work with different number types, but that's a small issue.

Re splitting up ParserTest: Right now, 90% of those tests are copied examples from the manual, not sure by which criterium those could be split.

Re generator: Easy to implement in principle. Some decisions that will come up:

How to serialize temporals? Need some signal that we can serialize those directly. Implementing writePOJO is easy, but is there a way to make the ObjectMapper actually call writePOJO for the java.time types toml supports?
How to serialize strings? Could use only basic strings with the proper escapes for now, but maybe add a heuristic in the future. Maybe signal using annotations? Not sure if there's any way to pass those through, not worth the effort probably.
How to serialize arrays / objects? Serializing them inline is easy, but probably not what most users want. Would need some heuristic on when to use tables, when to use dotted key form, when to use inline objects/arrays. Maybe go full inline for now, can always change later. Maybe annotations here too if possible?

yawkat · 2021-03-27T11:57:00Z

I've added some exception message testing. I used the junit ExpectedExceptions mechanism, since it also allows checking the message.

I may also try to add some more meaningful parse errors for some of these.

yawkat · 2021-03-27T17:50:15Z

Okay, I've built a generator, mostly copied from the properties module. Not many tests yet, but seems to work. Right now, it emits arrays inline, and objects as dotted keys (except when inside arrays), because that was straight-forward.

Newlines are added where necessary (namely after every dotted key value pair). There is no indentation. Other whitespace is "normal", i.e. spaces around key value separators and after commas.

Tables are problematic, and may not be possible using a streaming generator. This is because when you use a table, you lose any ability to add scalar values to the root object.

cowtowncoder · 2021-03-27T17:59:58Z

Ah. Yes, makes sense that tests were copied with minimal changes, good way to start.
Not always clear how to split up; often by general type of input, or focus, but multiple ways obviously.

I can probably help with number handling: agreed that for now not critical. I'd probably suggest something like:

Floating-point: always use BigDecimal internally as the "native" format, then coerce as needed.
Integral: copy from GeneratorBase: start with smallest applicable -- however, may be tricky with non-decimal values. Could consider "always BigInteger first, then compare" for non-decimal at least. Similarly, coerce as requested.
... and now that I said above, given that we use JsonNode backing, coercion should already occur (plus can fix model if there are issues)
Even with node model, could do "smallest practical", eventually, for integers. Easy enough with incremental work.

On generator, good questions. With 3.0, in particular, should be possible to simply use JsonGenerator.writePOJO(dateTime) and it should work, assuming java 8 date/time module is registered (or get exception for that missing). Should just document that when using this format backend, one should always register that module.

Then again... when serializing with ObjectMapper, I think underlying generator will NOT see anything but Strings etc.
This could prove problematic when date/time module settings differ from those of TOML requirements.
I think this is something that would probably need some thought on how to improve: I think Avro backend, and CBOR probably too, have something to say. I have thought about the challenge of "Logical Types" which is what this is, I think (that is: CBOR, Protobuf, Avro all have physical shape (String, binary, LongInt) for data, but also higher-level abstract type type, like DateTime). Problem being that Jackson streaming components deal exclusively with physical value, and databind currently handles all translations beyond those.

On Arrays... yeah, starting with inline sounds reasonable as the starting point.
On extensions; @JsonFormat does have shape; with options including Shape.OBJECT, Shape.ARRAY, but also Shape.NATURAL -- may or may not help here, but it is something that does exist. Although... not sure if applicable at appropriate layer.

cowtowncoder · 2021-03-27T18:04:14Z

Ok good progress wrt generators! I think that getting reading/parser right is probably the most valuable thing anyway; and getting some valid output (even if missing configurability and ability to produce constructs like tables) is better than not having anything.

Question then would be when to merge. I am happy to give you push access to this repo to simplify changes.

I think I would like a 2.13 backport too, and can help getting that to work. Just need to think of how to make things then forward-mergeable (2.13 -> master) when initial merge is the "Wrong way around".
The reason wrt 2.13 is simply that I still have no confidence on timing of 3.0.0 release candidates; so much to achieve beyond still some basic structural changes (changing of coordinates is needed; figuring out exact versioning scheme wrt annotations).

yawkat · 2021-03-27T19:31:33Z

I can probably help with number handling: agreed that for now not critical. I'd probably suggest something like:

Implemented in bfc3ff3.

It sounds like the java.time serialization will take more work, so I'll leave that be for now. Not like it's all that important anyway, especially for the serialization part.

One todo I still have in the source is newline handling. The TOML spec says for multi-line string parsing:

TOML parsers should feel free to normalize newline to whatever makes sense for their platform.

I think it's fine to just not do that for now. Maybe an optional feature in the future.

Beyond that, the only thing I can think of before merging is the license question. Since TOML is MIT, and the samples for the unit tests are from their website, I guess a LICENSE file in the project folder (not in the pom / final artifact) should be sufficient.

For 2.13, I can do the backport, may take a few days before I have time though. I think the branch merge should be fairly painless, there's no conflict with other modules after all, so it should be possible to just commit either the 2.13 or the 3.0 changeset and then add the other version as the merge commit for the branches.

…ield

yawkat · 2021-03-30T09:27:44Z

Rebased onto master for ContentReference change

yawkat · 2021-03-30T10:54:05Z

@cowtowncoder I've created https://github.com/yawkat/jackson-dataformats-text/tree/toml-2.13 , which is a squashed version of this PR cherry-picked onto 2.13, with an additional commit 4df60a4 that addresses the compat issues. The tests pass, but I did a fairly stupid "fix errors as they come up" approach, so there may still be stuff that's broken.

cowtowncoder · 2021-03-31T22:23:01Z

@yawkat Ok; so how should I merge these? Does ordering matter. I think I'm happy as long as nothing there blocks work on other format modules (that is, passing of test suite by whatever means is important :) )

yawkat · 2021-04-01T18:48:31Z

Sorry, a bit busy with moving atm :)

My suggestion would be merging this branch into master manually (should be safe), then merging master into 2.13 (should be no conflicts), and finally cherry picking 4df60a4 to move the toml module to 2.13. I think this shouldn't impact the ability to merge future changes in this repo between 2.13 and master.

cowtowncoder · 2021-04-02T21:41:38Z

@yawkat no prob wrt moving. I am pretty overload with work right now too.

Question on merging: not sure on meaning of manual here -- do you mean just copying new files from here? Merging from master to 2.13 in general is a no-go, except for cherry-picking, since that'd bring in drastic 3.0 changes for other modules

yawkat · 2021-04-03T08:22:31Z

By 'manual' I just mean doing it locally instead of through github web, in case it breaks somehow :P

I'm not sure about the mechanics of merging. Will merging master into 2.13 include the changes to other modules as well, even if you exclude them from the merge commit? Not sure how that works.

Anyway, merging in the other direction could work too. I've done it over in my repo. The layout is like this:

The "TOML data format" commit 91625f5 is all the squashed changes in this PR. The "2.13 compatibility" commit fefd694 changes the code to 2.13, and that is the HEAD of the toml-merge-2.13 branch. Then, that branch is merged into toml-merge-master in commit 58239b4, and the 2.13 compat commit is reverted in "3.0 compatibility for toml module" (5d479e8).

Squashing the dev history of this PR makes handling the merge easier, and it's not a big issue. The duplicate compat commits are a bit ugly, but it works.

Both branches mvn clean verify fine.

cowtowncoder · 2021-04-07T03:08:13Z

Right, local in that sense works: I pretty never use github gui for anything except for merging PRs.

As to merging from master to 2.13: yes, the whole repo and every change would be "brought back" so that does not work: I need to use cherry-picking. But that should be fine.
Changes for 2.13 are fine for that branch, but the usual flow is to merge all changes up to master, so need to prevent that from happening. It is doable, I'll see what is the easiest way (possibly just tar'ing up sources from master, merging, untar'ing and commit that... my git-fu is limited and brute-forcey :) ).

So I'll start with this PR.

cowtowncoder · 2021-04-07T03:09:33Z

toml/LICENSE

@@ -0,0 +1,25 @@
+Jackson is licensed under [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0.txt).
+
+The TOML grammar and some parts of the test suite are adapted from the TOML project, which is licensed under the MIT license:


This is fine, although I wonder if some processing systems barf on it.

One thing we could do, if want to, would be to make this module MIT licensed; that should be fine.
But I don't feel strongly about it, up to you.

There's some precedent for "weird" LICENSE files in other projects: https://github.com/rzwitserloot/lombok/blob/master/LICENSE

One solution would be dual-licensing this module as ASL2.0 and MIT. That way, if tools parse this incorrectly as MIT they'll still be fine, and at the same time there's no license inconsistency under the jackson umbrella.

cowtowncoder · 2021-04-07T03:18:33Z

Ok, merged this PR, squashed, into master. So far so good. I can see the 2.13 commit you mention but not sure how to merge it (it's on your forked repo?). Is it easy to create another PR targeting 2.13 with it, or create a diff I just patch?
I think I understand the merge 2.13->master with revert on master.

yawkat · 2021-04-07T08:09:55Z

I've created a separate PR for 2.13 in #251. That one is independent of master and should be safe to merge.

Though I'm not sure how well the merge into master will work with this PR already included there.

yawkat changed the title ~~basic toml parser~~ TOML support Mar 25, 2021

yawkat mentioned this pull request Mar 25, 2021

Add TOML (https://en.wikipedia.org/wiki/TOML) support #219

Closed

brunoborges reviewed Mar 25, 2021

View reviewed changes

toml/pom.xml Outdated Show resolved Hide resolved