Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit type specification for unions #36

Open
altendky opened this issue Jan 20, 2020 · 54 comments
Open

Explicit type specification for unions #36

altendky opened this issue Jan 20, 2020 · 54 comments
Labels
design discussion enhancement New feature or request

Comments

@altendky
Copy link
Member

Sometimes it would be good to store a type identifier in union fields. This would be especially true for sparsely populated data and unions containing str (described below).

Consider:

import typing

import attr
import desert


@attr.dataclass
class A:
    color: str


@attr.dataclass
class B:
    color: str


@attr.dataclass
class C:
    things: typing.List[typing.Union[A, B]]

We then create some instances:

instances = C(
    things=[
        A(color='red'),
        A(color='green'),
        B(color='blue'),
    ],
)

Theoretically that would serialize to (it doesn't seem to so there will presumably be another issue report):

{"things": [{"color": "red"}, {"color": "green"}, {"color": "blue"}]}

Which leaves us wondering what any of the things are.

On the flip side, if we take an example in the tests where there is a field t.Union[int, str] then adjust it to include a custom class after str such as t.Union[int, str, MyClass] then str will 'always' work and MyClass instances will be serialized to a str such as 'MyClass()' (well, for an attrs class with such a repr() anyways).

tl;dr, marshmallow-union has some pretty significant limitations. I got myself confused and started creating a solution over in marshmallow-polyfield Bachmann1234/marshmallow-polyfield#34 which would introduce an extra layer to the serialization and include a string to designate the type of that element. No more guessing.

{
    "things": [
        {
            "type": "A",
            "value": {
                "color": "red",
            }
        },
        {
            "type": "A",
            "value": {
                "color": "green",
            }
        },
        {
            "type": "B",
            "value": {
                "color": "blue",
            }
        }
    ]
}

Perhaps desert is interested in using marshmallow-polyfield for that feature? Perhaps desert would rather the feature be added to marshmallow-union?

In the mean time I'll try to find time to work my way through reporting and possibly fixing some other issues.

@python-desert
Copy link
Collaborator

python-desert commented Jan 20, 2020

Polyfield looks interesting. I think it'd be worthwhile to make sure we're not duplicating ideas that may have already been well worked out in, say, marshmallow-jsonschema.

@altendky
Copy link
Member Author

To get it on the record and out of my tab list...

OAI/OpenAPI-Specification#57 (comment)

@python-desert
Copy link
Collaborator

OpenAPI 3.0 uses an extended subset of JSON Schema Specification Wright Draft 00 (aka Draft 5) to describe the data formats. “Extended subset” means that some keywords are supported and some are not, some keywords have slightly different usage than in JSON Schema, and additional keywords are introduced.

https://swagger.io/docs/specification/data-models/keywords/

@altendky
Copy link
Member Author

As I think about looking at my own files I do really like the lack of an extra layer and the explicit type specification of every item. So I'm thinking that perhaps both options would be good. Of course, two ways has plenty of cost so...

I suspect the wrapper-layer approach is only going to be of significant interest to people that do not want types indicated in the serialized data but that also want functional unions. The wrapper layer is the least intrusive to the serialized form. This is a property of a field. Just union fields? Maybe, I haven't thought about any others. The identifer/type relationship is local.

The integrated form is nice when you do it everywhere. It is only one extra line per object and no indentation. This is more of a property of the class to be serialized. The identifier/type relationship is more 'global'. It doesn't have to literally be global but it isn't part of a field or a class.

I think the two may be able to cleanly coexist not only as features in desert but even in single serialized data blobs. I suppose there could be some silliness if you used different names for the two methods.

{
    "type": "MyClass",
    "value": {
        "_type": "color_stuff",
        "red": 0,
        "green": 1,
        "blue": 0.5
    }
}

So... meh. Maybe if all classes to be serialized in the union are internally typed then the union automatically drops the external typing layer? Too implicit? Maybe there are hazards in some scenarios such as serializing with one set of classes then adding a new option that isn't internally typed and the layers are expected but not present? The developer must choose but we give warnings or otherwise provide a means of testing that if their choice isn't obviously sensible?

There's also the matter of subclasses that I put in the WIP list over in polyfield but haven't thought through yet. IIRC, the types registered with the union are expecting to dict-lookup match the types of the actual objects.

No conclusions here I suppose, just some thoughts to revisit later.

@python-desert
Copy link
Collaborator

Where's the indication that 1 is a float? Do we only want that if the defining schema has a union field for "green" (which it presumably doesn't here)?

@altendky
Copy link
Member Author

Correct, no union for green. It was meant to just be a float.

The indication is in the schema (unless marshmallow does 1.0, I don't know). For people that want the integrated type form for native JSON (or more general) types, we could provide fields that have integrated types and a way to trigger them. Or I suppose the same for wrapped-layer.

For reference, I use decimal.Decimals as strings for all my 'floats'.

@python-desert
Copy link
Collaborator

python-desert commented Jan 27, 2020

If we are assuming the loader's schema is the same as the dumpers, and key ordering is canonicalized or maintained, there could be a List[int] as a top-level value where the ith element of the list gives the index of the Union[...] entry of the ith union-typed value in the data.

For example,

{
   "data": {"things": [{"color": "red"}, {"color": "green"}, {"color": "blue"}]},
   "union_indexes": [0,0,1],
   "schema_hash": "01abef8" # or at least "schema_name"
}

@obi1kenobi
Copy link

Hi all, I'm new to this project 👋Very interested in seeing this issue (and #89) get resolved. I've tried both dataclass-json and marshmallow-dataclass and gotten pretty disappointed, so I'm hoping desert can be the production-grade dataclass serialization library I need for the projects I maintain. Thank you for all the hard work you all have put into this library 🙏

I happened to come across a good writeup of the various options for explicit type specification for unions, courtesy of the Serde library that handles serialization in the Rust programming language: https://serde.rs/enum-representations.html

I think supporting a choice of explicit representation (and if possible, compatibility with Serde for cross-language applications) would really set this library apart from the competition, and if that's the direction chosen by the maintainers, I'd love to be part of that journey.

@python-desert
Copy link
Collaborator

Hey @obi1kenobi, welcome!

That Serde link gives a great rundown of our options. @altendky's polyfield link above looks like what Serde calls "adjacently tagged", and marshmallow-union looks like the "untagged" option.

I'm +1 on providing built-in support for all four options discussed there.

@altendky
Copy link
Member Author

So let's see if my original example can still be used to show the four options here.

Class Definitions

import typing

import attr
import desert


@attr.dataclass
class A:
    color: str


@attr.dataclass
class B:
    color: str


@attr.dataclass
class C:
    things: typing.List[typing.Union[A, B]]

Sample Data

instances = C(
    things=[
        A(color='red'),
        A(color='green'),
        B(color='blue'),
    ],
)

I'll use some 'non-standard' formatting and hopefully it'll end up more readable including considerations of fitting on the screen. I am also only applying it to the members of the attribute list. I think that may be an orthogonal question. 1) How to tag when being explicit and 2) whether to tag even when there isn't any ambiguity.

Externally Tagged

{
    "things": [
        {"A": {"color": "red"}},
        {"A": {"color": "green"}},
        {"B": {"color": "blue"}}
    ]
}

Internally Tagged

{
    "things": [
        {"type": "A", "color": "red"},
        {"type": "A", "color": "green"},
        {"type": "B", "color": "blue"}
    ]
}

Adjacently Tagged

{
    "things": [
        {"t": "A", "c": {"color": "red"}},
        {"t": "A", "c": {"color": "green"}},
        {"t": "B", "c": {"color": "blue"}}
    ]
}

Untagged

{
    "things": [
        {"color": "red"},
        {"color": "green"},
        {"color": "blue"}
    ]
}

If something looks wrong I think it will be worth updating in place here so just let me know and I'll tweak it.

Are the specific strings "type", "t", and "c" significant? For the internally tagged it seems maybe useful to default to something that can't be an attribute (normally anyways) like "@type" or whatever. For adjacent tagging I would have (well did :]) use full words.

As noted in the provided link (thanks), internal tagging doesn't apply to things serialized as JSON arrays rather than objects. If we wanted we could make the first element be the tag I suppose.

Side note, should we talk more about marshalling to Python dict/list/int/float/str/bool/None rather than serializing to JSON object/array/number/true/false/null? If we go big (instead of going home) and try to allow for multiple backends etc this seems maybe more relevant? I dunno, I don't have this big architecture plan in my head yet. Going big might also suggest it would be useful to implement the disambiguation in desert itself rather than delegating to a new implementation specific to each backend. Maybe...

@python-desert
Copy link
Collaborator

python-desert commented Mar 10, 2020

@altendky

Are the specific strings "type", "t", and "c" significant?

I'm guessing they come from the #[serde(tag = "t", content = "c")] declaration just above the enum.

@python-desert
Copy link
Collaborator

python-desert commented Mar 10, 2020

@altendky

Side note, should we talk more about marshalling to Python dict/list/int/float/str/bool/None rather than serializing to JSON object/array/number/true/false/null?

I think that's what we're doing already, since schema.dump(obj) returns a dict (or other appropriate type). Or do you mean something else?

@altendky
Copy link
Member Author

I may just mean that I've barely used desert yet, yeah. :| Though the terminology question still stands albeit only at 8 search hits to 1 for serialize vs. marshal.

@python-desert
Copy link
Collaborator

python-desert commented Mar 10, 2020

@altendky

  1. whether to tag even when there isn't any ambiguity.

Edit: moved this idea into a new issue.

@python-desert
Copy link
Collaborator

Will it be easiest for us to use polyfield as a dependency to get this feature or should we do it independently?

@python-desert
Copy link
Collaborator

After a brief look, I think polyfield is focused on unioning over schemas, not other types. So while that'd work fine in the example above, it'd be awkward in the case of int, float, or other non-struct types.

@python-desert
Copy link
Collaborator

python-desert commented Mar 10, 2020

If I'm thinking about this right we need

class AdjacentlyTaggedUnion(marshmallow.fields.Field): ...
class ExternallyTaggedUnion(marshmallow.fields.Field): ...
class InternallyTaggedUnion(marshmallow.fields.Field): ...
class UntaggedUnion(marshmallow.fields.Field): ...

and to pick a default for what typing.Union should go to. Since "internally tagged" doesn't work in arrays, "adjacent" and "external" ought to have higher development priority.

@sveinse
Copy link
Collaborator

sveinse commented Mar 10, 2020

I think this is a feature which is missing from desert, so I really like where this is going, but I can't help thinking that this is blurring the lines between desert and marshmallow. Or to put it another way: Shouldn't really these fields belong in marshmallow or a variant thereof? My impression is that desert exists as a shim on top of marshmallow to allow for dataclass and attr, but rely on marshmallow for the actual serialization. Won't putting type fields on unions into desert make desert and marshmallow incompatible? What is the intended relationship between desert and marshmallow?

@obi1kenobi
Copy link

I think this is a feature which is missing from desert, so I really like where this is going, but I can't help thinking that this is blurring the lines between desert and marshmallow. Or to put it another way: Shouldn't really these fields belong in marshmallow or a variant thereof? My impression is that desert exists as a shim on top of marshmallow to allow for dataclass and attr, but rely on marshmallow for the actual serialization. Won't putting type fields on unions into desert make desert and marshmallow incompatible? What is the intended relationship between desert and marshmallow?

I'm agnostic to the final decision on where this code should live — I'd be happy with anything that makes for working code. I just wanted to share this link, where as of about a year ago it seems to say that marshmallow has no plans to introduce a union field of any kind in the near future:
marshmallow-code/marshmallow#1191

I think @python-desert's suggested approach would produce field types that are compatible with marshmallow, and in principle could even be contributed upstream (or extracted into a separate marshmallow extension package) if everyone involved is amenable to that. So I think the first priority is getting to a working and stable implementation, since the paths from there onward don't seem to be too bad.

If I'm thinking about this right we need

class AdjacentlyTaggedUnion(marshmallow.fields.Field): ...
class ExternallyTaggedUnion(marshmallow.fields.Field): ...
class InternallyTaggedUnion(marshmallow.fields.Field): ...
class UntaggedUnion(marshmallow.fields.Field): ...

and to pick a default for what typing.Union should go to. Since "internally tagged" doesn't work in arrays, "adjacent" and "external" ought to have higher development priority.

This looks great to me, and I've given the choice of default some thought. I'd like to make a case for adjacently-tagged to be the default.

I think the default choice should always work, with minimum tweaking of configuration and other "if"s and "but"s.

  • This rules out untagged, since it's impossible to differentiate the string "123" from the string serialization of Decimal(123).
  • It also rules out interally-tagged, because we have to name the internal key that holds the name of the type — picking anything, including something like @type, means that we're betting that our users can't construct an object that has a value with that as its key, and that's not a bet I'd be willing to make (see this link for a situation where someone bet that users can't make a certain value occur, and lost).
  • That leaves adjacently-tagged and externally-tagged. Both of these are good options, and externally-tagged is Serde's default so it has that going for it as well. But externally-tagged unions have to cope with validating the serialized input in one additional way compared to adjacently-tagged — to reuse the example setup from above:
{
    "things": [
        {
            "A": {"color": "green"}, 
            "B": {"color": "blue"}
        },
    ]
}

The object inside things is invalid: it is supposed to be a single union field, but holds two union variants, an A and a B, within. This input wasn't generated by valid code, but people deserialize invalid things all the time — they might have been erroneously hand-written, or generated by another library. It's impossible for the deserializer to figure out which of the two variants should be the output here, and I'd argue this needs to result in an exception rather than a "best-effort" parse of some sort.

This situation isn't possible with adjacently-tagged unions. There, the "type" key can only have one value, and the "value" key only holds the data for one value. While there may be additional, unexpected keys in the dict, this is also true of externally-tagged unions, so the same error-handling code has to exist in both cases. But the "multiple-variants" error case is impossible by design, and I'd argue that's a win we'd like for our default choice.

So long as we pick an always-safe choice for the default, and allow users to consciously change that choice (if the new choice is also safe for their dataclasses) to get advantages like more compact representation, I think that will result in happy users. My team and I definitely will be happy 😄

@altendky
Copy link
Member Author

It seemed that if I finished it up that Bachmann1234/marshmallow-polyfield#34, which added adjacently tagged as an explicit union option, would get accepted. I would expect they would accept the rest. adamboche/python-marshmallow-union#27 refers over to marshmallow-polyfield. Or we could make our own third library for this... I have some hesitation if there are plans to support several backends that the number of libraries would grow annoying to tack on the features we need for each backend, but I may well be looking way too far ahead. And yeah, the code could be developed here in a PR, compared with marshmallow-polyfield, and then a decision made where the code should actually live. It would probably be nice to just work in one repo while developing unless we think that the probable final destination is polyfield. I'll have to look back at that, I don't really remember the structure over there but iirc I would have done it a bit differently if working from scratch.

@python-desert
Copy link
Collaborator

@sveinse

Shouldn't really these fields belong in marshmallow or a variant thereof?

It could make sense to put these changes in a more narrowly marshmallow-focused library. marshmallow-union could be a good fit. Though as @altendky says, working in one library is easier than coordinating changes across multiple (at least while we're still settling an API).

Won't putting type fields on unions into desert make desert and marshmallow incompatible?

Nothing particularly jumps out at me. What's the incompatibility you're thinking of?

@python-desert
Copy link
Collaborator

python-desert commented Mar 10, 2020

Let's talk interface.

thing: Union[A, B]

will be treated like

thing: Union[A, B] = AdjacentlyTaggedUnion(...)

What's the signature?

Edit: removed comments that didn't make sense.

@altendky
Copy link
Member Author

So yeah, there's the super flexible 'callable that decides how to deserialize' but that feels possibly excessively flexible to me. Though perhaps it is just a case of giving a default implementation that can be passed that doesn't involve that flexibility. I suspect that most of the time the user should have a registry of classes and their tags that are used to go back and forth between the two. That could be specific to the particular union but I would think that often it could just be 'global'. In graham I specify that in the class decorator. The registry could even allow for colliding names and only have an issue if the colliding classes end up used in an ambiguous place. Might want to have that be an opt-in loosening of the rules and error by default though.

So signature? Let's see if refreshing myself as to what I did over in polyfield ends up creating a useful summary for others. You need (with nominal defaults as reference):

  • class -> schema
    • desert.schema(cls)
  • class (or schema) -> tag
    • cls.__name__
  • tag -> class (or schema)
    • reverse above...
  • schema -> class
    • reverse above...

I reduced this to a class to schema mapping and a class to tag mapping that acted as an override for the default cls.__name__ behavior. Perhaps this 'override' ought to be brought out and handled separately. Another function that just returns a default class to name dict that you can .update(). Or maybe this is just another argument for an explicit registry class. I'll note that I did make the adjacent tagged schema a parameter as well so that it was somewhat configurable. Though I think not starting with that would be good. Once we have separate implementations we can identify overlap and ways to avoid repetition.

So if there's a registry, a parameter with it would simply be the signature for the fields? Or maybe leave it as a class-to-tag and a tag-to-class callable where normally people would specify my_registry.to_tag and my_registry.to_class. Except normally they wouldn't even do that and would just accept some global state holding a global registry? Or am I getting too evil here... either way, having the global registry is just a matter of registry = desert.Registry() and defaulting to it or whatever so it doesn't make a huge difference to design of the rest I don't think.

@obi1kenobi
Copy link

There may be some room for reasonable defaults and avoiding registries (especially global ones) in many common cases. While classes' __name__ values may collide, I suspect this is rather rare — and definitely something that can be checked for at schema construction time. So perhaps by default we could check the union variants and if their names are distinct, simply use them? And obviously allow (and if conflicting names are present, require) the user to provide their own variant-to-name mapping.

I have a mild preference for using dict() rather than callables to map (e.g.) between variants and tags, for the sake of easier validation and testing. Inverting a dict is relatively straightforward and easily checkable at schema construction time, while inverting general functions programmatically is beyond my skill level :)

@python-desert
Copy link
Collaborator

python-desert commented Mar 11, 2020

As I mentioned above, I think we want to diverge from polyfield in that all of those "schema"s in @altendky's bullet list should be fields for us. That way we can support non-dataclass union entries.

Do we just trust that python type<->field, python type <-> tag, field <-> tag are all one-to-one correspondences (bijective)? If they are, then it doesn't matter whether the tag comes from the python type or the field, but otherwise maybe it does. There are a bunch of "what if there are two _ for a given _" cases which I haven't thought through. Is it the type's job to determine the tag or the field's job?

@obi1kenobi
Copy link

If my understanding of the bijective correspondences question is correct, it seems to me that it may be possible to locally (i.e. on a per-union-field basis) at schema construction time verify the bijective property for the default (type-provided) mappings, or request that the user provide a custom (i.e. field-provided + also verified) one otherwise.

Does that sound reasonable, or am I missing some subtlety in the question? I admit I'm a newbie when it comes to marshmallow and especially desert — apologies for my ignorance.

@python-desert
Copy link
Collaborator

python-desert commented Mar 11, 2020

Suppose we're doing this for marshmallow independent of the dataclass logic. Do we allow

AdjacentlyTaggedUnion([AField(), BField()], ...)

or only

AdjacentlyTaggedUnion({'A': AField(), 'B': BField()}, ...)

@python-desert
Copy link
Collaborator

Eh let's start with the latter, we can always add more stuff if we want to.

@python-desert
Copy link
Collaborator

A concrete implementation of AdjacentlyTaggedUnion will probably give us more to talk about. Anybody interested in taking a crack at it?

@altendky
Copy link
Member Author

So we do have one over in the polyfield PR, but yes, I was considering needing some code to help me think. If someone else wants to take a stab, go for it. I don't know offhand if it will be hours or days or... until I get to it. Otherwise, yeah, I'll probably get to it at some point.

As to the registry and having a default, a piece of that idea is keeping all of the control in one place. If you can pass in a thing that has some logic and then the union field also provides some default logic then you have to go two places to figure out what is going on. If you put the default logic in a default registry then you only look at the registry. Also, the 'this set of "mappings" is consistent' check would be the responsibility of the registry object so the two callables passed to the field wouldn't be a thing the field would validate. It either works well or doesn't.

@altendky
Copy link
Member Author

altendky commented Mar 11, 2020

Mmm, not the 'default registry' being the 'one place' but that using a registry allows it to implement any logic and any validity checks it wants and it is the registry's responsibility to do the mapping. But yeah, I can't completely convince myself that a global registry is an evil thing for someone to want to use. If we don't provide one they'll just make it themselves. But yeah, using a registry for this purpose and having a global one are still two separate discussions I think.

@altendky
Copy link
Member Author

I guess there's also the fact that we'll have multiple fields with, for 'normal' cases, the same needs in converting between the tag and the field. So that mapping functionality and any default behavior should in some way be separated from the fields. Of course, a registry class is only one of the ways to do that.

@altendky
Copy link
Member Author

Haven't reflected on #94 yet myself but it's code and it passes two whole tests. I am tempted to try to do this without custom stuff. Might be able to pull it off with pluck or such.

@obi1kenobi
Copy link

I tried to go through #94 but I suspect my unfamiliarity with desert and marshmallow internals is getting in the way. I was a bit confused by the fact that the test data didn't seem to have a union type we're attempting to serialize — it seemed like the data being serialized and deserialized was just a float?

Since I understand this is early WIP code, I don't think I got a good sense of what the "real world" use of this code would look like. I found it difficult to untangle the test fixtures and test setup from the "the user would write this in the real world" bits that would allow me to have an informed opinion about the API and implementation.

@altendky
Copy link
Member Author

Yeah, this is both very early WIP and also a thing that may well end up in another library that desert would use. There is only one type presently tested and it was just making sure that when serialized it added the adjacently tagged layer and stripped it when deserializing. The registry is more or less the 'union', in a sense. It contains the type/tag->field mappings.

I do feel like this implementation requires us knowing about some marshmallow 'internals' (that I'm presumably not handling completely now) and as such I do plan to spend some time trying to use 'higher level' marshmallow mechanisms. Maybe pluck, maybe method, maybe function... I need to look over them.

https://marshmallow.readthedocs.io/en/stable/custom_fields.html

@python-desert
Copy link
Collaborator

@altendky
Thanks for doing this. I'm still mulling it over, but having working code is already a big improvement regardless.

@altendky
Copy link
Member Author

I hit up against the question of how to handle objects to be serialized that have the same type but different contents. The committed examples are a pair of lists, one with strings and another with integers. The registry I wrote is dual keyed by the type and the tag with the field used for serialization as the value. Not a (type, tag) tuple as a key but rather each individually is expected to be unique and maps to the field because both directions must, as planned now anyways, map to the same field. So, when you add both a string list and an integer list, poof, broken, because they are both just being checked as lists. Further it made me think about situations where you maybe have a list and a tuple but you want them serialized and deserialized the same. They are after all fairly frequently interchangeable elsewhere in Python.

So, custom classes are relatively straightforward. You don't usually want to differentiate the serialization of MyClass(x='abc') from MyClass(x=123) in a union situation. I mean, I can't say it's pointless but I expect it's a corner case. I'm not so sure the same can be said for collections. It's not working well yet but I'm adding a second registry to implement subclass checks where you could register a 'sequence (but not string)' type to catch them all to be serialized via a list field. This would help with the tuple/list question. I haven't started exploring the 'what about element types' question.

@obi1kenobi
Copy link

Apologies for being MIA for the last week. If I understand correctly, the concern is that we may not know how to properly deserialize a list because we aren't sure what the type inside it is?

I'm having a bit of trouble visualizing a situation where this arises. In my mind, we always have a schema, so if we are deserializing a List field, there are a few cases where we might end up:

  • It's a List[int] or List[str] or list of another primitive type that has a JSON representation. These we just deserialize directly, as if with json.loads().
  • It's a List[datetime] or List[MyClass] or something similar where the inner type requires further deserialization. For these, we apply the appropriate deserialization to the inner data (as specified by the registry, or by the schema, or some other mechanism), and then bundle all of the inner elements into a list.
  • It's a list of unions, e.g. List[Union[int, str, MyClass]]. This is actually a subtype of the previous case — each of the serialized unions is tagged (e.g. adjacently), so we apply the appropriate deserialization to each inner element and then bundle all of the elements into a list.
  • It's a list of Any or some other insufficiently-specific type like an ABC. Ideally, my preference would be to make it an error to declare these in the schema completely, so that the code fails long before any serialization / deserialization could happen. If a user wants to send in truly opaque data, they could e.g. JSON-ify it up front on their own and then stick it in a str-typed field.

Similarly, you mention the concern around differentiating between lists and tuples, even though they are serialized the same. In my mind, the schema tells us whether the type we are trying to deserialize is Tuple[T] or List[T], so even though they are serialized to the same representation, we always know which of them we should emit when deserializing.

Here's the top-level thing I'd like to make sure we're on the same page about: in the API we're all talking about, when deserializing we already expect to have the schema of the object we are attempting to deserialize, right? Put differently, we're aiming for something more akin to Foo.schema.deserialize(data) which returns an instance of Foo, instead of something more like json.loads(data) which somehow returns an instance of Foo even though Foo was never provided to the function. Is that right?

@altendky
Copy link
Member Author

I'm a bit tired at the moment but I think you are thinking about deserialization, which seems like obviously the troublesome part, but I was talking about serialization.

The schemas don't tell us what we have when deserializing, they tell us what we might have. The data has to tell us what we do have and will do so via the tags. But yes, we will have a field (not schema... I think...) for each tag that we need to deserialize 'from' a union. If we don't, we would raise an exception. This is what being explicit gets us and is certainly nice.

But for serialization... it is maybe a bit different. What I have implemented presently doesn't look 'inside' anything when serializing. As such, [1, 2, 3] and ['a', 'b', 'c'] are both interpreted as list giving the present registry no ability to differentiate when serializing a typing.Union[typing.List[int], typing.List[str]]. So, that's a limitation of my present early WIP exploratory attempt. It's obviously easy for us as humans to handle this specific case. It's also 'easy' to handle it in code. It is also about the most trivial case we could come across. Perhaps there is an isinstance() alike but for type hints we can use? Fail if the object to be serialized passes either zero or more than one possibility? I dunno, that half feels like guessing but making people be an extra layer of explicit in their code seems likely overly onerous.

As to the tuple vs. list question, that is again about serialization. If we have an attribute that is hinted typing.Union[typing.List[int], str] and we are asked to serialize (1, 2, 3), do we really want to fail because it isn't a list? Maybe we do and we allow for flexibility by instead hinting with typing.Sequence[int] and then either having a default sequence we deserialize to (list or tuple or...) and letting the user override somehow when they want to allow multiple sequence types to be serialized through this hint and also control the deserialized type.

But again, these are the most trivial of cases. Certainly we don't have to support every corner but I am trying to express the ones I think of so we can consider if there is a clean universal solution. Or at least decide on a clear and understandable way to handle them even if just raising an exception in some cases.

@python-desert
Copy link
Collaborator

Perhaps there is an isinstance() alike but for type hints we can use?

pytypes.is_of_type() does that.

@python-desert
Copy link
Collaborator

If we have an attribute that is hinted typing.Union[typing.List[int], str] and we are asked to serialize (1, 2, 3), do we really want to fail because it isn't a list?

For the most part Marshmallow doesn't validate anything on serialization.

>>> marshmallow.fields.List(marshmallow.fields.Int()).serialize('x', {'x': (1,2,3)})
[1, 2, 3]

@altendky
Copy link
Member Author

But we need to choose how to serialize it. Via the field for typing.List[int] or str?

@python-desert
Copy link
Collaborator

Ok yeah.

@python-desert
Copy link
Collaborator

python-desert commented Mar 20, 2020

We can do this with two functions.

  • python object -> union member type
  • union member type -> marshmallow field type/instance

Alternatively, we could skip the "union member type" step altogether and just map directly "python object -> marshmallow field type/instance". I think this is harder but might be useful in some situations.

Focusing on the first option, in the simple case where

  • each type in the union has a desert-known field type, and
  • each instance of the union can be mapped to exactly one member type,

it can be just:

foo: Union[int, List[str]]

This suggests we might want to do the registry thing to add more type -> field pairs.

Assuming we won't always be able to automatically determine the best field type, user needs a way to explicitly specify (what amounts to) a function Union[T1, T2] -> Field. Note that's not Union[Type[T1], Type[T2]] -> Field -- the domain is instances of python types.

This could be done a few ways, e.g. in "parallel lists" style, using the default field type for int as a placeholder:

foo: Union[int, List[str]] = \
    Explicit.from_list([desert.DefaultField(), fields.List(fields.Str())])

In mapping style, using the defaults without the need for placeholders:

foo: Union[int, List[str]] = Explicit.from_mapping({int: fields.Int()})

In function style:

def dispatch(x): 
    # XXX Really the fields should be instantiated only once.
   return fields.Int() if isinstance(x, int) else fields.List(fields.Str())

foo: Union[int, List[str]] = Explicit.from_function(dispatch)

@obi1kenobi
Copy link

@altendky you are completely right, I was thinking about deserialization — now this all makes sense. Thank you for explaining 👍

@python-desert your "two functions" suggestion sounds good to me. I like the mapping style best, and the parallel-lists approach second best. I'm a bit worried that the function style might provide too much freedom, in a sense, and may end up being a foot-gun. Besides, the mapping style technically allows arbitrary functions via a custom __getitem__ implementation — but has the benefit of pointing out that something a bit hacky is happening.

I'd maybe even argue that unions where types are not disjoint (i.e. it's not clear which union variant a particular value belongs to) could even be a definition-time error and blow up long before serializing anything. However, I acknowledge that I might not have a full understanding of all use cases, so I think the proposed "user provides a function to disambiguate" approach is sound.

@altendky
Copy link
Member Author

So the existing registry in #94 (TypeDictFieldRegistry), if I remember correctly, is basically Explicit.from_mapping(). the_dict[type(the_object)] will get you a field (roughly). At least that's what I assume is being implied. Explicit.from_list() doesn't address how to get from an object to a field, only hint to field. Explicit.from_function() let's the user do anything which I think should be the core ability but wrapped up with better more controlled implementations the user can just grab and use.

I had started another registry to attempt to resolve the issues I mentioned, but got distracted. The AdjacentlyTaggedUnion takes a object->field callable and a tag->field callable. I think the 'two functions' suggestion would turn that into three parameters? object->hint callable, hint->field callable, and tag->field callable? I don't have a good 'why' in mind for that but it feels like it might be better.

@python-desert
Copy link
Collaborator

Explicit.from_list() doesn't address how to get from an object to a field, only hint to field.

I believe "object to hint" can be done by a predefined mapping with fallback to the pytypes.is_of_type() function I mentioned earlier.

if x in mapping.keys(): 
    return mapping[x]
[hint] = {t for t in types if pytypes.is_of_type(x, t)}
return hint

@altendky
Copy link
Member Author

altendky commented Mar 20, 2020

Predefined type() based default? Are you thinking of these as just other ways to build other types of registries? Or are they somehow distinct?

Are you thinking changing to the following by breaking from_object into two pieces? (for some made up Hint hint...)

class AdjacentlyTaggedUnion(marshmallow.fields.Field):
    def __init__(
        self,
        *,
        hint_from_object: typing.Callable[[typing.Any], Hint],
        field_from_hint: typing.Callable[[Hint], TypeTagField],
        from_tag: typing.Callable[[str], TypeTagField],
        **kwargs,
    ):

@altendky
Copy link
Member Author

https://github.com/marshmallow-code/marshmallow-oneofschema

Maybe a relevant thing to look at. Just glanced quickly.

@altendky
Copy link
Member Author

If you poke at pytypes, be wary. Present PyPI doesn't support 3.7+. Git master does better.

Stewori/pytypes#40
Stewori/pytypes#88

@altendky
Copy link
Member Author

altendky commented Jun 1, 2020

Next funny business to consider. A str will match both str and typing.Sequence[str] hints. Yay!

@altendky
Copy link
Member Author

altendky commented Jun 1, 2020

So I think at some level it's simply going to be possible to have non-distinguishing hints. I guess it's up to the registry (the thing that figures out the hint to use etc, whatever) to decide if it should distinguish or not and even though we ought to provide a good option that'll be entirely configurable. The above could be handled by using an isinstance() check as well to prioritize str. Or just order of registered types (which is what happened implicitly before the latest commit). Or prioritize types over hints... But I think this isn't a structural, it doesn't relate to the interface we need to identify.

@altendky
Copy link
Member Author

altendky commented Jun 1, 2020

So, where are we at and what do we need to decide?

#94 presently has a TaggedUnion field. It is configured by four callables. from_object(), from_tag(), from_tagged(), and to_tagged(). Bad incomplete names but... the first two identify a hint/tag/field setting from an object (we need to serialize [1, 2, 3] so we will use the hint typing.List[int] and associated field) and from a tag (the serialized data had "list_of_ints" so we will use typing.List[int] etc). The last two will add and remove the explicit 'layer' to serialized (marshaled anyways) data. 1 and tag "my_integer" to {"type": "my_integer", "value": 1} for example, and back.

I think some above discussion was considering that maybe there should be three callables with an extra intermediate step?

The next layer around TaggedUnion field are the functions externally_tagged_union(), internally_tagged_union(), and adjacently_tagged_union. They fill in the callables for adding/removing the explicit tagging for each corresponding form of explicitness. You still provide them with the value/hint/tag/field translation callable pair.

The tests give each of those tagged union callables with methods from a OrderedIsinstanceFieldRegistry instance. It uses the type of the value and the type hints to decide what field to use for a given value to be serialized. Deserialization is pretty trivial given the explicit specification of the type by an associated string.

It is presently failing due to ['abc', '2', 'mno'] matching both typing.List[str] and typing.Sequence[str]. This seems like a not-normal case that a developer would hit this and if they do and have a rule that they want to use then they can make their own registry. Or perhaps we break the registry down a bit so they can reuse it's logic and add their own just for further distinguishment.

So, where are we at and what do we need to decide? (sorry, couldn't help being a bit WET here :])

@altendky
Copy link
Member Author

I might work on this a bit. I've at least reread the discussion so far. Just mentioning in case anyone has updates on the topic from the past year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants