Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pydantic validators for string format checks #406

Open
MarshalX opened this issue Oct 1, 2024 · 7 comments · May be fixed by #451
Open

Implement pydantic validators for string format checks #406

MarshalX opened this issue Oct 1, 2024 · 7 comments · May be fixed by #451
Labels
enhancement New feature or request

Comments

@MarshalX
Copy link
Owner

MarshalX commented Oct 1, 2024

lexicon`s string type has the "type" field which directs strict string pattern https://atproto.com/specs/lexicon

here is the list of these formats (https://atproto.com/specs/lexicon#string-formats):

We could make our model validation more strict. For someone who wants to use SDK`s models from the server side. To implement PDS, for example.

snarfed implemented this string validations and kindly shared with us: https://github.com/snarfed/lexrpc/blob/41a858c2c28ad212df64f347270c3a8092743f1b/lexrpc/base.py#L439-L522

    def _validate_string_format(self, val, format):
        """Validates an ATProto string value against a format.

        https://atproto.com/specs/lexicon#string-formats

        Args:
          val (str)
          format (str): one of the ATProto string formats

        Raises:
          ValidationError: if the value is invalid for the given format
        """
        def check(condition):
            if not condition:
                raise ValidationError(f'is invalid for format {format}')

        check(val)

        # TODO: switch to match once we require Python 3.10+
        if format == 'at-identifier':
            check(DID_RE.match(val) or DOMAIN_RE.match(val.lower()))

        elif format == 'at-uri':
            check(len(val) < 8 * 1024)
            check(AT_URI_RE.match(val))
            check('/./' not in val
                  and '/../' not in val
                  and not val.endswith('/.')
                  and not val.endswith('/..'))

        elif format == 'cid':
            # ideally I'd use CID.decode here, but it doesn't support CIDv5,
            # it's too strict about padding, etc.
            check(CID_RE.match(val))

        elif format == 'datetime':
            check('T' in val)

            orig_val = val
            # timezone is required
            val = re.sub(r'([+-][0-9]{2}:[0-9]{2}|Z)$', '', orig_val)
            check(val != orig_val)

            # strip fractional seconds
            val = re.sub(r'\.[0-9]+$', '', val)

            try:
                datetime.fromisoformat(val)
            except ValueError:
                check(False)

        elif format == 'did':
            check(DID_RE.match(val))

        elif format == 'nsid':
            check(len(val) <= 317)
            check(NSID_RE.match(val) and '.' in val)

        elif format in 'handle':
            check(len(val) <= 253)
            check(DOMAIN_RE.match(val.lower()))

        elif format == 'tid':
            check(TID_RE.match(val))
            # high bit, big-endian, can't be 1
            check(not ord(val[0]) & 0x40)

        elif format == 'record-key':
            check(val not in ('.', '..') and RKEY_RE.match(val))

        elif format == 'uri':
            check(len(val) < 8 * 1024)
            check(' ' not in val)
            parsed = urlparse(val)
            check(parsed.scheme
                  and parsed.scheme[0].lower() in string.ascii_lowercase
                  and (parsed.netloc or parsed.path or parsed.query
                       or parsed.fragment))

        elif format == 'language':
            check(LANG_RE.match(val))

        else:
            raise ValidationError(f'unknown format {format}')

the code above is licensed under CC0 1.0 Universal

I see this task as:

  • implement pydantic validators. 1 per each string format
  • tune models generator to apply validators for fields
@MarshalX MarshalX added the enhancement New feature or request label Oct 1, 2024
@zzstoatzz
Copy link
Contributor

hi @MarshalX - is this useful?

the expressions you linked above seemed perhaps too complex for Field's pattern kwarg, so I think the Annotated strategy might be simplest - could be wrong though.

I'd be interested in helping plugging this in here if you could point me in the right direction?

@MarshalX
Copy link
Owner Author

@zzstoatzz hi yes! this what do we need to implement for it. but i am a bit afraid to apply it to all existing models. suspect perf degradation. for example in firehose. so it would be great to make this regex validation optional. and the devs could enable it when they want to implement some server side

@zzstoatzz
Copy link
Contributor

zzstoatzz commented Nov 20, 2024

that makes sense! I updated the example to make use of validation context for opt-in strict mode, which incurs a small penalty for existing and then a bit more for being enabled

│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓                                                                │
│ ┃ Test Type                     ┃ Items   ┃ Time (seconds) ┃ Items/second ┃                                                                │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩                                                                │
│ │ Raw dictionaries              │ 100,000 │ 0.40           │ 247,971      │                                                                │
│ │ Skipped validation (default)  │ 100,000 │ 0.57           │ 176,242      │                                                                │
│ │ Opted-in to strict validation │ 100,000 │ 1.00           │ 99,908       │                                                                │
│ └───────────────────────────────┴─────────┴────────────────┴──────────────┘

fwiw in prefect we made a types module and are incrementally adopting the types in our schemas

@MarshalX
Copy link
Owner Author

@zzstoatzz it looks amazing! you did a really good job. do you want to do a PR? i see the PR like this:

  • put all types and validator to smth like atproto_client/models/string_formats.py
  • attach new string types according to lexicon in the models codegen
  • cover new validators with tests (i hope we can copy some test data from the official atproto repo)
  • add new flag to this function to enable strict string format:
    def get_or_create(
    model_data: ModelData[M], model: t.Optional[t.Type[M]] = None, *, strict: bool = True
    ) -> t.Optional[t.Union[M, UnknownRecordType, DotDict]]:
    """Get model instance from raw data.

    as you can see we already have strict which is enabled. so i suggest to have strict_string_format as a new option
  • open question: we have languages.py, could we use it for validator of language string format?
  • open question: do we want to bring support of re2? if re2 is installed use it instead of built-in re module

@zzstoatzz zzstoatzz linked a pull request Nov 22, 2024 that will close this issue
@zzstoatzz
Copy link
Contributor

@MarshalX cool! I've opened #451 but I'm not sure exactly how I should be plugging into / updating the codegen here - pointers welcome!

I tried

» ipython
In [1]: from atproto_codegen.models.generator import generate_models

In [2]: generate_models()

which appeared to update packages/atproto_client/models/* but CI yelled at me 🙂 so I reverted that for now

@MarshalX
Copy link
Owner Author

@zzstoatzz thank you! After changing the codegen, you need to run it and commit that changes in models as well. CI checks that you do not forgot do to it :)

To run codegens we have CLI. CMD:poetry run atproto gen models. Also, do not forget to run docs gen, since you added new file that probably needs .rst. To run it use make -s -C docs gen

@zzstoatzz
Copy link
Contributor

@MarshalX excellent! thank you for the direction 🙂

I've marked the PR ready - I'll keep an eye out for a review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants