Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add opt-in strict string format #451

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zzstoatzz
Copy link
Contributor

@zzstoatzz zzstoatzz commented Nov 22, 2024

adds pydantic validation per #406 (comment). see the usage example

format example pattern/constraints
did did:plc:z72i7hdynmk6r22z27h6tvur did:[a-z]+:[a-zA-Z0-9._%-]+ (max 2KB, no /?#[]@)
handle alice.bsky.social Domain name: 2+ segments, ASCII alphanumeric/hyphens, 1-63 chars per segment, max 253 total. Last segment no leading digit.
at-uri at://alice.bsky.social/app.bsky.feed.post/3jxtb5w2hkt2m at:// + handle/DID + optional /collection/rkey. Max 8KB. No query/fragment in Lexicon.
datetime 2024-11-24T06:02:00Z ISO 8601/RFC 3339 with required 'T', seconds, timezone (Z/±HH:MM). No -00:00.
nsid app.bsky.feed.post 3+ segments: reversed domain (lowercase alphanum+hyphen) + name (letters only). Max 317 chars.
tid 3jxtb5w2hkt2m 13 chars of [2-7a-z]. First byte's high bit (0x40) must be 0.
record-key 3jxtb5w2hkt2m 1-512 chars of [A-Za-z0-9._:~-]. "." and ".." forbidden.
uri https://example.com/path RFC-3986 URI with letter scheme + netloc/path/query/fragment. Max 8KB, no spaces.
did:plc did:plc:z72i7hdynmk6r22z27h6tvur Method identifier must be 24 chars of base32 ([a-z2-7]). Cannot include underscore.

does not address open questions regarding the language types or re2. happy to chat on those if they'd be blocking here

may close #406

notes

I ran this to generate models

uvx poetry run atproto gen models

and this to update docs

uvx poetry run make -s -C docs gen

and this to run the new tests

uvx poetry run pytest tests/test_atproto_client/models/tests/test_string_formats.py

if desirable, I can follow up this PR to add contribution instructions to the current README section (for uv at least, and perhaps a TODO for proper poetry usage, since I don't really use that) - let me know!


in/valid examples for tests are sourced from https://github.com/bluesky-social/atproto/tree/main/interop-test-files/syntax

pyproject.toml Outdated Show resolved Hide resolved
@zzstoatzz zzstoatzz changed the title [wip] add opt-in strict string format add opt-in strict string format Nov 23, 2024
@zzstoatzz zzstoatzz marked this pull request as ready for review November 23, 2024 04:49
Copy link
Owner

@MarshalX MarshalX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome hard work! thank you! speaking of your gotten knowledges about how to contribute we need to write CONTRIBUTING.md someday. but this will be separated PR for sure

packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Show resolved Hide resolved
packages/atproto_codegen/models/generator.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@zzstoatzz
Copy link
Contributor Author

thanks for the review @MarshalX! comments should be addressed in 6ad722a

@zzstoatzz zzstoatzz force-pushed the strict-string-format branch 3 times, most recently from 7bcd93e to eb1d28b Compare November 24, 2024 22:49
@pytest.fixture
def invalid_data() -> dict:
return {
'handle': get_test_cases('handle_syntax_invalid.txt')[0],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we parametrize it as well? to use all the test cases from .txt files instead of the first one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha good call. yeah this definitely opened a new can of worms 🙂

there are a couple straggler cases I couldn't quite figure out right away that I've explicitly skipped for now, just a few of the many. let me know if anything jumps at you, ill come back to dig more later

@zzstoatzz zzstoatzz force-pushed the strict-string-format branch 3 times, most recently from 8d7deee to 1c82c73 Compare November 25, 2024 13:18
RKEY_RE = re.compile(r'^[A-Za-z0-9._:~-]{1,512}$')
TID_RE = re.compile(rf'^[2-7a-z]{{{TID_LENGTH}}}$')
CID_RE = re.compile(r'^[A-Za-z0-9+]{8,}$')
AT_URI_RE = re.compile(
Copy link
Owner

@MarshalX MarshalX Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the long response! i am busy at work :(

as I can see this regex is different to official one:

https://github.com/bluesky-social/atproto/blob/53fcc2fbcbda9b325eeffb5705126831baf2a427/packages/syntax/src/aturi.ts#L5)

We use official in our AtUri class:

_ATP_URI_REGEX = r'^(at:\/\/)?((?:did:[a-z0-9:%-]+)|(?:[a-z0-9][a-z0-9.:-]*))(\/[^?#\s]*)?(\?[^#\s]+)?(#[^\s]+)?$'

(btw probably you could just use AtUri class to validate string instead of duplicating regex)

Probably, your regex come from re-impl mentioned in the issues. But it would be awesome to align with official one! And as I can see you can get everything from the syntax package! https://github.com/bluesky-social/atproto/tree/53fcc2fbcbda9b325eeffb5705126831baf2a427/packages/syntax/src. With it all the tests should pass and no need for excluding test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the long response! i am busy at work :(

no worries! so am I 🙂

ack! I missed that - thank you!

I will update the regexs to follow those as soon as I can 👍

@zzstoatzz
Copy link
Contributor Author

zzstoatzz commented Nov 30, 2024

hi @MarshalX - ok all of the cases are passing with no skipping!

I found that the spec and test cases in the interop valid/invalid .txt examples were not entirely covered by the types in atproto_core, so for now I have tended towards duplicating regex in string_formats.py and documenting with comments why I included the logic I did.

For example, looking at the AT-URI validation, I noticed what appear to be few differences between the regex I added and atproto_core's AtUri that seem important for passing the interop tests:

  1. The spec requires "at://" prefix (see tests like "a://did:plc:asdf123" and "at:/did:plc:asdf123" being invalid)
  2. The path must be a valid NSID format (invalid like "at://did:plc:asdf123/12345" and "at://did:plc:asdf123/short/stuff" show this)
  3. The handle authority section has some idiosyncrasies - tests fail cases like "at://name", "at://name.0", "at://bsky" that a general URI parser might accept

let me know if you think i'm missing something here or if you think I should move some of these requirements into the types in atproto_core instead

also, just in case it comes up: i had a tricky problem with loading the test cases where I was calling .strip() on each line of the .txt files, but at least one of the test cases was checking for whitespace 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pydantic validators for string format checks
2 participants