Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up kart diff #1018

Open
craigds opened this issue Nov 12, 2024 · 6 comments · May be fixed by #1025
Open

Speed up kart diff #1018

craigds opened this issue Nov 12, 2024 · 6 comments · May be fixed by #1025
Assignees
Labels
enhancement New feature or request

Comments

@craigds
Copy link
Member

craigds commented Nov 12, 2024

Generating a large diff using kart is quite slow

$ time kart diff '[EMPTY]...e13cee173cf1a3d14456c7f900dd8763ec321ff7' -o json-lines | pv > /dev/null
2.12GiB 0:03:50 [9.42MiB/s] [                              <=>                                                            ]

real	3m50.265s
user	0m0.134s
sys	0m1.084s

ie this full diff is 2.1 GB and takes 230s for Kart to generate as JSONL at 9MiB/s.

Describe the solution you'd like

Ideally these diffs could be made 5-10x as fast.

We should start with some profiling, although I suspect the limiting factor is git ODB access, which we've found difficult to speed up in the past. If that's the case then the most obvious speedup would be using multiple threads to fetch objects from the ODB.

@craigds craigds added the enhancement New feature or request label Nov 12, 2024
@craigds craigds changed the title Speed up kart diff [using threads] Speed up kart diff Nov 12, 2024
@olsen232
Copy link
Collaborator

Here is some rough profiling data from running a large diff, in a relatively recent Kart (I think v0.15.0)
Would obviously depend somewhat on the system and on exactly what is being diffed.

15% generate "raw" diff from ODB using pygit2 and convert it to a kart.diff_structs.Diff object (PK's are decoded, but blob data is not yet loaded)
32% load data from ODB blobs and lazily create data for each Delta object
  includes 16% load data for old blobs
  includes 16% load data for new blobs
7% convert geometry to hex WKB
29% dump JSON
10% "handle system exit" ????

total: 93% - leaves 7% "misc"

More threads is certainly worth trying, also we could try a different JSON library.
Eg https://github.com/ijl/orjson

@rcoup
Copy link
Member

rcoup commented Nov 12, 2024

orjson has a good reputation for being the quickest.

10% "handle system exit" ????

garbage collection/deallocation if we build up a lot of dynamic objects?

craigds added a commit that referenced this issue Nov 13, 2024
refs #1018

When generating a large (2GB) diff as JSON-Lines this takes 20-30% less
time than the stdlib.

It may be possible to use this in other places, but note that orjson
doesn't support streaming encoding (iterencode), which means it is of
limited utility where we're trying to stream JSON diffs of huge
datasets.

This change uses it for individual features in JSONL diffs only where
the lack of iterencode() isn't a concern.

orjson is MIT licensed.
craigds added a commit that referenced this issue Nov 13, 2024
refs #1018

When generating a large (2GB) diff as JSON-Lines this takes 20-30% less
time than the stdlib.

It may be possible to use this in other places, but note that orjson
doesn't support streaming encoding (iterencode), which means it is of
limited utility where we're trying to stream JSON diffs of huge
datasets.

This change uses it for individual features in JSONL diffs only where
the lack of iterencode() isn't a concern.

orjson is MIT licensed.
@craigds
Copy link
Member Author

craigds commented Nov 13, 2024

more profile, from my local kart after merging that orjson speedup:

https://gist.github.com/craigds/2b8c9eac356aae8f61744a3cf8ff4c54

@craigds
Copy link
Member Author

craigds commented Nov 14, 2024

msgspec: https://jcristharif.com/msgspec/benchmarks.html#messagepack-serialization

  • sounds like msgspec is 1.6x faster than msgpack (for decoding)
  • msgpack decode is ~41% of kart diff runtime.
  • so switching libraries would mean an additional 1.15x speedup for kart diff
  • and presumably similar speedup for any other things kart does for tabular datasets, not just diff

@rcoup
Copy link
Member

rcoup commented Nov 18, 2024

sounds like msgspec is 1.6x faster than msgpack (for decoding)

40% faster? (0.427s cf 0.799s)

And 70% faster for encoding.

Looks like msgspec does json too? So could potentially not ship it and orjson?

@craigds
Copy link
Member Author

craigds commented Nov 29, 2024

yes, looks like it performs similarly to orjson so we could use it for that too and avoid bundling both with Kart

@craigds craigds linked a pull request Nov 29, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants