Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why UUIDs? #519

Open
janosh opened this issue Jan 1, 2024 · 34 comments
Open

Why UUIDs? #519

janosh opened this issue Jan 1, 2024 · 34 comments
Labels
question Further information is requested ux User experience

Comments

@janosh
Copy link
Member

janosh commented Jan 1, 2024

I was looking at a wall of job IDs like this earlier and wondering if jobflow actually requires full 36-character UUIDs?

PhononUUIDs(optimization_run_uuid='cc10cd4d-944d-43b1-9651-2e168765c02e', displacements_uuids=['c29abb65-a8e5-47ff-a6d6-6776cd8e4095', 'e2e07788-daed-4201-921b-ec2e57c802c4', 'bf443b3d-6334-4b04-8f8a-29fabc08c37e', '23c36a18-4066-4047-a553-aa9fe2afa714', '81134dc7-a88f-486a-af3b-ddd9c1ba3eb1', '1009897b-a97c-4d4a-9cac-e952766b9608', '15015ddf-7967-410d-8fef-13642d7bd9e5', '1899803f-76c1-4768-bf23-c4d61fb51691', 'd39ce806-7403-495e-89a4-a4de961590bf', '35bb27e8-7f31-44ec-b684-0f9391b4228e', 'c6f188fb-2d26-47ff-b0f8-8438fae8ff58', '06b2249c-e7c6-4cf8-91b0-f572fccc209c', '18bbc0be-be4c-4131-8ef6-f6742c899acd', '8303eb32-162a-4a65-b75f-d8586aa23636', '392282e2-cbe6-447e-b7a9-e20a02e57cbe', '835b35f9-413c-479c-944d-14d9d4649b71', 'c59f17e4-1f9b-4e65-a9f2-dc9fc7314283', '95d9e094-802b-45c5-8630-3b0e8e6c3b64', 'dea7a13c-6921-42f5-adf5-7a4670319a7b', '6faa0b95-0058-4476-bd55-109d3746a334', '6a6fbce4-41b3-4b7a-8d69-0b6ffab7716c', '0ef5726d-3b62-4464-a5e9-f33237f59d83', '3de16018-90ad-4da7-9885-86a783d30c1c', '5c779412-0c29-423b-ae7b-c6eade5124bc', 'adc6ef3c-fc70-43cf-a819-741b45d0ec7f', '5b47a7ff-bb6d-45f4-b442-1183db8a58df', 'e01a19d7-3685-465a-a93b-bfa3e876081c', '79addbaa-32fb-4c5c-82c2-23c01dd9cd89', '6232ed4f-8b57-40a9-9fdf-0956913a84f6', '772a84ea-9f22-4167-8887-dc8ea46b2ad7', 'e74100ac-f17f-4fdc-81b9-18b2e3c69064', '7d0ba392-0eef-4da6-8d31-6eb84223dcc3', '98c688d5-5aa2-49cb-ac77-847b04abd145', '3e4a0fd0-d383-4b1a-b20c-56794fd9a096', 'ed466545-1988-4d52-8dec-835075a6049d', 'bb829a61-6e91-481f-b90e-ffa1baa0e876', '6e49c916-2197-4ff8-a582-52e751d20837', '6a5a7a71-3721-44e3-bd93-08c8128ef52c', '6dcba672-d54e-4900-b807-737f6adfc6e3', '6f13d708-a1cf-4884-b748-dcc932e753b9', '5721c7af-932b-4250-81f9-ab88f219a98f', '6f754a3f-81f6-46be-9160-95ffd0af6f9b', '723b9f8c-11ea-4261-801f-f02e1b4aa518', '90889945-059e-4777-9b94-a473ddacb185', '27066b5b-5c80-408b-a8aa-cae9b0304a26', '3389b0c1-e39d-4fba-b175-986b12c2baeb', '8185facc-f20b-4261-b8d2-cf584f50c151', '23251773-7ea8-42d5-a791-303240af896c', '5605e378-5648-44ab-b2dd-1caf44267c00', '6df6ccb8-d26e-4bbf-8285-18fdfe1ae64e', '2af8cf20-c1ee-48ae-bc52-dacfb60d9bf2', '9725444a-b59f-499b-9293-51641a65f689', '9c262ca0-15d3-4a78-8e53-a8643c0e46da', '39ebce16-b68c-4edc-8e2b-00d4cf677931', '267b87e0-ff6d-4832-a703-6d3cbc127e24', '186e0ac2-57a7-424f-8e95-09ddf6e25a86', '16a2f8cc-73ab-46ea-bc33-b6304250376e', '8fb0d733-8c8c-4df3-97f9-68937ea92b73', '788bc502-840a-4283-995b-6edf6df57458', 'f390865a-da66-4241-9bee-763279e050f9', 'c433c10b-64e1-4ace-aa07-13af868fbbc7', '23384eb4-21e3-4819-86d7-1f60a579d839', 'aa346ded-5d7a-4aff-a16a-413250513b40', 'b5a64bbd-ae68-49e2-8b60-f04e3e80158d', '5d392703-1fcd-42e1-a87f-8da4c2be82df', '6b3c4b43-2968-4a95-b32b-bf9be92aea6b'], static_run_uuid='4afd1757-09f3-44c4-ad15-93d099e989d0', born_run_uuid=None))

YouTube for example uses 11-character IDs?

https://youtu.be/R3N6BocbknM

It's case-sensitive alpha-numeric plus hyphen and underscore giving $64^{11} \approx 7.4 \cdot 10^{19}$ combinations. I think there would be readability and "pastibility" improvements.

@janosh janosh added question Further information is requested ux User experience labels Jan 1, 2024
@utf
Copy link
Member

utf commented Jan 7, 2024

I'm not against using shorter IDs, especially since they're currently encoded as string not bson uuids. Is there a way to generate those short IDs without an additional dependency?

@janosh
Copy link
Member Author

janosh commented Jan 7, 2024

Yes, we could do

import random
import string


def get_random_id():
    """Return a random 11-character YouTube-like ID."""
    chars = random.choices(string.ascii_letters + string.digits + "-_", k=11)
    return "".join(chars)

Maybe pastibility would be better if we drop the dash and underscore.

def get_random_id():
    """Return a random 11-character alpha-numeric ID."""
    chars = random.choices(string.ascii_letters + string.digits, k=11)
    return "".join(chars)

@gpetretto
Copy link
Contributor

I think that making the uuid shorter and easier for copy/paste would indeed be beneficial, but I have a few notes that may be considered before proceeding:

  • in jobflow-remote, in the CLI, I am using the structure of the UUID to distinguish if a user passes a uuid or a db_id. This is not a blocking issue, but I thought it was worth mentioning it. Maybe the new id structure could be made so to be clearly identifiable?
  • in the databases and in the code the UUIDs are always referred as uuid. Changing that would require some refactoring and maybe backward incompatible changes, not changing means that a uuid attribute contains something that is not a proper uuid

@janosh
Copy link
Member Author

janosh commented Jan 8, 2024

@gpetretto Great points both!

  1. We could prefix the IDs with jf- which would allow identifying them via id.startwith('jf-') and would even allow future changes of the ID format without breaking these checks.
  2. If we go ahead with this change, I vote to rename from uuid to id (or jf_id) since keeping uuid would be misleading.

@gpetretto
Copy link
Contributor

Thanks for considering these points. I have a few more comments.

  • There is an error in the estimate of the number of possible combinations. It is not 11^64 but 64^11 ~ 7.37E19 combinations, which is considerably smaller than the number of different UUIDs (5.3E36). I suppose the possibility of collision is still rather low, but if this option is chosen, it may be convenient to increase the size of the string.

  • If changing the id format is an option, it may also be convenient to consider other algorithms and their features. For example something sortable. Here are a few examples I have come across:

    python implementations are typically available and they probably won't require further dependencies, but I suppose it would also be easy to add a simple reimplementation inside jobflow.

@janosh
Copy link
Member Author

janosh commented Jan 8, 2024

I was looking at NanoID when opening this issue but ULID looks even better. Only thing is 26 vs 36 is not that much shorter.

It is not 11^64 but 64^11 ~ 7.37E19

Oops, rookie mistake 😅

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Jan 8, 2024

Maybe I'm not thinking hard enough about it, but is there any reason not to have the unique ID simply be the datetime + a small identifier?

time_now = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H-%M-%S-%f")
f"{time_now}-{SMALL_ID_GOES_HERE}"

Then it would be interpretable but also with low probability of clashing even if SMALL_ID_GOES_HERE is not particularly diverse.

I guess mine isn't shorter though, so I take it back. 😅

@jmmshn
Copy link
Contributor

jmmshn commented Jan 18, 2024

The sortable point from @gpetretto is pretty important.
Some of the more complex workflows in emmet require the IDs to be sorted (ideally based on creating time). So if we move away from UUID4 (which I think is random), I think this is actually a requirement.

Otherwise, we run into problems like you see here:
materialsproject/atomate2#655 (comment)

@utf
Copy link
Member

utf commented Jan 18, 2024

Agreed that given the emmet limitations, having a sortable id is essential.

Of the sortable options mentioned

  • UUID1 is 36 characters
  • ULID is 26 characters
  • Snowflake-id is 19 characters but all numbers and quite difficult to tell the difference between ids.
  • NanoID and CUID are not sortable as far as I can tell.

Are there any other options/suggestions on what to use?

@Andrew-S-Rosen
Copy link
Member

Just chiming in to say that, while it doesn't help now, looks like uuidv7 might address some of these needs in the future (e.g. for other projects). https://buildkite.com/blog/goodbye-integers-hello-uuids

@janosh
Copy link
Member Author

janosh commented Jan 18, 2024

Cool! uuidv7 support in std lib is tracked in python/cpython#89083 with a planned release in Python 3.13.

We could install uuid7 in the meantime or copy-paste this function into jobflow

@Andrew-S-Rosen
Copy link
Member

@janosh just a note: stevesimmons/uuid7#1 (comment)

@jmmshn
Copy link
Contributor

jmmshn commented Jan 18, 2024

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

@Andrew-S-Rosen
Copy link
Member

I think that makes the most sense of the available options.

@utf
Copy link
Member

utf commented Jan 18, 2024

Ok, happy to go with that. @jmmshn would you be able to submit a PR? I will then release a new version of jobflow.

@jmmshn
Copy link
Contributor

jmmshn commented Jan 18, 2024

OK!

@gpetretto
Copy link
Contributor

I think that also the initial points raised by @janosh in the first message are worth considering, i.e. the ability to quickly copy/paste and having a shorter ID. uuid1 and uuid7 address the sortability, but not the other two issues. ULID would address all of them (10 characters shorter, no hyphens, sortable). Why not considering that instead?

@utf
Copy link
Member

utf commented Jan 18, 2024

I would also be happy with ULID. The only drawback is that we would need to either:

  1. Add ULID as a dependency
  2. Copy the code into jobflow.

What I'm not clear about is whether we'd also need to do the same for emmet in order to use the sorting features.

@jmmshn
Copy link
Contributor

jmmshn commented Jan 18, 2024

Ditto @janosh and @gpetretto on the size thing.
since the UUID bit of code in jobflow is so small can't we just move it all to emmet core?

@utf
Copy link
Member

utf commented Jan 18, 2024

Jobflow doesn't depend on emmet though (only atomate2 does).

From some quick reading, it seems like the sorting should be trivial without needing ULID as a dependency.

@davidwaroquiers
Copy link
Contributor

Maybe in monty? Both emmet and joblow depend on monty.

@jmmshn
Copy link
Contributor

jmmshn commented Jan 18, 2024

I think we can just geta PR going with UUID1 for now and add others later.

@gpetretto
Copy link
Contributor

I would like to point out two other smaller issues with uuid1:

  1. it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.
  2. For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

@davidwaroquiers
Copy link
Contributor

I would like to point out two other smaller issues with uuid1:

  1. it is not really sortable. The order needs to be reconstructed at a later time. Which is definitely a downside as it could be useful in DB queries.
  2. For a single user the number of machines used to generate a new uuid are likely limited. New workflows are typically generated on the same machine and then added to the DB. So for a single user uuid1 can be seen as a very long id that just contains the timestamp.

As a general comment, I would avoid changing multiple times the algorithm for the id generation, as this may be confusing or leading to unexpected issues. Even changing it once may be a reason for concern. According to the website linked above, the definition of the uuid7 standard should be in its final stage. So I am wondering if it would not be better to wait for its release in order to minimize the number of changes.

I also think it might be better to not change things too many times as jobflow (and atomate2) are already used in production by many people. One thing that was discussed above was also the renaming of the uuid attribute (to id or jf_id). This might also be something to keep in mind (as @gpetretto mentioned, having a uuid attribute which is not a "proper" uuid could make things confusing). Such a change could be done earlier with a deprecation. Not sure I would call it id nor jf_id though. Maybe uid (for unique id) ?

@jmmshn
Copy link
Contributor

jmmshn commented Jan 19, 2024

I agree with the naming issue, and a change to uid is probably needed.
I think we can enable support for multiple id types, keep it at uuid4 as default for now, and swap it over to ulid or uuid7 at a later time.
This will give me some kind of sortable id immediately so I can proceed with my stuff (just have a hack in a comparator for the uuid1's) then we can wait a while to swap it over to ulid as default after we have all used it in testing for a while.

Also, if someone is already committed to using uuid4 they can just tweak their jobflow.yaml and not disrupt anything even after the switch.

@mkhorton
Copy link
Member

One important factor here is that MongoDB does have a native UUID datatype, which reduces the number of bytes to store it compared to a string representation. Since ULID is bit-compatible with UUID, it’s probably possible to switch away from using a string too.

@mkhorton
Copy link
Member

mkhorton commented Jan 23, 2024

That said, I’m not convinced about the argument against uuid7 if it’s a standard and implementations are imminent. If the size on disk is the same, I’m not sure it matters if the string representation has hyphens or not — and indeed, the hyphens help with readability.

@dpldgr
Copy link

dpldgr commented Feb 1, 2024

Maybe just use UUID1 for now, and when we add the support for UUIDS in emmet we can write it with both UUID1 and UUID7 in mind.

A few comments on UUIDv1 and UUIDv7:

  • UUIDv1 is not easily sortable in the same way that UUIDv7 is because of it's bit layout, so I would recommend against it's use if performant sorting by timestamp is important.
  • UUIDv7 is close to being made a standard but has not been finalised yet. It's close enough that you could with high probability rely on the bit layout not to change: https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/14/
  • A 3rd option would be to use UUIDv8 which has a much more flexible layout: just about any bit layout is valid as long as the version and variant fields are set correctly.

@mkhorton
Copy link
Member

mkhorton commented Feb 1, 2024

ULID was already merged into emmet, but I agree with your points @dpldgr. I’d much rather we try to stick with a standard, it seems safe to adopt UUIDv7 given how close it is to being finalized.

@jmmshn
Copy link
Contributor

jmmshn commented Feb 1, 2024

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

It might make sense to have a couple of full builds of MP with atomate2 data first before we fully commit to a convention change. But one of the MP builders will have to chime in and participate on that.

@Andrew-S-Rosen
Copy link
Member

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

@mkhorton
Copy link
Member

mkhorton commented Feb 2, 2024

This library looks good:

https://github.com/aminalaee/uuid-utils

I would avoid the “uuid7” library since apparently it’s non-compliant.

Discussion on uuid7 making it into the stdlib here:

https://discuss.python.org/t/add-uuid7-in-uuid-module-in-standard-library/44390/5

@dpldgr
Copy link

dpldgr commented Feb 2, 2024

So things on both emmet and jobflow are kept flexible for the time being and the default behavior is basically "nothing changes" but if you want to use ULID or UUIDv1 for any reason right now you can.

From what I can tell, ULID is very very similiar to UUIDv7 in bit layout. It uses the same bit layout for the timestamp, but omits the variant and version fields that all UUIDs have (they have random values instead). So you could trivially make a ULID into a UUIDv7 by setting the variant field (bits 64-65) to binary 10 and the version field (bits 48-51) to binary 0111 . That would make for a much easier potential migration path than converting from UUIDv1 to UUIDv7.

I would avoid the “uuid7” library since apparently it’s non-compliant.

It is compliant, but with an old version of the draft. It should either be updated or removed as a package because there's obviously plenty of scope for confusion and unexpected results.

@dpldgr
Copy link

dpldgr commented Feb 2, 2024

@jmmshn --- out of curiosity and somewhat independent of integration in jobflow, do you have a recommended library to use UUIDv7 today (acknowledging it's possible, but unlikely, it'll substantially change in the future)? https://github.com/oittaa/uuid6-python?

That one looks like a good choice to me. It's an active repo that has had plenty of releases, so you're much more likely to get support if it's needed. The 'uuid7' repo in comparison has been stagnant for quite some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested ux User experience
Projects
None yet
Development

No branches or pull requests

8 participants