Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why? #1

Open
nelsonic opened this issue Nov 2, 2018 · 5 comments
Open

Why? #1

nelsonic opened this issue Nov 2, 2018 · 5 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested T1d

Comments

@nelsonic
Copy link
Member

nelsonic commented Nov 2, 2018

At present we are using a UUID in our Append-only Log see: dwyl/alog#15
This is a good "stop gap" to get the alog project to "alpha" so it could be used in the Client Project. :shipit:
But I feel that we should carefully consider the use of UUIDs as IDs in the "long term".

If an ID is meant to be "machine readable" and "guaranteed" unique, then UUIDs are perfect.
If there are additional requirements then we need to capture them.

For example: are we going to display the UUID in the URL for a given content type e.g:

location-app.com/venues/123e4567-e89b-12d3-a456-426655440000

Is this the most user-friendly ID we could display? (is it distinctive or memorable ...?)

Could we instantly improve UX by shortening the URL and making it Base64 instead of Base16? e.g:

location-app.com/sw1x

Where the app would automatically 301 re-direct the request to:

location-app.com/venues/the-islington-roof-garden-sw1x

Can we "re-think" content/record IDs for both Uniqueness and Usability?

Basic Example: Address Book

The append-only log example: https://github.com/dwyl/phoenix-ecto-append-only-log-example
demonstrated the benefits of using an append-only log for an address book.
But we also explained that any app can benefit from having an immutable log as its' data store!
see: https://github.com/dwyl/phoenix-ecto-append-only-log-example#examples-where-an-append-only-log-is-useful

What the example did not cover is how to mitigate against saving the same data multiple times.
For example, consider the form:

Field Value
Name Bruce Wane
Address 1007 Mountain Drive, Gotham

if Bruce clicks/taps to edit his address,
and we have an auto-save function to prevent loss of changes.
We need a way of checking on the client that the address has not changed
while he is viewing the edit screen.

Yes, this function should only be triggered by the "onChange" DOM event, but for argument's sake, we assume that Bruce clicks the "save" button (which we have discovered through UX-testing people still expect to have despite the "autosave"),
should the server attempt to "re-save" the data that has not changed?

I suggest that by using a hash of the content as the Primary Key, Ecto (or PostgreSQL) would "reject" the insert request as a "duplicate" and we would not waste space in the database/table with dupe data.

If the data has not changed then the ContentID (cid) would be the same so no data insert.

Use Case: Distributed Learning Platform

Learning systems universally have "vendor" or "platform" Lock-in.
Do a lesson on
Ask a question on StackOverflow? If your account gets banned for any reason, you lose "ownership" of any questions you had asked.
Want to export your data and take it with you? Tough!

In our "product roadmap" we have begun to detail the creation of an Open Source Distributed Collaborative Learning platform:
https://github.com/dwyl/product-roadmap/blob/master/collaborative-learning-community.md

By using a cid where each item of learning has a unique "fingerprint",
_everyone learning on the platform can clearly see what each of their fellow learners has "covered".
If anyone wants to either import or export their learning, they can do it easily.

This is game-changing for making your "learning log" portable from kindergarten to the grave!
No longer will your new school, university or employer/team have to rely on a "report card" or (incomplete) "transcript", everyone will be able to see that Alex

  • "learned quadratic equations on 2017-01-14 10:09 at Hyde Park Middle School."
    cid: sha2-256-6e6ff7950a36187a801613426e858dce686cd7d7e3c0fc42ee0330072d245c95

When Alex moves from Middle School to High School, all his teachers and classmates can easily see exactly what he has learned.

The teacher can be far more effective because they know what each student has already covered vs. what they still don't grasp. And instead of teaching more advanced topics that half the class will be confused by, they can attempt to cover the areas that still require work.

This is highly relevant in the workplace too!
As a "hiring manager" or "team lead", I want to know exactly what my (potential) team members already know and what they are still trying to learn. I will delegate tasks to them that stretch their current abilities enough to encourage learning but not too much that would "overwhelm" them!

Closing the Gender/Class/Minority Divide in STEM/Tech

I feel that having a complete "learning log" will help to eliminate the gulf that exists in STEM/Tech because it will always be immediately obvious who has the knowledge/skill and it wont be a matter of which person has the loudest voice and most over-confidence.

@nelsonic nelsonic added help wanted Extra attention is needed good first issue Good for newcomers question Further information is requested labels Nov 2, 2018
@hyprstack
Copy link

Maybe use something like like a hash, or nanoId?

Or configure the database to generate a non-sequential unique id that is not a uuid when creating the resource?

This is also an interesting blog with some insight into making the uuid more "pretty" for the user.

This might be overkill but you could also, perhaps (I have not tested this), create a base64 decoding/encoding method that uses a salt or an iv to sign and create a AES192 cipher/decipher of your id.

Another solution would be to ditch the uuid completely and create the resource id based on pre-determined criteria that is designed to be user friendly OR have a linked table that would map directly to your uuid?

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2018

@hyprstack great points!
Thanks for sharing links and the Coding Horror Blog post; I found the comments especially insightful.

NanoID https://github.com/ai/nanoid (initial commit 5 Aug 2017)
image

Under the hood it's using crypto.randomFillSync(buffer)
see: https://github.com/ai/nanoid/random.js#L11

Similar to https://github.com/nelsonic/perma (initial commit 14 Feb 2015)
which uses crypto.createHash('sha512')
perma removes "confusing characters" by default and avoids difficult special chars e.g: ~ (tilde)

Nano ID has a good illustration of the Birthday Problem (Probability of ID Collision) to explain ID length:
https://zelark.github.io/nano-id-cc/
nano-id-birthday-problem

Anyone following along needing a recap or intro to the Birthday Paradox,
see: https://betterexplained.com/articles/understanding-the-birthday-paradox

If we could configure/program PostgreSQL to auto-generate the non-sequential ID it would be good
because the responsibility for managing/preventing collisions would be handled by the DB
and we wouldn't need to think about it. 👍

Hashids are good for obfuscating integers to strings and the reversibility is useful for many situations.
The official usecase is for "when you don't want to expose your database ids to the user":
image

My reasoning for not wanting to use Hashids is twofold:

  1. "confusing characters" in the default charset means there will be ambiguity in URLs e.g: 0 vs. O
    see: https://ux.stackexchange.com/questions/53341/are-there-any-letters-numbers-that-should-be-avoided-in-an-id

We could remedy this by providing the charset as the 3rd argument.

  1. Only useful for integers which implies that the IDs are still integers. In our case we want our DB IDs to be Strings with low chance of collision. We could use the Hashids as the IDs in the DB but then generating the IDs would be reversed because we would need first create an INT which is then fed into the hashids.encode function. This by definition reduces the entropy to the upper bound of JS numbers: 9007199254740991 16 digits.
    So if we used Hashids in an app creating 1k records/sec (e.g. in a reasonably popular analytics/chat app) there would be a collision after only 4hours.
    image

I agree that using a cryptographic hash function for creating IDs is a good idea for creating random strings with low chance of collisions.

We considered the use of a salt/iv when hashing - and it's a good suggestion in general when hashing to avoid rainbow table attacks - but it would only be useful in the case where we wanted to make ID reversibility (by an attacker) more difficult.
I'm not discounting the use-case of a salt when hashing content for ID creation yet, just not having it for "MVP" because I don't currently envisage ID cracking to be the weakest link in the system. 🤔

Requirements ?

I can summarise the requirements I have in mind in the following 3 statements:

  • Distributed (so cannot be Auto-incrementing/Serial INT or collisions would be guaranteed)
  • URL-safe and Human-friendly (easy to type and short to share on social, sms etc.)
  • Record Revision Tracking or "Blockchain" capable - "parent" node reference in ID ...

I think I need to start writing some code to illustrate my thoughts on this. 💭 |>📝
Thank you very much for chiming in with relevant links and insight @hyprstack 🥇
Hope your weekend is going well. 🎯

@nelsonic
Copy link
Member Author

nelsonic commented Dec 7, 2018

Discovered https://github.com/nocursor/ex-cid/ 🎉
while reading: https://github.com/ipld/cid#implementations via: dwyl/learn-ipfs#6 (comment)
Tried to clone and run it locally ... failed. 😞
Opened issue: nocursor/ex-cid#1 (to ask for help from the author!) ⏳

@nelsonic
Copy link
Member Author

Sadly, after a week of waiting for the author of ex-cid to respond to this issue: nocursor/ex-cid#1
I feel that I have no other choice ... I cannot afford to keep waiting indefinitely for this to work. ⏳
I can either fork it and dramatically improve the quality of the tests, code and documentation and the wait for the author to respond to the issue, or I can finish my own implementation. 🤔

Sadly, ex-cid was accepted as the "recommended" elixir implementation in this PR: multiformats/multibase#39 so now in addition to duplicating effort I have to inform the Multiformats community that the rush to publish ex-cid and subsequent non-responsiveness from the author is hindering adoption ... 😢

I hate it when people rush code and don't make the effort to add reliability guarantees via continuous integration with 100% test coverage and comprehensive examples.
I've had to "fill in the gaps" a couple of times before ...

e.g:

But I definitely don't "like" doing it! I would much rather someone else write and maintain this!
I want to build Incredible User Experiences/Interfaces not infrastructure...! 🙄

Anyway, GOTO: #11

@nelsonic
Copy link
Member Author

Update on nocursor/ex-cid#1

More than a month has passed since I opened the issue nocursor/ex-cid#1 informing the author/creator that the code does not compile nocursor/ex-cid#1 (comment) thus rendering it unusable.

@SimonLab created the Pull Request that would fix the failing test: nocursor/ex-cid#2
and make the package useable and no response from the author.

As much as I would like to avoid duplicating effort, I/we (@dwyl & the wider Elixir community) cannot afford to rely on a package that has an unresponsive author. 😞
The cid functionality is going to be at the core of all the code we will be writing for the next few years and if a bug (or security vulnerability) is discovered and we are unable to fix it because the "owner" of the package is AWOL, that will have a disastrous impact on our work.

I feel that this has left us with little choice but to re-do the work on cid (and any dependencies) to ensure that the project is maintained.

Again, GOTO: #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed question Further information is requested T1d
Projects
None yet
Development

No branches or pull requests

2 participants