Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A concept of default MerkleDb instance seems redundant #17002

Open
artemananiev opened this issue Dec 9, 2024 · 0 comments
Open

A concept of default MerkleDb instance seems redundant #17002

artemananiev opened this issue Dec 9, 2024 · 0 comments
Assignees
Labels
Platform Data Structures Platform Virtual Map Platform Tickets pertaining to the platform Tech Debt Reduced Issues which reduce technical debt.

Comments

@artemananiev
Copy link
Member

artemananiev commented Dec 9, 2024

Every virtual map has a data source, and all data sources for virtual maps in the state are stored in a single folder on disk. This folder is a MerkleDb instance. Here I claim that this MerkleDb instance concept is not needed. It brings more complexity, but doesn't provide any benefits. Each data source can reside in its own temp folder as it used to be previously.

A number of good reasons about it:

  • With all virtual maps combined to a mega map, there will be just one map and one data source
  • Take a snapshot. All data source snapshots are taken one by one anyway, despite they belong to a single MerkleDb instance
  • Restore a snapshot. When a data source snapshot is restored, it is restored to a default MerkleDb instance. In real consensus nodes this is not an issue, there is only one instance in the process created at startup. However, in tests there may be multiple instances, and tests often conflict with each other. This is why MerkleDb.resetDefaultInstancePath() workaround exists, but it's ugly
  • Data source copies. During reconnect, a data source copy is created for every virtual map, both on teacher and learner sides. These copies are put to the same default MerkleDb instance, but they are short-lived and should be removed after reconnects. Copies have the same table names as the corresponding data sources, so tables have unique IDs to distinguish between such tables. ID tracking is also messy
  • MerkleDb database has its own metadata: a set of table IDs and table metadata, but this database metadata isn't used for any purpose (other than, maybe, some integrity checks). It's a source of constant pain, though: all these "table already exists" or "table doesn't exist" exceptions in the tests are very hard to debug

This ticket proposes a much simpler approach, which is somewhat similar to what was used in JasperDB times:

  • Each data source has a dedicated folder on disk to store its data files
  • These folders are all in temp file space
  • Data source copies just create new independent temp folders
  • There must be a way to snapshot a data source to a specific folder (snapshot folder / table name)
  • To restore from a snapshot, yet another temp folder is used
@artemananiev artemananiev added Tech Debt Reduced Issues which reduce technical debt. Platform Virtual Map Platform Data Structures Platform Tickets pertaining to the platform labels Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Platform Data Structures Platform Virtual Map Platform Tickets pertaining to the platform Tech Debt Reduced Issues which reduce technical debt.
Projects
None yet
Development

No branches or pull requests

2 participants