Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One database for every experiment #309

Open
aidanheerdegen opened this issue Nov 4, 2022 · 23 comments
Open

One database for every experiment #309

aidanheerdegen opened this issue Nov 4, 2022 · 23 comments
Labels
🥞 database Related to the structure of the database itself 🧜🏽‍♀️ enhancement

Comments

@aidanheerdegen
Copy link
Collaborator

What about moving to a model where every experiment has it's own DB file? The central DB would query each individual DB for the information it needed for explorer functions.

  • Potentially scales better: main DB keeps a minimal amount of data required to discover experiments
  • Embarrassingly parallel indexing
  • Aligns better with other database approaches, like intake, which are discrete per experiment approach
  • Support intake catalogues as a source for the cookbook meta-catalogue
  • Maybe look at just using intake for the per-experiment DB/catalogue, and support the best of all worlds

Whaddya reckon @angus-g @rbeucher @micaeljtoliveira

(Yes I did think about putting this on the forum, and could still do so, but thought it was quite technical)

@aidanheerdegen aidanheerdegen added 🧜🏽‍♀️ enhancement 🥞 database Related to the structure of the database itself labels Nov 4, 2022
@rbeucher
Copy link
Contributor

rbeucher commented Nov 4, 2022

I was approaching this differently but I haven't given it much thoughts to be honest.
The main issue in my opinion is that the COSIMA Cookbook reindex everything periodically while we only want to add new experiments.

I have not looked in details but FREVA (Germany) uses a combination of a SQL database (MariaDb) and an Entreprise search service like SOLR. SOLR has some capabilities to do live indexing. Could be worth looking into that.

@rbeucher
Copy link
Contributor

rbeucher commented Nov 4, 2022

I think one Database is still the way to go. Indexing is the issue. Breaking it into multiple DB introduce as many new points of failure.

@aidanheerdegen
Copy link
Collaborator Author

What are you trying to solve by doing this?

  • Slow indexing
  • Monolithic design
  • Scaling

This approach is not only embarrassingly parallel, but also potentially asynchronous: when syncing new data to /g/data the same script can then run a command to reindex the data.

It is a "separation of concerns" argument: each experiment is responsible for keeping it's DB up to date. Hence modular and usable in different ways: access a single experiment DB directly, or discover via database of databases.

Potentially this approach also puts less stress on the lustre metadata servers (but I know very little about lustre). Finding all new files to index by doing this

find . -newer index.db

in a single experiment directory is asking a lot less of the filesystem than doing the same thing over many subdirectories.

This approach might also scale better. It would be possible to have multiple meta-DBs that contain different collections of data, but there can be overlap between them. e.g. Antarctic datasets, BGC datasets. This could be seen as a negative, with balkanisation of data, but as the number of experiments grows it might not be feasible to have them all in a single DB.

Breaking it into multiple DB introduce as many new points of failure.

It is possible to see it the other way around: a single point of failure is bad. Multiple separate DBs is potentially a good thing. One DB gets corrupted and it's no big deal to reindex.

Some drawbacks to one DB per experiment:

Duplication

Meta-data is duplicated across databases, e.g. variable names, attributes etc

Version synchronisation

Database schema can be updated, which would require tools to either robustly support old versions (potentially a lot of work), or all the individual DBs would have to be updated. This is not nice.

@micaeljtoliveira
Copy link
Contributor

Sounds like you are trying to solve the problem of sqlite not scaling well nor allowing concurrent access by using many sqlite databases. Sure, it could work, but it does look like an ugly hack if you consider that there are other SQL implementations out there that don't have these issues... Maybe it would be time to have a serious discussions with NCI about this?

@angus-g
Copy link
Collaborator

angus-g commented Nov 10, 2022

For sqlite concurrent access, has anybody enabled write-ahead logging? Particularly because it's only reader/writer concurrency we care about:

WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.

@aidanheerdegen
Copy link
Collaborator Author

Sounds like you are trying to solve the problem of sqlite not scaling well nor allowing concurrent access by using many sqlite databases. Sure, it could work, but it does look like an ugly hack if you consider that there are other SQL implementations out there that don't have these issues...

Sort of, but once I thought about it I realised there are other things going for it. As I said above. I like the idea of experiments updating their own DB. It has some graceful degradation: the main DB is not required to access a specific experiment if the location of the experiment is known.

There is also a pretty horrendous name-space issue already happening, where experiments have increasingly convoluted names for uniqueness. Clearly some of that is from payu, but even if that weren't the case it would be necessary for indexing into a single monolithic DB. I'd bought up the idea of "collections" previously: another level of organisation on top of experiments. They are a common idea in data catalogues. The different index-DBs naturally map to the idea of a collection.

I honestly think this might be a better design, but that also happens to help with scaling and indexing.

From an ACCESS-NRI point of view it might be desirable to generate a meta-DB/index-DB that only contains "supported"or published datasets, whereas the COSIMA Community will want to have the bleeding edge stuff as well.

Equally there may be subject specific index-DBs, e.g. land surface, CMIP6,. You can have datasets appearing in more than one index-DB, there is virtually no cost to this once each experiment has it's own index.

I'm not insisting subject-specific index-DBs is necessarily the way to go, but you could literally not even think of doing this with the current approach. It would just be too cumbersome to index an experiment more than once, and maybe get the filesystem folks at NCI quite upset.

Maybe it would be time to have a serious discussions with NCI about this?

Sure. They're already talking about some indexing stuff, and @rbeucher had a meeting with them about this. Probably best to discuss offline I would say.

For sqlite concurrent access, has anybody enabled write-ahead logging?

Yeah I did look into that, but I think:

  1. All processes using a database must be on the same host computer; WAL does not work over a network filesystem.

is a show-stopper, but happy to be set straight if I've got the wrong end of the stick.

@aidanheerdegen
Copy link
Collaborator Author

I don't want to keep banging on, but with the approach I'm advocating in this issue an error such as COSIMA/master_index#30 no longer stuffs up indexing for other experiments, as the error is contained to the problematic experiment.

Just sayin'

@rbeucher
Copy link
Contributor

I don't like it... I second @micaeljtoliveira calling it a ugly hack... ;-)
Let's explore a bit more...

@aidanheerdegen
Copy link
Collaborator Author

I second @micaeljtoliveira calling it a ugly hack... ;-)

I am feeling bullied :p

@micaeljtoliveira
Copy link
Contributor

I don't want to keep banging on, but with the approach I'm advocating in this issue an error such as COSIMA/master_index#30 no longer stuffs up indexing for other experiments, as the error is contained to the problematic experiment.

Glad you mention it. In this case I would actually argue that the underlying problem lies with the indexing script, not the database. In my opinion the indexing script should not return an error code in the first place when failing to index a file, but instead return a warning. It should only return an error for something that could compromise the integrity of the database, like loosing the connection to the database while updating it.

@aidanheerdegen
Copy link
Collaborator Author

In this case I would actually argue that the underlying problem lies with the indexing script, not the database. In my opinion the indexing script should not return an error code in the first place when failing to index a file, but instead return a warning. It should only return an error for something that could compromise the integrity of the database, like loosing the connection to the database while updating it.

Agreed. If we're throwing around the epithet "ugly hack" the original shell script for this qualifies in spades (to be clear, I am referring to the script I wrote).

Hasn't change my mind about separately indexing experiments though :)

@aidanheerdegen
Copy link
Collaborator Author

In case it isn't obvious, you could also include a DB update step to the sync script. Each experiment DB is then only updated when the contents of the directory are updated.

The Cookbook DB script then just has to check a number (50-200?) of experiment DBs and check which have been updated since the last time they were slurped in.

That is a signficant reduction in disk access.

@aidanheerdegen
Copy link
Collaborator Author

Also can't fathom why I hadn't invoked this before

71312p

@rmholmes
Copy link

Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others).

@aidanheerdegen
Copy link
Collaborator Author

Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others).

I am feeling some momentum building .. :)

@adele-morrison
Copy link

adele-morrison commented Nov 17, 2022 via email

@aekiss
Copy link
Collaborator

aekiss commented Nov 17, 2022

Will querying a database of databases slow down the explorer?

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/access-nri-postgres-db-for-capturing-provenance-usage-metrics-and-cosima-cookbook-index/153/7

@aidanheerdegen
Copy link
Collaborator Author

Will querying a database of databases slow down the explorer?

My assumption was that it would speed it up, as I'm proposing the DBDB/Index would contain significantly less information than it does currently. It would only have enough to info to help users decide which experiment to use. This is currently the way the DB explorer works pretty much. It creates a mapping from variable to experiment, to enable users to then decide which experiment to query.

@dougiesquire
Copy link
Collaborator

I think this approach is somewhat like what is used for the Pangeo Intake catalogue. They have a master catalogue that references other catalogues of "climate" (CMIP), "oceanographic" etc datasets.

@rbeucher
Copy link
Contributor

rbeucher commented Feb 1, 2023

I have looked at using the COSIMA-Cookbook to index the ACCESS-ESM/CM experiments.
That can work but it is very slow and inefficient.

I spent a bit of time looking at the LUSTRE architecture and found that article:
https://people.cs.vt.edu/~butta/docs/CCGrid2020-BRINDEXER.pdf

This is about file metadata indexing but it has some good ideas that are somewhat related to the idea of having multiple databases.
I would probably not go for a one database per experiment design as suggested by @aidanheerdegen but database sharding using simple RDMS like SQLite could work.
It does not look too complicated to set up.

There are some information about sharding in the SQLAlchemy documentation:
https://docs.sqlalchemy.org/en/14/_modules/examples/sharding/separate_databases.html

Anyway, Indexing is a priority for the MED Team so we will definitely look into this.

@aidanheerdegen
Copy link
Collaborator Author

I would probably not go for a one database per experiment design as suggested by @aidanheerdegen

:|

@rbeucher
Copy link
Contributor

rbeucher commented Feb 1, 2023

You know what I mean 🤪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🥞 database Related to the structure of the database itself 🧜🏽‍♀️ enhancement
Projects
None yet
Development

No branches or pull requests

9 participants