-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One database for every experiment #309
Comments
I was approaching this differently but I haven't given it much thoughts to be honest. I have not looked in details but FREVA (Germany) uses a combination of a SQL database (MariaDb) and an Entreprise search service like SOLR. SOLR has some capabilities to do live indexing. Could be worth looking into that. |
I think one Database is still the way to go. Indexing is the issue. Breaking it into multiple DB introduce as many new points of failure. |
This approach is not only embarrassingly parallel, but also potentially asynchronous: when syncing new data to It is a "separation of concerns" argument: each experiment is responsible for keeping it's DB up to date. Hence modular and usable in different ways: access a single experiment DB directly, or discover via database of databases. Potentially this approach also puts less stress on the lustre metadata servers (but I know very little about lustre). Finding all new files to index by doing this
in a single experiment directory is asking a lot less of the filesystem than doing the same thing over many subdirectories. This approach might also scale better. It would be possible to have multiple meta-DBs that contain different collections of data, but there can be overlap between them. e.g. Antarctic datasets, BGC datasets. This could be seen as a negative, with balkanisation of data, but as the number of experiments grows it might not be feasible to have them all in a single DB.
It is possible to see it the other way around: a single point of failure is bad. Multiple separate DBs is potentially a good thing. One DB gets corrupted and it's no big deal to reindex. Some drawbacks to one DB per experiment: DuplicationMeta-data is duplicated across databases, e.g. variable names, attributes etc Version synchronisationDatabase schema can be updated, which would require tools to either robustly support old versions (potentially a lot of work), or all the individual DBs would have to be updated. This is not nice. |
Sounds like you are trying to solve the problem of sqlite not scaling well nor allowing concurrent access by using many sqlite databases. Sure, it could work, but it does look like an ugly hack if you consider that there are other SQL implementations out there that don't have these issues... Maybe it would be time to have a serious discussions with NCI about this? |
For sqlite concurrent access, has anybody enabled write-ahead logging? Particularly because it's only reader/writer concurrency we care about:
|
Sort of, but once I thought about it I realised there are other things going for it. As I said above. I like the idea of experiments updating their own DB. It has some graceful degradation: the main DB is not required to access a specific experiment if the location of the experiment is known. There is also a pretty horrendous name-space issue already happening, where experiments have increasingly convoluted names for uniqueness. Clearly some of that is from I honestly think this might be a better design, but that also happens to help with scaling and indexing. From an ACCESS-NRI point of view it might be desirable to generate a meta-DB/index-DB that only contains "supported"or published datasets, whereas the COSIMA Community will want to have the bleeding edge stuff as well. Equally there may be subject specific index-DBs, e.g. land surface, CMIP6,. You can have datasets appearing in more than one index-DB, there is virtually no cost to this once each experiment has it's own index. I'm not insisting subject-specific index-DBs is necessarily the way to go, but you could literally not even think of doing this with the current approach. It would just be too cumbersome to index an experiment more than once, and maybe get the filesystem folks at NCI quite upset.
Sure. They're already talking about some indexing stuff, and @rbeucher had a meeting with them about this. Probably best to discuss offline I would say.
Yeah I did look into that, but I think:
is a show-stopper, but happy to be set straight if I've got the wrong end of the stick. |
I don't want to keep banging on, but with the approach I'm advocating in this issue an error such as COSIMA/master_index#30 no longer stuffs up indexing for other experiments, as the error is contained to the problematic experiment. Just sayin' |
I don't like it... I second @micaeljtoliveira calling it a ugly hack... ;-) |
I am feeling bullied :p |
Glad you mention it. In this case I would actually argue that the underlying problem lies with the indexing script, not the database. In my opinion the indexing script should not return an error code in the first place when failing to index a file, but instead return a warning. It should only return an error for something that could compromise the integrity of the database, like loosing the connection to the database while updating it. |
Agreed. If we're throwing around the epithet "ugly hack" the original shell script for this qualifies in spades (to be clear, I am referring to the script I wrote). Hasn't change my mind about separately indexing experiments though :) |
In case it isn't obvious, you could also include a DB update step to the sync script. Each experiment DB is then only updated when the contents of the directory are updated. The Cookbook DB script then just has to check a number (50-200?) of experiment DBs and check which have been updated since the last time they were slurped in. That is a signficant reduction in disk access. |
Another advantage of the database-per-experiment approach might be that people could use it more easily for their personal simulations (that they don't neccessarily plan to share with others). |
I am feeling some momentum building .. :) |
I haven't weighed in, because I don't feel qualified. But if you want a
non-expert opinion, it sounds like a great solution to me Aidan! This has
been an ongoing issue for some time now and this approach of multiple
databases is very easy to implement and could fix our problems tomorrow. I
often have my own databases anyway, because of new simulations that aren't
in the master and issues with the existing database not updating or being
available.
…On Thu, 17 Nov 2022 at 16:40, Aidan Heerdegen ***@***.***> wrote:
Another advantage of the database-per-experiment approach might be that
people could use it more easily for their personal simulations (that they
don't neccessarily plan to share with others).
I am feeling some momentum building .. :)
—
Reply to this email directly, view it on GitHub
<#309 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA44U3EBETRSMYMGDSOHS3WIXALZANCNFSM6AAAAAARW3QZRE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Will querying a database of databases slow down the explorer? |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
My assumption was that it would speed it up, as I'm proposing the DBDB/Index would contain significantly less information than it does currently. It would only have enough to info to help users decide which experiment to use. This is currently the way the DB explorer works pretty much. It creates a mapping from variable to experiment, to enable users to then decide which experiment to query. |
I think this approach is somewhat like what is used for the Pangeo Intake catalogue. They have a master catalogue that references other catalogues of "climate" (CMIP), "oceanographic" etc datasets. |
I have looked at using the COSIMA-Cookbook to index the ACCESS-ESM/CM experiments. I spent a bit of time looking at the LUSTRE architecture and found that article: This is about file metadata indexing but it has some good ideas that are somewhat related to the idea of having multiple databases. There are some information about sharding in the SQLAlchemy documentation: Anyway, Indexing is a priority for the MED Team so we will definitely look into this. |
:| |
You know what I mean 🤪 |
What about moving to a model where every experiment has it's own DB file? The central DB would query each individual DB for the information it needed for explorer functions.
Whaddya reckon @angus-g @rbeucher @micaeljtoliveira
(Yes I did think about putting this on the forum, and could still do so, but thought it was quite technical)
The text was updated successfully, but these errors were encountered: