-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a CLI #53
Comments
@timrobertson100 Nice! How about a second command:
|
I'm not entirely sure how a The input is a zip file, that on opening requires file sorting in order to implement joins without reading the whole thing into memory. The reader will also disregard supplementary files not of interest from the zip manifest. Streaming the output would be no issue. I think it'd need to be:
Or am I missing something, please? |
I can see your point about the auxiliary files. However, I figure you can stream to content of the dwca into temporary files instead of keeping in memory. If the meta.xml occurs first, then you can already start ignoring auxiliary files and even start building / sorting that star schema model. My main point is that, especially given the size of most datasets, streaming processing is pretty important to keep storage/memory overhead low, as you noticed in the jenkins/spark setup you have now. |
Perhaps worth a hacking session . . . ; ) |
Thanks @jhpoelen
You may not be aware, but this is what the library already does internally when there are extensions (uses a sort to temporary files following by a streaming join). When no extensions exist it just streams directly from the file. Large archives will run on minimal memory and I still don't see an obvious way that this can be improved. The main issues I'm aware of for efficient DwC-A reading are 1) the sorting and 2) the deflate compression step as it applied to individual files and not chunks of the files. Both of these are inherently single-threaded operations and where the processing overhead comes from. Sorting is better in Linux environments if there is access to a native These limitations are why GBIF converts DwC-A into Avro as a first step, after which we can run processing in parallel - e.g. in Spark. I suspect this is what you're referring to, but I don't think that is an issue with this library. |
(Where a DWCA contains extensions, extracting and sorting the data files in parallel would give some improvement to the overall speed.) I think we could take some inspiration from Avro tools. At least
|
@MattBlissett @timrobertson100 Am pretty excited about all this. I think having a swiss army knife for dwca would be neat. Sort of the like ffmpeg for dwc. Re: streaming vs. files - I am somewhat aware of the internals of the current dwca-io - like you say, it does a lot: expanding files, sorting, schema interpretation with some special magic (merging synonymous terms), and merging the various related files to populate a data model. Re: avro tools inpiration - yes, this looks great. I think there's a lot of similarities between avro and dwca, in the way that both formats contain structured data. Here's some rough ideas -
You can stream Supported formats: tsv, json, avro ? |
Note that I ended up implementing my own cli for handling dwca ; ( |
Thanks @jhpoelen |
@timrobertson100 for sure. For now, the dwca streaming functionality is part of the preston cli . a small part of your dwca-io library is re-used (e.g., reading meta.xml / record iterator). Works great! Would be neat to have a small focused library that only does that. Now, all these other dependencies are pulled in. Is dwc-io still the library you use in the gbif infrastructure? |
@timrobertson100 thanks for pointing out that you are no longer using the https://github.com/gbif/dwc-io module, but are using a copied version of it embedded in the "core" module of the https://github.com/gbif/pipelines instead. Thanks for pointing our the "frictionless" data experiments. Does that mean that GBIF is departing from the DwC ? How are you planning to transition? Would I be correct to assume that this "frictionless" data experiment is related to the big splash you made earlier this month re: https://discourse.gbif.org/t/use-case-biotic-interactions-sottunga-island-melitaea-cinxia-population-study/3312 and other use cases? |
Also, just wondering - why didn't you refactor and re-use the https://github.com/gbif/dwca-io module in the pipelines project? And, how are you planning to keep them in sync? |
I think you've misread that - pipelines use use dwc-io so consistency is not an issue. GBIF is not departing from Darwin Core as it's the core standard for much of what GBIF does. We're exploring richer data exchange formats as part of the work to diversity the data model and frictionless looks like a reasonable packaging format to explore. |
We also have simple DWCA->AVRO CLI implementation here https://github.com/gbif/pipelines/blob/dev/tools/archives-converters/src/main/java/org/gbif/converters/DwcaToAvroConverter.java#L28 |
@timrobertson100 I am glad I misread that and thanks for clarifying. Great to see you are re-using existing libraries. I was confused by the what I thought were cloned implementation of the DwcaReader classes. Am still hoping I can convince you and you colleagues to publish to maven central to make it easier to discover and use your valuable libraries (#55 ). Free coffee and cookies? An ice cream? What will it take? And great to hear that you are extending support for additional schemas beyond dwc . I've very much enjoyed using W3C CSV for many years now. I am curious to see how you'll end up managing the schemas (e.g., versioning) and data (e.g., provenance). |
First command:
The text was updated successfully, but these errors were encountered: