Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a CLI #53

Open
timrobertson100 opened this issue Oct 21, 2020 · 16 comments
Open

Implement a CLI #53

timrobertson100 opened this issue Oct 21, 2020 · 16 comments

Comments

@timrobertson100
Copy link
Member

First command:

dwca-tools --format JSON /input.dwca /tmp/output.json
@jhpoelen
Copy link

jhpoelen commented Oct 21, 2020

@timrobertson100 Nice! How about a second command:

$ cat input.dwca | dwca-tools --format JSON > output.json 

@timrobertson100
Copy link
Member Author

timrobertson100 commented Oct 21, 2020

I'm not entirely sure how a cat input.dwca would/could work here @jhpoelen

The input is a zip file, that on opening requires file sorting in order to implement joins without reading the whole thing into memory. The reader will also disregard supplementary files not of interest from the zip manifest. Streaming the output would be no issue.

I think it'd need to be:

dwca-tools --format JSON /input.dwca > output.json

Or am I missing something, please?

@jhpoelen
Copy link

I can see your point about the auxiliary files. However, I figure you can stream to content of the dwca into temporary files instead of keeping in memory. If the meta.xml occurs first, then you can already start ignoring auxiliary files and even start building / sorting that star schema model. My main point is that, especially given the size of most datasets, streaming processing is pretty important to keep storage/memory overhead low, as you noticed in the jenkins/spark setup you have now.

@jhpoelen
Copy link

Perhaps worth a hacking session . . . ; )

@timrobertson100
Copy link
Member Author

timrobertson100 commented Oct 22, 2020

Thanks @jhpoelen

can stream to content of the dwca into temporary files instead of keeping in memory

You may not be aware, but this is what the library already does internally when there are extensions (uses a sort to temporary files following by a streaming join). When no extensions exist it just streams directly from the file. Large archives will run on minimal memory and I still don't see an obvious way that this can be improved.

The main issues I'm aware of for efficient DwC-A reading are 1) the sorting and 2) the deflate compression step as it applied to individual files and not chunks of the files. Both of these are inherently single-threaded operations and where the processing overhead comes from. Sorting is better in Linux environments if there is access to a native gnuSort process.

These limitations are why GBIF converts DwC-A into Avro as a first step, after which we can run processing in parallel - e.g. in Spark. I suspect this is what you're referring to, but I don't think that is an issue with this library.

@MattBlissett
Copy link
Member

(Where a DWCA contains extensions, extracting and sorting the data files in parallel would give some improvement to the overall speed.)

I think we could take some inspiration from Avro tools. At least cat, getmeta, getschema, random, tojson, totext seem like the kind of things that would be useful in a CLI.

java -jar avro-tools-1.8.2.jar
Version 1.8.2
 of Apache Avro
Copyright 2010-2015 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
----------------
Available tools:
          cat  extracts samples from files
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
 idl2schemata  Extract JSON schemata of the types from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
       random  Creates a file with randomly generated instances of a schema.
      recodec  Alters the codec of a data file.
       repair  Recovers data from a corrupt Avro Data file
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, record per line or pretty.
       totext  Converts an Avro data file to a text file.
     totrevni  Converts an Avro data file to a Trevni file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.

@jhpoelen
Copy link

jhpoelen commented Oct 22, 2020

@MattBlissett @timrobertson100 Am pretty excited about all this. I think having a swiss army knife for dwca would be neat. Sort of the like ffmpeg for dwc.

Re: streaming vs. files - I am somewhat aware of the internals of the current dwca-io - like you say, it does a lot: expanding files, sorting, schema interpretation with some special magic (merging synonymous terms), and merging the various related files to populate a data model.

Re: avro tools inpiration - yes, this looks great. I think there's a lot of similarities between avro and dwca, in the way that both formats contain structured data.

Here's some rough ideas -

dwca-cli description avro equivalent
schema prints meta.xml in some format getschema
meta prints eml.xml if available in some format getmeta
occurrences print occurrences if available n/a
taxa print checklist if available in some format n/a
media print media if available in some format n/a
data print entire populated schema in some format n/a
... ... ...

You can stream dwca occurrences provided (1) the meta.xml occurs before the data files or (2) the schema is provided up front via cat archive.zip | dwca getschema.

Supported formats: tsv, json, avro ?

@jhpoelen
Copy link

Note that I ended up implementing my own cli for handling dwca ; (
I guess sometimes you just have to roll your own to fit a specific use case.
With the cmd, you can know stream all of gbif using a command like the one described in bio-guoda/preston#148 .
Holler if you change your mind on building a cli, happy to give it a spin.

@timrobertson100
Copy link
Member Author

Thanks @jhpoelen
If your implementation is something that could be reused and you think would be useful to others to have in this library then PR's are always welcome.

@jhpoelen
Copy link

@timrobertson100 for sure.

For now, the dwca streaming functionality is part of the preston cli . a small part of your dwca-io library is re-used (e.g., reading meta.xml / record iterator). Works great! Would be neat to have a small focused library that only does that. Now, all these other dependencies are pulled in.

Is dwc-io still the library you use in the gbif infrastructure?

@timrobertson100
Copy link
Member Author

Is dwc-io still the library you use in the gbif infrastructure?

Yes, it is e.g. here. We turn everything to Avro in the first stage of processing in GBIF though.

Unrelated to this issue, but for background info - we're exploring Frictionless Data as a replacement for the limited DwC-A format.

@jhpoelen
Copy link

@timrobertson100 thanks for pointing out that you are no longer using the https://github.com/gbif/dwc-io module, but are using a copied version of it embedded in the "core" module of the https://github.com/gbif/pipelines instead.

Thanks for pointing our the "frictionless" data experiments. Does that mean that GBIF is departing from the DwC ? How are you planning to transition? Would I be correct to assume that this "frictionless" data experiment is related to the big splash you made earlier this month re: https://discourse.gbif.org/t/use-case-biotic-interactions-sottunga-island-melitaea-cinxia-population-study/3312 and other use cases?

@jhpoelen
Copy link

Also, just wondering - why didn't you refactor and re-use the https://github.com/gbif/dwca-io module in the pipelines project? And, how are you planning to keep them in sync?

@timrobertson100
Copy link
Member Author

I think you've misread that - pipelines use use dwc-io so consistency is not an issue.

GBIF is not departing from Darwin Core as it's the core standard for much of what GBIF does. We're exploring richer data exchange formats as part of the work to diversity the data model and frictionless looks like a reasonable packaging format to explore.

@muttcg
Copy link
Member

muttcg commented Apr 26, 2022

@jhpoelen
Copy link

@timrobertson100 I am glad I misread that and thanks for clarifying. Great to see you are re-using existing libraries. I was confused by the what I thought were cloned implementation of the DwcaReader classes. Am still hoping I can convince you and you colleagues to publish to maven central to make it easier to discover and use your valuable libraries (#55 ). Free coffee and cookies? An ice cream? What will it take?

And great to hear that you are extending support for additional schemas beyond dwc . I've very much enjoyed using W3C CSV for many years now. I am curious to see how you'll end up managing the schemas (e.g., versioning) and data (e.g., provenance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants