Add in flags which enable creation of nodes grouped by source #138

DnlRKorn · 2024-07-08T19:41:04Z

Posting as a result of KG Construction Crew discussion on July 8, 2024.

The current configuration of Koza is to generate one large TSV file for all nodes parsed from a singular datasource. To help with debugging and certain use cases; having the ability to have output node files for each individual data source could be useful.

In addition to this behavior, adding a flag which could be used to disable the creation of the large node TSV file may also be helpful.

Summary of request:

Introduction of flag or variable which enables creation of per data source node TSV files.
Introduction of flag or variable which can disable the creation of the large monolithic node TSV file.

Please reach out to @hrshdhgd for more details of the advantages of this approach!

kevinschaper · 2024-07-08T20:59:16Z

What does data source mean in this context? Should we be thinking in terms of just supplying a field or list of fields that would be use to split into separate files?

I don't know if partition is the right term here, but something like:

node_partition:
  fields: 
    - primary_knowledge_source
    - taxon
  write_combined: false

hrshdhgd · 2024-07-08T23:56:43Z

A data source example would be this one.

We would like to have something like

├── output/
    ├── all-traits-nodes.tsv
    ├── all-traits-edges.tsv
    ├── OrganismTaxonPathways/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxonCarbonSubstrate/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxon Motility/
        ├── nodes.tsv
        ├── edges.tsv

where tax_id => OrganismTaxon (from biolink)
This is a very crude example based on column names (filenames could be more informative than just nodes and edges). But the idea is generating all KGs (an all inclusive one and components). I have shown just 3 columns but we would have OrganismXXX XXX being every other column name

The idea is to generate KGs of everything possible w.r.t what's available in the data source. This will allow downstream projects pick and choose either each individual KGs of interest (sort of like building a bouquet of flowers)or the whole thing based on the requirements. Hope this makes sense.

kevinschaper · 2024-07-09T03:26:52Z

I spent a little time looking at refactoring to go from a single writer to a dict of writers, but it wasn’t the kind of refactor that just easily falls into place. I might start with a cli command to split the files after the ingest, because that’s much more straightforward to implement, and much less likely to break the existing behavior.

kevinschaper mentioned this issue Jul 12, 2024

Add koza split cli command to split up a kgx file based on field values #139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in flags which enable creation of nodes grouped by source #138

Add in flags which enable creation of nodes grouped by source #138

DnlRKorn commented Jul 8, 2024

kevinschaper commented Jul 8, 2024

hrshdhgd commented Jul 8, 2024 •

edited

Loading

kevinschaper commented Jul 9, 2024

Add in flags which enable creation of nodes grouped by source #138

Add in flags which enable creation of nodes grouped by source #138

Comments

DnlRKorn commented Jul 8, 2024

kevinschaper commented Jul 8, 2024

hrshdhgd commented Jul 8, 2024 • edited Loading

kevinschaper commented Jul 9, 2024

hrshdhgd commented Jul 8, 2024 •

edited

Loading