Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in flags which enable creation of nodes grouped by source #138

Open
DnlRKorn opened this issue Jul 8, 2024 · 3 comments
Open

Add in flags which enable creation of nodes grouped by source #138

DnlRKorn opened this issue Jul 8, 2024 · 3 comments

Comments

@DnlRKorn
Copy link
Contributor

DnlRKorn commented Jul 8, 2024

Posting as a result of KG Construction Crew discussion on July 8, 2024.

The current configuration of Koza is to generate one large TSV file for all nodes parsed from a singular datasource. To help with debugging and certain use cases; having the ability to have output node files for each individual data source could be useful.

In addition to this behavior, adding a flag which could be used to disable the creation of the large node TSV file may also be helpful.

Summary of request:

  • Introduction of flag or variable which enables creation of per data source node TSV files.
  • Introduction of flag or variable which can disable the creation of the large monolithic node TSV file.

Please reach out to @hrshdhgd for more details of the advantages of this approach!

@kevinschaper
Copy link
Member

What does data source mean in this context? Should we be thinking in terms of just supplying a field or list of fields that would be use to split into separate files?

I don't know if partition is the right term here, but something like:

node_partition:
  fields: 
    - primary_knowledge_source
    - taxon
  write_combined: false

@hrshdhgd
Copy link
Collaborator

hrshdhgd commented Jul 8, 2024

A data source example would be this one.

We would like to have something like

├── output/
    ├── all-traits-nodes.tsv
    ├── all-traits-edges.tsv
    ├── OrganismTaxonPathways/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxonCarbonSubstrate/
    │   ├── nodes.tsv
    │   ├── edges.tsv
    ├── OrganismTaxon Motility/
        ├── nodes.tsv
        ├── edges.tsv

where tax_id => OrganismTaxon (from biolink)
This is a very crude example based on column names (filenames could be more informative than just nodes and edges). But the idea is generating all KGs (an all inclusive one and components). I have shown just 3 columns but we would have OrganismXXX XXX being every other column name

The idea is to generate KGs of everything possible w.r.t what's available in the data source. This will allow downstream projects pick and choose either each individual KGs of interest (sort of like building a bouquet of flowers)or the whole thing based on the requirements. Hope this makes sense.

@kevinschaper
Copy link
Member

I spent a little time looking at refactoring to go from a single writer to a dict of writers, but it wasn’t the kind of refactor that just easily falls into place. I might start with a cli command to split the files after the ingest, because that’s much more straightforward to implement, and much less likely to break the existing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants