-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review kg-phenio ID prefix subset #468
Comments
It may not make an immense impact - I see 1,661 edges in kg-phenio involving PATO which aren't subclass_of or those redundant category edges. That set includes 504 different PATO terms, most of them participating in related_to edges with UPHENO, UBERON, FBbt and MONDO. For many (most? nearly all?) PATO terms, like PATO:0000389 (acute) or PATO:0000634 (unilateral), it's not enabling a path to exist between these or other ontologies, it's just acting more like a qualifier. |
@kevinschaper - In order to tell you more, I need to know why you subset at all. There are a number of problems with this, including
In generally, I would recommend to request a Phenio subset on the phenio issue tracker, describe the characteristics you need and why, and integrated that instead of the whole thing. |
@matentzn That makes a lot of sense. I think the reason that I did this filtering initially was that I had errors because there were gene nodes coming in, and when I looked further it seemed like an opt-in list was more practical than an opt-out list. If I remove filtering, here are prefixes w/counts
I'm definitely getting genes from HGNC, FlyBase (used to be MGI, but they look like they're gone now). I'm getting nodes for biolink classes, predicates and even enum permissible values. Should I have 1 Orphanet ID? 2 MESH IDs? 6 GENO IDs? Do I want singleton nodes for I don't feel confident in my strategy from either direction, but I think initially opt-in was just less of a time sink. |
I just noticed that I have these LinkML nodes (also singletons):
|
Oh, those are definitely coming in through the Biolink merge, since BL imports them. |
@kevinschaper what you are observing can be answered only on the phenio level - any weird ID can come in through imports, we do not really control the prefix space in PHENIO at all. ODK does come with a way though to drop specific prefixes from the pipeline! |
I'm doing an interactive kg build and starting looking at phenio filtering, and I'm going to reverse my include list, and go to this very short exclude list:
I'll also write out a file in qc for what was excluded by prefix. |
Why are you excluding HGNC for example? are you not loosing some Mondo->HGNC links this way? |
We're filtering out nodes and edges from kg-phenio by prefix currently:
https://github.com/monarch-initiative/monarch-ingest/blob/40020ec11b892d929632aa1b98b529b87511bd32/src/monarch_ingest/cli_utils.py#LL158C1-L161C1
I created this Iist initially to include ontology prefixes that we use in ingests. I realized recently that we don't have PATO terms in the graph, which makes me think that my list is filtering too aggressively and there are probably ontologies that provide connections within phenio that are important to have in the graph
Do you have feedback @cmungall @matentzn?
The text was updated successfully, but these errors were encountered: