Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review kg-phenio ID prefix subset #468

Open
kevinschaper opened this issue May 25, 2023 · 8 comments
Open

Review kg-phenio ID prefix subset #468

kevinschaper opened this issue May 25, 2023 · 8 comments
Assignees

Comments

@kevinschaper
Copy link
Member

We're filtering out nodes and edges from kg-phenio by prefix currently:

https://github.com/monarch-initiative/monarch-ingest/blob/40020ec11b892d929632aa1b98b529b87511bd32/src/monarch_ingest/cli_utils.py#LL158C1-L161C1

    prefixes = ["MONDO", "OMIM", "HP", "ZP", "MP", "CHEBI", "FBbt",
                "FYPO", "WBPhenotype", "GO", "MESH", "XPO",
                "ZFA", "UBERON", "WBbt", "ORPHA", "EMAPA"]

I created this Iist initially to include ontology prefixes that we use in ingests. I realized recently that we don't have PATO terms in the graph, which makes me think that my list is filtering too aggressively and there are probably ontologies that provide connections within phenio that are important to have in the graph

Do you have feedback @cmungall @matentzn?

@kevinschaper kevinschaper self-assigned this May 25, 2023
@caufieldjh
Copy link
Member

caufieldjh commented May 25, 2023

It may not make an immense impact - I see 1,661 edges in kg-phenio involving PATO which aren't subclass_of or those redundant category edges. That set includes 504 different PATO terms, most of them participating in related_to edges with UPHENO, UBERON, FBbt and MONDO. For many (most? nearly all?) PATO terms, like PATO:0000389 (acute) or PATO:0000634 (unilateral), it's not enabling a path to exist between these or other ontologies, it's just acting more like a qualifier.

@matentzn
Copy link
Member

@kevinschaper - In order to tell you more, I need to know why you subset at all. There are a number of problems with this, including

  • many Species-specific vocabs that are missing (XAO, FBcv, DPO)
  • Mainting a hard coded list of prefixes creates another point of failure that needs to be maintained if new sources are added (instead, create a phenio-monarch version that only contains the terms you want)
  • Maintaining closure and links across branches when you "remove" links from a KG. This is a highly complex issue.

In generally, I would recommend to request a Phenio subset on the phenio issue tracker, describe the characteristics you need and why, and integrated that instead of the whole thing.

@kevinschaper
Copy link
Member Author

@matentzn That makes a lot of sense. I think the reason that I did this filtering initially was that I had errors because there were gene nodes coming in, and when I looked further it seemed like an opt-in list was more practical than an opt-out list.

If I remove filtering, here are prefixes w/counts

36602 ZP
25990 MONDO
20061 XPO
19918 FBbt
19632 UPHENO
16954 HP
15719 UBERON
14049 MP
10458 GO
8745 EMAPA
7556 WBbt
6506 FMA
5158 OBA
3683 CHEBI
3382 HGNC
3218 ZFA
3072 MA
2695 WBPhenotype
1634 NCBITaxon
1622 CL
1605 XAO
 763 PR
 666 biolink
 597 PATO
 501 RO
 299 NBO
 237 FlyBase
 232 HSAPDV
 203 CHR
 115 FAO
 103 STY
 103 OBO
  80 BSPO
  61 IAO
  60 PO
  52 MPATH
  40 https
  39 SO
  38 ZFS
  35 ECO
  33 STATO
  33 OBI
  31 WD_Entity
  31 UBPROP
  31 NCIT
  30 BFO
  28 TS
  27 SIO
  21 ENVO
  17 ECTO
  10 http
  10 OIO
   9 LINKML
   9 CARO
   7 dc
   6 MFOMD
   6 GENO
   4 foaf
   4 MF
   3 rdfs
   3 owl
   3 UMLS
   3 NIF.EXT
   2 dcterms
   2 PROV
   2 OMO
   2 OGMS
   2 MESH
   2 MAXO
   1 dctypes
   1 dcat
   1 WBLS
   1 WBBT
   1 TO
   1 SNOMEDCT
   1 SEPIO
   1 RNORDV
   1 PW
   1 PHENIO
   1 PCO
   1 Orphanet
   1 NLX.SUB
   1 NLX.OEN
   1 NIF.STD
   1 FYPO
   1 FOODON
   1 FBcv
   1 DOID
   1 CLO
   1 CIO
   1 APO

I'm definitely getting genes from HGNC, FlyBase (used to be MGI, but they look like they're gone now). I'm getting nodes for biolink classes, predicates and even enum permissible values. Should I have 1 Orphanet ID? 2 MESH IDs? 6 GENO IDs?

Do I want singleton nodes for rdfs:isDefinedBy, rdfs:label, rdfs:seeAlso?

I don't feel confident in my strategy from either direction, but I think initially opt-in was just less of a time sink.

@kevinschaper
Copy link
Member Author

I just noticed that I have these LinkML nodes (also singletons):

LINKML:Boolean  biolink:NamedThing
LINKML:Date     biolink:NamedThing
LINKML:Double   biolink:NamedThing
LINKML:Float    biolink:NamedThing
LINKML:Integer  biolink:NamedThing
LINKML:String   biolink:NamedThing
LINKML:Time     biolink:NamedThing
LINKML:Uriorcurie       biolink:NamedThing
LINKML:mixin    biolink:NamedThing

@caufieldjh
Copy link
Member

Oh, those are definitely coming in through the Biolink merge, since BL imports them.
They can be omitted during the Phenio build.

@matentzn
Copy link
Member

@kevinschaper what you are observing can be answered only on the phenio level - any weird ID can come in through imports, we do not really control the prefix space in PHENIO at all. ODK does come with a way though to drop specific prefixes from the pipeline!

@kevinschaper
Copy link
Member Author

I'm doing an interactive kg build and starting looking at phenio filtering, and I'm going to reverse my include list, and go to this very short exclude list:

    exclude_prefixes = [
        "HGNC",
        "FlyBase",
        "http",
        "biolink"
    ]

I'll also write out a file in qc for what was excluded by prefix.

@matentzn
Copy link
Member

Why are you excluding HGNC for example? are you not loosing some Mondo->HGNC links this way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants