-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dangling edges for Panther #446
Comments
The simplest explanation here that many (most) of the dangling edges have ENSEMBL subject or object gene identifiers. A brute force solution - tempting to apply - is simply to have the ingest script filter out all edges with ENSEMBL prefixed identifiers. That said, simple grep of the dangling edges versus the ingest output file itself, isolating single deviant Panther protein groups, shows a slight discrepancy in the counts. Oddly enough, the ‘dangling edges’ file have many entries that totally lack a (mapped) original_subject or original_object node identifier (i.e. the column value is empty for the edge), but some of the entries still have one non-blank identifier, which generally seems to be from the ENSEMBL namespace (so it was not removed from the edge during mappings?). This oddity likely (at least partly) explains the count difference. Unless we think otherwise, the necessary patch of the Panther ingest is simply to filter out ENSEMBL identifiers in either the subject or object node. I can issue a PR to do this and we could rerun the ingest to see if this handles most of the dangling edges. I don’t know if we’ll lose any legitimate edges - I guess if we don’t commonly rely on ENSEMBL identifiers for gene nodes, but rather, model organism curated nodes only, then we should be fine. |
Attempted resolution in #456 by filtering out edges that contain ENSEMBL identifiers; however, after @kevinschaper and I review this, simply discarding ENSEMBL gene identifiers is not the best solution. That said, @kevinschaper has made some progress in reducing the dangling edges(?). |
Taking a fresh look today (June 19, 2023): From the file downloaded today shows the first dangling edge record:
Searching the latest gene2ensembl file: $ gunzip -c gene2ensembl.gz |grep ENSSSCG |grep 24292
9823 100516001 ENSSSCG00000007596 XM_003124292.3 ENSSSCT00000008336.5 XP_003124340.1 ENSSSCP00000008117.2
9823 100521003 ENSSSCG00000026422 XM_003124244.5 ENSSSCT00000022811.4 XP_003124292.1 ENSSSCP00000027235.2
9823 100624292 ENSSSCG00000028172 XM_013993942.2 ENSSSCT00000023759.4 XP_013849396.1 ENSSSCP00000020943.1
9823 100736682 ENSSSCG00000004854 XM_021098883.1 ENSSSCT00000024292.4 XP_020954542.1 ENSSSCP00000019275.3 Shows that the gene record is simply missing from the gene2ensembl file. That said, a direct https://www.ebi.ac.uk/ebisearch/search using this identifier brings up the following (note, the first TCF-19 link is broken, but the other one works). UniProKB entry Q9TSV4 does have a link to the Ensembl gene record ENSSSCG00070024292.
|
Another random use case: ANIA loci (Dictyostelium genomic loci).
A modest subset of the dangling edges. For example, we look at the first one at the top of the list: ANIA_10586 $ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep ANIA_ |head -1
uuid:b401c492-0dd0-11ee-bd34-f39d5ac7a30a biolink:orthologous_to biolink:GeneToGeneHomologyAssociation infores:monarchinitiative PANTHER.FAMILY:PTHR43765 infores:panther panther_genome_orthologs_edges SGD:S000002605 ENSEMBL:ANIA_10586 Again, assuming that this is ENSEMBL, nothing found inside in the gene2ensembl.gz: $ gunzip -c gene2ensembl.gz |grep ANIA_|wc -l
0 However, a UniProtKB search ignoring the ENSEMBL prefix, has a hit: C8VAR7, including a Panther family mapping: PTHR43765. Thus, for these pseudo-ENSEMBL curies that have object id's beginning in ANIA_ (locus identifiers from the original Dictostelium gene set?), we'd simply want to strip off the ENSEMBL prefix and conduct a direct match on UniProtKB. There is perhaps a caveat here in that this locus is deemed uncurated TrEMBL. |
The common thread between the two examples I chose so far seems to be to search in UniProtKB for the identifier mappings. Note, however, that the search is slightly different for each one since the original identifiers are distinct in character. We are using UniprotKB already but given the size of the id mapping file (>11 GB?), we likely need to be a bit clever (iteratively, based on each use case we find, maybe one subset of missing identifiers at a time?) |
Related to monarch-initiative/monarch-app#351 which was closed? "Donkey: "Are we there yet?" Shrek: "Shut up!" |
@sagehrke @kevinschaper, I don't have any more insights to add beyond the above analyses. The ENSEMBL team - if I recall - didn't seem to think that the identifiers in question are missing at their end. The most fruitful approach here may be to leverage the UniprotKB in an SSSOM kind of way? I leave this with you... |
Dangling edges found for Panther. See https://monarch-initiative.github.io/monarch-qc/ for the report; https://data.monarchinitiative.org/monarch-kg-dev/ for the data.
Find out why and suggest a repair.
The text was updated successfully, but these errors were encountered: