Dangling edges for Panther #446

RichardBruskiewich · 2023-04-24T20:25:22Z

Dangling edges found for Panther. See https://monarch-initiative.github.io/monarch-qc/ for the report; https://data.monarchinitiative.org/monarch-kg-dev/ for the data.

Find out why and suggest a repair.

RichardBruskiewich · 2023-05-11T15:19:14Z

The simplest explanation here that many (most) of the dangling edges have ENSEMBL subject or object gene identifiers.

A brute force solution - tempting to apply - is simply to have the ingest script filter out all edges with ENSEMBL prefixed identifiers.

That said, simple grep of the dangling edges versus the ingest output file itself, isolating single deviant Panther protein groups, shows a slight discrepancy in the counts. Oddly enough, the ‘dangling edges’ file have many entries that totally lack a (mapped) original_subject or original_object node identifier (i.e. the column value is empty for the edge), but some of the entries still have one non-blank identifier, which generally seems to be from the ENSEMBL namespace (so it was not removed from the edge during mappings?). This oddity likely (at least partly) explains the count difference.

Unless we think otherwise, the necessary patch of the Panther ingest is simply to filter out ENSEMBL identifiers in either the subject or object node. I can issue a PR to do this and we could rerun the ingest to see if this handles most of the dangling edges.

I don’t know if we’ll lose any legitimate edges - I guess if we don’t commonly rely on ENSEMBL identifiers for gene nodes, but rather, model organism curated nodes only, then we should be fine.

RichardBruskiewich · 2023-06-12T20:28:02Z

Attempted resolution in #456 by filtering out edges that contain ENSEMBL identifiers; however, after @kevinschaper and I review this, simply discarding ENSEMBL gene identifiers is not the best solution.

That said, @kevinschaper has made some progress in reducing the dangling edges(?).

RichardBruskiewich · 2023-06-19T17:59:13Z

Taking a fresh look today (June 19, 2023):

From the file downloaded today shows the first dangling edge record:

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep panther |less
uuid:b401c46f-0dd0-11ee-bd34-f39d5ac7a30a               biolink:orthologous_to          biolink:GeneToGeneHomologyAssociation   infores:monarchinitiative       PANTHER.FAMILY:PTHR15464        infores:panther panther_genome_orthologs_edges                                                                          
HGNC:11629      ENSEMBL:ENSSSCG00070024292

Searching the latest gene2ensembl file:

$ gunzip -c gene2ensembl.gz |grep ENSSSCG |grep 24292
9823    100516001       ENSSSCG00000007596      XM_003124292.3  ENSSSCT00000008336.5    XP_003124340.1  ENSSSCP00000008117.2
9823    100521003       ENSSSCG00000026422      XM_003124244.5  ENSSSCT00000022811.4    XP_003124292.1  ENSSSCP00000027235.2
9823    100624292       ENSSSCG00000028172      XM_013993942.2  ENSSSCT00000023759.4    XP_013849396.1  ENSSSCP00000020943.1
9823    100736682       ENSSSCG00000004854      XM_021098883.1  ENSSSCT00000024292.4    XP_020954542.1  ENSSSCP00000019275.3

Shows that the gene record is simply missing from the gene2ensembl file.

That said, a direct https://www.ebi.ac.uk/ebisearch/search using this identifier brings up the following (note, the first TCF-19 link is broken, but the other one works).

UniProKB entry Q9TSV4 does have a link to the Ensembl gene record ENSSSCG00070024292.

Genomes & metagenomes (1 results)

Source: Ensembl Gene (ID: ENSSSCG00070024292)
[TCF19](https://www.ensembl.org/pig_usmarc/geneview?gene=ENSSSCG00070024292)

transcription factor 19 [Source:NCBI gene;Acc:100152381]

Cross References: Samples & ontologies (3) Nucleotide sequences (2) Protein sequences (2)

Protein sequences (1 results)

Source: UniProtKB (ID: TCF19_PIG)
[Q9TSV4](https://www.uniprot.org/uniprot/Q9TSV4)

Transcription factor 19 TCF-19
Sus scrofa(Reviewed)
Secondary accession number(s): O19083

Cross References: Protein families (22) Bioactive molecules (8) Protein sequences (6) show more

Formats:[ in FASTA format ](https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=Q9TSV4&format=fasta&style=raw)in Feature Viewer in Interpro Matches

RichardBruskiewich · 2023-06-19T18:25:58Z

Another random use case: ANIA loci (Dictyostelium genomic loci).

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep ANIA_ |wc -l
29950

A modest subset of the dangling edges.

For example, we look at the first one at the top of the list: ANIA_10586

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep ANIA_ |head -1
uuid:b401c492-0dd0-11ee-bd34-f39d5ac7a30a               biolink:orthologous_to          biolink:GeneToGeneHomologyAssociation   infores:monarchinitiative       PANTHER.FAMILY:PTHR43765        infores:panther panther_genome_orthologs_edges                                                                         SGD:S000002605   ENSEMBL:ANIA_10586

Again, assuming that this is ENSEMBL, nothing found inside in the gene2ensembl.gz:

$ gunzip -c gene2ensembl.gz |grep ANIA_|wc -l
0

However, a UniProtKB search ignoring the ENSEMBL prefix, has a hit: C8VAR7, including a Panther family mapping: PTHR43765.

Thus, for these pseudo-ENSEMBL curies that have object id's beginning in ANIA_ (locus identifiers from the original Dictostelium gene set?), we'd simply want to strip off the ENSEMBL prefix and conduct a direct match on UniProtKB. There is perhaps a caveat here in that this locus is deemed uncurated TrEMBL.

RichardBruskiewich · 2023-06-19T20:14:31Z

The common thread between the two examples I chose so far seems to be to search in UniProtKB for the identifier mappings. Note, however, that the search is slightly different for each one since the original identifiers are distinct in character.

We are using UniprotKB already but given the size of the id mapping file (>11 GB?), we likely need to be a bit clever (iteratively, based on each use case we find, maybe one subset of missing identifiers at a time?)

RichardBruskiewich · 2023-10-03T20:05:10Z

Related to monarch-initiative/monarch-app#351 which was closed?

"Donkey: "Are we there yet?" Shrek: "Shut up!"

RichardBruskiewich · 2024-01-24T02:44:00Z

@sagehrke @kevinschaper, I don't have any more insights to add beyond the above analyses. The ENSEMBL team - if I recall - didn't seem to think that the identifiers in question are missing at their end.

The most fruitful approach here may be to leverage the UniprotKB in an SSSOM kind of way? I leave this with you...

RichardBruskiewich self-assigned this Apr 24, 2023

RichardBruskiewich removed their assignment Jan 24, 2024

kevinschaper mentioned this issue Dec 11, 2024

Investigate possible missing orthology monarch-initiative/monarch-app#923

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dangling edges for Panther #446

Dangling edges for Panther #446

RichardBruskiewich commented Apr 24, 2023 •

edited

Loading

RichardBruskiewich commented May 11, 2023

RichardBruskiewich commented Jun 12, 2023

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading

RichardBruskiewich commented Oct 3, 2023

RichardBruskiewich commented Jan 24, 2024

Dangling edges for Panther #446

Dangling edges for Panther #446

Comments

RichardBruskiewich commented Apr 24, 2023 • edited Loading

RichardBruskiewich commented May 11, 2023

RichardBruskiewich commented Jun 12, 2023

RichardBruskiewich commented Jun 19, 2023 • edited Loading

RichardBruskiewich commented Jun 19, 2023 • edited Loading

RichardBruskiewich commented Jun 19, 2023 • edited Loading

RichardBruskiewich commented Oct 3, 2023

RichardBruskiewich commented Jan 24, 2024

RichardBruskiewich commented Apr 24, 2023 •

edited

Loading

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading

RichardBruskiewich commented Jun 19, 2023 •

edited

Loading