Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medgen updates #7

Merged
merged 3 commits into from
Aug 14, 2023
Merged

Medgen updates #7

merged 3 commits into from
Aug 14, 2023

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Jul 23, 2023

Updates

a4eff72a96a4a018fb46a1222f25c968312cecb9

    - Update: medgen2obo.pl: (i) Abstracted adding of classes and their triples as a function, (ii) updated name
spacing of classes based on what type of MedGen/UMLS identifier they are.
    - Update: Namespaces MedGen, MedGen_UI (removed), MedGen_CUI

f461d52abf3c7eb4981c43ab0a63aca653a333db

    - Update: new classes: duplicated some UMLS: classes as Medgen:, if they started with 'C' and a number.
    - Update: prefixes: In addition to new classes above, renamed UMLS prefix with Medgen for all other classes (which happen to all start with 'CN:'
    - Update: prefixes: Renamed prior MEDGEN: xref prefixes to Medgen_UID: These IDs don't start with C (CUI; Concept Unique Identifier) or CN (Common Name?). These are internal Medgen UIDs that are duplicative and not for clinical or analytical use.
    - Rename: bin/ -> src/
    - Add: output/: For both release outputs and non-release.
    - Rename: release/ -> output/release/
    - Add: mondo_mapping_status.py: For generating artefacts related to the reporting and management of mappings between Mondo and Medgen.
    - Add: Python dependency requirements files.
    - Add: run.sh: For running commands in ODK
    - Add: config/medgen.sssom-metadata.yml

@joeflack4 joeflack4 marked this pull request as draft July 23, 2023 21:17
@joeflack4 joeflack4 requested a review from matentzn July 23, 2023 21:17
@joeflack4 joeflack4 self-assigned this Jul 23, 2023
@joeflack4 joeflack4 added the enhancement New feature or request label Jul 23, 2023
src/medgen2obo.pl Outdated Show resolved Hide resolved
src/medgen2obo.pl Outdated Show resolved Hide resolved
src/mondo_mapping_status.py Outdated Show resolved Hide resolved
@joeflack4 joeflack4 force-pushed the update1 branch 5 times, most recently from df9d980 to f208076 Compare July 24, 2023 01:07
makefile Show resolved Hide resolved
run.sh Show resolved Hide resolved
@@ -0,0 +1,115 @@
"""Mapping status between Medgen and Mondo"""
Copy link
Contributor Author

@joeflack4 joeflack4 Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repurposed some of my GARD code to create these attached atefacts.

Feels a bit duplicative, but this is probably fine for now. However I'm likely to do more of this for other ingests as well.
I did simplify some of it. Most of the code is the same between Medgen and GARD, but some things are special to each.

medgen_terms_mapping_status.tsv.zip
obsoleted_medgen_terms_in_mondo.txt

Some counts:
# tot_medgen_only = len(existing_overlap_df[existing_overlap_df['status'] == 'medgen']) # n=66,224
# tot_mondo_only = len(existing_overlap_df[existing_overlap_df['status'] == 'mondo']) # n=2,362
# tot_both_only = len(existing_overlap_df[existing_overlap_df['status'] == 'both']) # n=14,263

mondo_df['prefix'] = mondo_df['object_id'].apply(lambda x: x.split(':')[0])
mondo_df = mondo_df[mondo_df['prefix'].isin(MEDGEN_PREFIXES)] # n=16,627
del mondo_df['prefix']
# preds = list(mondo_df['predicate_id'].unique()) # only skos:exactMatch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactMatch thoughts

I imagined we want to do the same thing here as with GARD where we only care about exact matches.
However a couple things I discovered so far are:

  • Our prior Mondo->Medgen mappings were only of skos:exactMatch
  • All of the mappings coming out of Chris' pipeline are only of: oboInOwl:hasDbXref or owl:equivalentClass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all owl:equivalentClass to skos:exactMatch in the Medgen ingest. owl:equivalentClass is no longer relevant anywhere across our pipelines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've figured out how to do this. In OBO, this:
equivalent_to: <CURIE>
Should be changed to this:
property_value: exactMatch <CURIE>

I just need to change the Perl that creates the OBO like this. Then the pipeline will generate the correct OWL.

…hey started with 'C' and a number.

- Update: prefixes: In addition to new classes above, renamed UMLS prefix with Medgen for all other classes (which happen to all start with 'CN:'
- Update: prefixes: Renamed prior MEDGEN: xref prefixes to Medgen_UID: These IDs don't start with C (CUI; Concept Unique Identifier) or CN (Common Name?). These are internal Medgen UIDs that are duplicative and not for clinical or analytical use.
- Rename: bin/ -> src/
- Add: output/: For both release outputs and non-release.
- Rename: release/ -> output/release/
- Add: mondo_mapping_status.py: For generating artefacts related to the reporting and management of mappings between Mondo and Medgen.
- Add: Python dependency requirements files.
- Add: run.sh: For running commands in ODK
- Add: config/medgen.sssom-metadata.yml
@joeflack4 joeflack4 force-pushed the update1 branch 4 times, most recently from a89f466 to 05e8ecd Compare August 7, 2023 23:51
…riples as a function, (ii) updated namespacing of classes based on what type of MedGen/UMLS identifier they are.

- Update: Namespaces MedGen, MedGen_UI (removed), MedGenCUI
- Bugfix: SSSOM metadata yaml had a typo preventing conversion
- Bugfix: Makefile: (i) needed to rename a dependency, (ii) needed to run 'analyze' step after 'stage'
- Update: Makefile: Simplified some goals
- Bugfix: For UMLS CUIs (e.g. starts with C then #s), we chose to do duplicate classes with namespaces UMLS and MedGen. However, I just now made it so that also all references (e.g. xrefs) are also duplicated, e.g. MedGen:1 maps to MedGen:2 and UMLS:2.
src/medgen2obo.pl Outdated Show resolved Hide resolved
@joeflack4 joeflack4 marked this pull request as ready for review August 13, 2023 23:26
@joeflack4
Copy link
Contributor Author

joeflack4 commented Aug 13, 2023

@matentzn I'm going to merge this one for now. A handful of misc work was done on this ingest, but we're going to be pausing it for now, holding off to see if the MedGen team is able to handle a lot of the work that this ingest would otherwise be doing.

For future work on this ingest, when/if ever needed, I'll open up new PRs.

I did wrap up a couple things from our least meeting though. Had to change a conditional block, and updated the namespaces, e.g. MedGen -> MEDGEN.

@joeflack4 joeflack4 merged commit b9e8bc5 into main Aug 14, 2023
@joeflack4 joeflack4 deleted the update1 branch August 14, 2023 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants