Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

berkeley-schema-fy24: make squeaky-clean all output begins with error message about missing file local/gold-study-ids.yaml #2177

Open
eecavanna opened this issue Sep 6, 2024 · 10 comments
Assignees
Labels
backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. bug Something isn't working cleanup unfinished wasn't completed during sprint, should be reprioritized and/or broken down to smaller issues X SMALL Less than 8 hours, less than 1 day

Comments

@eecavanna
Copy link
Collaborator

In my local clone of the berkeley-schema-fy24 repo, when I run $ make squeaky-clean all, the console output begins with an error message:

$ make squeaky-clean all
Error: open local/gold-study-ids.yaml: no such file or directory
rm -rf project
rm -rf tmp
# ...

Screenshot:

image

@ssarrafan
Copy link
Collaborator

@turbomam @eecavanna this hasn't been touched in 2 weeks
Removing from sprint, adding backlog label

@ssarrafan ssarrafan added backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. unfinished wasn't completed during sprint, should be reprioritized and/or broken down to smaller issues labels Sep 20, 2024
@turbomam
Copy link
Member

turbomam commented Sep 24, 2024

I would like to get some buy-in from @mbthornton-lbl because I designed this buggy feature in support of automating the validation of data that can be retrieved with methods he wrote. I'm not using it, and removing that whole workflow would really slim down the nmdc-schema's Makefile.

Update: I intended for this to help with bulk get-study-related-records operations

get-study-related-records = "src.scripts.nmdc_database_tools:cli" # todo recheck

We could also remove the targets that interact with fuseki as part of this.

@eecavanna
Copy link
Collaborator Author

Some paths forward I see:

  1. If the file has moved, update the reference and close issue
  2. If the file is gone, remove the reference and close issue
  3. If the file is obsolete, remove the reference and close issue

@turbomam
Copy link
Member

.PHONY: pre-build
pre-build: local/gold-study-ids.yaml create-nmdc-tdb2-from-app

## getting a report of GOLD study identifiers, which might have been used a Study ids in legacy (pre-Napa) data
local/gold-study-ids.json:
	curl -X 'GET' \
		--output $@ \
		'https://api-napa.microbiomedata.org/nmdcschema/study_set?max_page_size=999&projection=id%2Cgold_study_identifiers' \
		-H 'accept: application/json'

local/gold-study-ids.yaml: local/gold-study-ids.json
	yq -p json -o yaml $< | cat > $@

# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_IDS := $(shell yq '.resources.[].id' local/gold-study-ids.yaml  | awk '{printf "%s ", $$0} END {print ""}')

# can't ever be used without generating local/gold-study-ids.yaml first
print-discovered-study-ids:
	@echo $(STUDY_IDS)

# Replace colons with hyphens in study IDs
# can't ever be used without generating local/gold-study-ids.yaml first
STUDY_YAML_FILES := $(addsuffix .yaml,$(addprefix local/study-files/,$(subst :,-,$(STUDY_IDS))))

# can't ever be used without generating local/gold-study-ids.yaml first
create-study-yaml-files-from-study-ids-list: $(STUDY_YAML_FILES)

# can't ever be used without generating local/gold-study-ids.yaml first
print-intended-yaml-files: local/gold-study-ids.yaml
	@echo $(STUDY_YAML_FILES)

@turbomam
Copy link
Member

turbomam commented Sep 26, 2024

PS: API calls with arbitrary, high max_page_size are risky

@turbomam
Copy link
Member

turbomam commented Sep 26, 2024

wc -l local/gold-study-ids.yaml

63 local/gold-study-ids.yaml

head local/gold-study-ids.yaml

resources:

  • id: nmdc:sty-11-8fb6t785
    gold_study_identifiers:
    • gold:Gs0114675
  • id: nmdc:sty-11-33fbta56
    gold_study_identifiers:
    • gold:Gs0110138
  • id: nmdc:sty-11-aygzgv51
    gold_study_identifiers:
    • gold:Gs0114663

@turbomam
Copy link
Member

make --dry-run create-study-yaml-files-from-study-ids-list

mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml
sed -i.bak 's/gold:/GOLD:/' local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml # kludge modify data to match (old!) schema
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.bak
poetry run linkml-validate --schema nmdc_schema/nmdc_materialized_patterns.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml > local/study-files/nmdc-sty-11-8fb6t785.yaml.validation.log.txt
time poetry run migration-recursion \
        --schema-path nmdc_schema/nmdc_materialized_patterns.yaml \
        --input-path local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml \
        --output-path local/study-files/nmdc-sty-11-8fb6t785.yaml # kludge masks ids that contain whitespace
rm -rf local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml.bak

mkdir -p local/study-files
study_file_name=`echo local/study-files/nmdc-sty-11-33fbta56.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-33fbta56.yaml.tmp.yaml

etc.

@turbomam
Copy link
Member

turbomam commented Sep 26, 2024

study_file_name=`echo local/study-files/nmdc-sty-11-8fb6t785.yaml` ; \
        echo $study_file_name ; \
        study_id=`poetry run get-study-id-from-filename $study_file_name` ; \
        echo $study_id ; \
        date ; \
        time poetry run get-study-related-records \
                --api-base-url https://api-berkeley.microbiomedata.org \
                extract-study \
                --study-id $study_id \
                --output-file local/study-files/nmdc-sty-11-8fb6t785.yaml.tmp.yaml

local/study-files/nmdc-sty-11-8fb6t785.yaml
nmdc:sty-11-8fb6t785
Thu Sep 26 11:10:57 AM EDT 2024
STUDY-ID: nmdc:sty-11-8fb6t785
SCHEMA-VERSION: 11.0.0rc22
Got study nmdc:sty-11-8fb6t785 from the NMDC database.
Got 0 biosamples part_of nmdc:sty-11-8fb6t785.
Traceback (most recent call last):
File "", line 1, in
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 261, in extract_study
raise e
File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 253, in extract_study
omics_processing_records = api_client.get_omics_processing_records_part_of_study(study_id)
File "/home/mark/gitrepos/berkeley-schema-fy24/src/scripts/nmdc_database_tools.py", line 75, in get_omics_processing_records_part_of_study
response.raise_for_status()
File "/home/mark/.cache/pypoetry/virtualenvs/nmdc-schema-gXr5ogK9-py3.10/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api-berkeley.microbiomedata.org/nmdcschema/omics_processing_set?filter=%7B%22part_of%22%3A+%22nmdc%3Asty-11-8fb6t785%22%7D&max_page_size=1000

real 0m3.892s
user 0m1.407s
sys 0m0.124s

@turbomam
Copy link
Member

turbomam commented Sep 26, 2024

nmdc:sty-11-8fb6t785 appears to be a real study: https://api-berkeley.microbiomedata.org/nmdcschema/ids/nmdc%3Asty-11-8fb6t785

but the command above is trying to find OmicsProcessings that are part of nmdc:sty-11-8fb6t785, and OmicsProcessing has been replaced with DataGeneration subclasses as of berkeley-schema-fy24

Also maybe there really are no DataGeneration subclass instances that are part of that Study?

https://api-berkeley.microbiomedata.org/nmdcschema/data_generation_set?filter=%7B%22part_of%22%3A%22nmdc%3Asty-11-8fb6t785%22%7D&max_page_size=20

In fact, maybe DataGeneration subclass instances can't be part_of anything any more?

https://api-berkeley.microbiomedata.org/nmdcschema/data_generation_set?max_page_size=1

{
  "resources": [
    {
      "id": "nmdc:omprc-11-0003fm52",
      "name": "1000S_WLUP_FTMS_SPE_BTM_1_run2_Fir_22Apr22_300SA_p01_149_1_3506",
      "description": "High resolution MS spectra only",
      "has_input": [
        "nmdc:bsm-11-jht0ty76"
      ],
      "has_output": [
        "nmdc:dobj-11-cp4p5602"
      ],
      "processing_institution": "EMSL",
      "type": "nmdc:MassSpectrometry",
      "analyte_category": "nom",
      "associated_studies": [
        "nmdc:sty-11-28tm5d36"
      ],
      "instrument_used": [
        "nmdc:inst-14-mwrrj632"
      ]
    }
  ],
  "next_page_token": "nmdc:sys0qphf9j29"
}

see also

https://microbiomedata.github.io/berkeley-schema-fy24/MassSpectrometry/

so if that's still hard-coded into https://github.com/microbiomedata/berkeley-schema-fy24/blob/cd6acbee87b627b439d068b6bfeb8cb002f05d99/src/scripts/nmdc_database_tools.py#L64-L83

then maybe that script should be considered unmaintained?

@turbomam
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. bug Something isn't working cleanup unfinished wasn't completed during sprint, should be reprioritized and/or broken down to smaller issues X SMALL Less than 8 hours, less than 1 day
Projects
None yet
3 participants