Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error getting DrexelMetadata in episode 10 #10

Open
thompsonmj opened this issue Mar 16, 2023 · 5 comments
Open

Error getting DrexelMetadata in episode 10 #10

thompsonmj opened this issue Mar 16, 2023 · 5 comments

Comments

@thompsonmj
Copy link
Contributor

(/fs/ess/PAS2136/Workshops/Snakemake/conda_env) [thompsonmj@o0647 SnakemakeWorkflow]$ snak
emake -c1 --use-singularity DrexelMetadata/bj373514.json
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
generate_metadata        1              1              1
total                    1              1              1
Select jobs to execute...
[Thu Mar 16 14:26:09 2023]
rule generate_metadata:
    input: Images/bj373514.jpg
    output: DrexelMetadata/bj373514.json, Mask/bj373514_mask.png
    log: logs/generate_metadata_bj373514.log
    jobid: 0
    reason: Missing output files: DrexelMetadata/bj373514.json
    wildcards: image=bj373514
    resources: tmpdir=/tmp/slurmtmp.23961738
Activating singularity image /users/PAS2136/thompsonmj/SnakemakeWorkflow/.snakemake/singul
arity/48c2d571fde349f4656aa5ab95dccc30.simg
WARNING: Environment variable LD_PRELOAD already has value [], will not forward new value 
[/usr/local/xalt/xalt/lib64/libxalt_init.so] from parent process environment
Waiting at most 5 seconds for missing files.
MissingOutputException in rule generate_metadata in file https://raw.githubusercontent.com
/hdr-bgnn/BGNN_Core_Workflow/1.0.0/workflow/Snakefile, line 19:
Job 0 completed successfully, but some output files are missing. Missing files after 5 sec
onds. This might be due to filesystem latency. If that is the case, consider to increase t
he wait time with --latency-wait:
Mask/bj373514_mask.png
Removing output files of failed job generate_metadata since they might be corrupted:
DrexelMetadata/bj373514.json
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message```
@johnbradley
Copy link
Collaborator

@thompsonmj Could you check the log mentioned above to see if there is anything helpful in logs/generate_metadata_bj373514.log?
The error is that gen_metadata.py only created DrexelMetadata/bj373514.json but not Mask/bj373514_mask.png.
What does you Snakefile look like right now?

@johnbradley
Copy link
Collaborator

@thompsonmj Where you able to figure out what caused your problem? If not you could also check Images/bj373514.jpg to see if it's a valid image.

@johnbradley
Copy link
Collaborator

This problem could be caused by a typo in the download_image rule

rule download_image:
    params: url=get_image_url    
    output:"images/bj373514.jpg"
    shell: "wget -O {output} {params.url}"

If you used a lowercase -o the log of the download would be written to the output file.

@thompsonmj
Copy link
Contributor Author

I'll check on this today, I had accidentally overwritten my snakefile by copying your solution. I got it recovered so I'll check if it was that typo or something else. The solution does work fine though.

@thompsonmj
Copy link
Contributor Author

Here was the Snakefile I had built up after going through the episodes:

import pandas as pd

def get_image_url(wildcards):
        filename = config["filter_multimedia"]
        df = pd.read_csv(filename)
        row = df[df["arkID"] == wildcards.ark_id]
        url = row["accessURI"].item()
        return url

def get_image_filenames(wildcards):
	filename = config["filter_multimedia"]
	df = pd.read_csv(filename)
	ark_ids = df["arkID"].tolist()
	return expand("Images/{ark_id}.jpg", ark_id=ark_ids)

configfile: "config.yaml"

rule all:
	input: get_image_filenames

rule reduce:
	input: "multimedia.csv"
	params: rows="11"
	output: "reduce/multimedia.csv"
	resources:
		mem_mb=200
	shell: "head -n {params.rows} {input} > {output}"

rule download_image:
	input: config["filter_multimedia"]
	params: url=get_image_url
	output: "Images/{ark_id}.jpg"
	container: "docker://quay.io/biocontainers/gnu-wget:1.18--h60da905_7"
	shell: "wget -O {output} {params.url}"

checkpoint filter:
	input:
		script = "Scripts/FilterImages.R",
		fishes = config["reduce_multimedia"]
	output: config["filter_multimedia"]
	shell: "Rscript {input.script}"

module bgnn_core:
	snakefile:
		github("hdr-bgnn/BGNN_Core_Workflow", path="workflow/Snakefile", tag="1.0.0")

use rule generate_metadata from bgnn_core
use rule transform_metadata from bgnn_core
use rule crop_image from bgnn_core
use rule segment_image from bgnn_core

def get_summary_inputs(wildcards):
	filename = checkpoints.filter.get().output[0]
	df = pd.read_csv(filename)
	ark_ids = df["arkID"].tolist()
	return expand('Segmented/{arkID}_segmented.png', arkID=ark_ids)

rule summary:
	input:
		scripts="Scripts/SummaryReport.R",
		markdown="Scripts/Summary.Rmd",
		morphology=get_summary_inputs
	output: config["summary_report"]
	container: "docker://ghcr.io/rocker-org/tidyverse:4.2.2"
	shell: "Rscript {input.script}"

compared to the solution Snakefile:

import pandas as pd

configfile: "config.yaml"

rule all:
    input: config["summary_report"]

rule reduce:
    input: "multimedia.csv"
    params: rows="11"
    output: "reduce/multimedia.csv"
    shell: "head -n {params.rows} {input} > {output}"

checkpoint filter:
    input:
        script="Scripts/FilterImages.R",
        fishes=config["reduce_multimedia"]
    output: config["filter_multimedia"]
    shell: "Rscript {input.script}" 

def get_image_url(wildcards):
    filename = checkpoints.filter.get().output[0]
    df = pd.read_csv(filename)
    row = df[df["arkID"] == wildcards.ark_id]
    url = row["accessURI"].item()
    return url 

rule download_image:
    input: config["filter_multimedia"]
    params: url=get_image_url
    output: "Images/{ark_id}.jpg"
    container: "docker://quay.io/biocontainers/gnu-wget:1.18--h60da905_7"
    shell: "wget -O {output} {params.url}"


module bgnn_core:
    snakefile:
        github("hdr-bgnn/BGNN_Core_Workflow", path="workflow/Snakefile", tag="1.0.0")

use rule generate_metadata from bgnn_core
use rule transform_metadata from bgnn_core
use rule crop_image from bgnn_core
use rule segment_image from bgnn_core

def get_segmentation_files(wildcards):
    filename = checkpoints.filter.get().output[0]
    df = pd.read_csv(filename)
    ark_ids = df["arkID"].tolist()
    return expand("Segmented/{ark_id}_segmented.png", ark_id=ark_ids)

rule summary:
    input:
       script="Scripts/SummaryReport.R", 
       segmentation=get_segmentation_files
    output: config["summary_report"]
    container: "docker://ghcr.io/rocker-org/tidyverse:4.2.2"
    shell: "Rscript {input.script}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants