Skip to content

Commit

Permalink
Merge pull request #140 from tillenglert/release1.0.0_review_changes
Browse files Browse the repository at this point in the history
Release1.0.0 review changes
  • Loading branch information
tillenglert authored Oct 9, 2024
2 parents 68258d0 + 954de99 commit 65372b7
Show file tree
Hide file tree
Showing 46 changed files with 177 additions and 184 deletions.
14 changes: 7 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:

env:
NXF_ANSI_LOG: false
NFTEST_VER: "0.8.4"
NFT_VER: "0.8.4"

concurrency:
group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
Expand Down Expand Up @@ -50,9 +50,9 @@ jobs:
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Install nf-test
run: |
wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
sudo mv nf-test /usr/local/bin/
uses: nf-core/setup-nf-test@v1
with:
version: ${{ env.NFT_VER }}

- name: Run nf-test
run: |
Expand Down Expand Up @@ -101,9 +101,9 @@ jobs:
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Install nf-test
run: |
wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
sudo mv nf-test /usr/local/bin/
uses: nf-core/setup-nf-test@v1
with:
version: ${{ env.NFT_VER }}

- name: Run nf-test
env:
Expand Down
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.0.0 - [2022-01-20]
## v1.0.0 - nf-core/metapep "Golden Megalodon" - [2022-01-20]

First release of [nf-core/metapep](https://nf-co.re/metapep), created based on [nf-core](https://nf-co.re) standards and [nf-core/tools](https://nf-co.re/tools) template version 1.14.1.
First release of [nf-core/metapep](https://nf-co.re/metapep), based on [nf-core](https://nf-co.re) standards and [nf-core/tools](https://nf-co.re/tools) template version 1.14.1.

### `Added`

Expand Down
50 changes: 27 additions & 23 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,20 @@

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
## Pipeline tools
## [nf-test](https://www.biorxiv.org/content/10.1101/2024.05.25.595877v1)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
> L. Forer, S. Schönherr Improving the Reliability and Quality of Nextflow Pipelines with nf-test. bioRxiv 2024.05.25.595877; doi: 10.1101/2024.05.25.595877
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
## Pipeline tools

- [Entrez](https://pubmed.ncbi.nlm.nih.gov/15608257/)

> Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D54-8. doi: 10.1093/nar/gki031. Update in: Nucleic Acids Res. 2007 Jan;35(Database issue):D26-31. PMID: 15608257; PMCID: PMC539985.
- [Prodigal](https://pubmed.ncbi.nlm.nih.gov/20211023/)

> Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119. PMID: 20211023; PMCID: PMC2848648.
- [Epytope](https://academic.oup.com/bioinformatics/article/32/13/2044/1743767)

> Schubert, B., Walzer, M., Brachvogel, H-P., Sozolek, A., Mohr, C., and Kohlbacher, O. (2016). FRED 2 - An Immunoinformatics Framework for Python. Bioinformatics 2016; doi: 10.1093/bioinformatics/btw113
- [SYFPEITHI](https://pubmed.ncbi.nlm.nih.gov/10602881/)

> Hans-Georg Rammensee, Jutta Bachmann, Niels Nikolaus Emmerich, Oskar Alexander Bachor, Stefan Stevanovic: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics (1999) 50: 213-219
- [MHCflurry](https://dx.doi.org/10.1016/j.cels.2018.05.014)

> Timothy J. O’Donnell, Alex Rubinsteyn, Maria Bonsack, Angelika B. Riemer, Uri Laserson, Jeff Hammerbacher. MHC flurry: open-source class I MHC binding affinity prediction. Cell systems 7(1), 129-132 (2018). doi: 10.1016/j.cels.2018.05.014.
Expand All @@ -38,8 +30,20 @@

> Xiaoshan M. Shao, Rohit Bhattacharya, Justin Huang, I.K. Ashok Sivakumar, Collin Tokheim, Lily Zheng, Dylan Hirsch, Benjamin Kaminow, Ashton Omdahl, Maria Bonsack, Angelika B. Riemer, Victor E. Velculescu, Valsamo Anagnostou, Kymberleigh A. Pagel and Rachel Karchin. High-throughput prediction of MHC class i and ii neoantigens with MHCnuggets. Cancer Immunology Research 8(3), 396-408 (2020). doi: 10.1158/2326-6066.CIR-19-0464.
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
- [pigz](https://zlib.net/pigz/)

- [Prodigal](https://pubmed.ncbi.nlm.nih.gov/20211023/)

> Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119. PMID: 20211023; PMCID: PMC2848648.
- [SYFPEITHI](https://pubmed.ncbi.nlm.nih.gov/10602881/)

> Hans-Georg Rammensee, Jutta Bachmann, Niels Nikolaus Emmerich, Oskar Alexander Bachor, Stefan Stevanovic: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics (1999) 50: 213-219
## Python Packages

- [Python](https://www.python.org/)
Expand All @@ -50,35 +54,31 @@

> Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423. https://doi.org/10.1093/bioinformatics/btp163.
- [pandas](https://doi.org/10.5281/zenodo.3509134)

> The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.0.3). Zenodo. https://doi.org/10.5281/zenodo.8092754
- [numpy](https://www.nature.com/articles/s41586-020-2649-2)

> Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. https://www.nature.com/articles/s41586-020-2649-2.
- [pandas](https://doi.org/10.5281/zenodo.3509134)

> The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.0.3). Zenodo. https://doi.org/10.5281/zenodo.8092754
## R Packages

- [R](https://www.R-project.org/)

> R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- [ggplot2](https://cran.r-project.org/package=ggplot2)
- [data.table](https://cran.r-project.org/package=data.table)

> H. Wickham (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
> Dowle Matt (2022). data.table: Extension of 'data.frame'.
- [dplyr](https://dplyr.tidyverse.org)

> Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
- [data.table](https://cran.r-project.org/package=data.table)

> Dowle Matt (2022). data.table: Extension of 'data.frame'.
- [stringr](https://stringr.tidyverse.org)
- [ggplot2](https://cran.r-project.org/package=ggplot2)

> Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr.
> H. Wickham (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
- [ggpubr](https://cran.r-project.org/package=ggpubr)

Expand All @@ -88,6 +88,10 @@

> Trevor L Davis (2022). optparse: Command Line Option Parser.
- [stringr](https://stringr.tidyverse.org)

> Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr.
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
1 change: 1 addition & 0 deletions bin/assign_entity_weights.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
1 change: 1 addition & 0 deletions bin/check_samplesheet_create_tables.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
1 change: 1 addition & 0 deletions bin/collect_stats.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
1 change: 1 addition & 0 deletions bin/concat_tsv.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
3 changes: 2 additions & 1 deletion bin/download_proteins_entrez.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

# for each strain: select largest assembly (for now)
# for each strain: select largest assembly or given specific (for now)

import argparse
import csv
Expand Down
1 change: 1 addition & 0 deletions bin/epytope_predict.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import contextlib
Expand Down
1 change: 1 addition & 0 deletions bin/fasta_to_tsv.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import gzip
Expand Down
1 change: 1 addition & 0 deletions bin/finalize_microbiome_entities.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
34 changes: 17 additions & 17 deletions bin/gen_prediction_chunks.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import os
Expand All @@ -7,7 +8,8 @@
import pandas as pd

####################################################################################################

global cur_chunk
####################################################################################################

def parse_args():
"""Parses the command line arguments specified by the user."""
Expand Down Expand Up @@ -80,45 +82,44 @@ def parse_args():
return parser.parse_args()


def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None):
def write_chunks(data, alleles, max_task_per_allele, max_chunk_size, outdir, remainder=False, pbar=None):
"""Takes data in form of a table of peptide_id, peptide_sequence and
identical allele_name values. The data is partitioned into chunks and
written into individual output files, prepended with a comment line (#)
indicating the allele name."""
global cur_chunk

max_chunk_size = args.max_chunk_size

# Dynamically increase the chunk size dependent on the maximum number of allowed processes.
if len(data)/max_chunk_size > max_task_per_allele:
print("WARN: Chunk size is too small and too many chunks are generated. Chunksize is increased to match the maximum number of chunks.")
max_chunk_size = int(len(data)/max_task_per_allele)+1 # Make sure that all peptides end up in chunks

if remainder and len(data) > max_chunk_size:
print("ERROR: Something went wrong!", file=sys.stderr)
print("ERROR: Something went wrong! The remainder is larger than the allowed chunk size.", file=sys.stderr)
sys.exit(1)

allele_name = alleles[alleles["allele_id"] == data.iloc[0].allele_id]["allele_name"].iloc[0]
written = pd.Index([])
for start in range(0, len(data), max_chunk_size):
# if not handling remainder: only write out full chunks here
if remainder or len(data) - start >= max_chunk_size:
with open(os.path.join(args.outdir, "peptides_" + str(cur_chunk).rjust(5, "0") + ".txt"), "w") as outfile:
with open(os.path.join(outdir, "peptides_" + str(globals()["cur_chunk"]).rjust(5, "0") + ".txt"), "w") as outfile:
print(f"#{allele_name}#{data.iloc[0].allele_id}", file=outfile)
write = data.iloc[start : start + max_chunk_size]
written = written.append(data.index[start : start + max_chunk_size])
if pbar:
pbar.update(len(write))
write[["peptide_id", "peptide_sequence"]].to_csv(outfile, sep="\t", index=False)
cur_chunk = cur_chunk + 1
globals()["cur_chunk"] = globals()["cur_chunk"] + 1

# delete chunks that were written out already
data.drop(written, inplace=True)
return data

####################################################################################################

try:
def main():


# Parse command line arguments
args = parse_args()

Expand Down Expand Up @@ -193,7 +194,7 @@ def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None)
print("\nInfo: proteins_allele_info", flush=True)
proteins_allele_info.info(verbose=False, memory_usage=print_mem)

cur_chunk = 0
globals()["cur_chunk"] = 0
requests = 0
keep = pd.DataFrame()

Expand Down Expand Up @@ -232,19 +233,18 @@ def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None)
keep = (
pd.concat([keep, to_predict], ignore_index=True)
.groupby("allele_id", group_keys=False)
.apply(lambda x: write_chunks(x, alleles, max_task_per_allele))
.apply(lambda x: write_chunks(x, alleles, max_task_per_allele, max_chunk_size=args.max_chunk_size, outdir=args.outdir))
)
# use group_keys=False to avoid generation of extra index with "allele_id"

print("Info: keep", flush=True)
keep.info(verbose=False, memory_usage=print_mem)

# Write out remaining peptides
keep.groupby("allele_id", group_keys=False).apply(lambda x: write_chunks(x, alleles, max_task_per_allele, remainder=True))
keep.groupby("allele_id", group_keys=False).apply(lambda x: write_chunks(x, alleles, max_task_per_allele, remainder=True, max_chunk_size=args.max_chunk_size, outdir=args.outdir))

# We're happy if we got here
print(f"All done. Written {requests} peptide prediction requests into {cur_chunk} chunks.")
sys.exit(0)
except KeyboardInterrupt:
print("\nUser aborted.", file=sys.stderr)
sys.exit(1)
print(f"All done. Written {requests} peptide prediction requests into {globals()['cur_chunk']} chunks.")

if __name__ == "__main__":
sys.exit(main())
1 change: 1 addition & 0 deletions bin/generate_peptides.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import gzip
Expand Down
1 change: 1 addition & 0 deletions bin/generate_protein_and_entity_ids.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

# NOTE
# entrez proteins of all microbiome input files already within one file (proteins.entrez.tsv.gz)
Expand Down
1 change: 1 addition & 0 deletions bin/plot_entity_binding_ratios.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env Rscript
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

library(optparse)
library(ggplot2)
Expand Down
1 change: 1 addition & 0 deletions bin/plot_score_distribution.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env Rscript
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

library(ggplot2)
library(data.table)
Expand Down
1 change: 1 addition & 0 deletions bin/prepare_entity_binding_ratios.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import datetime
Expand Down
1 change: 1 addition & 0 deletions bin/prepare_score_distribution.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import datetime
Expand Down
1 change: 1 addition & 0 deletions bin/show_supported_models.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

# This script originates from the nf-core/epitopeprediction pipeline and is modified and refactored for use in nf-core/metapep

Expand Down
2 changes: 1 addition & 1 deletion bin/unify_model_lengths.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env python3

# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license

import argparse
import sys
Expand Down
2 changes: 1 addition & 1 deletion main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ nextflow.enable.dsl = 2
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

include { METAPEP } from './workflows/metapep'
include { METAPEP } from './workflows/metapep'
include { PIPELINE_INITIALISATION } from './subworkflows/local/utils_nfcore_metapep_pipeline'
include { PIPELINE_COMPLETION } from './subworkflows/local/utils_nfcore_metapep_pipeline'

Expand Down
6 changes: 3 additions & 3 deletions modules/local/assign_nucl_entity_weights.nf
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ process ASSIGN_NUCL_ENTITY_WEIGHTS {
path weights_files

output:
path "microbiomes_entities.nucl.tsv", emit: ch_nucl_microbiomes_entities // entity_name, microbiome_id, entity_weight
path "versions.yml" , emit: versions
path "microbiomes_entities.nucl.tsv", emit: ch_nucl_microbiomes_entities // entity_name, microbiome_id, entity_weight
path "versions.yml" , emit: versions


script:
Expand All @@ -27,7 +27,7 @@ process ASSIGN_NUCL_ENTITY_WEIGHTS {
cat <<-END_VERSIONS > versions.yml
"${task.process}":
python: \$(python --version | sed 's/Python //g')
pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
pandas: \$(python -c "import pandas; print(pandas.__version__)")
END_VERSIONS
"""
}
Loading

0 comments on commit 65372b7

Please sign in to comment.