Merge pull request #140 from tillenglert/release1.0.0_review_changes

Release1.0.0 review changes
nf-core · Oct 9, 2024 · 65372b7 · 65372b7
2 parents 68258d0 + 954de99
commit 65372b7
Show file tree

Hide file tree

Showing 46 changed files with 177 additions and 184 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -10,7 +10,7 @@ on:
 
 env:
   NXF_ANSI_LOG: false
-  NFTEST_VER: "0.8.4"
+  NFT_VER: "0.8.4"
 
 concurrency:
   group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
@@ -50,9 +50,9 @@ jobs:
         uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
 
       - name: Install nf-test
-        run: |
-          wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
-          sudo mv nf-test /usr/local/bin/
+        uses: nf-core/setup-nf-test@v1
+        with:
+          version: ${{ env.NFT_VER }}
 
       - name: Run nf-test
         run: |
@@ -101,9 +101,9 @@ jobs:
         uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
 
       - name: Install nf-test
-        run: |
-          wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
-          sudo mv nf-test /usr/local/bin/
+        uses: nf-core/setup-nf-test@v1
+        with:
+          version: ${{ env.NFT_VER }}
 
       - name: Run nf-test
         env:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,9 +3,9 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.0.0 - [2022-01-20]
+## v1.0.0 - nf-core/metapep "Golden Megalodon" - [2022-01-20]
 
-First release of [nf-core/metapep](https://nf-co.re/metapep), created based on [nf-core](https://nf-co.re) standards and [nf-core/tools](https://nf-co.re/tools) template version 1.14.1.
+First release of [nf-core/metapep](https://nf-co.re/metapep), based on [nf-core](https://nf-co.re) standards and [nf-core/tools](https://nf-co.re/tools) template version 1.14.1.
 
 ### `Added`
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -8,28 +8,20 @@
 
 > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
 
-## Pipeline tools
+## [nf-test](https://www.biorxiv.org/content/10.1101/2024.05.25.595877v1)
 
-- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+> L. Forer, S. Schönherr Improving the Reliability and Quality of Nextflow Pipelines with nf-test. bioRxiv 2024.05.25.595877; doi: 10.1101/2024.05.25.595877
 
-  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+## Pipeline tools
 
 - [Entrez](https://pubmed.ncbi.nlm.nih.gov/15608257/)
 
   > Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D54-8. doi: 10.1093/nar/gki031. Update in: Nucleic Acids Res. 2007 Jan;35(Database issue):D26-31. PMID: 15608257; PMCID: PMC539985.
 
-- [Prodigal](https://pubmed.ncbi.nlm.nih.gov/20211023/)
-
-  > Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119. PMID: 20211023; PMCID: PMC2848648.
-
 - [Epytope](https://academic.oup.com/bioinformatics/article/32/13/2044/1743767)
 
   > Schubert, B., Walzer, M., Brachvogel, H-P., Sozolek, A., Mohr, C., and Kohlbacher, O. (2016). FRED 2 - An Immunoinformatics Framework for Python. Bioinformatics 2016; doi: 10.1093/bioinformatics/btw113
 
-- [SYFPEITHI](https://pubmed.ncbi.nlm.nih.gov/10602881/)
-
-  > Hans-Georg Rammensee, Jutta Bachmann, Niels Nikolaus Emmerich, Oskar Alexander Bachor, Stefan Stevanovic: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics (1999) 50: 213-219
-
 - [MHCflurry](https://dx.doi.org/10.1016/j.cels.2018.05.014)
 
   > Timothy J. O’Donnell, Alex Rubinsteyn, Maria Bonsack, Angelika B. Riemer, Uri Laserson, Jeff Hammerbacher. MHC flurry: open-source class I MHC binding affinity prediction. Cell systems 7(1), 129-132 (2018). doi: 10.1016/j.cels.2018.05.014.
@@ -38,8 +30,20 @@
 
   > Xiaoshan M. Shao, Rohit Bhattacharya, Justin Huang, I.K. Ashok Sivakumar, Collin Tokheim, Lily Zheng, Dylan Hirsch, Benjamin Kaminow, Ashton Omdahl, Maria Bonsack, Angelika B. Riemer, Victor E. Velculescu, Valsamo Anagnostou, Kymberleigh A. Pagel and Rachel Karchin. High-throughput prediction of MHC class i and ii neoantigens with MHCnuggets. Cancer Immunology Research 8(3), 396-408 (2020). doi: 10.1158/2326-6066.CIR-19-0464.
 
+- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
+
+  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+
 - [pigz](https://zlib.net/pigz/)
 
+- [Prodigal](https://pubmed.ncbi.nlm.nih.gov/20211023/)
+
+  > Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119. PMID: 20211023; PMCID: PMC2848648.
+
+- [SYFPEITHI](https://pubmed.ncbi.nlm.nih.gov/10602881/)
+
+  > Hans-Georg Rammensee, Jutta Bachmann, Niels Nikolaus Emmerich, Oskar Alexander Bachor, Stefan Stevanovic: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics (1999) 50: 213-219
+
 ## Python Packages
 
 - [Python](https://www.python.org/)
@@ -50,35 +54,31 @@
 
   > Cock PA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B and de Hoon MJL (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423. https://doi.org/10.1093/bioinformatics/btp163.
 
-- [pandas](https://doi.org/10.5281/zenodo.3509134)
-
-  > The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.0.3). Zenodo. https://doi.org/10.5281/zenodo.8092754
-
 - [numpy](https://www.nature.com/articles/s41586-020-2649-2)
 
   > Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. https://www.nature.com/articles/s41586-020-2649-2.
 
+- [pandas](https://doi.org/10.5281/zenodo.3509134)
+
+  > The pandas development team. (2023). pandas-dev/pandas: Pandas (v2.0.3). Zenodo. https://doi.org/10.5281/zenodo.8092754
+
 ## R Packages
 
 - [R](https://www.R-project.org/)
 
   > R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
 
-- [ggplot2](https://cran.r-project.org/package=ggplot2)
+- [data.table](https://cran.r-project.org/package=data.table)
 
-  > H. Wickham (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
+  > Dowle Matt (2022). data.table: Extension of 'data.frame'.
 
 - [dplyr](https://dplyr.tidyverse.org)
 
   > Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
 
-- [data.table](https://cran.r-project.org/package=data.table)
-
-  > Dowle Matt (2022). data.table: Extension of 'data.frame'.
-
-- [stringr](https://stringr.tidyverse.org)
+- [ggplot2](https://cran.r-project.org/package=ggplot2)
 
-  > Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr.
+  > H. Wickham (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
 
 - [ggpubr](https://cran.r-project.org/package=ggpubr)
 
@@ -88,6 +88,10 @@
 
   > Trevor L Davis (2022). optparse: Command Line Option Parser.
 
+- [stringr](https://stringr.tidyverse.org)
+
+  > Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr.
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/bin/assign_entity_weights.py b/bin/assign_entity_weights.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/bin/check_samplesheet_create_tables.py b/bin/check_samplesheet_create_tables.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/bin/collect_stats.py b/bin/collect_stats.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/bin/concat_tsv.py b/bin/concat_tsv.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/bin/download_proteins_entrez.py b/bin/download_proteins_entrez.py
@@ -1,6 +1,7 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
-# for each strain: select largest assembly (for now)
+# for each strain: select largest assembly or given specific (for now)
 
 import argparse
 import csv

diff --git a/bin/epytope_predict.py b/bin/epytope_predict.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import contextlib

diff --git a/bin/fasta_to_tsv.py b/bin/fasta_to_tsv.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import gzip

diff --git a/bin/finalize_microbiome_entities.py b/bin/finalize_microbiome_entities.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/bin/gen_prediction_chunks.py b/bin/gen_prediction_chunks.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import os
@@ -7,7 +8,8 @@
 import pandas as pd
 
 ####################################################################################################
-
+global cur_chunk
+####################################################################################################
 
 def parse_args():
     """Parses the command line arguments specified by the user."""
@@ -80,45 +82,44 @@ def parse_args():
     return parser.parse_args()
 
 
-def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None):
+def write_chunks(data, alleles, max_task_per_allele, max_chunk_size, outdir, remainder=False, pbar=None):
     """Takes data in form of a table of peptide_id, peptide_sequence and
     identical allele_name values. The data is partitioned into chunks and
     written into individual output files, prepended with a comment line (#)
     indicating the allele name."""
-    global cur_chunk
-
-    max_chunk_size = args.max_chunk_size
 
     # Dynamically increase the chunk size dependent on the maximum number of allowed processes.
     if len(data)/max_chunk_size > max_task_per_allele:
         print("WARN: Chunk size is too small and too many chunks are generated. Chunksize is increased to match the maximum number of chunks.")
         max_chunk_size = int(len(data)/max_task_per_allele)+1 # Make sure that all peptides end up in chunks
 
     if remainder and len(data) > max_chunk_size:
-        print("ERROR: Something went wrong!", file=sys.stderr)
+        print("ERROR: Something went wrong! The remainder is larger than the allowed chunk size.", file=sys.stderr)
         sys.exit(1)
 
     allele_name = alleles[alleles["allele_id"] == data.iloc[0].allele_id]["allele_name"].iloc[0]
     written = pd.Index([])
     for start in range(0, len(data), max_chunk_size):
         # if not handling remainder: only write out full chunks here
         if remainder or len(data) - start >= max_chunk_size:
-            with open(os.path.join(args.outdir, "peptides_" + str(cur_chunk).rjust(5, "0") + ".txt"), "w") as outfile:
+            with open(os.path.join(outdir, "peptides_" + str(globals()["cur_chunk"]).rjust(5, "0") + ".txt"), "w") as outfile:
                 print(f"#{allele_name}#{data.iloc[0].allele_id}", file=outfile)
                 write = data.iloc[start : start + max_chunk_size]
                 written = written.append(data.index[start : start + max_chunk_size])
                 if pbar:
                     pbar.update(len(write))
                 write[["peptide_id", "peptide_sequence"]].to_csv(outfile, sep="\t", index=False)
-                cur_chunk = cur_chunk + 1
+                globals()["cur_chunk"] = globals()["cur_chunk"] + 1
 
     # delete chunks that were written out already
     data.drop(written, inplace=True)
     return data
 
 ####################################################################################################
 
-try:
+def main():
+
+
     # Parse command line arguments
     args = parse_args()
 
@@ -193,7 +194,7 @@ def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None)
     print("\nInfo: proteins_allele_info", flush=True)
     proteins_allele_info.info(verbose=False, memory_usage=print_mem)
 
-    cur_chunk = 0
+    globals()["cur_chunk"] = 0
     requests = 0
     keep = pd.DataFrame()
 
@@ -232,19 +233,18 @@ def write_chunks(data, alleles, max_task_per_allele, remainder=False, pbar=None)
             keep = (
                 pd.concat([keep, to_predict], ignore_index=True)
                 .groupby("allele_id", group_keys=False)
-                .apply(lambda x: write_chunks(x, alleles, max_task_per_allele))
+                .apply(lambda x: write_chunks(x, alleles, max_task_per_allele, max_chunk_size=args.max_chunk_size, outdir=args.outdir))
             )
             # use group_keys=False to avoid generation of extra index with "allele_id"
 
             print("Info: keep", flush=True)
             keep.info(verbose=False, memory_usage=print_mem)
 
     # Write out remaining peptides
-    keep.groupby("allele_id", group_keys=False).apply(lambda x: write_chunks(x, alleles, max_task_per_allele, remainder=True))
+    keep.groupby("allele_id", group_keys=False).apply(lambda x: write_chunks(x, alleles, max_task_per_allele, remainder=True, max_chunk_size=args.max_chunk_size, outdir=args.outdir))
 
     # We're happy if we got here
-    print(f"All done. Written {requests} peptide prediction requests into {cur_chunk} chunks.")
-    sys.exit(0)
-except KeyboardInterrupt:
-    print("\nUser aborted.", file=sys.stderr)
-    sys.exit(1)
+    print(f"All done. Written {requests} peptide prediction requests into {globals()['cur_chunk']} chunks.")
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/bin/generate_peptides.py b/bin/generate_peptides.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import gzip

diff --git a/bin/generate_protein_and_entity_ids.py b/bin/generate_protein_and_entity_ids.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 # NOTE
 # entrez proteins of all microbiome input files already within one file (proteins.entrez.tsv.gz)

diff --git a/bin/plot_entity_binding_ratios.R b/bin/plot_entity_binding_ratios.R
@@ -1,4 +1,5 @@
 #!/usr/bin/env Rscript
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 library(optparse)
 library(ggplot2)

diff --git a/bin/plot_score_distribution.R b/bin/plot_score_distribution.R
@@ -1,4 +1,5 @@
 #!/usr/bin/env Rscript
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 library(ggplot2)
 library(data.table)

diff --git a/bin/prepare_entity_binding_ratios.py b/bin/prepare_entity_binding_ratios.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import datetime

diff --git a/bin/prepare_score_distribution.py b/bin/prepare_score_distribution.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python3
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import datetime

diff --git a/bin/show_supported_models.py b/bin/show_supported_models.py
@@ -1,4 +1,5 @@
 #!/usr/bin/env python
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 # This script originates from the nf-core/epitopeprediction pipeline and is modified and refactored for use in nf-core/metapep
 

diff --git a/bin/unify_model_lengths.py b/bin/unify_model_lengths.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-
+# Written by Sabrina Krakau, Leon Kuchenbecker, and Till Englert under the MIT license
 
 import argparse
 import sys

diff --git a/main.nf b/main.nf
@@ -17,7 +17,7 @@ nextflow.enable.dsl = 2
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 */
 
-include { METAPEP  }                from './workflows/metapep'
+include { METAPEP                 } from './workflows/metapep'
 include { PIPELINE_INITIALISATION } from './subworkflows/local/utils_nfcore_metapep_pipeline'
 include { PIPELINE_COMPLETION     } from './subworkflows/local/utils_nfcore_metapep_pipeline'
 

diff --git a/modules/local/assign_nucl_entity_weights.nf b/modules/local/assign_nucl_entity_weights.nf
@@ -12,8 +12,8 @@ process ASSIGN_NUCL_ENTITY_WEIGHTS {
     path weights_files
 
     output:
-    path   "microbiomes_entities.nucl.tsv", emit: ch_nucl_microbiomes_entities  // entity_name, microbiome_id, entity_weight
-    path    "versions.yml"                , emit: versions
+    path "microbiomes_entities.nucl.tsv", emit: ch_nucl_microbiomes_entities  // entity_name, microbiome_id, entity_weight
+    path "versions.yml"                 , emit: versions
 
 
     script:
@@ -27,7 +27,7 @@ process ASSIGN_NUCL_ENTITY_WEIGHTS {
     cat <<-END_VERSIONS > versions.yml
     "${task.process}":
         python: \$(python --version | sed 's/Python //g')
-        pandas: \$(python -c "import pkg_resources; print(pkg_resources.get_distribution('pandas').version)")
+        pandas: \$(python -c "import pandas; print(pandas.__version__)")
     END_VERSIONS
     """
 }