These set of scripts is designed to align multiple samples of same species to a reference genome, do preprocessing, and then call variants. The scripts use the following software for given purposes in the given order.
trimgalore
to trim adapters, clip the ends of the reads and generating fastqc reportsbwa mem
for aligningsamtools sort -n
for sorting by namesamtools fixmate
for fixing mate informationsamtools sort
for sorting by coordinatessamtools markdup
for marking duplicatespicard-tools AddOrReplaceReadGroups
for addding and replacing RGtagspicard-tools CleanSam
for setting Mapping Quality 0 for the sequences that are not aligned.samtools index
for indexingsamtools coverage
for coverage reportsbamtools stats
for alignment reportsbcftools
for variant calling
git clone https://github.com/evolozzy/NGS-Pipeline.git
- Make a subdirectory named
Data
in the folder containing your scripts and copy your files there, or change the line containingDATASOURCE
in yourPARAMETERS
file, and set it to the folder that contains your data. - If you have two or more sets of reads to merge keep them in separate directories in
Data
directory. - Make sure you have your reference file.
- Edit
RGTAGS
file carefully, the files belonging the same sample should have the same SM (sample name).
- Carefully change the
PARAMETERS
.- Set the
REFERENCEFILE
to the path to reference. - If you are running on multiple threads set
THREADS
to number of cores you want to use.
- Set the
- Set the directories to be used in
DIRECTORIES
file.- If you're not running the scripts in the directory you have the scripts change the line containing
WD
to the path that contains your scripts.
- If you're not running the scripts in the directory you have the scripts change the line containing
- Install required software, and set
PROGRAMPATHS
.
Inside the folder:
./runall.sh
Or outside the folder:
/path/to/scripts/runall.sh
If you encounter any errors during the process and clean all the files created by the script:
./resetanalysis.sh
- Before running
runall.sh
, usetrimall.sh
to quality control the trimming process. Checkout the fastqc reports after trimming and setPARAMETERS
accordingly. - Make sure that the core numbers are set properly. Try to use parallel more, but it depends on the number of files. For low numbers of files
- The script checks
- if the files are in place
- if the software is installed
- calculates a good way to use the cores available
- builds references from reference file
- Trimming is done with
trimgalore
. - Aligning is done with
bwa
- Preprocessing is done with
samtools
andpicard-tools
.- First, the files are sorted by name and mate info is fixed.
- Second, the files are sorted by coordinate and duplicates are marked.
- Third, the files are cleaned from reads that were not aligned.
- Last, RG tags are added.
- Variants are called with
bcftools
.
- The middle files can be kept, deleted, or archived to another location.
- The code also generates reports of trimming (fastqc reports), alignment, and coverage.
Note: I haven't developed this project for some time, but I have plans to convert it to a snakemake pipeline when I have some extra time. If you need help with it, send me an email.