Skip to content

CZID mNGS Pipeline Stage #1: Host Filtering and QC

rzlim08 edited this page Aug 1, 2022 · 2 revisions

The IDseq pipeline supports Illumina sequencing reads from single- or paired-end sequencing runs. The required input format is .fastq. Prior to the host filtering procedure, raw .fastq input is validated.

The sequence of steps referred to collectively as "Host Filtering and QC" are as follows:

  1. Initial host filtration using STAR
  2. Trim sequencing adapters using Trimmomatic
  3. Quality filter using PriceSeq
  4. Identify duplicate reads using idseq-dedup (note: prior to pipeline version 6.0, IDseq used CD-HIT-DUP for duplicate identification)
  5. Filter out low complexity sequences using LZW
  6. Filter out remaining host sequences using Bowtie2
  7. Subsampling to 1 million fragments (reads/read-pairs) if > 1M remain after step (6)
  8. Filter out human sequences, regardless of host (using STAR, Bowtie2, and GSNAP)

The default host filtering databases for a human host can be found here

Alignment of reads to NCBI NR and NT databases is computationally intensive. In many metagenomic NGS samples for pathogen identification, the majority of sequencing reads belong to the host from which the sample was taken. Therefore, to reduce the total input to downstream microbial alignment steps, rapid host filtration and QC steps are applied.

Several metrics are collected during host filtering.

  • The "Passed QC" percentage exclusively refers to steps (2) & (3). It refers to the fraction of reads that came out of PriceSeq, step (3), compared to what went in at step (1).
  • The "Passed Filters" percentage refers to the fraction of reads that came out of step (8) compared to what went in at step (1). For example, any reads that got collapsed as duplicates will look as if they had been removed during "host filtering", even though they DIDN'T map to the host genome. Same for anything that got weeded out in the LZW filter steps.
  • You can have complete visibility into the numbers of reads remaining after each step by navigating to the "Results folder" in the sample Details tab.

Generating host removal databases

There are two alignment steps for host removal - (1) uses STAR and (6) uses Bowtie2. Each of these steps requires a pre-generated host database. New “host” organisms may be added by the IDseq team upon request. The currently available hosts are listed below, along with links to the raw .fasta file used the generate the STAR and Bowtie2 databases.

Input Fasta Files

  • Human: hg38
  • Mosquito: multiple genome and mitochondrial genome assemblies were used, along with Drosophila genome assembly.
  • Tick:
  • ERCC only:
  • Mouse: mm18
  • Cat:
  • Pig: pig genome
  • C. elegans:

The process for generating host removal databases is as follows.

Build the STAR database:

STAR
--runThreadN {cpu_count}
--runMode genomeGenerate
--sjdbGTFfile {gtf_command_part}*  *
--genomeDir {star_genome_part_dir}
--genomeFastaFiles {fasta_file_list[i]}* *

note: the --sjdbGTFfile option and .gtf file is only used for database generation in cases where this file is available.

ERCCs in the STAR database: The External RNA Control Consortium (ERCC) developed a set of spike-in controls that can be used to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts (manuscript available here). Quantification of ERCCs is done in the first host removal pipeline step, using STAR. The ERCC reference sequences are appended to the host reference .fasta file and a .gtf file is used to enable quantification of ERCC abundance via the built-in STAR gene-count capabilities.

Build the Bowtie2 database:

bowtie2-build {fasta_file} {host_name}

Build the GSNAP database for final human removal: This was generated one-time only, using the following command:

# gsnap version 2017-08-25
gmap_build 
-d hg38_pantro5_k16 
-D ./gsnap 
-k 16 
hg38_pantro5.fa

Pipeline Input

RunValidateInput

Input: raw .fastq files

Process: Validates that the input files are .fastq format and truncates to 75 million fragments (specifically, 75 million reads for single-end libraries or 150 million reads for paired-end libraries). The validation process counts the number of sequences that fall in specified length buckets, which will inform the parameters used downstream for initial host removal by STAR.

Output:

  • validate_input1.fastq
  • validate_input2.fastq
  • validate_input_summary.json: {"<50": 0, "50-500": 93621550, "500-10000": 0, "10000+": 0}

Host Filtering Procedure

STEP 1: RunStar

Reasoning: The STAR aligner is used for rapid first-pass host filtration. Unmapped reads are passed to the subsequent step. The current implementation of STAR, will fail to remove host sequences that map to multiple regions, thus these are filtered out by a subsequent host filtration step using Bowtie2.

Input:

  • validate_input1.fastq
  • validate_input2.fastq

Process: Different parameters are required for alignment of short vs long reads using STAR. Therefore, based on the initial input validation, the appropriate parameters are selected.

If short reads:

STAR 
--outFilterMultimapNmax 99999 
--outFilterScoreMinOverLread 0.5 
--outFilterMatchNminOverLread 0.5
--outReadsUnmapped Fastx
--outFilterMismatchNmax 999
--outSAMmode None 
--clip3pNbases 0
--runThreadN {cpus}
--genomeDir {genome_dir}
--readFilesIn {input files}

If long reads (specifically if there are more than 1 reads with length greater than READ_LEN_CUTOFF_HIGH, as determined during input validation step):

STARlong 
--outFilterMultimapNmax 99999 
--outFilterScoreMinOverLread 0.5 
--outFilterMatchNminOverLread 0.5
--outReadsUnmapped Fastx
--outFilterMismatchNmax 999
--outSAMmode None 
--clip3pNbases 0
--runThreadN {cpus}
--genomeDir {genome_dir}
--readFilesIn {input files}
--seedSearchStartLmax 20
--seedPerReadNmax 100000
--seedPerWindowNmax 1000
--alignTranscriptsPerReadNmax 100000

STAR documentation can be found here

Output:

  • unmapped1.fastq
  • unmapped2.fastq
  • reads_per_gene.tab (only available to project owner, on request)

STEP 2: RunTrimmomatic*: *

Reasoning: Remove adapter sequences

Input:

  • unmapped1.fastq
  • unmapped2.fastq

Process:

java -jar /usr/local/bin/trimmomatic-0.38.jar 
PE|SE 
-phred33 
[input_files] 
[output_files] 
ILLUMINACLIP:{adapter_fasta}:2:30:10:8:true
MINLEN:35

Where, the adapter_fasta for single-end reads is: illumina_TruSeq3-SE.fasta

>TruSeq3_IndexedAdapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>TruSeq3_UniversalAdapter
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

And paired-end reads: illumina_TruSeq3-PE-2_NexteraPE-PE.fasta

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PE1_rc
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PE2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE2_rc
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC>PrefixNX/1
AGATGTGTATAAGAGACAG
>PrefixNX/2
AGATGTGTATAAGAGACAG
>Trans1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Trans1_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Trans2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>Trans2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC

Output:

  • trimmomatic1.fastq
  • trimmomatic2.fastq

note: the output reads at this step include full-length reads that have no adapter, plus reads where the adapter has been identified and lopped off, leaving a shorter read (but not shorter than 35nt). In cases where the insert size is small, resulting in adapter read-through, R2 will be a direct reverse complement of R1; the "true" parameter enables these reads to be saved for downstream analysis.

Step 3: RunPriceSeq

Reasoning: Remove low-quality reads to reduce their impact on downstream alignment and analysis.

Input:

  • trimmomatic1.fastq
  • trimmomatic2.fastq

Process: The default input is a .fastq file, which enables the use of -rqf parameter for filtering on base quality.

PriceSeqFilter 
-a 12 
-rnf 90 
-log c 
-fp {input files}
-op {output files} 
-rqf 85 0.98

Per the PriceSeq Documentation, available here, this command uses 12 threads to

  1. filter out pairs of reads if either has an unacceptably high number of uncalled nucleotides (N's), requiring that 90% of nucleotides are called per read.
  2. filter out sequences with an unacceptably high number of low-quality nucleotides, as defined by the provided quality score, requiring that 85% of nucleotides have a probability of being correct of greater than 0.98.

Output:

  • priceseq1.fa
  • priceseq2.fa

note: These output files are now .fasta format, as opposed to previous .fastq format.

Step 4: RunCDHitDup

Reasoning: Remove / identify duplicate reads. In IDseq versions prior to v4.0 cd-hit-dup would remove duplicates. Since IDseq v4.0 tools are used to identify duplicates (and therefore compute the duplicate compression ratio, DCR), but they are not removed from the analysis. In particular, cd-hit-dup was used for duplicate identification prior to pipeline v6.0 and idseq-dedup is used for duplicate identification in IDseq versions v6.0 and later.

Input:

  • priceseq1.fa
  • priceseq2.fa

Process:

Prior to IDseq v4.0

cd-hit-dup 
-i {input_fasta}
-o {output_fasta}
-e 0.05
-u 70

Per the CDHit Documentation, available here, this command uses a 0.05 threshold for the number of mismatches - indicating that if two reads are > 95% similar they will be considered duplicates. It uses only the first/last 70 bases of each read to do the analysis on sequence similarity.

After IDseq v4.0

cd-hit-dup 
-i {input_fasta}
-o {output_fasta}
-e 0.0
-u 70

After IDseq v6.0

idseq-dedup 
-l 70 
-i my-fasta.fasta 
-o my-deduped-fasta.fasta

Per the CDHit Documentation, available here, this command requires identical matches to be considered duplicates. It uses only the first/last 70 bases of each read to do the analysis on sequence similarity.

idseq-dedup is configured to replicate the above behavior, using. the first/last 70 bases of each read.

Output:

  • dedup1.fa
  • dedup2.fa

Step 5: RunLZW:

Reasoning: Remove low-complexity reads to mitigate challenges in aligning repetitive sequences.

Input:

  • dedup1.fa
  • dedup2.fa

Process: LZW refers to the Lempel-Ziv-Welch (LZW) algorithm, which provides loss-less data compression. The ratio of LZW-compressed sequence length to the original sequence length is computed for each read. Reads with a compression ratio greater than the specified threshold of 0.45 are retained. In the case where all reads are filtered by this threshold, a reduced threshold of 0.42 is used. If reads are greater than 150 basepairs long, the LZW score is scaled to avoid penalizing long reads. In particular, for reads longer that 150 basepairs, the raw LZW score is multiplied by the adjustment_heuristic = (1 + (seq_length - 150) / 1000).

Output:

  • lzw1.fa
  • lzw2.fa

Step 6: RunBowtie2

Reasoning: While STAR provides an initial, rapid, removal of host sequences, it is possible that some host sequences will make it past this step. Therefore, to improve filter sensitivity, an additional host removal step is performed using Bowtie2 to clean up any remaining sequences.

Input:

  • lzw1.fa
  • lzw2.fa

Process: Remove remaining host reads.

bowtie2
-q 
-x {genome_basename}
-f
--very-sensitive-local
-S {output_sam_file}
--seed random_seed
-p {number_of_cpus}
-1 [input R1]
-2 [input R2] 

If the input is single-end the -U [input R, if not paired] argument is used instead of -1 and -2. Bowtie2 documentation can be found here

Output:

  • bowtie2_1.fa
  • bowtie2_2.fa

Step 7: RunSubsample

Reasoning: For samples with a high fraction of non-host reads (ie stool samples), the .fasta outputs following bowtie alignment may contain large numbers of sequences. GSNAP alignment to NT and NR databases is a resource-intensive step. To reduce computational time, the reads are randomly sub-sampled to 1 million total fragments (1 million single-end reads or 2 million paired-end reads).

Input:

  • bowtie2_1.fa
  • bowtie2_2.fa

Process: Randomly subsample 1 million fragments.

Output:

  • subsampled_1.fa
  • subsampled_2.fa
  • subsampled_merged.fa

Step 8: RunGsnapFilter

Reason: Regardless of specified “host” organism, it is essential to remove all potentially-human sequences for privacy reasons. Thus, a final GSNAP alignment is performed against the human genome for samples from all host types.

Input:

  • subsampled_1.fa
  • subsampled_2.fa
  • subsampled_merged.fa

Process:

gsnapl
-A sam
--batch=0
--use-shared-memory=0
--gmap-mode=all
--npaths=1
--ordered
-t 32
--max-mismatches=40
-D {gsnap_base_dir}
-d {gsnap_index_name}
-o {output_sam_file} 
{input_fas}

Two input FASTAs means paired reads. The GSNAP documentation can be found here.

Output:

  • gsnap_filter_1.fa - R1 fasta file containing reads that remain after host removal and QC, input to GSNAP
  • gsnap_filter_2.fa - R2 fasta file containing reads that remain after host removal and QC , input to GSNAP
  • gsnap_filter_merged.fa - combined fasta file containing reads that remain after host removal and QC, input to Rapsearch2