Skip to content

A collection of scripts for filtering annotated variant call format files

License

Notifications You must be signed in to change notification settings

scholl-lab/vcf-filtering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vcf-filtering

This repository contains scripts and documentation for filtering variant call format (VCF) files to identify rare genetic variants in genes of interest using a streamlined bash script, filter_variants.sh.

Overview

The filter_variants.sh script performs the following steps:

  1. Extract Genes of Interest: Uses snpEff genes2bed to produce a BED file containing the genes of interest.
  2. Sort BED File: Sorts the generated BED file.
  3. Modify BED File: Adds "chr" prefix to the entries in the BED file if add_chr is set to true.
  4. Extract Variants: Uses bcftools to extract the variants in the BED file from the VCF.
  5. Filter for Rare Variants: Uses SnpSift to filter for rare variants based on the provided filter string.
  6. Extract Fields of Interest: Uses SnpSift again to extract the specified fields of interest.
  7. Modify Header: Removes the "ANN[0]" and "GEN[*]" prefixes from the header.
  8. Replace GT Values: Uses replace_gt_with_sample.sh to replace the GT values with the sample names.
  9. Save Output: Saves the output to a specified file.

Example of the shell pipeline the script is composing:

snpEff genes2bed GRCh38.mane.1.0.refseq OFD1 | sortBed | awk '{print "chr"$0}' | bcftools view ann.dbnsfp.vcf.gz -R - | SnpSift -Xmx8g filter " (( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')) " | SnpSift -Xmx4g extractFields -s "," -e "NA" - CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT | sed -e '1s/ANN\[0\]\.//g; s/GEN\[\*\]\.//g' | ./replace_gt_with_sample.sh samples.txt 21 > OFD1_rare_variants.GCKD.tsv

Usage of filter_variants.sh

./filter_variants.sh [--config config_file] <gene_name> <vcf_file_location> [reference] [add_chr] [filters] [fields_to_extract] [sample_file] [replace_script_location] [output_file]

Parameters:

  • --config config_file: (Optional) The path to the configuration file containing default values for parameters.
  • gene_name: The name of the gene of interest, e.g., "BICC1".
  • vcf_file_location: The location of the VCF file.
  • reference: (Optional, default: "GRCh38.mane.1.0.refseq") The reference to use.
  • add_chr: (Optional, default: true) Whether or not to add "chr" to the chromosome name. Use "true" or "false".
  • filters: (Optional, default: Filters for rare and moderate/high impact variants) The filters to apply.
  • fields_to_extract: (Optional, default: Various fields including gene info, predictions, allele counts) The fields to extract.
  • sample_file: (Optional, default: "samples.txt") The path to the file containing the sample values to use for replacement.
  • replace_script_location: (Optional, default: "./replace_gt_with_sample.sh") The location of the replace_gt_with_sample.sh script.
  • replace_script_options: (Optional) Additional options to pass to the replace_gt_with_sample.sh script. This can be used to append genotype values to sample names for non-"0/0" genotypes.
  • output_file: (Optional, default: "variants.tsv") The name of the output file.

Configuration File

The script allows users to provide a configuration file containing default values for parameters. The configuration file is sourced if provided, and the values specified in it are used as defaults.

Example of a configuration file:

reference=GRCh38.mane.1.0.refseq
add_chr=true
filters=(( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))
fields_to_extract=CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT
sample_file=samples.txt
replace_script_location=./replace_gt_with_sample.sh
replace_script_options="--append-genotype"
output_file=variants.tsv

Generating the Sample File

To generate the sample file from a multi-sample VCF, you can use the following command:

bcftools view -h /path/to/your_multi_sample.vcf.gz | awk -F'	' '{ for (i=10; i<=NF; ++i) printf "%s%s", $i, (i==NF ? RS : ",") }' > /path/to/samplefile.txt

Requirements

  • GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
  • GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
  • bcftools 1.17
  • snpEff version SnpEff 5.1d (build 2022-04-19 15:49)
  • SnpSift version 5.1d (build 2022-04-19 15:50)

Releases

No releases published