vcf-filtering

This repository contains scripts and documentation for filtering variant call format (VCF) files to identify rare genetic variants in genes of interest using a streamlined bash script, filter_variants.sh.

Overview

The filter_variants.sh script performs the following steps:

Extract Genes of Interest: Uses snpEff genes2bed to produce a BED file containing the genes of interest.
Sort BED File: Sorts the generated BED file.
Modify BED File: Adds "chr" prefix to the entries in the BED file if add_chr is set to true.
Extract Variants: Uses bcftools to extract the variants in the BED file from the VCF.
Filter for Rare Variants: Uses SnpSift to filter for rare variants based on the provided filter string.
Extract Fields of Interest: Uses SnpSift again to extract the specified fields of interest.
Modify Header: Removes the "ANN[0]" and "GEN[*]" prefixes from the header.
Replace GT Values: Uses replace_gt_with_sample.sh to replace the GT values with the sample names.
Save Output: Saves the output to a specified file.

Example of the shell pipeline the script is composing:

snpEff genes2bed GRCh38.mane.1.0.refseq OFD1 | sortBed | awk '{print "chr"$0}' | bcftools view ann.dbnsfp.vcf.gz -R - | SnpSift -Xmx8g filter " (( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')) " | SnpSift -Xmx4g extractFields -s "," -e "NA" - CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT | sed -e '1s/ANN\[0\]\.//g; s/GEN\[\*\]\.//g' | ./replace_gt_with_sample.sh samples.txt 21 > OFD1_rare_variants.GCKD.tsv

Usage of filter_variants.sh

./filter_variants.sh [--config config_file] <gene_name> <vcf_file_location> [reference] [add_chr] [filters] [fields_to_extract] [sample_file] [replace_script_location] [output_file]

Parameters:

--config config_file: (Optional) The path to the configuration file containing default values for parameters.
gene_name: The name of the gene of interest, e.g., "BICC1".
vcf_file_location: The location of the VCF file.
reference: (Optional, default: "GRCh38.mane.1.0.refseq") The reference to use.
add_chr: (Optional, default: true) Whether or not to add "chr" to the chromosome name. Use "true" or "false".
filters: (Optional, default: Filters for rare and moderate/high impact variants) The filters to apply.
fields_to_extract: (Optional, default: Various fields including gene info, predictions, allele counts) The fields to extract.
sample_file: (Optional, default: "samples.txt") The path to the file containing the sample values to use for replacement.
replace_script_location: (Optional, default: "./replace_gt_with_sample.sh") The location of the replace_gt_with_sample.sh script.
replace_script_options: (Optional) Additional options to pass to the replace_gt_with_sample.sh script. This can be used to append genotype values to sample names for non-"0/0" genotypes.
output_file: (Optional, default: "variants.tsv") The name of the output file.

Configuration File

The script allows users to provide a configuration file containing default values for parameters. The configuration file is sourced if provided, and the values specified in it are used as defaults.

Example of a configuration file:

reference=GRCh38.mane.1.0.refseq
add_chr=true
filters=(( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))
fields_to_extract=CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT
sample_file=samples.txt
replace_script_location=./replace_gt_with_sample.sh
replace_script_options="--append-genotype"
output_file=variants.tsv

Generating the Sample File

To generate the sample file from a multi-sample VCF, you can use the following command:

bcftools view -h /path/to/your_multi_sample.vcf.gz | awk -F'	' '{ for (i=10; i<=NF; ++i) printf "%s%s", $i, (i==NF ? RS : ",") }' > /path/to/samplefile.txt

Requirements

GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)
bcftools 1.17
snpEff version SnpEff 5.1d (build 2022-04-19 15:49)
SnpSift version 5.1d (build 2022-04-19 15:50)

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_variants.R		analyze_variants.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcf-filtering

Overview

Usage of filter_variants.sh

Parameters:

Configuration File

Generating the Sample File

Requirements

About

Releases

Languages

License

scholl-lab/vcf-filtering

Folders and files

Latest commit

History

Repository files navigation

vcf-filtering

Overview

Usage of filter_variants.sh

Parameters:

Configuration File

Generating the Sample File

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages