VCF Merge Sequential Pipeline

Overview

This repository contains a pipeline designed to process VCF files by splitting them into manageable subsets, merging them, and then creating a single merged VCF file using bcftools.

Setup

Configuration

Before running the pipeline, you need to set up the config.yaml file:

vcf_folder: "/path/to/vcf/files"
output_folder: "/desired/output/folder"
vcfs_per_batch: 500
conda_env: "bcftools"

vcf_folder: The path to the directory containing your .vcf.gz files.
output_folder: The desired directory where the results will be saved.
vcfs_per_batch: The number of VCFs processed in each batch.
conda_env: The name of the Conda environment to be used, which must have bcftools installed.

Conda Environment

Ensure that the specified Conda environment in the config.yaml file contains bcftools.

Running the Pipeline

Directly with Snakemake

If you have Snakemake installed, you can run the pipeline directly:

snakemake -s merge_sequential.smk --use-conda

With Slurm Scheduler

To run the pipeline on a system with the Slurm scheduler, use the provided shell script:

sbatch run_merge_sequential.sh

The script sets up necessary temporary directories, logs, and other parameters before invoking Snakemake. The logs for the Snakemake jobs are saved in the slurm_logs directory.

Logging

The script logs the start and end times of each rule to facilitate performance profiling and troubleshooting. Log files are stored in the directory specified under log_subfolder in the configuration file.

Contribution

Feel free to fork the repository and submit pull requests for any enhancements or bug fixes. Contributions to improve the script or documentation are welcome.

TODO

Implement md5sum calculation for all files to verify data integrity.
Remove intermediate files to save storage space.
Ensure proper sequence of index and VCF file creation to maintain organization.
Add error handling to manage potential issues gracefully, such as missing input files or unsuccessful command executions.
Consider making the file extensions configurable to allow for more flexible input.
Explore options for dynamic memory allocation based on the number of threads, possibly through a configuration setting or automatic calculation.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
bcftools		bcftools
gatk		gatk
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCF Merge Sequential Pipeline

Overview

Contents

Setup

Configuration

Conda Environment

Running the Pipeline

Directly with Snakemake

With Slurm Scheduler

Logging

Contribution

TODO

About

Releases

Packages

Languages

scholl-lab/merge-multisample-vcf

Folders and files

Latest commit

History

Repository files navigation

VCF Merge Sequential Pipeline

Overview

Contents

Setup

Configuration

Conda Environment

Running the Pipeline

Directly with Snakemake

With Slurm Scheduler

Logging

Contribution

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages