A Nextflow pipeline for generating Tree Sequences from PLINK and VCF files.
This pipeline is designed to infer Tree Sequences from genotype data. It is currently tailored for PLINK genotype files, where all relevant samples are contained within a single file. The pipeline converts the PLINK file into a VCF file, corrects ALT/REF alleles, and checks chromosome sizes. It then uses Beagle to impute and phase any missing data before running tsinfer
to create Tree Sequences from the VCF file.
tsinfer
requires ancestral alleles to generate tree sequence files. Currently, the pipeline supports three methods for determining ancestral alleles:
- Using the reference genome: The REF allele in the VCF file is used as the ancestral allele.
- Using
est-sfs
: This method estimates the site frequency spectrum and infers ancestral alleles. It requires the presence of outgroup samples (ancestral to the rest of the data) in the dataset. - Using
compara
: This method requires an additional CSV file containing the ancestral alleles.
You can obtain this pipeline by cloning the GitHub repository:
git clone cnr-ibba/nf-treeseq
Alternatively, you can use the nextflow pull
command:
nextflow pull cnr-ibba/nf-treeseq
For more information on installing and running Nextflow pipelines, including dealing with revisions, refer to the Nextflow documentation.
While all parameters can be passed via the command line, it is recommended to use a configuration file. The configuration file should be a simple JSON file containing at least the following parameters:
{
"plink_bfile": "<binary plink prefix>",
"plink_species": "<plink species options>",
"plink_keep": "<plink keep file>",
"plink_geno": 0.1,
"genome": "<genome file>"
}
plink_bfile
: The binary PLINK file prefix used as the--bfile
parameter.plink_species
: Species-specific options for PLINK (e.g.,--species sheep
or--chr-set 26 no-xy no-mt --allow-no-sex
).plink_keep
: A TSV file withFID
andIID
columns indicating the samples to keep.plink_geno
: The PLINK--geno
parameter (default: 0.1), which excludes SNPs with a higher missing rate.genome
: The genome file used bybcftools
for allele normalization (setting ALT/REF alleles) and chromosome size correction.
The pipeline requires ancestral alleles to generate tree sequences. At least one of the following methods must be used to infer ancestral alleles:
To use the reference genome for inferring ancestral alleles, simply set the reference_ancestor
flag:
{
"reference_ancestor": true
}
To infer ancestral alleles using est-sfs
, enable the with_estsfs
flag and specify one or more outgroup sample files (TSV format with FID
and IID
columns). You can provide up to three outgroup files:
{
"with_estsfs": true,
"outgroup1": "<outgroup1 samples file>",
"outgroup2": "<outgroup2 samples file>",
"outgroup3": "<outgroup3 samples file>"
}
To use compara
for inferring ancestral alleles, provide a CSV file with the following format:
chrom,position,alleles,anc_allele
26,209049,A/G,C
26,268822,A/G,C
26,285471,A/G,G
26,361728,G/T,G
After generating the file, specify it using the compara_ancestor
parameter:
{
"compara_ancestor": "<compara file>"
}
Additional parameters can be set in the configuration file to control the pipeline or specify the output directory. To see all available options, run:
nextflow run cnr-ibba/nf-treeseq --help
For more advanced options, including hidden parameters:
nextflow run cnr-ibba/nf-treeseq --help --validationShowHiddenParams
Once your configuration file is set up, run the pipeline with:
nextflow run cnr-ibba/nf-treeseq -profile <profile> -params-file <config.json>
<profile>
: The execution environment profile (e.g.,docker
orsingularity
).<config.json>
: The configuration file you created.
You can also override specific parameters directly in the command line:
nextflow run cnr-ibba/nf-treeseq -profile singularity -params-file config.json --plink_geno 0.2