This guide covers merging, concatenating, sorting, and comparing VCF files.
- bcftools installed (
conda install -c bioconda bcftools) - Input VCFs should be compressed (
.vcf.gz) and indexed for most operations - Index files with
bcftools index input.vcf.gz
Tell your AI agent what you want to do:
- "Merge VCF files from different samples into a single cohort VCF"
- "Concatenate per-chromosome VCFs into a genome-wide file"
- "Compare two variant callsets and find shared and unique variants"
- "Extract specific samples from a multi-sample VCF"
These two operations are commonly confused:
| Operation | Use When | Example |
|---|---|---|
| merge | Same variants, different samples | Combining per-sample VCFs from same caller |
| concat | Same samples, different regions | Combining per-chromosome VCFs |
You have variants called separately for each sample:
sample1.vcf.gz → contains only sample1
sample2.vcf.gz → contains only sample2
sample3.vcf.gz → contains only sample3
Use bcftools merge to combine into one multi-sample VCF.
You have variants called in parallel by chromosome:
chr1.vcf.gz → all samples, chromosome 1
chr2.vcf.gz → all samples, chromosome 2
chr3.vcf.gz → all samples, chromosome 3
Use bcftools concat to combine into one genome-wide VCF.
bcftools merge sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz -Oz -o merged.vcf.gz
bcftools index merged.vcf.gzFor many samples, create a file list:
# Create list of VCF files
ls *.vcf.gz > vcf_list.txt
# Merge all
bcftools merge -l vcf_list.txt -Oz -o merged.vcf.gzWhen samples have variants at different positions:
# Default: missing genotypes shown as ./.
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
# Treat missing as homozygous reference
bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gzIf VCFs have the same sample name:
# Force merge (keeps both with modified names)
bcftools merge --force-samples file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gzbcftools merge --threads 4 -l vcf_list.txt -Oz -o merged.vcf.gz# Files must be in order
bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gzbcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz chrM.vcf.gz -Oz -o genome.vcf.gz# files.txt must list VCFs in genomic order
bcftools concat -f files.txt -Oz -o concatenated.vcf.gzWhen files may contain overlapping regions:
bcftools concat -a file1.vcf.gz file2.vcf.gz -Oz -o combined.vcf.gz# Remove exact duplicate records
bcftools concat -a -d exact file1.vcf.gz file2.vcf.gz -Oz -o combined.vcf.gz
# Remove duplicate variants (same position and alleles)
bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o combined.vcf.gzbcftools sort input.vcf -Oz -o sorted.vcf.gz
bcftools index sorted.vcf.gzUse temporary directory and memory limit:
bcftools sort -T /scratch/tmp -m 4G input.vcf.gz -Oz -o sorted.vcf.gzbcftools isec -p comparison_dir caller1.vcf.gz caller2.vcf.gzThis creates:
comparison_dir/0000.vcf- Variants only in caller1comparison_dir/0001.vcf- Variants only in caller2comparison_dir/0002.vcf- Shared variants (from caller1)comparison_dir/0003.vcf- Shared variants (from caller2)
bcftools isec -p comparison_dir -Oz caller1.vcf.gz caller2.vcf.gz# Variants present in both files
bcftools isec -n=2 -w1 caller1.vcf.gz caller2.vcf.gz -Oz -o shared.vcf.gz# Variants only in file1 (not in file2)
bcftools isec -C caller1.vcf.gz caller2.vcf.gz -Oz -o only_in_caller1.vcf.gz# Find variants present in all 3 files
bcftools isec -n=3 -w1 file1.vcf.gz file2.vcf.gz file3.vcf.gz -Oz -o in_all_three.vcf.gz
# Find variants present in at least 2 of 3 files
bcftools isec -n+2 -w1 file1.vcf.gz file2.vcf.gz file3.vcf.gz -Oz -o in_at_least_two.vcf.gz| Option | Meaning |
|---|---|
-n=2 |
Present in exactly 2 files |
-n+2 |
Present in 2 or more files |
-n-2 |
Present in fewer than 2 files |
-n~11 |
Present in file1 AND file2 |
-n~10 |
Present in file1 but NOT file2 |
-n~01 |
Present in file2 but NOT file1 |
bcftools view -s sample1,sample2,sample3 input.vcf.gz -Oz -o subset.vcf.gzbcftools view -s ^sample4,sample5 input.vcf.gz -Oz -o without.vcf.gz# samples.txt: one sample name per line
bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz
# Exclude samples in list
bcftools view -S ^samples.txt input.vcf.gz -Oz -o without.vcf.gzbcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz# From BED file
bcftools view -R targets.bed input.vcf.gz -Oz -o targets.vcf.gz
# Multiple regions
bcftools view -r chr1:1000-2000,chr2:3000-4000 input.vcf.gz -Oz -o regions.vcf.gz# Format: old_name<tab>new_name
cat > rename.txt << EOF
SRR123456 patient_001
SRR123457 patient_002
SRR123458 patient_003
EOFbcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz# Before
bcftools query -l input.vcf.gz
# After
bcftools query -l renamed.vcf.gzfor sample in $(bcftools query -l input.vcf.gz); do
bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz"
bcftools index "${sample}.vcf.gz"
donefor chr in chr1 chr2 chr3 chr4 chr5; do
bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz"
bcftools index "${chr}.vcf.gz"
done# List all sample VCFs
ls samples/*.vcf.gz > sample_vcfs.txt
# Merge with missing handled as reference
bcftools merge -l sample_vcfs.txt --missing-to-ref -Oz -o cohort.vcf.gz
bcftools index cohort.vcf.gz
# Verify samples
bcftools query -l cohort.vcf.gz# After calling variants in parallel by chromosome
for chr in chr{1..22} chrX chrY chrM; do
bcftools index "output/${chr}.vcf.gz"
done
# Concatenate all
bcftools concat output/chr*.vcf.gz -Oz -o genome.vcf.gz
bcftools index genome.vcf.gz# Compare GATK vs bcftools
bcftools isec -p caller_comparison gatk.vcf.gz bcftools.vcf.gz
# Count variants in each category
echo "GATK only: $(bcftools view -H caller_comparison/0000.vcf | wc -l)"
echo "bcftools only: $(bcftools view -H caller_comparison/0001.vcf | wc -l)"
echo "Shared: $(bcftools view -H caller_comparison/0002.vcf | wc -l)"# samples_case.txt and samples_control.txt contain sample names
# Extract cases
bcftools view -S samples_case.txt cohort.vcf.gz -Oz -o cases.vcf.gz
bcftools index cases.vcf.gz
# Extract controls
bcftools view -S samples_control.txt cohort.vcf.gz -Oz -o controls.vcf.gz
bcftools index controls.vcf.gzYou're using concat when you should use merge, or vice versa:
merge= same positions, different samplesconcat= same samples, different positions
Input files must be sorted for concat:
bcftools sort input.vcf.gz -Oz -o sorted.vcf.gzOr allow unsorted with -a:
bcftools concat -a file1.vcf.gz file2.vcf.gz -Oz -o combined.vcf.gzMany operations need indexed input:
bcftools index input.vcf.gzSamples have the same name in different files:
# Either rename samples first
bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
# Or force merge
bcftools merge --force-samples file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz"Merge VCF files from different samples into a single cohort VCF"
"Concatenate per-chromosome VCFs into a genome-wide file"
"Compare two variant callsets and find shared and unique variants"
"Extract specific samples from a multi-sample VCF"
- bcftools merge documentation
- bcftools concat documentation
- bcftools isec documentation
- vcf-basics - View and query VCF files
- filtering-best-practices - Filter variants before manipulation