This repository contains a simple, linear Next-Generation Sequencing (NGS) pipeline using bash for paired-end FastQ data processing, optimized for Whole Exome Sequencing (WES).
The pipeline incorporates the following standard industry practices:
0. Quality Control & Trimming: fastqc assesses read quality and trim_galore removes sequencing adapters and low-quality sequences (defaulting to --quality 20 and --length 50) to minimize noise. For advanced trimming options, consult the Trim Galore User Guide.
- Alignment: Trimmed reads are aligned to the reference genome (
hg38.fa) usingbwa mem, then sorted viasamtools sort. - Duplicate Marking: Eliminates PCR or optical duplicates using
gatk MarkDuplicatesto prevent biased variant calling. - Base Quality Score Recalibration (BQSR): Detects and corrects systematic errors made by the sequencing machine in estimating base quality scores using
gatk BaseRecalibratorandApplyBQSR. - WES Quality Metrics: Calculates general alignment statistics, insert sizes, and on-target capture efficiency using
gatk CollectAlignmentSummaryMetrics,CollectInsertSizeMetrics,CollectHsMetricsandDepthOfCoverage. - Variant Calling: Scans the recalibrated alignment to identify SNPs and Indels using
gatk HaplotypeCaller. - Variant Filtration: Applies rigorous, fixed thresholds adapted specifically for high-coverage WES data using
gatk VariantFiltration. - Quality Extraction: Excludes heavily filtered calls, leaving only high-confidence (
PASS) variants usingbcftools view. - CSV Conversion: Converts the final
PASSvariants VCF into an Excel-compatible Comma-Separated Values (.csv) file. - Variant Annotation: Structurally and functionally annotates the resulting variants to understand their biological impact.
First, clone this pipeline repository to your local machine and navigate into it:
git clone https://github.com/Waqaskhan600/Bioinformatics-DNA-seq-data-analysis-WES-.git
cd Bioinformatics-DNA-seq-data-analysis-WES-We highly recommend using Conda to manage your bioinformatics tools.
Included in this repository is an environment.yaml file containing all the necessary dependencies (fastqc, trim-galore, bwa, samtools, gatk4, and bcftools).
Create and activate the environment quickly via:
conda env create -f environment.yaml
conda activate wes_pipelineYou need to download and index a reference genome before running the scripts. The Broad Institute provides the standard hg38 resource bundle.
Run the following to download and prepare your reference:
# Create a folder for your references
mkdir -p resources/reference
cd resources/reference
# Download the Fasta genome sequence
wget https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -O hg38.fa
# Create a BWA Index (Warning: this step can take 2-3 hours)
bwa index hg38.fa
# Create a Fasta sequence dictionary for GATK
gatk CreateSequenceDictionary -R hg38.fa
# Create a samtools expected FASTA index (.fai)
samtools faidx hg38.faNote: Make sure that REFERENCE inside ngs_pipeline.sh points to the absolute path of the hg38.fa file you just downloaded.
Whole Exome Sequencing explicitly targets specific regions of the genome using probes or "baits". Critical Point: Always use the exact target region file (.bed or .interval_list) that matches your capture kit. Using the wrong target file will lead to incorrect quality metrics, missed variants, and suboptimal variant calling.
You must place your target regions BED file at resources/reference/target_regions.bed (or configure TARGET_REGIONS in the shell script). Common panels include:
- Agilent SureSelect – Uses RNA baits, excellent uniformity (https://earray.chem.agilent.com/suredesign/)
- Illumina Nextera – DNA baits, good for degraded samples (https://support.illumina.com/sequencing/sequencing_kits/nextera-dna-exome/downloads.html)
- Roche SeqCap – DNA baits, customizable designs (https://diagnostics.roche.com/us/en/eservice-overview/technical-documents.html). Note: If the file is not listed, users often have to contact Roche Technical Support directly to request legacy design files.
- Twist Bioscience – Synthetic DNA baits, latest technology (Human Core Exome: https://www.twistbioscience.com/resources/data-files/ngs-human-core-exome-panel-bed-file) (Mitochondrial: https://www.twistbioscience.com/resources/data-files/twist-mitochondrial-dna-panel-bed-files) (Methylome: https://www.twistbioscience.com/resources/data-files/twist-human-methylome-panel-target-bed-file)
GATK Interval List Format:
Critical quality control tools like CollectHsMetrics require the BED file to be converted into a GATK-specific .interval_list format with a standard SAM/BAM header. If you only have the .bed file, you can easily convert it using:
# Create interval list for GATK
gatk BedToIntervalList \
-I resources/reference/target_regions.bed \
-O resources/reference/target_regions.interval_list \
-SD resources/reference/hg38.dictWhile target files are highly recommended for WES to calculate capture efficiency and restrict variant calling to valid exons, they are not strictly mandatory to generate a VCF. If you absolutely do not have the BED file from the sequencing center:
- You must comment out or remove
gatk CollectHsMetricsandgatk DepthOfCoveragein the script. - You must remove the
-L ${TARGET_REGIONS}flag fromgatk HaplotypeCaller(Step 5).
Note: Without the -L flag, HaplotypeCaller will blindly search the entire genome for variants instead of just the exome, which will take significantly longer. Furthermore, you will lose all metrics on whether the physical capture kit actually worked and whether you achieved enough depth to trust your variants!
Note: The pipeline includes built-in Pre-Run Validation Checks that will ensure you have correctly downloaded the reference genome, installed all software dependencies, and placed your input data before executing any bioinformatics tools. If anything is missing (for example, if you forgot to run conda activate wes_pipeline), the script will catch the missing dependencies, halt immediately, and provide the exact commands you need to fix the issue!
- Configure the pipeline: Open
ngs_pipeline.shand define thePROJECT_DIRvariable to point to your project's absolute path. If you are running the script directly from within your project directory, you can simply set it toPROJECT_DIR="$PWD". All other directories (data, outputs, etc.) will branch off from this folder. - Place your raw input reads in the
data/directory inside your project folder (e.g.,sample_1_R1.fastq.gz,sample_1_R2.fastq.gz).- Tip on Storage: While this pipeline dynamically accepts uncompressed
.fastqfiles automatically, they are incredibly large. It is considered an industry best practice to compress them yourself to save immense disk space (70-80% reduction) by runninggzip input/*.fastqbefore starting.
- Tip on Storage: While this pipeline dynamically accepts uncompressed
- Update the
REFERENCEandSAMPLEvariables insidengs_pipeline.shto match your reference genome and sample name. - Make the script executable:
chmod +x ngs_pipeline.sh
- Run the script:
./ngs_pipeline.sh
A companion ngs_config.yaml has also been created. Future versions of this pipeline can be modified to parse variables straight out of the YAML using tools like yq, making configuration much easier across multiple samples.
The pipeline automatically generates a structured output directory system branching off your PROJECT_DIR:
qc/: FastQC HTML quality reports for the raw input reads.trimmed/: Cleaned paired-end fastq files produced by trim_galore.aligned/: Sorted BAMs, GATK duplicate metrics, and duplicate-marked BAMs.variants/: Unfiltered VCFs, hard-filtered VCFs, finalPASSVCFs, and Excel-compatible.csvvariant sheets.logs/: Detailed terminal console outputs for every single step of the pipeline. If a step fails or produces unexpected results, check these log files for exact tool error messages.
Once you have your final VCF, you can heavily annotate your variants to predict their clinical or biological consequences through various external tools such as:
- wANNOVAR
- Ensembl VEP
- SnpEff
- GATK Funcotator
No single tool provides complete information. Each has unique strengths:
- GATK Funcotator – GATK’s native annotator with a clinical database focus
- Ensembl VEP – Comprehensive consequence prediction with extensive options
- SnpEff – Fast annotation with powerful filtering capabilities
- ANNOVAR – Excellent for population frequency analysis
By using multiple tools, you gain different perspectives on each variant and build confidence in your interpretations.
When evaluating the results/qc/ reports (_hs_metrics.txt, _alignment_summary.txt, and _insert_size_metrics.txt), keep an eye on these critical WES-specific metrics:
- PCT_PF_READS_ALIGNED (Alignment Summary): Percentage of reads aligned to the reference.
- Standard: >95% for reliable human read data.
- PF_MISMATCH_RATE (Alignment Summary): Rate of mismatches to the reference.
- Standard: <5% (high mismatch rates imply DNA degradation or species contamination).
- INSERT_SIZE (Insert Size Metrics): Distance between paired reads.
- Expected: WES target kits typically fragment DNA smaller than WGS to tightly fit exons. Expect a clean bell curve around 150-250bp.
- ON_TARGET_BASES (HsMetrics): Percentage of sequenced bases that fall within your target capture regions.
- Excellent: >80%
- Good: 70-80%
- Poor: <70%
- MEAN_TARGET_COVERAGE: Average coverage depth across target regions.
- Clinical applications: >100x
- Research applications: >50x
- PCT_TARGET_BASES_10X/20X/30X: Percentage of targets successfully covered at various depths.
- Clinical quality: >95% at 20x coverage
- FOLD_ENRICHMENT: How much more coverage the targeted regions received versus the genome average.
- Good enrichment: >40-fold
- Poor enrichment: <20-fold
- AT_DROPOUT and GC_DROPOUT: Coverage uniformity across different GC content.
- Good uniformity: <10% dropout
- Problematic: >20% dropout
Coverage Requirements (Review your qc/*_hs_metrics.txt):
- Mean target coverage: >100x for clinical applications, >50x for research
- Coverage uniformity: >95% of targets at 20x depth
- On-target rate: >80% for optimal results
Performance Benchmarks:
- Variant Calling Sensitivity: >95% for variants >20x coverage (via
gatk HaplotypeCaller) - Specificity: >99% after applying the hard filters generated by
gatk VariantFiltration
-
Low On-Target Rate (<70%)
- Causes: Wrong interval list, poor physical capture, sample degradation.
- Solutions: Verify your
TARGET_INTERVALSandTARGET_REGIONSvariables point precisely to the correct manufacturer capture kit. - Prevention: Always validate protocols using exact manufacturer BED files.
-
Uneven Coverage Across Targets
- Causes: GC bias, capture probe efficiency, PCR artifacts.
- Solutions: Review the AT/GC dropout metrics in the QC folder
hs_metrics. - Monitoring: Track coverage uniformity metrics output by
gatk DepthOfCoverage.
-
High False Positive Rate
- Causes: Insufficient filtering, capture artifacts, sequence biases.
- Solutions: Adjust the
gatk VariantFiltrationparameters specifically for your sample's coverage depth. The defaults we configured (QD < 2.0,DP < 20) are a baseline for high-coverage WES. - Validation: Confirm highly suspicious output "PASS" variants with Sanger sequencing.
-
Poor Raw Read Quality or Technical Bias at Read Extremes
- Indicator: The raw FastQC report shows erratic "Per Base Sequence Content" at the 5' end, or adapters are failing to be removed.
- Solutions: The pipeline defaults to standard, safe trimming (
--quality 20and--length 50). If you spot persistent issues in FastQC, you can add advanced flags to thetrim_galorecommand inStep 0.2of the script:--illuminaor--nextera: By default,trim_galoreis incredibly smart and will auto-detect which adapter your sequencing center used. However, if you know for an absolute fact you used a Nextera Exome kit, you can force it with--nextera.--clip_R1 5and--clip_R2 5: If the "Per Base Sequence Content" graph is chaotic for the very first 5 to 10 base pairs (very common in Illumina WES), this flag acts like a guillotine and mindlessly chops off the first 5 bases of every single read, regardless of quality, to remove that technical bias.
- Missing Expected Germline Variants
- Causes: Low coverage in specific exons, or the variant is located in a non-captured region outside the baits.
- Solutions: Manually review the
DepthOfCoveragefile for the specific gene coordinate to ensure it was properly sequenced. - Prevention: Always review your capture kit's exact probe content for genes of interest beforehand.
Thank you for using this pipeline!
If you found this useful and informative, or if you are looking for collaboration and research assistance, please feel free to reach out to me on LinkedIn: Waqas Khan - LinkedIn Waqas Khan Upwork