Skip to content

gmboowa/rMAP-TB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rMAP-TB: Rapid Mycobacterial Analysis Pipeline for Tuberculosis & Mycobacterial Genomic Surveillance

rMAP-TB is a reproducible, Dockerized WDL/Cromwell workflow for public-health-oriented analysis of Mycobacterium tuberculosis complex (MTBC) and non-MTBC Mycobacteria genomic data. It supports paired-end Illumina FASTQ inputs & integrates read preprocessing, species typing, MTBC/NTM routing, TB drug-resistance profiling, lineage interpretation, MTBC-only sample filtering, core-SNP phylogenomics & interactive surveillance reporting.

The workflow first performs read trimming, sequence quality control & Kraken2/Bracken-based Mycobacteria species typing. Species typing is used to route samples before TB-Profiler execution: MTBC-supported samples proceed to TB-Profiler resistance, species & lineage profiling, while non-MTBC Mycobacteria are summarized separately through an NTM speciation branch. This allows the workflow to report the most probable NTM species and supporting evidence while excluding non-MTBC samples from MTBC-specific downstream analyses.

For MTBC-supported samples, rMAP-TB performs Snippy-based variant calling, mean-depth extraction, variant summary generation, Snippy-core core-genome alignment, drug-resistance-associated non-synonymous mutation summarization, pairwise SNP distance estimation, SNP cluster interpretation, lineage distribution analysis, optional Gubbins recombination filtering, IQ-TREE2 maximum-likelihood phylogeny & ETE3 phylogenetic tree visualization.

rMAP-TB generates integrated HTML reports & downloadable public-health surveillance outputs, including QC filtering rationale, Mycobacteria species typing summaries, NTM speciation summaries, TB-Profiler mutation-level resistance evidence, resistance-profile summaries, lineage distribution summaries, pairwise SNP distance tables, SNP cluster summaries, SNP distance heatmaps, phylogenetic tree visualizations & surveillance metadata TSV files.

Visual summary of the rMAP-TB workflow

rMAP-TB workflow

Detailed workflow logic & sample routing

Paired-end FASTQ files
          ⬇
Read trimming with Trimmomatic
          ⬇
Sequence quality control with FastQC
          ⬇
QC aggregation with MultiQC
          ⬇
Mycobacteria species typing with Kraken2 + Bracken
          ⬇
MTBC / non-MTBC Mycobacteria routing
          │
          ├──────────────────────────────────────────────▶ Non-MTBC Mycobacteria / NTM branch
          │                                                       ⬇
          │                                               NTM speciation summary
          │                                                       ⬇
          │                                               Most probable NTM species identified
          │                                                       ⬇
          │                                               Species-level evidence & MTBC support
          │                                                       ⬇
          │                                               Exclusion from MTBC-specific analysis
          │                                                       ⬇
          │                                               Non-MTBC Mycobacteria species summary
          │                                                       ⬇
          │                                               Integrated HTML report
          │
          ⬇
MTBC-supported samples only
          ⬇
TB-Profiler species, lineage & AMR profiling
          ⬇
MTBC-only sample filtering
          ⬇
Snippy per-sample variant calling
          ⬇
Mean-depth extraction & variant summary generation
          ⬇
Snippy-core core-genome alignment
          ⬇
Drug-resistance-associated non-synonymous mutation summary
          ⬇
Pairwise SNP distance estimation
          ⬇
SNP cluster interpretation
          ⬇
Lineage distribution summary
          ⬇
SNP distance heatmap generation
          ⬇
QC filtering rationale & surveillance metadata export
          ⬇
Optional Gubbins recombination filtering
          ⬇
IQ-TREE2 maximum-likelihood phylogeny
          ⬇
ETE3 phylogenetic tree visualization
          ⬇
Integrated HTML report with downloadable surveillance outputs

Key features

rMAP-TB provides a reproducible, modular workflow for Mycobacteria species typing, MTBC-focused tuberculosis genomic surveillance, drug-resistance interpretation, SNP analysis & phylogenomic reporting.

Input, preprocessing & quality control

  • Supports paired-end Illumina FASTQ inputs
  • Performs adapter trimming & read preprocessing using Trimmomatic
  • Runs per-sample sequence quality assessment using FastQC
  • Generates aggregated quality-control reports using MultiQC

Mycobacteria species typing & sample routing

  • Performs Mycobacteria species typing using Kraken2 & Bracken before TB-Profiler execution
  • Routes samples as MTBC or non-MTBC Mycobacteria based on Kraken2/Bracken species-typing results
  • Reports non-MTBC Mycobacteria / NTM speciation summaries
  • Reports the most probable NTM species with species-level evidence & MTBC support status
  • Excludes non-MTBC Mycobacteria from MTBC-specific downstream analyses

TB-Profiler resistance & lineage interpretation

  • Runs TB-Profiler for MTBC-supported samples
  • Reports MTBC species, lineage, sub-lineage and drug-resistance profiles
  • Provides WHO-aligned TB drug-resistance classification, including Hr-TB, RR-TB, MDR/RR-TB, Pre-XDR-TB & XDR-TB
  • Reports mutation-level TB-Profiler resistance evidence, including drug, gene, mutation, confidence & evidence fields
  • Filters MTBC-supported samples for downstream phylogenomics

Variant calling, SNP analysis & clustering

  • Performs Snippy-based reference-guided per-sample variant calling
  • Extracts mean depth & generates per-sample variant summaries
  • Generates core-genome SNP alignments using Snippy-core
  • Reports non-synonymous mutations in key TB drug-resistance-associated genes
  • Estimates pairwise SNP distances from MTBC core-genome alignments
  • Interprets SNP clusters using configurable SNP-distance thresholds
  • Generates SNP distance heatmaps for genomic relatedness assessment

Phylogenomics & visualization

  • Summarizes & visualizes lineage distributions
  • Supports optional recombination filtering using Gubbins
  • Performs maximum-likelihood phylogenetic inference using IQ-TREE2
  • Generates ETE3-based phylogenetic tree visualizations with lineage & resistance metadata

Reporting & reproducibility

  • Produces downloadable QC filtering rationale & surveillance metadata TSV outputs
  • Generates an integrated interactive HTML report suitable for GitHub Pages deployment
  • Uses Dockerized modular WDL/Cromwell execution for reproducible analysis

Repository structure

rMAP-TB/
├── README.md
├── LICENSE
├── .dockstore.yml
├── .gitignore
├── rMAP_TB.wdl
├── examples/
│   └── inputs.example.json
├── resources/
│   ├── adapters.fa
│   ├── H37Rv.gb
│   └── README.md
└── docs/
    ├── index.html
    ├── DEPLOYMENT.md
    ├── reports/
    │   ├── small_dataset/
    │   ├── medium_dataset/
    │   └── large_dataset/
    └── assets/
        ├── workflow/
        ├── images/
        └── css/

Requirements

Requirement Purpose
Java Required to run the Cromwell workflow engine
Cromwell Executes the WDL workflow locally or on supported backends
Docker Runs the containerized bioinformatics tools used by each WDL task
Paired-end Illumina FASTQ files Primary input sequencing data for trimming, QC, species typing, TB-Profiler & variant calling
Adapter FASTA file Required for Trimmomatic adapter trimming
MTBC GenBank reference Required for Snippy reference-guided variant calling & Snippy-core alignment
Kraken2/Bracken Mycobacteria database Required for Mycobacteria species typing; embedded in the workflow Docker image if using the recommended container
TB-Profiler database Required for MTBC lineage & drug-resistance profiling; provided within the TB-Profiler container
Sufficient local compute resources Needed for read processing, variant calling, SNP alignment, recombination filtering, phylogeny & HTML report generation

Main workflow inputs

Input Description
input_reads Array of paired-end Illumina FASTQ files, ordered as R1 followed immediately by the matching R2 file
adapters Adapter FASTA file used by Trimmomatic during read trimming
mtbc_reference_genbank MTBC reference genome in GenBank format for Snippy variant calling & core-SNP alignment
do_trimming Enables adapter trimming & read preprocessing
do_quality_control Enables FastQC quality assessment & MultiQC aggregation
do_species_typing Enables Mycobacteria species typing using Kraken2 + Bracken
do_tb_profiler Enables TB-Profiler-based MTBC species, lineage, sub-lineage & drug-resistance profiling
do_phylogeny Enables MTBC-only SNP phylogenomics, including Snippy, Snippy-core, IQ-TREE2 & tree visualization
use_gubbins Enables optional recombination filtering before phylogenetic reconstruction
tbprofiler_docker Docker image used for TB-Profiler AMR & lineage profiling
species_typing_docker Docker image used for Kraken2 + Bracken Mycobacteria species typing
snippy_reference_type Reference format used by Snippy; use genbank when providing a GenBank reference
iqtree2_model IQ-TREE2 nucleotide substitution model used for maximum-likelihood phylogeny
iqtree2_bootstraps Number of bootstrap replicates used for phylogenetic support estimation
min_mtbc_samples_for_tree Minimum number of MTBC-positive samples required to proceed with tree reconstruction
likely_transmission_snp_threshold SNP-distance threshold for identifying genomically close sample pairs requiring epidemiological review
possible_transmission_snp_threshold SNP-distance threshold for identifying intermediate-distance sample pairs requiring metadata review
tb_drug_resistance_genes Comma-separated list of TB drug-resistance-associated genes used for non-synonymous mutation reporting
tree_title Title displayed on the rendered MTBC phylogenetic tree
tree_image_format Output format for the ETE3-rendered phylogenetic tree image

Example input JSON

An example Cromwell input file is provided here:

examples/inputs.example.json

The input FASTQ files must be ordered like this:

  "rMAP_TB.input_reads": [
    "~/sample1_1.fastq.gz",
    "~/sample1_2.fastq.gz",
    "~/sample2_1.fastq.gz",
    "~/sample2_2.fastq.gz"
  ],

  "rMAP_TB.adapters": "~/adapters.fa",
  "rMAP_TB.mtbc_reference_genbank": "~/H37Rv.gb",

  "rMAP_TB.do_trimming": true,
  "rMAP_TB.do_quality_control": true,
  "rMAP_TB.do_species_typing": true,
  "rMAP_TB.do_tb_profiler": true,
  "rMAP_TB.do_phylogeny": true,
  "rMAP_TB.use_gubbins": true,

  "rMAP_TB.tbprofiler_docker": "staphb/tbprofiler:6.6.6",
  "rMAP_TB.species_typing_docker": "gmboowa/mycobacterium-kraken2-bracken:2026.05",
  "rMAP_TB.snippy_reference_type": "genbank",

  "rMAP_TB.iqtree2_model": "GTR+G",
  "rMAP_TB.iqtree2_bootstraps": 1000,
  "rMAP_TB.min_mtbc_samples_for_tree": 3,

  "rMAP_TB.likely_transmission_snp_threshold": 5,
  "rMAP_TB.possible_transmission_snp_threshold": 12,

  "rMAP_TB.report_nonsynonymous_drug_gene_mutations": true,
  "rMAP_TB.tb_drug_resistance_genes": "rpoB,katG,inhA,fabG1,ahpC,embB,pncA,rpsL,rrs,gyrA,gyrB,eis,ethA,ethR,thyA,folC,alr,ddl,gidB,tlyA,rrl,atpE,rv0678,pepQ",

  "rMAP_TB.max_cpus": 8,
  "rMAP_TB.max_memory_gb": 16,
  "rMAP_TB.min_read_length": 50,
  "rMAP_TB.min_mapping_quality": 20,

  "rMAP_TB.tree_title": "MTBC Core-SNP Phylogeny",
  "rMAP_TB.tree_width": 2400,
  "rMAP_TB.tree_height": 1600,
  "rMAP_TB.tree_image_format": "png"
}

Docker Images Used in the Workflow

Workflow Component Docker Image Purpose
Read trimming quay.io/biocontainers/trimmomatic:0.39--hdfd78af_2 Adapter trimming & read-quality filtering
FastQC staphb/fastqc:0.11.9 Per-sample read-level quality-control assessment
MultiQC ewels/multiqc:latest Aggregated QC reporting across samples
Species typing gmboowa/mycobacterium-kraken2-bracken:2026.05 Mycobacterium species identification using Kraken2 & Bracken
TB-Profiler staphb/tbprofiler:6.6.6 MTBC species, lineage, sub-lineage, drug-resistance prediction & mutation-level resistance evidence
Snippy staphb/snippy:4.6.0 Reference-guided per-sample SNP calling
Snippy-core staphb/snippy:4.6.0 Core-genome SNP alignment generation across MTBC-positive samples
Non-synonymous mutation summary python:3.11-slim Extraction & reporting of non-synonymous mutations in TB drug-resistance-associated genes
Pairwise SNP distance & clustering python:3.11-slim Pairwise SNP distance estimation, reference-sequence exclusion & SNP cluster interpretation
Surveillance summary visuals python:3.11-slim Lineage distribution plots, SNP heatmap generation, QC filtering rationale & surveillance metadata TSV export
Gubbins staphb/gubbins:3.4.1 Optional recombination filtering before phylogenetic reconstruction
IQ-TREE2 gmboowa/iqtree2-python:2.3.4 Maximum-likelihood phylogenetic inference with bootstrap support
Tree visualization gmboowa/ete3-render:1.18 ETE3-based phylogenetic tree rendering with lineage, resistance & bootstrap metadata
Report merging python:3.11-slim Final integrated interactive HTML report generation

Running the workflow

From the repository root:

java -jar cromwell-<version>.jar run rMAP_TB.wdl --inputs ~/inputs.example.json

For example:

java -jar cromwell-92.jar run rMAP_TB.wdl --inputs ~/inputs.example.json

Recommended local Docker resources

For small to moderate MTBC datasets on a local workstation:

CPUs:   8
Memory: 16 GB or higher

For larger datasets, especially when using Gubbins & IQ-TREE2, consider increasing memory & CPU allocation where possible.

Main outputs

rMAP-TB produces modular intermediate outputs and a final integrated HTML surveillance report covering quality control, species typing, TB drug-resistance interpretation, SNP analysis, phylogenomics & public-health reporting.

1. Quality control & read preprocessing

  • Trimmed paired-end FASTQ files
  • FastQC per-sample HTML and ZIP reports
  • MultiQC aggregated quality-control report
  • Trimming summary table
  • QC summary HTML report

2. Mycobacteria species typing and MTBC / non-MTBC routing

  • Kraken2 classification outputs & species-level reports
  • Bracken abundance outputs
  • Mycobacteria species typing TSV & HTML summaries
  • Most probable Mycobacteria species call per sample
  • Evidence supporting species assignment
  • MTBC support status per sample
  • MTBC / non-MTBC Mycobacteria routing summary

3. Non-MTBC Mycobacteria / NTM speciation

  • Non-MTBC Mycobacteria / NTM sample list
  • Most probable NTM species identified per sample
  • Species-level evidence supporting NTM assignment
  • Non-MTBC Mycobacteria species summary TSV & HTML section
  • Rationale for exclusion from MTBC-specific downstream analyses
  • Integrated report output when all samples are non-MTBC Mycobacteria / NTM

4. TB-Profiler, lineage & MTBC filtering

  • TB-Profiler JSON & text outputs for MTBC-supported samples
  • Combined TB-Profiler HTML report
  • TB-Profiler summary TSV
  • MTBC species, lineage & sub-lineage summary
  • WHO-aligned TB drug-resistance profile summary
  • Predicted resistant drugs summary
  • Mutation-level resistance evidence TSV and HTML report
  • MTBC-positive sample list
  • MTBC-filtered FASTQ files for downstream phylogenomics
  • MTBC selection & exclusion rationale

5. Variant calling & core-SNP alignment

  • Per-sample Snippy variant-calling directories
  • Per-sample VCF, aligned FASTA & tabular variant files
  • Snippy logs
  • Variant summary HTML report
  • Mean-depth summary TSV
  • Snippy-core full alignment
  • Snippy-core SNP alignment
  • Core SNP VCF & tabular summary

6. Drug-resistance-associated mutation summaries

  • Non-synonymous mutation TSV summary
  • Non-synonymous mutation HTML report
  • Per-sample collapsible mutation summaries
  • Drug-resistance-associated gene-level mutation reporting

7. Pairwise SNP distance & cluster interpretation

  • Pairwise SNP distance matrix TSV
  • Pairwise SNP distance pairs TSV
  • SNP cluster summary TSV
  • SNP distance cluster HTML report
  • Pairwise SNP heatmap PNG
  • Reference & non-sample sequence exclusion log
  • SNP distance task status log

8. Surveillance summary outputs

  • Lineage distribution TSV & SVG plot
  • SNP distance heatmap SVG
  • QC filtering rationale TSV
  • Surveillance metadata TSV
  • Surveillance summary HTML report
  • Mean depth, MTBC support, lineage, resistance profile & tree-inclusion metadata
  • Combined MTBC & non-MTBC Mycobacteria reporting metadata, where applicable

9. Optional recombination filtering

  • Gubbins recombination-filtered polymorphic-sites alignment
  • Gubbins recombination-filtered final tree
  • Gubbins log files
  • Recombination-filtering status outputs

10. Phylogenetic inference & visualization

  • IQ-TREE2 maximum-likelihood tree file
  • IQ-TREE2 report & log files
  • Bootstrap-supported Newick tree
  • Exportable Newick tree for downstream visualization tools such as iTOL
  • ETE3-rendered MTBC phylogenetic tree image
  • Cleaned tree file used for visualization
  • Tree rendering log

11. Integrated reports & downloadable public-health outputs

  • Final integrated interactive HTML report
  • Run metadata file
  • Downloadable TB surveillance metadata TSV
  • Downloadable QC filtering rationale TSV
  • Downloadable Mycobacteria species typing summary TSV
  • Downloadable non-MTBC Mycobacteria / NTM species summary TSV, where applicable
  • Embedded lineage distribution plot
  • Embedded SNP distance heatmap
  • Embedded MTBC phylogenetic tree
  • Embedded NTM speciation section when non-MTBC Mycobacteria are detected
  • GitHub Pages-compatible report outputs Example final report output:
integrated_report.html

GitHub Pages report site

https://gmboowa.github.io/rMAP-TB/

Interpretation guidance

The integrated report should be interpreted using multiple complementary layers of genomic, resistance, quality-control & epidemiological evidence:

Key outputs and visualizations

rMAP-TB generates an integrated HTML surveillance report with:

Species typing & sample selection

  • Mycobacteria species typing results
  • MTBC selection & filtering rationale
  • TB-Profiler species, lineage & sub-lineage calls

Drug-resistance interpretation

  • TB-Profiler drug-resistance profiles
  • Mutation-level resistance evidence, including drug, gene, mutation/change, confidence & evidence fields
  • Non-synonymous mutations in key TB drug-resistance-associated genes

Quality control & sequencing metrics

  • Mean depth & sample-level QC indicators
  • Tree-inclusion status & sample filtering notes

SNP analysis & genomic relatedness

  • Pairwise SNP distances between MTBC isolates
  • SNP cluster interpretation using configured SNP-distance thresholds
  • SNP distance heatmap for genomic relatedness assessment

Phylogenomics

  • Core-SNP phylogenetic clustering
  • Bootstrap support values on the phylogenetic tree
  • Recombination-filtered alignment & tree, if Gubbins is enabled

Surveillance metadata

  • Country
  • Year
  • Collection site
  • Sample source
  • Lineage
  • Resistance profile
  • Tree-inclusion status, where available

Close SNP clustering or close placement on a phylogenetic tree should not be interpreted as proof of direct transmission on its own. Transmission interpretation should be made only after considering epidemiological linkage, sampling density, collection dates, geography, lineage, resistance profile, sequence quality, SNP distances & bootstrap support.

Suggested citation / acknowledgement

If you use this workflow, please cite or acknowledge the associated manuscript:

rMAP-TB: a reproducible WDL/Cromwell workflow for Mycobacterium tuberculosis complex genomic surveillance and drug-resistance interpretation.

License

▪ MIT License for permissive open-source reuse

About

A robust, Dockerized WDL/Cromwell workflow for Mycobacterium tuberculosis complex genomic surveillance, integrating read trimming, QC, species typing, TB-Profiler resistance & lineage profiling, NTM/MTBC identification, mutation evidence reporting, core-SNP phylogenomics, SNP distance analysis & interactive HTML reporting for TB AMR surveillance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages