Target eukaryotic genome assembly from contaminated PacBio HiFi read sets.
target-asm is a Nextflow DSL2 workflow for recovering target eukaryotic genomes from mixed long-read sequencing libraries. It is intended for samples where the organism of interest is sequenced together with host, symbiont, microbial, environmental, or culture-associated DNA.
The workflow is reference-independent with respect to the target genome. It first assembles the full read set as a metagenome, removes non-target contigs with NCBI FCS-GX, recruits original reads that support the target-enriched draft, optionally downsamples those reads, reassembles with hifiasm, and runs a final FCS-GX screen.
The motivating benchmark was contaminated PacBio HiFi sequencing of obligate biotrophic oomycetes, but the strategy is not oomycete-specific. Any eukaryotic target with an appropriate NCBI taxonomy ID for FCS-GX can be used.
metaMDBGassembles the input HiFi reads as a metagenome.FCS-GXremoves contigs outside the requested target taxon.minimap2maps the original HiFi reads back to the cleaned draft.samtoolsextracts mapped reads for target-enriched reassembly.rasusaoptionally downsamples mapped reads to a target base count.hifiasmreassembles the recruited reads.FCS-GXruns a final contamination screen.- Optional
CompleasmandQUASTreports track assembly quality across steps.
- Nextflow
- One of Docker, Singularity, or Apptainer
- PacBio HiFi reads in fastq.gz format
- An NCBI FCS-GX database
- An NCBI taxonomy ID for the target clade
All workflow tools are configured as containers in nextflow.config.
git clone https://github.com/Yixuan39/target-asm.git
cd target-asm
nextflow run main.nf --helpnextflow run main.nf \
--reads /path/to/sample.fastq.gz \
--gx_db /path/to/fcs-gx-db-prefix \
--tax_id <target_ncbi_tax_id> \
--target_bases <expected_genome_size> \
--outdir results \
-profile apptainerSet --tax_id to the NCBI taxonomy ID for the target organism or target clade. Adjust --target_bases to the expected genome size multiplied by the desired coverage. For example, a 90 Mb genome at 60x coverage is 5.4e9.
To use all mapped reads for hifiasm reassembly, omit --target_bases.
Use target-asm when contamination is too complex for read-length filtering or whole-library assembly alone. It is designed for cases where:
- the target is a eukaryote represented by a minority or mixed fraction of reads;
- contaminant reads overlap the target reads in length or quality;
- a close reference genome is unavailable or should not drive assembly;
- taxonomic cleaning plus read recruitment is preferable to manual contig filtering.
For clean single-organism HiFi libraries, running hifiasm directly is usually enough.
| Parameter | Required | Default | Description |
|---|---|---|---|
--reads |
yes | null |
PacBio HiFi reads, compressed as .fastq.gz. |
--gx_db |
yes | null |
Path prefix for the NCBI FCS-GX database. |
--tax_id |
yes | null |
NCBI taxonomy ID used as the target taxon for FCS-GX. |
--target_bases |
no | null |
Number of bases to retain with rasusa before hifiasm reassembly. |
--rasusa_seed |
no | 0 |
Random seed for rasusa. |
--threads |
no | 24 |
CPU threads used by threaded processes. |
--fcs_gx_memory |
no | 512 GB |
Memory requested for each FCS-GX process. |
--hifiasm_option |
no | -l 2 |
Extra options passed to hifiasm. |
--outdir |
no | . |
Output directory for published files. |
--keep_intermediates |
no | false |
Publish intermediate assemblies and mapped reads. |
--quality_library |
no | null |
Compleasm database path. Enables assembly QC when supplied. |
--quality_lineage |
with QC | '' |
Compleasm lineage name. |
--help |
no | false |
Print command-line help. |
Add Compleasm and QUAST summaries to a full workflow run:
nextflow run main.nf \
--reads /path/to/sample.fastq.gz \
--gx_db /path/to/fcs-gx-db-prefix \
--tax_id <target_ncbi_tax_id> \
--target_bases 5.4e9 \
--quality_library /path/to/compleasm_db \
--quality_lineage <busco_lineage> \
--outdir results \
-profile apptainerWhen QC is enabled, target-asm evaluates the metaMDBG assembly, the first FCS-GX-cleaned assembly, the hifiasm assembly, and the final FCS-GX-cleaned assembly.
With the default settings, the main output is:
results/
<sample>.fasta.gz
fcs_gx/
fcs_initial.fcs_gx_report.txt
fcs_final.fcs_gx_report.txt
When --quality_library and --quality_lineage are provided:
results/
quality/
quality_trace.csv
quality_final.csv
When --keep_intermediates is set, target-asm also publishes intermediate outputs under metaMDBG/, minimap2/, rasusa/, hifiasm/, and fcs_gx/.
The workflow includes these profiles:
| Profile | Description |
|---|---|
standard |
Local execution. |
slurm |
SLURM execution. |
docker |
Enable Docker containers. |
singularity |
Enable Singularity containers with automounts. |
apptainer |
Enable Apptainer containers with automounts. |
Profiles can be combined, for example -profile slurm,apptainer, when running on a SLURM cluster. On SLURM systems, the slurm profile is recommended because only the FCS-GX steps require high-memory nodes, while the remaining workflow steps can run with ordinary scheduler resources.
- NCBI recommends 512 GiB shared memory for FCS-GX with the standard database; running below this can be extremely slow. target-asm requests
--fcs_gx_memory '512 GB'by default. target-asmremoves non-target taxonomic contamination, but target-derived organellar contigs may remain and should be handled downstream if nuclear-only assemblies are required.- For the downy mildew benchmark, Oomycota was used as the target clade (
--tax_id 4762) andstramenopileswas used for Compleasm QC. For other targets, choose the matching NCBI taxon and Compleasm/BUSCO lineage.
