Heamanthus_pumilo_hifiasm.log
Hi there,
We are attempting to assemble the genome of Haemanthus pumilo (plant genome) using Hifiasm. The estimated genome size is approximately 30 Gb, with an estimated HiFi sequencing coverage of only ~10×, so reducing the input dataset is not a practical option.
The assembly progresses substantially but consistently runs out of memory before reaching the error-correction stage. This is a critical checkpoint because reaching it would allow the assembly to be resumed after wall-time limitations on our HPC system. Unfortunately, the job exhausts all available RAM (1 TB) before this stage is completed.
We understand that Hifiasm's memory requirements are expected to increase with genome size, particularly for very large and repetitive plant genomes. However, we would appreciate any guidance on the following:
- Are there any recommended parameters, strategies, or workflow modifications that could reduce memory usage for genomes of this scale without sacrificing already limited coverage?
- Is support for genomes in the ~30 Gb range considered within the intended scope of Hifiasm, or are we approaching practical memory limits of the current implementation?
- More generally, are there any plans or prospects for future versions of Hifiasm to support distributed-memory execution across multiple compute nodes?
Our HPC center has noted that sourcing sufficiently large single-node memory systems is becoming increasingly challenging and expensive. We are aware that distributed-memory approaches (e.g., PGAS-based frameworks and distributed hash-table implementations) exist for applications with very large in-memory data structures, and we were curious whether such approaches have ever been considered for Hifiasm or if there are architectural constraints that would make this impractical.
Any advice or insight would be greatly appreciated.
Thank you in advance.
Heamanthus_pumilo_hifiasm.log
Hi there,
We are attempting to assemble the genome of Haemanthus pumilo (plant genome) using Hifiasm. The estimated genome size is approximately 30 Gb, with an estimated HiFi sequencing coverage of only ~10×, so reducing the input dataset is not a practical option.
The assembly progresses substantially but consistently runs out of memory before reaching the error-correction stage. This is a critical checkpoint because reaching it would allow the assembly to be resumed after wall-time limitations on our HPC system. Unfortunately, the job exhausts all available RAM (1 TB) before this stage is completed.
We understand that Hifiasm's memory requirements are expected to increase with genome size, particularly for very large and repetitive plant genomes. However, we would appreciate any guidance on the following:
Our HPC center has noted that sourcing sufficiently large single-node memory systems is becoming increasingly challenging and expensive. We are aware that distributed-memory approaches (e.g., PGAS-based frameworks and distributed hash-table implementations) exist for applications with very large in-memory data structures, and we were curious whether such approaches have ever been considered for Hifiasm or if there are architectural constraints that would make this impractical.
Any advice or insight would be greatly appreciated.
Thank you in advance.