Here's a structured plan for your Node.js/TypeScript CLI document-to-PDF converter:
- Core Architecture
interface CrawlerConfig {
maxConcurrency: number;
sameDomain: boolean;
outputDir: string;
}
interface PageData {
url: string;
content: string;
links: string[];
}- Implementation Phases
A. CLI Setup
- Use
commanderfor argument parsing - Required arguments:
--url,--output - Optional:
--concurrency(default: 10)
B. Enhanced Crawler
class ParallelCrawler {
private visited = new Set<string>();
private queue = new PQueue({ concurrency: config.maxConcurrency });
async crawl(url: string): Promise<PageData[]> {
// Implementation with:
// 1. Domain validation using URL module
// 2. Fetch with timeout (5s)
// 3. Link extraction with cheerio
// 4. Automatic queue management
}
}C. Conversion Pipeline
-
HTML → Markdown:
turndownwith custom rules -
Content aggregation:
- Frontmatter with source URL
- Section headers with page titles
-
PDF Generation:
md-to-pdfwith table of contents -
Key Optimizations
- Link deduplication with bloom filter
- Connection reuse with
undicifetch - Memory management for large documents
- Parallel processing pipeline:
Crawl → Convert → Aggregate (parallel stages)
- Dependencies
{
"dependencies": {
"commander": "CLI parsing",
"p-queue": "Parallel control",
"turndown": "HTML→Markdown",
"md-to-pdf": "PDF generation",
"cheerio": "HTML processing",
"undici": "High-performance fetch"
}
}- Validation Plan
- Test matrix: Static sites, SPAs, paginated content
- Benchmark: 100-page doc under 15s (non-IO bound)
- Failure modes: Redirects, auth walls, robots.txt
- Execution Flow
$ doc-export --url https://example.com/docs --output manual.pdf
[1/4] Initializing crawler (10 parallel workers)
[2/4] Found 42 pages in 3.2s
[3/4] Converted 42 pages to Markdown (1.8MB)
[4/4] PDF generated: manual.pdf (82 pages)
Done in 4.9s
- Error Safeguards
- Network: Retry with exponential backoff
- Memory: Chunked PDF rendering
- Content: Sanitization before conversion
- Scaling Considerations
- Cluster mode for >1k pages
- Partial export resume capability
- CDN cache integration points
Implementation Time Estimate: 6-8 hours (production-grade)
Would you like me to elaborate on any specific component or provide sample code for critical sections?