A lightweight, stealthy web scraping tool powered by Obscura (headless browser) that crawls websites using BFS or DFS algorithms and extracts structured content.
- Features
- Installation
- Usage
- CLI Commands
- API Documentation
- Configuration
- Output Structure
- Examples
- Limitations
- Development
- Contributing
- License
- 🔍 Intelligent Crawling: Choose between BFS (Breadth-First Search) or DFS (Depth-First Search) algorithms
- 📝 Content Extraction: Extracts metadata, headings, paragraphs, links, images, and all text content
- 🎯 Domain-Scoped: Only crawls internal links within the same domain
- 🚀 Interactive CLI: User-friendly command-line interface with input validation
- 💾 Multiple Output Formats: Save as Markdown, JSON, or CSV
- 🏷️ Custom Tag Selectors: Limit extraction to specific HTML tags or CSS selectors
- ⏱️ Configurable Delay: Set delay between requests to avoid overwhelming servers
- 🕵️ Stealth & Anti-Detection: Masks automation flags, spoofs navigator properties, realistic User-Agent
- 🔒 Tracker Blocking: Blocks analytics domains and unnecessary resource types for faster, private crawling
- 🔄 Duplicate Prevention: Tracks visited URLs to avoid redundant scraping
- 🎨 SEO Metadata: Extracts Open Graph, Twitter Cards, and other meta tags
- ⏱️ Timeout Handling: Built-in timeout management for unresponsive pages
- 🪶 Lightweight Browser: Uses Obscura — a lightweight headless browser via CDP (no heavy Chromium download needed)
npm install -g @harshvz/crawlernpm install @harshvz/crawlergit clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run build
npm install -g .Note: This package uses Obscura — a lightweight headless browser connected via CDP. It is significantly faster and smaller than Chromium. The only limitation is that it does not support screenshots (headless-only).
Simply run the command and follow the prompts:
# Primary command (recommended)
crawler
# Alternative (for backward compatibility)
scraperYou'll be prompted to enter:
- URL: The website URL to scrape (e.g.,
https://example.com) - Algorithm: Choose between
bfsordfs(default: bfs) - Format: Output format —
md,json, orcsv(default: md) - Depth: Maximum crawl depth (-1 for infinite)
- Delay: Milliseconds to wait between requests (0 for none)
- Tags: Custom HTML tags or CSS selectors to extract (comma-separated, blank for all)
- Output Directory: Custom save location (default:
~/knowledgeBase)
# Show version
crawler --version
crawler -v
# Show help
crawler --help
crawler -hNote: Both
crawlerandscrapercommands work identically. We recommend usingcrawlerfor new projects.
import Scraper from '@harshvz/crawler';
const scraper = new Scraper('https://example.com', {
depth: 2,
format: 'md',
delay: 500,
tags: 'h1, p, a, img',
});
// Using BFS
const results = await scraper.bfs('/');
// Using DFS
const results = await scraper.dfs('/');
await scraper.close();# Run in development mode with auto-reload
npm run dev
# Build the project
npm run build
# Start the built version
npm startThe main class that orchestrates crawling, content extraction, and file storage.
new Scraper(website: string, options?: ScraperOptions)Parameters:
website(string): The base URL of the website to scrapeoptions(ScraperOptions, optional):depth(number): Maximum depth relative to base URL (-1 = infinite, default: -1)format("md" | "json" | "csv"): Output format (default: "md")delay(number): Milliseconds between requests (default: 0)outputPath(string): Output directory (default: ~/knowledgeBase)tags(string): Comma-separated tags/selectors to extract (default: all)selectors(string[]): Additional CSS selectors for extra CSV output
Crawls the website using Breadth-First Search algorithm.
Crawls the website using Depth-First Search algorithm.
Closes the browser and cleans up all resources.
Extracts structured content from a Playwright Page instance.
import { ContentExtractor } from '@harshvz/crawler';
const extractor = new ContentExtractor(page);
const details = await extractor.getBasicDetails();
const content = await extractor.getStructuredContent('h1, p, a');getBasicDetails()— Returns title, description, robots, OG/Twitter metadatagetStructuredContent(customSelector?)— Extracts all content with attributes (href on links, src/alt on images)getContentBySelectors(selectors[])— Extracts text from custom CSS selectorstoJson(data)— Format as JSON stringtoMarkdown(metadata, content)— Format as Markdown stringtoCsv(data)— Format as CSV string
Handles file output with automatic directory creation.
import { FileService } from '@harshvz/crawler';
const fs = new FileService('./output');
fs.saveJson(url, endpoint, data);
fs.saveMarkdown(url, endpoint, content);
fs.saveCsv(url, endpoint, content);Choose between md, json, or csv. Each format includes:
- Markdown (
.md): Rich text with headings, links, and images - JSON (
.json): Full structured data with metadata and content arrays - CSV (
.csv): Tabular format with tag, text, href, src, and alt columns
By default, all content tags are extracted (h1-h6, p, a, img, li, code, pre, etc.). You can limit extraction to specific tags or CSS selectors:
const scraper = new Scraper('https://example.com', {
tags: 'h1, .product-title, a.product-link, img',
});Set a delay between requests to avoid rate-limiting:
const scraper = new Scraper('https://example.com', {
delay: 500, // 500ms between each page visit
});Depth is calculated relative to the base URL's pathname. If your base URL is https://site.com/blog/post/, then /blog/post/1 is depth 1 and /blog/post/1/2 is depth 2.
By default, all scraped data is stored in:
~/knowledgeBase/
Each website gets its own folder based on its hostname.
~/knowledgeBase/
└── examplecom/
├── home.md # Extracted content from homepage
├── home.json # (if json format selected)
├── home.csv # (if csv format selected)
├── _about.md # Extracted content from /about
└── _contact.md # Extracted content from /contact
Each file includes:
- Page title and URL
- Meta description
- Open Graph tags
- Twitter Card tags
- Extracted text content (with href on links and src/alt on images)
- Robots directives
import Scraper from '@harshvz/crawler';
const scraper = new Scraper('https://docs.example.com');
await scraper.bfs('/');
await scraper.close();const scraper = new Scraper('https://blog.example.com', {
depth: 2,
delay: 1000,
});
await scraper.dfs('/');
await scraper.close();const scraper = new Scraper('https://example.com', {
format: 'json',
tags: 'h1, h2, p, img',
});
await scraper.bfs('/');
await scraper.close();const scraper = new Scraper('https://example.com', {
depth: -1,
outputPath: '/custom/output/path',
});
await scraper.bfs('/');
await scraper.close();- No screenshots: Obscura is a lightweight headless-only browser — screenshot capture is not supported
- Sequential crawling: Pages are processed one at a time (concurrent crawling not yet implemented)
- In-memory queue: Queue is held in memory — very large crawls may exhaust available RAM
- Node.js >= 16.x
- npm >= 7.x
git clone https://github.com/harshvz/crawler.git
cd crawler
npm install
npm run devcrawler/
├── src/
│ ├── index.ts # CLI entry point
│ └── Services/
│ ├── Scraper.ts # Main orchestrator (crawling logic)
│ ├── ContentExtractor.ts # Page content extraction
│ ├── FileService.ts # Output file storage
│ └── BrowserService.ts # Browser lifecycle & stealth
├── dist/ # Compiled JavaScript
├── package.json
├── tsconfig.json
└── README.md
npm run buildThis compiles TypeScript files to JavaScript in the dist/ directory.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
ISC © Harshvz
- Built with Obscura — lightweight headless browser via CDP
- Browser automation by Playwright
- CLI powered by Inquirer.js
Made with ❤️ by harshvz
