Configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — geographic filtering, two-pass deduplication, data validation, and professionally formatted three-sheet Excel export. Point it at a new API with a config change. No code edits required.
⚠️ API access required. This tool fetches data from an API endpoint you configure — it does not scrape HTML pages. You need an API URL that returns JSON records. Seeconfig.yaml.examplefor the full configuration reference.
Found this useful? A ⭐ on GitHub helps other developers find it.
- Preview
- What It Does
- Use Cases
- How It Works
- Features
- Performance
- What Data You Get
- Quick Start
- Configuration Reference
- Output
- Runtime Controls
- Auto-Protection Features
- Resuming a Run
- Extending
- Tech Stack
- Project Structure
- Running Tests
- Troubleshooting
- Part of the B2B Lead Toolkit
- License
| Terminal — live progress bar | Excel Output — three sheets |
|---|---|
![]() |
![]() |
- Reads config.yaml — one file controls the API endpoint, pagination mode, geographic bounds, field mapping, validation rules, and output format.
- Fetches all records from the configured JSON API via paginated GET or POST requests, with configurable inter-page delay and request timeout.
- Filters by geography using a lat/lng bounding box — keeps only records whose coordinates fall within the configured region.
- Deduplicates in two passes — first by unique ID field, then by Name + Postcode composite key.
- Validates each record against configurable rules (name length, postcode requirement, postcode regex, or any custom pattern).
- Exports to Excel — a styled Data sheet (clean records), a Flagged sheet (failed validation), and a Summary sheet (run statistics).
Everything is config-driven. The Python source files contain zero API-specific strings — every field path, URL, and cleaning rule lives in config.yaml. Switching to a new directory API requires only a new config file, not a code change.
| Who uses it | What they do | Example config |
|---|---|---|
| Property data analysts | Harvest national directory APIs filtered to a city bounding box | geo_filter: {enabled: true, lat_min: 51.4, lat_max: 51.6} |
| Lead gen teams | Extract and validate business records from membership directory APIs | Any B2B directory with a JSON API |
| Market researchers | Pull structured datasets from industry directories for analysis | Configure once, re-run weekly for fresh data |
| CRM admins | Automate nightly imports from a members or listings API into Excel | Schedule with cron or Task Scheduler |
| Data engineers | Use as a lightweight ETL layer for JSON API → Excel pipelines | Extend exporter.py for custom output formats |
| Developers | Adapt to any new JSON directory API in minutes using a new config.yaml | Copy config.yaml.example, fill in 5 fields, run |
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1 — Fetch & Prepare │
│ │
│ config.yaml ──► API endpoint (POST or GET) │
│ │ paginated, configurable response path │
│ ▼ │
│ raw_records[] │
│ │ │
│ ├──► geo_filter() ← lat/lng bounding box │
│ ├──► dedup_records() ← ID, then name+post │
│ └──► checkpoint.save() ← Phase 1 done │
└──────────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────┐
│ PHASE 2 — Process Records │
│ │
│ records[] ──► extract_row() ← field mapping, HTML strip, │
│ phone normalise, https prefix │
│ │ │
│ ├──► validate_row() ──► CLEAN or FLAGGED │
│ └──► checkpoint.save() every N records │
└──────────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────────▼──────────────────────────────────┐
│ PHASE 3 — Export │
│ │
│ clean_rows[] ──► Excel Sheet 1: Data │
│ flagged_rows[] ──► Excel Sheet 2: Flagged (with Flag Reason) │
│ run stats{} ──► Excel Sheet 3: Summary │
│ │
│ {prefix}_{YYYYMMDD}.xlsx + rotating .log file │
└─────────────────────────────────────────────────────────────────┘
| Feature | Detail |
|---|---|
| Zero-code API targeting | Point at any JSON directory API by editing config.yaml — no Python changes needed |
| POST and GET | Configurable HTTP method per endpoint |
| Nested JSON navigation | Dot-path response_path traverses any response structure: ["data", "results"] → response["data"]["results"] |
| Flexible pagination | page_in_path: false (adds page as payload param) · page_in_path: true (substitutes {page} in URL path) |
| Geographic bounding-box filter | Optional lat/lng filter — restrict output to any city, region, or country |
| Two-pass deduplication | Pass 1: unique ID · Pass 2: Name + Postcode composite key |
| Configurable validation | Name length, postcode requirement, postcode regex — all config-driven; failures go to Flagged sheet |
| Atomic checkpoint writes | Write-to-temp-then-rename — checkpoint survives a crash mid-write |
| Three-sheet Excel export | Data · Flagged · Summary — styled, dated, named from config |
| Checkpoint / resume | Atomic JSON saves every N records — resume after any interruption with zero re-fetching |
| Interactive keyboard controls | P pause · R resume · S status · Q quit — no Enter needed on any platform |
| Auto-protection | Stop time · low-disk guard · consecutive-failure cap · retry queue |
| Rotating log file | 5 MB cap, 3 backups — full DEBUG to file, clean INFO to console |
| Dry-run mode | --dry-run reports record counts without writing any files |
| Environment variable overrides | SCRAPER_API_URL, SCRAPER_API_KEY — secrets never in config.yaml |
| Ruff linting in CI | Linter enforced on every push across Ubuntu, Windows, and macOS |
| 102 pure-function tests | Full test suite runs offline in under 3 seconds — no API key needed |
| Dataset size | Records fetched | Processing | Time |
|---|---|---|---|
| Small directory | 200–500 records | Geo-filter + dedup + validate | Under 2 min |
| Medium directory | 1,000–5,000 records | Full pipeline | 5–15 min |
| Large directory | 10,000+ records | Resumable overnight run | 1–4 hours |
Actual speed depends on the target API's response time per page. The tool adds a configurable inter-page delay (
inter_page_delayinruntime:) to stay polite. Processing (filtering, dedup, validation) adds negligible overhead on top of API fetch time.
| Field | Example |
|---|---|
| Name | Acme Property Management Ltd |
| Phone | 0161 234 5678 |
| Website | https://www.acme-property.co.uk |
| Postcode | M1 1AA |
| Category | Property Management |
| Source | Directory API |
See Assets/sample_output.csv for realistic sample output.
git clone https://github.com/FAAQJAVED/json-directory-harvester.git
cd json-directory-harvester
pip install -r requirements.txtcp config.yaml.example config.yamlEdit config.yaml — set your API URL, response path, field mapping, geo bounds, and output preferences. Every option is annotated in the example file.
cp .env.example .envAdd your API key to .env. It is injected at runtime via SCRAPER_API_KEY and never stored in config.yaml.
# Standard run
python scraper.py
# Dry run — reports counts without writing files
python scraper.py --dry-run
# Use a different config file
python scraper.py --config my_config.yaml
# Start fresh (clears any saved checkpoint)
python scraper.py --reset| Key | Description |
|---|---|
url |
Full API endpoint URL. Can be overridden with SCRAPER_API_URL |
method |
POST or GET |
headers |
Dict of request headers (User-Agent, Content-Type, Referer, Origin, Authorization) |
payload |
POST body / GET query params |
response_path |
List of keys to navigate to the records list: ["data", "results"] |
| Key | Description |
|---|---|
enabled |
Set true to enable multi-page fetching |
page_param |
Payload key for the page number (used when page_in_path: false) |
max_pages |
Hard ceiling on total pages fetched |
page_in_path |
If true, substitute {page} in the URL instead of adding a payload param |
| Key | Description |
|---|---|
enabled |
Toggle geographic filtering |
lat_min/max, lng_min/max |
Bounding box coordinates |
lat_field, lng_field |
Dot-path to lat/lng in each record (e.g. "coordinates.latitude") |
Maps the scraper's logical field names (id, name, phone, website, postcode) to the actual key names in the API's JSON records.
| Key | Description |
|---|---|
directory |
Output folder for all generated files |
filename_prefix |
Prefix for Excel and log filenames |
category |
Value written to the Category column |
source |
Value written to the Source column |
| Key | Description |
|---|---|
min_name_length |
Records with shorter names are flagged |
require_postcode |
Flag records with no postcode |
postcode_regex |
Regex pattern postcodes must match (blank = disabled) |
| Key | Default | Description |
|---|---|---|
stop_at |
"23:00" |
HH:MM auto-stop time |
save_every |
10 |
Checkpoint every N records |
progress_every |
10 |
Log progress every N records |
request_timeout |
15 |
HTTP timeout in seconds |
low_disk_mb |
500 |
Auto-pause threshold (MB free) |
max_consec_fail |
3 |
Auto-pause after N consecutive failures |
inter_page_delay |
0.5 |
Seconds to wait between paginated API requests |
| Sheet | Contents |
|---|---|
| Data | Clean, validated records — frozen header row, navy header, alternating row shading, auto-width columns |
| Flagged | Records that failed validation, each annotated with a Flag Reason column |
| Summary | Run metadata: date, record counts, elapsed time, source endpoint, version, status |
Rotating log file (5 MB max, 3 backups). Full DEBUG output with timestamps to file. Clean INFO-level only to console.
While the scraper is running, press a key (no Enter needed on any platform):
| Key | Action |
|---|---|
P |
Pause processing |
R |
Resume after pause |
S |
Print current status (progress bar, counts, rate, ETA) |
Q |
Quit cleanly — saves checkpoint for resumption |
| Feature | Trigger | Behaviour |
|---|---|---|
| Stop time | Configured HH:MM | Saves checkpoint and exits cleanly |
| Low disk guard | Free disk < low_disk_mb MB |
Auto-pauses; resumes when R is pressed |
| Consecutive failure cap | N failures in a row | Auto-pauses; resets counter on resume |
| Retry queue | Any record-level exception | Failed records retried once after main loop |
If a run is interrupted — by Q, stop time, disk guard, or any crash after Phase 1 completes — the checkpoint file is saved automatically. The next run detects it and resumes Phase 2 from where it stopped. Phase 1 (API fetching) is never repeated.
python scraper.py # automatically resumes if checkpoint exists
python scraper.py --reset # discard checkpoint and start fresh| Goal | Where to change |
|---|---|
| Add a new output column | processor.extract_row() and exporter.DATA_FIELDS |
| Add a new validation rule | processor.validate_row() |
| Add a new field normaliser | New function in processor.py alongside strip_html() |
| Support a new auth scheme | config._apply_env_overrides() |
| Add a new runtime protection | Top of Phase 2 loop in scraper.py |
| Library | Role |
|---|---|
requests |
HTTP GET/POST to the configured API endpoint |
pyyaml |
YAML config loading |
openpyxl |
Three-sheet Excel output (Data · Flagged · Summary) |
python-dotenv |
Optional — loads SCRAPER_API_URL and SCRAPER_API_KEY from .env |
json-directory-harvester/
├── scraper.py # CLI entry point — three-phase orchestrator
├── fetcher.py # Paginated HTTP fetching (POST/GET, two pagination modes)
├── processor.py # Geo filter, dedup, field extraction, validation (pure functions)
├── exporter.py # Three-sheet Excel workbook builder
├── checkpoint.py # Atomic JSON checkpoint save/load/clear
├── controls.py # Cross-platform keyboard listener and audio feedback
├── config.py # YAML loader with validation and env-var overrides
├── config.yaml.example # Fully annotated configuration template
├── .env.example # Secret injection template
├── requirements.txt # Runtime dependencies
├── requirements-dev.txt # Development and testing dependencies
├── pyproject.toml # Package metadata and tool configuration
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # MIT
├── Assets/
│ ├── terminal_progress.png # Screenshot — live progress bar
│ ├── output_preview.png # Screenshot — Excel output (three sheets)
│ └── sample_output.csv # Sample output rows for reference
└── tests/
├── __init__.py
├── test_processor.py # 44 tests — all pure functions
├── test_fetcher.py # 20 tests — mocked HTTP, both pagination modes
├── test_checkpoint.py # 15 tests — save/load/clear/atomic write
├── test_config.py # 12 tests — load_config, env overrides, validation
└── test_exporter.py # 11 tests — schema constants, Excel file generation
pip install -r requirements-dev.txt
pytest tests/ -v --cov=. --cov-report=term-missing102 tests. No API key required. Full suite runs offline in under 3 seconds.
- Python 3.9+
requests— HTTP fetchingpyyaml— YAML config loadingopenpyxl— Excel workbook outputpython-dotenv—.envsecret injection
See requirements.txt for pinned minimum versions.
"Config file not found: config.yaml"
Copy the annotated example: cp config.yaml.example config.yaml
Then open it and replace all YOUR_* placeholder values.
API returning zero records
Wrong response_path. Open browser DevTools → Network tab → find the API call → inspect the JSON structure. If records are at {"data": {"results": [...]}}, set response_path: ["data", "results"]. Use --dry-run first to confirm counts.
Checkpoint not resuming
Re-run python scraper.py — checkpoint is detected automatically.
To start fresh: python scraper.py --reset
Excel output locked / PermissionError
Close the previous output file in Excel before running. Excel holds an exclusive lock on open .xlsx files.
Keyboard controls not responding on macOS
macOS requires Accessibility permissions for raw keypress reading. System Settings → Privacy & Security → Accessibility → add your terminal app. Restart the terminal after granting permission.
Inter-page delay too fast — API is rate-limiting
Increase inter_page_delay in your config.yaml under runtime:.
Default is 0.5 seconds. For strict APIs, try 2.0 or higher:
runtime:
inter_page_delay: 2.0This tool is one component of a broader B2B lead generation pipeline targeting UK property management companies, letting agents, block managers, and HMO landlords.
| Repo | What it does |
|---|---|
| JSON Directory Harvester ← you are here | Configurable harvester for any JSON directory API |
| Google Maps Business Scraper | Extracts and enriches business listings from Google Maps |
| Email Phone Enrichment Tool | Converts a website list into a verified email + phone database |
| HTML Directory Scrapers | Two-engine toolkit for HTML and WordPress AJAX directories |
| LeadHunter Pro | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
| Trustpilot Business Scraper | Extracts business contact data from Trustpilot search results |
All five tools share the same Excel output schema (Data + Summary sheets) — results can be combined directly in Excel or imported together into a CRM.

