JSON Directory Harvester

Configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — geographic filtering, two-pass deduplication, data validation, and professionally formatted three-sheet Excel export. Point it at a new API with a config change. No code edits required.

⚠️ API access required. This tool fetches data from an API endpoint you configure — it does not scrape HTML pages. You need an API URL that returns JSON records. See config.yaml.example for the full configuration reference.

Found this useful? A ⭐ on GitHub helps other developers find it.

Preview

Terminal — live progress bar	Excel Output — three sheets

What It Does

Reads config.yaml — one file controls the API endpoint, pagination mode, geographic bounds, field mapping, validation rules, and output format.
Fetches all records from the configured JSON API via paginated GET or POST requests, with configurable inter-page delay and request timeout.
Filters by geography using a lat/lng bounding box — keeps only records whose coordinates fall within the configured region.
Deduplicates in two passes — first by unique ID field, then by Name + Postcode composite key.
Validates each record against configurable rules (name length, postcode requirement, postcode regex, or any custom pattern).
Exports to Excel — a styled Data sheet (clean records), a Flagged sheet (failed validation), and a Summary sheet (run statistics).

Everything is config-driven. The Python source files contain zero API-specific strings — every field path, URL, and cleaning rule lives in config.yaml. Switching to a new directory API requires only a new config file, not a code change.

Use Cases

Who uses it	What they do	Example config
Property data analysts	Harvest national directory APIs filtered to a city bounding box	`geo_filter: {enabled: true, lat_min: 51.4, lat_max: 51.6}`
Lead gen teams	Extract and validate business records from membership directory APIs	Any B2B directory with a JSON API
Market researchers	Pull structured datasets from industry directories for analysis	Configure once, re-run weekly for fresh data
CRM admins	Automate nightly imports from a members or listings API into Excel	Schedule with cron or Task Scheduler
Data engineers	Use as a lightweight ETL layer for JSON API → Excel pipelines	Extend `exporter.py` for custom output formats
Developers	Adapt to any new JSON directory API in minutes using a new config.yaml	Copy config.yaml.example, fill in 5 fields, run

How It Works

┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1 — Fetch & Prepare                                      │
│                                                                 │
│  config.yaml ──► API endpoint (POST or GET)                    │
│                      │  paginated, configurable response path   │
│                      ▼                                          │
│                 raw_records[]                                   │
│                      │                                          │
│                      ├──► geo_filter()   ← lat/lng bounding box │
│                      ├──► dedup_records() ← ID, then name+post  │
│                      └──► checkpoint.save()  ← Phase 1 done     │
└──────────────────────────────┬──────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────┐
│  PHASE 2 — Process Records                                      │
│                                                                 │
│  records[] ──► extract_row()  ← field mapping, HTML strip,     │
│                                  phone normalise, https prefix  │
│                    │                                            │
│                    ├──► validate_row() ──► CLEAN or FLAGGED     │
│                    └──► checkpoint.save() every N records       │
└──────────────────────────────┬──────────────────────────────────┘
                               │
┌──────────────────────────────▼──────────────────────────────────┐
│  PHASE 3 — Export                                               │
│                                                                 │
│  clean_rows[]   ──► Excel Sheet 1: Data                        │
│  flagged_rows[] ──► Excel Sheet 2: Flagged (with Flag Reason)  │
│  run stats{}    ──► Excel Sheet 3: Summary                     │
│                                                                 │
│  {prefix}_{YYYYMMDD}.xlsx  +  rotating .log file               │
└─────────────────────────────────────────────────────────────────┘

Features

Feature	Detail
Zero-code API targeting	Point at any JSON directory API by editing config.yaml — no Python changes needed
POST and GET	Configurable HTTP method per endpoint
Nested JSON navigation	Dot-path `response_path` traverses any response structure: `["data", "results"]` → `response["data"]["results"]`
Flexible pagination	`page_in_path: false` (adds page as payload param) · `page_in_path: true` (substitutes `{page}` in URL path)
Geographic bounding-box filter	Optional lat/lng filter — restrict output to any city, region, or country
Two-pass deduplication	Pass 1: unique ID · Pass 2: Name + Postcode composite key
Configurable validation	Name length, postcode requirement, postcode regex — all config-driven; failures go to Flagged sheet
Atomic checkpoint writes	Write-to-temp-then-rename — checkpoint survives a crash mid-write
Three-sheet Excel export	Data · Flagged · Summary — styled, dated, named from config
Checkpoint / resume	Atomic JSON saves every N records — resume after any interruption with zero re-fetching
Interactive keyboard controls	P pause · R resume · S status · Q quit — no Enter needed on any platform
Auto-protection	Stop time · low-disk guard · consecutive-failure cap · retry queue
Rotating log file	5 MB cap, 3 backups — full DEBUG to file, clean INFO to console
Dry-run mode	`--dry-run` reports record counts without writing any files
Environment variable overrides	`SCRAPER_API_URL`, `SCRAPER_API_KEY` — secrets never in `config.yaml`
Ruff linting in CI	Linter enforced on every push across Ubuntu, Windows, and macOS
102 pure-function tests	Full test suite runs offline in under 3 seconds — no API key needed

Performance

Dataset size	Records fetched	Processing	Time
Small directory	200–500 records	Geo-filter + dedup + validate	Under 2 min
Medium directory	1,000–5,000 records	Full pipeline	5–15 min
Large directory	10,000+ records	Resumable overnight run	1–4 hours

Actual speed depends on the target API's response time per page. The tool adds a configurable inter-page delay (inter_page_delay in runtime:) to stay polite. Processing (filtering, dedup, validation) adds negligible overhead on top of API fetch time.

What Data You Get

Field	Example
Name	Acme Property Management Ltd
Phone	0161 234 5678
Website	https://www.acme-property.co.uk
Postcode	M1 1AA
Category	Property Management
Source	Directory API

See Assets/sample_output.csv for realistic sample output.

Quick Start

1. Clone and install

git clone https://github.com/FAAQJAVED/json-directory-harvester.git
cd json-directory-harvester
pip install -r requirements.txt

2. Configure

cp config.yaml.example config.yaml

Edit config.yaml — set your API URL, response path, field mapping, geo bounds, and output preferences. Every option is annotated in the example file.

3. Set secrets (optional)

cp .env.example .env

Add your API key to .env. It is injected at runtime via SCRAPER_API_KEY and never stored in config.yaml.

4. Run

# Standard run
python scraper.py

# Dry run — reports counts without writing files
python scraper.py --dry-run

# Use a different config file
python scraper.py --config my_config.yaml

# Start fresh (clears any saved checkpoint)
python scraper.py --reset

Configuration Reference

`api` section

Key	Description
`url`	Full API endpoint URL. Can be overridden with `SCRAPER_API_URL`
`method`	`POST` or `GET`
`headers`	Dict of request headers (User-Agent, Content-Type, Referer, Origin, Authorization)
`payload`	POST body / GET query params
`response_path`	List of keys to navigate to the records list: `["data", "results"]`

`api.pagination` section

Key	Description
`enabled`	Set `true` to enable multi-page fetching
`page_param`	Payload key for the page number (used when `page_in_path: false`)
`max_pages`	Hard ceiling on total pages fetched
`page_in_path`	If `true`, substitute `{page}` in the URL instead of adding a payload param

`geo_filter` section

Key	Description
`enabled`	Toggle geographic filtering
`lat_min/max`, `lng_min/max`	Bounding box coordinates
`lat_field`, `lng_field`	Dot-path to lat/lng in each record (e.g. `"coordinates.latitude"`)

`field_mapping` section

Maps the scraper's logical field names (id, name, phone, website, postcode) to the actual key names in the API's JSON records.

`output` section

Key	Description
`directory`	Output folder for all generated files
`filename_prefix`	Prefix for Excel and log filenames
`category`	Value written to the Category column
`source`	Value written to the Source column

`validation` section

Key	Description
`min_name_length`	Records with shorter names are flagged
`require_postcode`	Flag records with no postcode
`postcode_regex`	Regex pattern postcodes must match (blank = disabled)

`runtime` section

Key	Default	Description
`stop_at`	`"23:00"`	HH:MM auto-stop time
`save_every`	`10`	Checkpoint every N records
`progress_every`	`10`	Log progress every N records
`request_timeout`	`15`	HTTP timeout in seconds
`low_disk_mb`	`500`	Auto-pause threshold (MB free)
`max_consec_fail`	`3`	Auto-pause after N consecutive failures
`inter_page_delay`	`0.5`	Seconds to wait between paginated API requests

Output

Excel workbook — `{prefix}_{YYYYMMDD}.xlsx`

Sheet	Contents
Data	Clean, validated records — frozen header row, navy header, alternating row shading, auto-width columns
Flagged	Records that failed validation, each annotated with a `Flag Reason` column
Summary	Run metadata: date, record counts, elapsed time, source endpoint, version, status

Log file — `{prefix}_{YYYYMMDD}.log`

Rotating log file (5 MB max, 3 backups). Full DEBUG output with timestamps to file. Clean INFO-level only to console.

Runtime Controls

While the scraper is running, press a key (no Enter needed on any platform):

Key	Action
`P`	Pause processing
`R`	Resume after pause
`S`	Print current status (progress bar, counts, rate, ETA)
`Q`	Quit cleanly — saves checkpoint for resumption

Auto-Protection Features

Feature	Trigger	Behaviour
Stop time	Configured HH:MM	Saves checkpoint and exits cleanly
Low disk guard	Free disk < `low_disk_mb` MB	Auto-pauses; resumes when R is pressed
Consecutive failure cap	N failures in a row	Auto-pauses; resets counter on resume
Retry queue	Any record-level exception	Failed records retried once after main loop

Resuming a Run

If a run is interrupted — by Q, stop time, disk guard, or any crash after Phase 1 completes — the checkpoint file is saved automatically. The next run detects it and resumes Phase 2 from where it stopped. Phase 1 (API fetching) is never repeated.

python scraper.py          # automatically resumes if checkpoint exists
python scraper.py --reset  # discard checkpoint and start fresh

Extending

Goal	Where to change
Add a new output column	`processor.extract_row()` and `exporter.DATA_FIELDS`
Add a new validation rule	`processor.validate_row()`
Add a new field normaliser	New function in `processor.py` alongside `strip_html()`
Support a new auth scheme	`config._apply_env_overrides()`
Add a new runtime protection	Top of Phase 2 loop in `scraper.py`

Tech Stack

Library	Role
`requests`	HTTP GET/POST to the configured API endpoint
`pyyaml`	YAML config loading
`openpyxl`	Three-sheet Excel output (Data · Flagged · Summary)
`python-dotenv`	Optional — loads `SCRAPER_API_URL` and `SCRAPER_API_KEY` from `.env`

Project Structure

json-directory-harvester/
├── scraper.py            # CLI entry point — three-phase orchestrator
├── fetcher.py            # Paginated HTTP fetching (POST/GET, two pagination modes)
├── processor.py          # Geo filter, dedup, field extraction, validation (pure functions)
├── exporter.py           # Three-sheet Excel workbook builder
├── checkpoint.py         # Atomic JSON checkpoint save/load/clear
├── controls.py           # Cross-platform keyboard listener and audio feedback
├── config.py             # YAML loader with validation and env-var overrides
├── config.yaml.example   # Fully annotated configuration template
├── .env.example          # Secret injection template
├── requirements.txt      # Runtime dependencies
├── requirements-dev.txt  # Development and testing dependencies
├── pyproject.toml        # Package metadata and tool configuration
├── CHANGELOG.md          # Version history
├── CONTRIBUTING.md       # Contribution guidelines
├── LICENSE               # MIT
├── Assets/
│   ├── terminal_progress.png   # Screenshot — live progress bar
│   ├── output_preview.png      # Screenshot — Excel output (three sheets)
│   └── sample_output.csv       # Sample output rows for reference
└── tests/
    ├── __init__.py
    ├── test_processor.py  # 44 tests — all pure functions
    ├── test_fetcher.py    # 20 tests — mocked HTTP, both pagination modes
    ├── test_checkpoint.py # 15 tests — save/load/clear/atomic write
    ├── test_config.py     # 12 tests — load_config, env overrides, validation
    └── test_exporter.py   # 11 tests — schema constants, Excel file generation

Running Tests

pip install -r requirements-dev.txt
pytest tests/ -v --cov=. --cov-report=term-missing

102 tests. No API key required. Full suite runs offline in under 3 seconds.

Requirements

Python 3.9+
requests — HTTP fetching
pyyaml — YAML config loading
openpyxl — Excel workbook output
python-dotenv — .env secret injection

See requirements.txt for pinned minimum versions.

Troubleshooting

"Config file not found: config.yaml"

Copy the annotated example: cp config.yaml.example config.yaml Then open it and replace all YOUR_* placeholder values.

API returning zero records

Wrong response_path. Open browser DevTools → Network tab → find the API call → inspect the JSON structure. If records are at {"data": {"results": [...]}}, set response_path: ["data", "results"]. Use --dry-run first to confirm counts.

Checkpoint not resuming

Re-run python scraper.py — checkpoint is detected automatically. To start fresh: python scraper.py --reset

Excel output locked / PermissionError

Close the previous output file in Excel before running. Excel holds an exclusive lock on open .xlsx files.

Keyboard controls not responding on macOS

macOS requires Accessibility permissions for raw keypress reading. System Settings → Privacy & Security → Accessibility → add your terminal app. Restart the terminal after granting permission.

Inter-page delay too fast — API is rate-limiting

Increase inter_page_delay in your config.yaml under runtime:. Default is 0.5 seconds. For strict APIs, try 2.0 or higher:

runtime:
  inter_page_delay: 2.0

Part of the B2B Lead Toolkit

This tool is one component of a broader B2B lead generation pipeline targeting UK property management companies, letting agents, block managers, and HMO landlords.

Repo	What it does
JSON Directory Harvester ← you are here	Configurable harvester for any JSON directory API
Google Maps Business Scraper	Extracts and enriches business listings from Google Maps
Email Phone Enrichment Tool	Converts a website list into a verified email + phone database
HTML Directory Scrapers	Two-engine toolkit for HTML and WordPress AJAX directories
LeadHunter Pro	Multi-engine search scraper with HOT/WARM/COLD lead scoring
Trustpilot Business Scraper	Extracts business contact data from Trustpilot search results

All five tools share the same Excel output schema (Data + Summary sheets) — results can be combined directly in Excel or imported together into a CRM.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
Assets		Assets
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
checkpoint.py		checkpoint.py
config.py		config.py
config.yaml.example		config.yaml.example
controls.py		controls.py
exporter.py		exporter.py
fetcher.py		fetcher.py
processor.py		processor.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
scraper.py		scraper.py

Folders and files

Latest commit

History

Repository files navigation

JSON Directory Harvester

Table of Contents

Preview

What It Does

Use Cases

How It Works

Features

Performance

What Data You Get

Quick Start

1. Clone and install

2. Configure

3. Set secrets (optional)

4. Run

Configuration Reference

api section

api.pagination section

geo_filter section

field_mapping section

output section

validation section

runtime section

Output

Excel workbook — {prefix}_{YYYYMMDD}.xlsx

Log file — {prefix}_{YYYYMMDD}.log

Runtime Controls

Auto-Protection Features

Resuming a Run

Extending

Tech Stack

Project Structure

Running Tests

Requirements

Troubleshooting

Part of the B2B Lead Toolkit

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`api` section

`api.pagination` section

`geo_filter` section

`field_mapping` section

`output` section

`validation` section

`runtime` section

Excel workbook — `{prefix}_{YYYYMMDD}.xlsx`

Log file — `{prefix}_{YYYYMMDD}.log`

Packages