Skip to content

Commit e9056cd

Browse files
committed
full README upgrade — tagline, disclaimer, TOC, use cases, performance, what data you get, tech stack, troubleshooting, 102 badge
1 parent 0364926 commit e9056cd

1 file changed

Lines changed: 172 additions & 27 deletions

File tree

README.md

Lines changed: 172 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,47 @@
11
# JSON Directory Harvester
22

3-
> A configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — with geographic filtering, deduplication, data validation, and formatted three-sheet Excel export.
3+
**Configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — geographic filtering, two-pass deduplication, data validation, and professionally formatted three-sheet Excel export. Point it at a new API with a config change. No code edits required.**
4+
5+
> **⚠️ API access required.** This tool fetches data from an API endpoint you configure — it does not scrape HTML pages. You need an API URL that returns JSON records. See `config.yaml.example` for the full configuration reference.
46
57
[![CI](https://github.com/FAAQJAVED/json-directory-harvester/actions/workflows/ci.yml/badge.svg)](https://github.com/FAAQJAVED/json-directory-harvester/actions/workflows/ci.yml)
68
[![Python](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://python.org)
79
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
8-
[![Tests](https://img.shields.io/badge/tests-79%20passing-brightgreen)](tests/)
10+
[![Tests](https://img.shields.io/badge/tests-102%20passing-brightgreen)](tests/)
11+
[![Coverage](https://codecov.io/gh/FAAQJAVED/json-directory-harvester/graph/badge.svg)](https://codecov.io/gh/FAAQJAVED/json-directory-harvester)
912
[![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey)](https://github.com/FAAQJAVED/json-directory-harvester)
1013

1114
---
1215

16+
> Found this useful? A ⭐ on GitHub helps other developers find it.
17+
18+
---
19+
20+
## Table of Contents
21+
22+
- [Preview](#preview)
23+
- [What It Does](#what-it-does)
24+
- [Use Cases](#use-cases)
25+
- [How It Works](#how-it-works)
26+
- [Features](#features)
27+
- [Performance](#performance)
28+
- [What Data You Get](#what-data-you-get)
29+
- [Quick Start](#quick-start)
30+
- [Configuration Reference](#configuration-reference)
31+
- [Output](#output)
32+
- [Runtime Controls](#runtime-controls)
33+
- [Auto-Protection Features](#auto-protection-features)
34+
- [Resuming a Run](#resuming-a-run)
35+
- [Extending](#extending)
36+
- [Tech Stack](#tech-stack)
37+
- [Project Structure](#project-structure)
38+
- [Running Tests](#running-tests)
39+
- [Troubleshooting](#troubleshooting)
40+
- [Part of the B2B Lead Toolkit](#part-of-the-b2b-lead-toolkit)
41+
- [License](#license)
42+
43+
---
44+
1345
## Preview
1446

1547
| Terminal — live progress bar | Excel Output — three sheets |
@@ -20,9 +52,27 @@
2052

2153
## What It Does
2254

23-
JSON Directory Harvester fetches records from **any JSON-based directory API**, filters them by geographic bounding box, deduplicates them in two passes, validates each record against configurable rules, and exports the results to a professionally formatted Excel workbook.
55+
1. **Reads config.yaml** — one file controls the API endpoint, pagination mode, geographic bounds, field mapping, validation rules, and output format.
56+
2. **Fetches all records** from the configured JSON API via paginated GET or POST requests, with configurable inter-page delay and request timeout.
57+
3. **Filters by geography** using a lat/lng bounding box — keeps only records whose coordinates fall within the configured region.
58+
4. **Deduplicates in two passes** — first by unique ID field, then by Name + Postcode composite key.
59+
5. **Validates each record** against configurable rules (name length, postcode requirement, postcode regex, or any custom pattern).
60+
6. **Exports to Excel** — a styled Data sheet (clean records), a Flagged sheet (failed validation), and a Summary sheet (run statistics).
61+
62+
Everything is config-driven. The Python source files contain zero API-specific strings — every field path, URL, and cleaning rule lives in `config.yaml`. Switching to a new directory API requires only a new config file, not a code change.
63+
64+
---
65+
66+
## Use Cases
2467

25-
Everything — the API endpoint, pagination style, field names, geo bounds, validation rules, and output format — is controlled by a single `config.yaml` file. **No source code changes are needed to point the tool at a new API.**
68+
| Who uses it | What they do | Example config |
69+
|---|---|---|
70+
| **Property data analysts** | Harvest national directory APIs filtered to a city bounding box | `geo_filter: {enabled: true, lat_min: 51.4, lat_max: 51.6}` |
71+
| **Lead gen teams** | Extract and validate business records from membership directory APIs | Any B2B directory with a JSON API |
72+
| **Market researchers** | Pull structured datasets from industry directories for analysis | Configure once, re-run weekly for fresh data |
73+
| **CRM admins** | Automate nightly imports from a members or listings API into Excel | Schedule with cron or Task Scheduler |
74+
| **Data engineers** | Use as a lightweight ETL layer for JSON API → Excel pipelines | Extend `exporter.py` for custom output formats |
75+
| **Developers** | Adapt to any new JSON directory API in minutes using a new config.yaml | Copy config.yaml.example, fill in 5 fields, run |
2676

2777
---
2878

@@ -69,20 +119,50 @@ Everything — the API endpoint, pagination style, field names, geo bounds, vali
69119

70120
| Feature | Detail |
71121
|---|---|
72-
| **Config-driven** | API endpoint, pagination, field names, geo bounds, validation — all in `config.yaml`. Zero code changes to target a new API |
122+
| **Zero-code API targeting** | Point at any JSON directory API by editing config.yaml — no Python changes needed |
73123
| **POST and GET** | Configurable HTTP method per endpoint |
74124
| **Nested JSON navigation** | Dot-path `response_path` traverses any response structure: `["data", "results"]``response["data"]["results"]` |
75-
| **Two pagination modes** | `page_in_path: false` (adds page as payload param) · `page_in_path: true` (substitutes `{page}` in URL path) |
125+
| **Flexible pagination** | `page_in_path: false` (adds page as payload param) · `page_in_path: true` (substitutes `{page}` in URL path) |
76126
| **Geographic bounding-box filter** | Optional lat/lng filter — restrict output to any city, region, or country |
77-
| **Two-pass deduplication** | Pass 1: by record ID (richer record wins) · Pass 2: by name + postcode (case-insensitive) |
78-
| **Configurable validation** | Name length, postcode requirement, postcode regex — all config-driven |
79-
| **Three-sheet Excel export** | Data / Flagged / Summary with frozen headers, alternating row shading, and auto-width columns |
127+
| **Two-pass deduplication** | Pass 1: unique ID · Pass 2: Name + Postcode composite key |
128+
| **Configurable validation** | Name length, postcode requirement, postcode regex — all config-driven; failures go to Flagged sheet |
129+
| **Atomic checkpoint writes** | Write-to-temp-then-rename — checkpoint survives a crash mid-write |
130+
| **Three-sheet Excel export** | Data · Flagged · Summary — styled, dated, named from config |
80131
| **Checkpoint / resume** | Atomic JSON saves every N records — resume after any interruption with zero re-fetching |
81132
| **Interactive keyboard controls** | P pause · R resume · S status · Q quit — no Enter needed on any platform |
82133
| **Auto-protection** | Stop time · low-disk guard · consecutive-failure cap · retry queue |
83134
| **Rotating log file** | 5 MB cap, 3 backups — full DEBUG to file, clean INFO to console |
84-
| **Dry-run mode** | Reports counts without writing any files |
135+
| **Dry-run mode** | `--dry-run` reports record counts without writing any files |
85136
| **Environment variable overrides** | `SCRAPER_API_URL`, `SCRAPER_API_KEY` — secrets never in `config.yaml` |
137+
| **Ruff linting in CI** | Linter enforced on every push across Ubuntu, Windows, and macOS |
138+
| **102 pure-function tests** | Full test suite runs offline in under 3 seconds — no API key needed |
139+
140+
---
141+
142+
## Performance
143+
144+
| Dataset size | Records fetched | Processing | Time |
145+
|---|---|---|---|
146+
| Small directory | 200–500 records | Geo-filter + dedup + validate | Under 2 min |
147+
| Medium directory | 1,000–5,000 records | Full pipeline | 5–15 min |
148+
| Large directory | 10,000+ records | Resumable overnight run | 1–4 hours |
149+
150+
> Actual speed depends on the target API's response time per page. The tool adds a configurable inter-page delay (`inter_page_delay` in `runtime:`) to stay polite. Processing (filtering, dedup, validation) adds negligible overhead on top of API fetch time.
151+
152+
---
153+
154+
## What Data You Get
155+
156+
| Field | Example |
157+
|---|---|
158+
| Name | Acme Property Management Ltd |
159+
| Phone | 0161 234 5678 |
160+
| Website | https://www.acme-property.co.uk |
161+
| Postcode | M1 1AA |
162+
| Category | Property Management |
163+
| Source | Directory API |
164+
165+
See [`Assets/sample_output.csv`](Assets/sample_output.csv) for realistic sample output.
86166

87167
---
88168

@@ -190,6 +270,7 @@ Maps the scraper's logical field names (`id`, `name`, `phone`, `website`, `postc
190270
| `request_timeout` | `15` | HTTP timeout in seconds |
191271
| `low_disk_mb` | `500` | Auto-pause threshold (MB free) |
192272
| `max_consec_fail` | `3` | Auto-pause after N consecutive failures |
273+
| `inter_page_delay` | `0.5` | Seconds to wait between paginated API requests |
193274

194275
---
195276

@@ -244,6 +325,29 @@ python scraper.py --reset # discard checkpoint and start fresh
244325

245326
---
246327

328+
## Extending
329+
330+
| Goal | Where to change |
331+
|---|---|
332+
| Add a new output column | `processor.extract_row()` and `exporter.DATA_FIELDS` |
333+
| Add a new validation rule | `processor.validate_row()` |
334+
| Add a new field normaliser | New function in `processor.py` alongside `strip_html()` |
335+
| Support a new auth scheme | `config._apply_env_overrides()` |
336+
| Add a new runtime protection | Top of Phase 2 loop in `scraper.py` |
337+
338+
---
339+
340+
## Tech Stack
341+
342+
| Library | Role |
343+
|---|---|
344+
| `requests` | HTTP GET/POST to the configured API endpoint |
345+
| `pyyaml` | YAML config loading |
346+
| `openpyxl` | Three-sheet Excel output (Data · Flagged · Summary) |
347+
| `python-dotenv` | Optional — loads `SCRAPER_API_URL` and `SCRAPER_API_KEY` from `.env` |
348+
349+
---
350+
247351
## Project Structure
248352

249353
```
@@ -265,12 +369,15 @@ json-directory-harvester/
265369
├── LICENSE # MIT
266370
├── Assets/
267371
│ ├── terminal_progress.png # Screenshot — live progress bar
268-
│ └── output_preview.png # Screenshot — Excel output (three sheets)
372+
│ ├── output_preview.png # Screenshot — Excel output (three sheets)
373+
│ └── sample_output.csv # Sample output rows for reference
269374
└── tests/
270375
├── __init__.py
271376
├── test_processor.py # 44 tests — all pure functions
272377
├── test_fetcher.py # 20 tests — mocked HTTP, both pagination modes
273-
└── test_checkpoint.py # 15 tests — save/load/clear/atomic write
378+
├── test_checkpoint.py # 15 tests — save/load/clear/atomic write
379+
├── test_config.py # 12 tests — load_config, env overrides, validation
380+
└── test_exporter.py # 11 tests — schema constants, Excel file generation
274381
```
275382

276383
---
@@ -282,17 +389,7 @@ pip install -r requirements-dev.txt
282389
pytest tests/ -v --cov=. --cov-report=term-missing
283390
```
284391

285-
---
286-
287-
## Extending
288-
289-
| Goal | Where to change |
290-
|---|---|
291-
| Add a new output column | `processor.extract_row()` and `exporter.DATA_FIELDS` |
292-
| Add a new validation rule | `processor.validate_row()` |
293-
| Add a new field normaliser | New function in `processor.py` alongside `strip_html()` |
294-
| Support a new auth scheme | `config._apply_env_overrides()` |
295-
| Add a new runtime protection | Top of Phase 2 loop in `scraper.py` |
392+
102 tests. No API key required. Full suite runs offline in under 3 seconds.
296393

297394
---
298395

@@ -308,22 +405,70 @@ See `requirements.txt` for pinned minimum versions.
308405

309406
---
310407

408+
## Troubleshooting
409+
410+
**"Config file not found: config.yaml"**
411+
412+
Copy the annotated example: `cp config.yaml.example config.yaml`
413+
Then open it and replace all `YOUR_*` placeholder values.
414+
415+
---
416+
417+
**API returning zero records**
418+
419+
Wrong `response_path`. Open browser DevTools → Network tab → find the API call → inspect the JSON structure. If records are at `{"data": {"results": [...]}}`, set `response_path: ["data", "results"]`. Use `--dry-run` first to confirm counts.
420+
421+
---
422+
423+
**Checkpoint not resuming**
424+
425+
Re-run `python scraper.py` — checkpoint is detected automatically.
426+
To start fresh: `python scraper.py --reset`
427+
428+
---
429+
430+
**Excel output locked / PermissionError**
431+
432+
Close the previous output file in Excel before running. Excel holds an exclusive lock on open `.xlsx` files.
433+
434+
---
435+
436+
**Keyboard controls not responding on macOS**
437+
438+
macOS requires Accessibility permissions for raw keypress reading.
439+
System Settings → Privacy & Security → Accessibility → add your terminal app.
440+
Restart the terminal after granting permission.
441+
442+
---
443+
444+
**Inter-page delay too fast — API is rate-limiting**
445+
446+
Increase `inter_page_delay` in your `config.yaml` under `runtime:`.
447+
Default is 0.5 seconds. For strict APIs, try 2.0 or higher:
448+
449+
```yaml
450+
runtime:
451+
inter_page_delay: 2.0
452+
```
453+
454+
---
455+
311456
## Part of the B2B Lead Toolkit
312457
313458
This tool is one component of a broader B2B lead generation pipeline targeting UK property management companies, letting agents, block managers, and HMO landlords.
314459
315460
| Repo | What it does |
316461
|---|---|
317-
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)***you are here* | Harvests records from any JSON-based directory API |
462+
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** ← *you are here* | Configurable harvester for any JSON directory API |
318463
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
319-
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Scrapes contact emails and phones from company websites |
464+
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Converts a website list into a verified email + phone database |
320465
| **[LeadHunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)** | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
321-
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business listings from Trustpilot search results |
466+
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business contact data from Trustpilot search results |
322467
323468
All five tools share the same Excel output schema (Data + Summary sheets) — results can be combined directly in Excel or imported together into a CRM.
324469
325470
---
326471
327472
## License
328473
329-
MIT © 2026 [FAAQJAVED](https://github.com/FAAQJAVED)
474+
MIT © 2026 [FAAQJAVED](https://github.com/FAAQJAVED) — see [LICENSE](LICENSE)

0 commit comments

Comments
 (0)