You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> A configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — with geographic filtering, deduplication, data validation, and formatted three-sheet Excel export.
3
+
**Configurable, resumable Python pipeline for harvesting records from any JSON-based directory API — geographic filtering, two-pass deduplication, data validation, and professionally formatted three-sheet Excel export. Point it at a new API with a config change. No code edits required.**
4
+
5
+
> **⚠️ API access required.** This tool fetches data from an API endpoint you configure — it does not scrape HTML pages. You need an API URL that returns JSON records. See `config.yaml.example` for the full configuration reference.
-[Part of the B2B Lead Toolkit](#part-of-the-b2b-lead-toolkit)
41
+
-[License](#license)
42
+
43
+
---
44
+
13
45
## Preview
14
46
15
47
| Terminal — live progress bar | Excel Output — three sheets |
@@ -20,9 +52,27 @@
20
52
21
53
## What It Does
22
54
23
-
JSON Directory Harvester fetches records from **any JSON-based directory API**, filters them by geographic bounding box, deduplicates them in two passes, validates each record against configurable rules, and exports the results to a professionally formatted Excel workbook.
55
+
1.**Reads config.yaml** — one file controls the API endpoint, pagination mode, geographic bounds, field mapping, validation rules, and output format.
56
+
2.**Fetches all records** from the configured JSON API via paginated GET or POST requests, with configurable inter-page delay and request timeout.
57
+
3.**Filters by geography** using a lat/lng bounding box — keeps only records whose coordinates fall within the configured region.
58
+
4.**Deduplicates in two passes** — first by unique ID field, then by Name + Postcode composite key.
59
+
5.**Validates each record** against configurable rules (name length, postcode requirement, postcode regex, or any custom pattern).
60
+
6.**Exports to Excel** — a styled Data sheet (clean records), a Flagged sheet (failed validation), and a Summary sheet (run statistics).
61
+
62
+
Everything is config-driven. The Python source files contain zero API-specific strings — every field path, URL, and cleaning rule lives in `config.yaml`. Switching to a new directory API requires only a new config file, not a code change.
63
+
64
+
---
65
+
66
+
## Use Cases
24
67
25
-
Everything — the API endpoint, pagination style, field names, geo bounds, validation rules, and output format — is controlled by a single `config.yaml` file. **No source code changes are needed to point the tool at a new API.**
68
+
| Who uses it | What they do | Example config |
69
+
|---|---|---|
70
+
|**Property data analysts**| Harvest national directory APIs filtered to a city bounding box |`geo_filter: {enabled: true, lat_min: 51.4, lat_max: 51.6}`|
71
+
|**Lead gen teams**| Extract and validate business records from membership directory APIs | Any B2B directory with a JSON API |
72
+
|**Market researchers**| Pull structured datasets from industry directories for analysis | Configure once, re-run weekly for fresh data |
73
+
|**CRM admins**| Automate nightly imports from a members or listings API into Excel | Schedule with cron or Task Scheduler |
74
+
|**Data engineers**| Use as a lightweight ETL layer for JSON API → Excel pipelines | Extend `exporter.py` for custom output formats |
75
+
|**Developers**| Adapt to any new JSON directory API in minutes using a new config.yaml | Copy config.yaml.example, fill in 5 fields, run |
26
76
27
77
---
28
78
@@ -69,20 +119,50 @@ Everything — the API endpoint, pagination style, field names, geo bounds, vali
69
119
70
120
| Feature | Detail |
71
121
|---|---|
72
-
|**Config-driven**|API endpoint, pagination, field names, geo bounds, validation — all in `config.yaml`. Zero code changes to target a new API|
122
+
|**Zero-code API targeting**|Point at any JSON directory API by editing config.yaml — no Python changes needed|
73
123
|**POST and GET**| Configurable HTTP method per endpoint |
|**Three-sheet Excel export**| Data · Flagged · Summary — styled, dated, named from config |
80
131
|**Checkpoint / resume**| Atomic JSON saves every N records — resume after any interruption with zero re-fetching |
81
132
|**Interactive keyboard controls**| P pause · R resume · S status · Q quit — no Enter needed on any platform |
82
133
|**Auto-protection**| Stop time · low-disk guard · consecutive-failure cap · retry queue |
83
134
|**Rotating log file**| 5 MB cap, 3 backups — full DEBUG to file, clean INFO to console |
84
-
|**Dry-run mode**|Reports counts without writing any files |
135
+
|**Dry-run mode**|`--dry-run` reports record counts without writing any files |
85
136
|**Environment variable overrides**|`SCRAPER_API_URL`, `SCRAPER_API_KEY` — secrets never in `config.yaml`|
137
+
|**Ruff linting in CI**| Linter enforced on every push across Ubuntu, Windows, and macOS |
138
+
|**102 pure-function tests**| Full test suite runs offline in under 3 seconds — no API key needed |
139
+
140
+
---
141
+
142
+
## Performance
143
+
144
+
| Dataset size | Records fetched | Processing | Time |
145
+
|---|---|---|---|
146
+
| Small directory | 200–500 records | Geo-filter + dedup + validate | Under 2 min |
147
+
| Medium directory | 1,000–5,000 records | Full pipeline | 5–15 min |
148
+
| Large directory | 10,000+ records | Resumable overnight run | 1–4 hours |
149
+
150
+
> Actual speed depends on the target API's response time per page. The tool adds a configurable inter-page delay (`inter_page_delay` in `runtime:`) to stay polite. Processing (filtering, dedup, validation) adds negligible overhead on top of API fetch time.
151
+
152
+
---
153
+
154
+
## What Data You Get
155
+
156
+
| Field | Example |
157
+
|---|---|
158
+
| Name | Acme Property Management Ltd |
159
+
| Phone | 0161 234 5678 |
160
+
| Website |https://www.acme-property.co.uk|
161
+
| Postcode | M1 1AA |
162
+
| Category | Property Management |
163
+
| Source | Directory API |
164
+
165
+
See [`Assets/sample_output.csv`](Assets/sample_output.csv) for realistic sample output.
86
166
87
167
---
88
168
@@ -190,6 +270,7 @@ Maps the scraper's logical field names (`id`, `name`, `phone`, `website`, `postc
190
270
|`request_timeout`|`15`| HTTP timeout in seconds |
| Add a new output column |`processor.extract_row()` and `exporter.DATA_FIELDS`|
292
-
| Add a new validation rule |`processor.validate_row()`|
293
-
| Add a new field normaliser | New function in `processor.py` alongside `strip_html()`|
294
-
| Support a new auth scheme |`config._apply_env_overrides()`|
295
-
| Add a new runtime protection | Top of Phase 2 loop in `scraper.py`|
392
+
102 tests. No API key required. Full suite runs offline in under 3 seconds.
296
393
297
394
---
298
395
@@ -308,22 +405,70 @@ See `requirements.txt` for pinned minimum versions.
308
405
309
406
---
310
407
408
+
## Troubleshooting
409
+
410
+
**"Config file not found: config.yaml"**
411
+
412
+
Copy the annotated example: `cp config.yaml.example config.yaml`
413
+
Then open it and replace all `YOUR_*` placeholder values.
414
+
415
+
---
416
+
417
+
**API returning zero records**
418
+
419
+
Wrong `response_path`. Open browser DevTools → Network tab → find the API call → inspect the JSON structure. If records are at `{"data": {"results": [...]}}`, set `response_path: ["data", "results"]`. Use `--dry-run` first to confirm counts.
420
+
421
+
---
422
+
423
+
**Checkpoint not resuming**
424
+
425
+
Re-run `python scraper.py` — checkpoint is detected automatically.
426
+
To start fresh: `python scraper.py --reset`
427
+
428
+
---
429
+
430
+
**Excel output locked / PermissionError**
431
+
432
+
Close the previous output file in Excel before running. Excel holds an exclusive lock on open `.xlsx` files.
433
+
434
+
---
435
+
436
+
**Keyboard controls not responding on macOS**
437
+
438
+
macOS requires Accessibility permissions for raw keypress reading.
439
+
System Settings → Privacy & Security → Accessibility → add your terminal app.
440
+
Restart the terminal after granting permission.
441
+
442
+
---
443
+
444
+
**Inter-page delay too fast — API is rate-limiting**
445
+
446
+
Increase `inter_page_delay` in your `config.yaml` under `runtime:`.
447
+
Default is 0.5 seconds. For strict APIs, try 2.0 or higher:
448
+
449
+
```yaml
450
+
runtime:
451
+
inter_page_delay: 2.0
452
+
```
453
+
454
+
---
455
+
311
456
## Part of the B2B Lead Toolkit
312
457
313
458
This tool is one component of a broader B2B lead generation pipeline targeting UK property management companies, letting agents, block managers, and HMO landlords.
314
459
315
460
| Repo | What it does |
316
461
|---|---|
317
-
|**[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** ← *you are here*|Harvests records from any JSON-based directory API |
462
+
| **[JSON Directory Harvester](https://github.com/FAAQJAVED/json-directory-harvester)** ← *you are here* | Configurable harvester for any JSON directory API |
318
463
| **[Google Maps Business Scraper](https://github.com/FAAQJAVED/Google-Maps-Business-Scraper)** | Extracts and enriches business listings from Google Maps |
319
-
|**[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)**|Scrapes contact emails and phones from company websites|
464
+
| **[Email Phone Enrichment Tool](https://github.com/FAAQJAVED/Email-Phone-Number-Enrichment-Tool)** | Converts a website list into a verified email + phone database |
320
465
| **[LeadHunter Pro](https://github.com/FAAQJAVED/Leadhunter_Pro)** | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
321
-
|**[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)**| Extracts business listings from Trustpilot search results |
466
+
| **[Trustpilot Business Scraper](https://github.com/FAAQJAVED/trustpilot-business-scraper)** | Extracts business contact data from Trustpilot search results |
322
467
323
468
All five tools share the same Excel output schema (Data + Summary sheets) — results can be combined directly in Excel or imported together into a CRM.
0 commit comments