All notable changes to this project will be documented in this file. Format follows Keep a Changelog.
- Keyboard controls now work reliably — P, R, S/Q, W keys previously had
no effect unless pressed during a narrow window between page-loop iterations.
Root cause: the listener thread stored keys in
_keybut the main loop only calledget_key()once per page (every 10–30 s), so most keypresses were missed. Fix: P and S/Q now setthreading.Eventflags (_pause_event,_stop_event) directly inside the listener thread, so they take effect immediately — even when the main loop is blocked intime.sleep()or atqdmiteration. R clears_pause_eventimmediately. W remains stored in_keyand is consumed by the main loop to print stats.InputControllergainsis_paused()andis_stopped()helper methods; both scrapers updated to poll these instead of a localpausedvariable. - Dead-website retry: 0 → 1 — profile/crawl fetches now retry once on
ReadTimeoutand general connection errors.ConnectTimeout(TCP-level failure — host is unreachable) still returns("", 0)immediately with zero retries, so truly dead sites never stall the run.
docs/folder added to the repository root for visual assets. Containsdocs/README.mdwith step-by-step instructions for recordingterminal-demo.gif(asciinema) and capturingexcel-preview.png.docs/.gitkeepensures Git tracks the empty folder.- README Output Preview — placeholder text replaced with an HTML comment
block that is ready to uncomment once assets are placed in
docs/. - README Project Structure — updated to show the
docs/folder.
- pyproject.toml build backend corrected — changed from non-existent
setuptools.backends.legacy:buildto the correctsetuptools.build_meta. The toolkit is now installable viapip install .andpip install -e .. || trueremoved from mypy CI step — type checking is now fully enforcing in GitHub Actions.types-PyYAMLandtypes-requestsstubs added so mypy resolves third-party annotations cleanly..gitignoreupdated —.mypy_cache/,.coverage,.coverage.*,.pytest_cache/, andhtmlcov/are now excluded from version control.
- UK postcode regex removed from
parser.parse_cards()— the hardcoded UK pattern\b[A-Z]{1,2}[0-9][0-9A-Z]?\s*[0-9][A-Z]{2}\bis replaced by an optionalpostcode_regexconfig key. The scraper now produces an empty Postcode column for non-UK directories rather than silently failing to match. - NYC geo example removed from
config.yaml.example—geo_boundsnow usesYOUR_LAT_MIN/YOUR_LAT_MAX/YOUR_LNG_MIN/YOUR_LNG_MAXplaceholders so the config template is genuinely neutral.
- CLI entry points —
html-scraperandwp-scrapercommands available afterpip install -e .from the repo root. engines/__init__.py,engines/html/__init__.py,engines/wordpress/__init__.py— engines are now proper Python packages.postcode_regexconfig key (WordPress engine) — optional regex added toconfig.yaml.example;parse_cards()signature updated to acceptcfgas an optional third argument;pc_match.group(1)corrected topc_match.group(0)so config-supplied patterns without capture groups work correctly (fixesIndexError: no such groupat runtime).- macOS runner added to CI —
smoke-platformsjob matrix now coverswindows-latestandmacos-latest. - Coverage threshold raised to 80% — both test jobs enforce
--cov-fail-under=80.
- README expanded to 15 sections — added Prerequisites, Performance & Benchmarks, Configuration Reference, Known Limitations, FAQ, and Contributing.
- Output Preview — placeholder text replaced with representative terminal output block and Excel sheet description table.
- Configuration Reference tables — full key/type/default/description tables for both engines.
- Contributing section — links to
CONTRIBUTING.md.
TestParseCardsPostcodeRegex(3 tests) — covers regex extraction, absent regex, andcfg=Nonein the WordPress engine.TestCheckStopTimefixed in both test suites — replaced bare string comparison with time-invariant assertions:check_stop_time("00:00")is alwaysTrue(current time is always past midnight);check_stop_time("")is alwaysFalse(empty string disables the feature). Avoidsmock.patch("controls.datetime")which fails when controls usesfrom datetime import datetime(no module-leveldatetimeattribute).TestFetcherRateEta(2 tests) — covers insufficient-data and sufficient-data branches ofrate_eta().TestExtractEmailPriority(3 tests) — verifies CF href takes precedence overdata-cfemailover plainmailto:.TestSafeDecodeEdgeCases(4 tests) — empty bytes, partial gzip header, partial zlib header, plain UTF-8 pass-through.TestDecodeEntitiesComprehensive(4 tests) — numeric entity, named entity, no-entity pass-through, multiple entities.TestProgressBar(3 tests) — bracket delimiters, 100% display, zero-total guard.- Total test count raised to 120+ across both engines.
- Concurrent profile fetching —
ThreadPoolExecutor(max_workers=3)fetches all profile pages on a listing/AJAX page in parallel. Each worker uses its own HTTP client/session to avoid shared-state issues.--workers NCLI flag (default: 3, cap: 8) overrides the worker count at runtime. Expected throughput improvement: ~3× (from ~12/min to ~35/min on a typical run). - tqdm live progress bar — inner profile-fetch loop now shows a live tqdm bar (
p{page}/{total} [████░░] N rec/s) that updates in-place without breaking log output. Falls back gracefully if tqdm is not installed. - W key for live stats — pressing
Wduring a run prints a full stats snapshot (saved, flagged, email count, phone count, website count, current sector, page, elapsed time) to the log without interrupting the scrape. - tqdm-safe logging — when tqdm is active, console log output is routed through
tqdm.write()so the progress bar is never broken by log lines. tqdm>=4.66.0added to bothrequirements.txtfiles.--workers NCLI flag added to bothscraper.pyfiles.
- Keyboard controls fully wired up —
P(pause),R(resume),S(stop),W(stats) are now read viacontroller.get_key()in both the outer page loop and the inner result-processing loop. Previously theInputControllerthread ran but its output was never consumed by the HTML scraper. - Category from badge images —
scraper.pynow readscard["services"][0](populated byparse_cards()from badge image keyword matching) as the record category instead of always falling back to the first category in config. Fixes the bug where all records showed "Residential Sales" regardless of the member's actual service types. - Location from listing-card meta text —
parse_cards()now reads aselectors.card_metaCSS selector (e.g.span.meta) from the config and populatescard["location"]from it.scraper.pywrites this to theLocationfield when the profile-page regex returns empty. Fixes the bug where the Location column was always blank for sites like Propertymark. --freshCLI flag added to HTMLscraper.py(previously only in WordPress engine).
- Keyboard controls fully wired up —
P,R,S/Q,Wkeys read in both outer AJAX page loop and inner result-processing loop.
- Website crawl timeout reduced from 25 s to 6 s — third-party company websites now use a 6-second timeout. This reduces per-dead-site stall time from up to 7.5 minutes to under 6 seconds.
ConnectTimeoutno longer retried — a TCP-level connection timeout means the host is unreachable; retrying wastes time. HTML engine:httpx.ConnectTimeoutinfetcher.safe_get(is_profile=True)now returns("", 0)immediately. WordPress engine:requests.exceptions.ConnectTimeoutinfetcher.http_get(is_crawl=True)now returns("", 0)immediately with no sleep. This eliminates the primary cause of the scraper appearing "stuck".
from bs4 import BeautifulSoupmoved to top-level import — was incorrectly placed inside the mainfor pageloop in the previous version.- Unused
keyvariable removed — dead code in the record-save block. sound_sequence/beep_rawordering fixed —beep_rawis now defined beforesound_sequencewhich calls it.
crawl_for_email()circular import eliminated — replaced deferredfrom parser import ...insidecrawl_for_emailwith self-contained inline helpers_is_validand_find_emails.decode_entitiesmade public — renamed from_decode_entitiestodecode_entitiesinparser.py;scraper.pycall updated accordingly.seen: set = set()annotation cleaned — inconsistent whitespace in inline type annotation corrected.
- Banner now shows
Workers: N concurrent profile threadsandKeys: P=pause R=resume S=stop W=stats. - Per-record log line now shows
loc:field with up to 12 characters. make_session()log level for cookie/proxy messages changed fromINFOtoDEBUGto reduce noise in concurrent runs.
- Modular 7-file architecture (scraper, config, fetcher, parser, exporter, checkpoint, controls)
- Config-driven CSS selectors — no code changes needed to retarget any paginated HTML directory
- Two-phase crawl: listing pages → individual profile pages
- Cloudflare XOR email decoding (handles
/cdn-cgi/l/email-protectionanddata-cfemailpatterns) - Generic phone normalisation (7–15 digit, international E.164-compatible)
- Geographic regex filter on extracted location text
command.txtruntime controls: pause / resume / stop / fresh / status (no interactive terminal needed)- Exponential backoff + circuit breaker (3 consecutive failures → auto-pause)
- Optional SMTP email verification via SMTP RCPT handshake (
dnspython) - Daily auto-stop time and low-disk space guard
- 3-sheet Excel output: Data, Flagged (geo-filtered + failed fetches), Summary
CheckpointManagerclass with atomic.tmp→ rename writeInputControllercross-platform keyboard listener (Windows msvcrt + Unix select/tty)--configCLI flag
- Modular 7-file architecture (scraper, config, fetcher, parser, exporter, checkpoint, controls)
- Config-driven sectors and AJAX parameters — retarget any WordPress directory via config only
- WordPress nonce auto-extraction (3 regex patterns)
- Mid-run nonce refresh on empty/failed AJAX response
admin-ajax.phpPOST pagination- Manual gzip/zlib decompression of AJAX responses
- Geographic bounding-box filter using lat/lng from AJAX markers
- Email enrichment via company website crawl
- Deduplication by (name, postcode)
- Exponential-backoff retry on all GET and POST requests
P/R/S/Qkeyboard controls (Windows msvcrt + Unix select/tty)- Configurable Excel header colour
- 3-sheet Excel output: Data, Flagged, Summary
CheckpointManagerwith atomic.tmp→ rename write--freshand--configCLI flags