This script monitors changes on a set of websites, comparing current content and links against previous snapshots. It identifies added or removed text and links, saves differences, and generates a formatted Excel report summarizing changes. It supports rebuilding reports from saved diffs or using sample data for testing.
Function Descriptions
- main() Controls execution flow: uses sample data, rebuilds from diffs, or scrapes sites, then creates the report.
- collect_records(from_diffs, with_paths) Either rebuilds records from change logs (diffs) or scrapes websites and processes entries.
- get_urls(path, sheet, website_col, grantee_col, agency_col) Reads an Excel sheet to collect grantee names, agencies, and website URLs.
- process_entry(entry) Fetches a site, compares it to its previous snapshot, saves diffs, and returns a structured summary of changes.
- fetch_site(url) Uses Selenium to load a website, capture its HTML, clean it, and extract text blocks and links.
- remove_html_noise(document) Removes unwanted HTML elements (scripts, styles) to reduce noise before extracting content.
- extract_blocks(document, base_url) Extracts meaningful text blocks and associated links from cleaned HTML, skipping noise keywords.
- css_path(el) Generates a CSS-like path for an HTML element, used to locate text blocks.
- url_to_filename(url) Converts a URL into a safe filename for storing snapshots.
- load_snapshot(url) / save_snapshot(fname, data) Loads or saves JSON snapshots of site content for comparison.
- compare(old, new) Compares old and new snapshots to find added or removed text and links.
- load_diff_records(with_paths) Loads previously saved diffs to rebuild a report without re-scraping.
- make_report(df) Generates an Excel report from a DataFrame, applying formatting, hyperlinks, and summary styling