Skip to content

intercalaris/site_change_tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

This script monitors changes on a set of websites, comparing current content and links against previous snapshots. It identifies added or removed text and links, saves differences, and generates a formatted Excel report summarizing changes. It supports rebuilding reports from saved diffs or using sample data for testing.

Function Descriptions

  1. main() Controls execution flow: uses sample data, rebuilds from diffs, or scrapes sites, then creates the report.
  2. collect_records(from_diffs, with_paths) Either rebuilds records from change logs (diffs) or scrapes websites and processes entries.
  3. get_urls(path, sheet, website_col, grantee_col, agency_col) Reads an Excel sheet to collect grantee names, agencies, and website URLs.
  4. process_entry(entry) Fetches a site, compares it to its previous snapshot, saves diffs, and returns a structured summary of changes.
  5. fetch_site(url) Uses Selenium to load a website, capture its HTML, clean it, and extract text blocks and links.
  6. remove_html_noise(document) Removes unwanted HTML elements (scripts, styles) to reduce noise before extracting content.
  7. extract_blocks(document, base_url) Extracts meaningful text blocks and associated links from cleaned HTML, skipping noise keywords.
  8. css_path(el) Generates a CSS-like path for an HTML element, used to locate text blocks.
  9. url_to_filename(url) Converts a URL into a safe filename for storing snapshots.
  10. load_snapshot(url) / save_snapshot(fname, data) Loads or saves JSON snapshots of site content for comparison.
  11. compare(old, new) Compares old and new snapshots to find added or removed text and links.
  12. load_diff_records(with_paths) Loads previously saved diffs to rebuild a report without re-scraping.
  13. make_report(df) Generates an Excel report from a DataFrame, applying formatting, hyperlinks, and summary styling

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages