Skip to content

KBNLwikimedia/photographs-by-jacob-olie

Repository files navigation

Wikimedia Commons logo Stadsarchief Amsterdam logo

License: CC0-1.0 Python 3.8+ MediaWiki API Memorix Mediabank API Wikimedia Commons

Photographs by Jacob Olie — Wikimedia Commons to Stadsarchief Amsterdam link rot fixer and metadata extractor

For 3.600 Commons photos by Jacob Olie (1834–1905), this tool fixes outdated, broken source links to their current source records on the Stadsarchief Amsterdam Beeldbank, and extracts descriptive metadata from those records.

Jacob Olie was an Amsterdam photographer whose extensive body of work — street scenes, cityscapes, and portraits — is a key visual record of late-19th-century Amsterdam. Over 3,600 of his photographs are available on Wikimedia Commons, with source references pointing to the Stadsarchief Amsterdam collection.

Damrak 38-41
Damrak 38-41
Keizersgracht
Keizersgracht
Brouwersgracht
Brouwersgracht
Dam-noordzijde
Dam-noordzijde
Paleis voor Volksvlijt
Paleis voor Volksvlijt
Centraal Station
Centraal Station

Example photographs by Jacob Olie, from the Stadsarchief Amsterdam collection. Click to view on the Beeldbank.

What this does

This tool/pipeline can do 4 things:

  1. The old Beeldbank URLs (e.g. http://beeldbank.amsterdam.nl/afbeelding/10019A001542) embedded in the {{Photograph}} templates on Commons (example) no longer resolve to the correct pages in the Stadsarchief Amsterdam image bank. This pipeline extracts those URLs and resolves them to the new persistent detail page URLs on beta.archief.amsterdam.

  2. From those new detail pages, 11 descriptive, structured metadata fields are extracted and added to the Excel, see column details below.

  3. The new beta.archief.amsterdam URLs are written back into the source = parameter of the {{Photograph}} template on each Commons file page, fixing the link rot at the source. See REPLACE_SOURCE_URLS.md for the full design and operating procedure.

  4. This pipeline can easily be adapted for other collections from Stadsarchief Amsterdam, or other Memorix-based archives, see MANUAL.md for detailed usage instructions.

Input: a list of Wikimedia Commons filenames (filelist.txt)

Output: an Excel workbook (jacob_olie_sources.xlsx) with 18 columns. Column order is not load-bearing — every script resolves columns by header name, so the workbook can be safely reordered. The order shown here matches the current workbook:

Which URL should I use?   The Beta Archief Amsterdam Detail URL is the canonical, user-facing URL for each Commons image and is present on 3 640 of 3 656 rows (~99.6%). It is the only URL a reader of this workbook should look at or share. The Archief Amsterdam Detail URL is the same record on the legacy interface and is kept for traceability only; the Source URL is the original (now broken) Beeldbank URL we replaced and is deprecated; the Archief Amsterdam URL is an intermediate search-result URL used during pipeline construction. None of these three should be used to refer to a record.

Column Content
Filename Commons filename (e.g. File:Jacob Olie 001.jpg)
File URL (Commons) Link to the Commons file page
Beta Archief Amsterdam Detail URL Canonical detail page URL on beta.archief.amsterdam — this is the one to use
Archief Amsterdam Detail URL Same record on the legacy archief.amsterdam interface — kept for traceability, do not use
Source URL Original (now broken) Beeldbank URL from the {{Photograph}} template — deprecated, do not use
Afbeeldingsbestand (identifier) Image file identifier
Archief Amsterdam URL Transformed search URL used during pipeline construction — intermediate, do not use
Titel (dc_title) Title of the photograph
Beschrijving (dc_description) Description
Datering (dc_date) Date of the photograph
Documenttype (sk_documenttype) Document type (e.g. "foto")
Vervaardiger (sk_vervaardiger) Creator
Collectie (dc_provenance) Collection name
Geografische aanduiding (geografische_aanduiding) Geographic location (street, area)
Gebouw (sk_gebouw) Building name(s)
Inventarissen (dc_source) Link to the archival inventory
Rechthebbende (sr_rechthebbende) Rights holder
URL replacement succes Outcome of writing the Beta URL back to Commons: TRUE (written or already present) or FALSE (could not / did not write — see the run log for the reason)

Quick start

pip install requests openpyxl python-dotenv pyyaml

# Step 1: Extract source URLs from Wikimedia Commons
python extract_sources.py

# Step 2: Add transformed Archief Amsterdam search URLs (see MANUAL.md)

# Step 3: Resolve to detail page URLs via the Memorix API
python add_detail_urls.py

# Step 4: Extract full metadata from the Memorix API
python add_metadata.py

# Step 5: Write the new Beta URLs back to the {{Photograph}} source= on Commons.
# Defaults to dry-run; --live performs the actual edits.
# See REPLACE_SOURCE_URLS.md for the full procedure.
python replace_source_urls.py --sample 2          # preview 2 proposed diffs
python replace_source_urls.py --live --limit 2    # live test on 2 rows
python replace_source_urls.py --live              # full run (resumable)

Scripts

Script Description
extract_sources.py Queries the MediaWiki API to extract source URLs from {{Photograph}} templates on Commons file pages. Processes in batches of 50 with rate limiting.
add_detail_urls.py Queries the Memorix Mediabank API to resolve Stadsarchief Amsterdam identifiers to detail page UUIDs.
add_metadata.py Queries the Memorix Mediabank API to extract 11 metadata fields (title, date, description, location, etc.) for each record.
replace_source_urls.py Writes the new beta.archief.amsterdam URLs back into the source = parameter of the {{Photograph}} template on each Commons file page. Dry-run by default; --live performs the edits. Configurable via replace_config.yaml.
integrity_check.py Read-only post-flight check over the workbook. Verifies structural invariants (TRUE rows must have a Beta URL, UUIDs match between Detail URL columns, URL hosts/paths are well-formed, etc.) and reports coverage-quality observations (per-field empty counts, TRUE rows missing required metadata, duplicates). Writes a copy of the report to integrity_report.txt.

URL replacement on Wikimedia Commons

Once the workbook contains current Beta Archief Amsterdam Detail URL values, the script replace_source_urls.py writes those URLs back into the source = parameter of the {{Photograph}} template on each Commons file page. The script defaults to dry-run; live edits require an explicit --live flag.

Artifacts:

  • replace_source_urls.py — the replacement script.
  • replace_config.yaml — runtime parameters (paths, column headers, delays, retries, edit summary, bot flag).
  • REPLACE_SOURCE_URLS.md — full design, all quality and safety mechanisms, and the operating procedure.
  • A new column URL replacement succes is added to the workbook with TRUE or FALSE per row.
  • A timestamped backup of the workbook taken at the start of every run.
  • A per-row CSV run log with timestamps, before/after URLs, HTTP status, action taken, error, and new revision id.

Wikimedia credentials are read from .env (COMMONS_USERNAME, COMMONS_PASSWORD).

Integrity checking

After the pipeline runs, integrity_check.py verifies the workbook is internally consistent. It is read-only — openpyxl opens the file with read_only=True and never writes back.

python integrity_check.py

Prints the full report to stdout and writes a copy to integrity_report.txt. Two layers of checks:

Structural invariants (must all be 0)

  • Every TRUE row has a Beta URL; no row has an empty result cell.
  • UUIDs in Archief Amsterdam Detail URL and Beta Archief Amsterdam Detail URL match.
  • Per-column URL format validation: Beta Archief Amsterdam Detail URLbeta.archief.amsterdam/detail/<uuid>; Archief Amsterdam Detail URLarchief.amsterdam/beeldbank/detail/<uuid>; File URL (Commons)commons.wikimedia.org/wiki/File:…; Inventarissenarchief.amsterdam/archief/<fonds>/<id>.
  • No leftover ERROR ... strings in metadata cells; no leading/trailing whitespace; no control / non-printable characters (BOMs, zero-width chars, line separators, etc.); no empty Filename; no duplicate Filenames; the File URL (Commons) matches the Filename.

Coverage-quality observations (informational, may be non-zero)

  • Outcome distribution and FALSE sub-bucket breakdown (no_beta_url, no_filename, no_file_url, no_source_url, other).
  • Metadata coverage histogram (out of 11 fields per row).
  • Per-field empty-cell count — pinpoints which fields drive the gaps.
  • TRUE rows missing one of the required fields (Titel, Datering, Documenttype, Vervaardiger, Collectie, Afbeeldingsbestand).
  • Duplicate Source URLs and Beta URLs — multiple Commons files pointing at the same record (often legitimate; worth a spot-check).

See INTEGRITY_REPORT.md for the latest results and a per-section explanation of what every check looks for.

Documentation

See MANUAL.md for detailed usage instructions, including how to adapt this pipeline for other collections or other Memorix-based archives. See REPLACE_SOURCE_URLS.md for the design and operating procedure of the Commons writeback step.

APIs used

  • MediaWiki Action API — to fetch wikitext from Wikimedia Commons (batched, with proper User-Agent)
  • Memorix Mediabank API by Vitec — to resolve record identifiers to UUIDs and extract metadata (public API key, embedded in the Beeldbank page source)

Rate limiting

All scripts include rate limiting to be respectful to the servers:

  • Wikimedia Commons (read, batched): 1 second between batches of 50
  • Wikimedia Commons (write, per edit): 2 seconds between edits, plus maxlag=5 and exponential backoff on retry
  • Memorix API: 0.5 seconds between individual requests

License

This repository is dedicated to the public domain under the CC0 1.0 Universal license. The photographs by Jacob Olie are in the public domain.

About

For 3.600 Commons photos by Jacob Olie (1834–1905), this tool fixes outdated, broken source links to their current source records on the Stadsarchief Amsterdam Beeldbank, and extracts descriptive metadata from those records.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages