Photographs by Jacob Olie — Wikimedia Commons to Stadsarchief Amsterdam link rot fixer and metadata extractor
For 3.600 Commons photos by Jacob Olie (1834–1905), this tool fixes outdated, broken source links to their current source records on the Stadsarchief Amsterdam Beeldbank, and extracts descriptive metadata from those records.
Jacob Olie was an Amsterdam photographer whose extensive body of work — street scenes, cityscapes, and portraits — is a key visual record of late-19th-century Amsterdam. Over 3,600 of his photographs are available on Wikimedia Commons, with source references pointing to the Stadsarchief Amsterdam collection.
Damrak 38-41 |
Keizersgracht |
Brouwersgracht |
Dam-noordzijde |
Paleis voor Volksvlijt |
Centraal Station |
Example photographs by Jacob Olie, from the Stadsarchief Amsterdam collection. Click to view on the Beeldbank.
This tool/pipeline can do 4 things:
-
The old Beeldbank URLs (e.g.
http://beeldbank.amsterdam.nl/afbeelding/10019A001542) embedded in the{{Photograph}}templates on Commons (example) no longer resolve to the correct pages in the Stadsarchief Amsterdam image bank. This pipeline extracts those URLs and resolves them to the new persistent detail page URLs onbeta.archief.amsterdam. -
From those new detail pages, 11 descriptive, structured metadata fields are extracted and added to the Excel, see column details below.
-
The new
beta.archief.amsterdamURLs are written back into thesource =parameter of the{{Photograph}}template on each Commons file page, fixing the link rot at the source. See REPLACE_SOURCE_URLS.md for the full design and operating procedure. -
This pipeline can easily be adapted for other collections from Stadsarchief Amsterdam, or other Memorix-based archives, see MANUAL.md for detailed usage instructions.
Input: a list of Wikimedia Commons filenames (filelist.txt)
Output: an Excel workbook (jacob_olie_sources.xlsx) with 18 columns. Column order is not load-bearing — every script resolves columns by header name, so the workbook can be safely reordered. The order shown here matches the current workbook:
Which URL should I use? The
Beta Archief Amsterdam Detail URLis the canonical, user-facing URL for each Commons image and is present on 3 640 of 3 656 rows (~99.6%). It is the only URL a reader of this workbook should look at or share. TheArchief Amsterdam Detail URLis the same record on the legacy interface and is kept for traceability only; theSource URLis the original (now broken) Beeldbank URL we replaced and is deprecated; theArchief Amsterdam URLis an intermediate search-result URL used during pipeline construction. None of these three should be used to refer to a record.
| Column | Content |
|---|---|
| Filename | Commons filename (e.g. File:Jacob Olie 001.jpg) |
| File URL (Commons) | Link to the Commons file page |
| Beta Archief Amsterdam Detail URL | Canonical detail page URL on beta.archief.amsterdam — this is the one to use |
| Archief Amsterdam Detail URL | Same record on the legacy archief.amsterdam interface — kept for traceability, do not use |
| Source URL | Original (now broken) Beeldbank URL from the {{Photograph}} template — deprecated, do not use |
| Afbeeldingsbestand (identifier) | Image file identifier |
| Archief Amsterdam URL | Transformed search URL used during pipeline construction — intermediate, do not use |
| Titel (dc_title) | Title of the photograph |
| Beschrijving (dc_description) | Description |
| Datering (dc_date) | Date of the photograph |
| Documenttype (sk_documenttype) | Document type (e.g. "foto") |
| Vervaardiger (sk_vervaardiger) | Creator |
| Collectie (dc_provenance) | Collection name |
| Geografische aanduiding (geografische_aanduiding) | Geographic location (street, area) |
| Gebouw (sk_gebouw) | Building name(s) |
| Inventarissen (dc_source) | Link to the archival inventory |
| Rechthebbende (sr_rechthebbende) | Rights holder |
| URL replacement succes | Outcome of writing the Beta URL back to Commons: TRUE (written or already present) or FALSE (could not / did not write — see the run log for the reason) |
pip install requests openpyxl python-dotenv pyyaml
# Step 1: Extract source URLs from Wikimedia Commons
python extract_sources.py
# Step 2: Add transformed Archief Amsterdam search URLs (see MANUAL.md)
# Step 3: Resolve to detail page URLs via the Memorix API
python add_detail_urls.py
# Step 4: Extract full metadata from the Memorix API
python add_metadata.py
# Step 5: Write the new Beta URLs back to the {{Photograph}} source= on Commons.
# Defaults to dry-run; --live performs the actual edits.
# See REPLACE_SOURCE_URLS.md for the full procedure.
python replace_source_urls.py --sample 2 # preview 2 proposed diffs
python replace_source_urls.py --live --limit 2 # live test on 2 rows
python replace_source_urls.py --live # full run (resumable)| Script | Description |
|---|---|
extract_sources.py |
Queries the MediaWiki API to extract source URLs from {{Photograph}} templates on Commons file pages. Processes in batches of 50 with rate limiting. |
add_detail_urls.py |
Queries the Memorix Mediabank API to resolve Stadsarchief Amsterdam identifiers to detail page UUIDs. |
add_metadata.py |
Queries the Memorix Mediabank API to extract 11 metadata fields (title, date, description, location, etc.) for each record. |
replace_source_urls.py |
Writes the new beta.archief.amsterdam URLs back into the source = parameter of the {{Photograph}} template on each Commons file page. Dry-run by default; --live performs the edits. Configurable via replace_config.yaml. |
integrity_check.py |
Read-only post-flight check over the workbook. Verifies structural invariants (TRUE rows must have a Beta URL, UUIDs match between Detail URL columns, URL hosts/paths are well-formed, etc.) and reports coverage-quality observations (per-field empty counts, TRUE rows missing required metadata, duplicates). Writes a copy of the report to integrity_report.txt. |
Once the workbook contains current Beta Archief Amsterdam Detail URL values, the script replace_source_urls.py writes those URLs back into the source = parameter of the {{Photograph}} template on each Commons file page. The script defaults to dry-run; live edits require an explicit --live flag.
Artifacts:
replace_source_urls.py— the replacement script.replace_config.yaml— runtime parameters (paths, column headers, delays, retries, edit summary, bot flag).REPLACE_SOURCE_URLS.md— full design, all quality and safety mechanisms, and the operating procedure.- A new column
URL replacement succesis added to the workbook withTRUEorFALSEper row. - A timestamped backup of the workbook taken at the start of every run.
- A per-row CSV run log with timestamps, before/after URLs, HTTP status, action taken, error, and new revision id.
Wikimedia credentials are read from .env (COMMONS_USERNAME, COMMONS_PASSWORD).
After the pipeline runs, integrity_check.py verifies the workbook is internally consistent. It is read-only — openpyxl opens the file with read_only=True and never writes back.
python integrity_check.pyPrints the full report to stdout and writes a copy to integrity_report.txt. Two layers of checks:
Structural invariants (must all be 0)
- Every
TRUErow has a Beta URL; no row has an empty result cell. - UUIDs in
Archief Amsterdam Detail URLandBeta Archief Amsterdam Detail URLmatch. - Per-column URL format validation:
Beta Archief Amsterdam Detail URL→beta.archief.amsterdam/detail/<uuid>;Archief Amsterdam Detail URL→archief.amsterdam/beeldbank/detail/<uuid>;File URL (Commons)→commons.wikimedia.org/wiki/File:…;Inventarissen→archief.amsterdam/archief/<fonds>/<id>. - No leftover
ERROR ...strings in metadata cells; no leading/trailing whitespace; no control / non-printable characters (BOMs, zero-width chars, line separators, etc.); no emptyFilename; no duplicateFilenames; theFile URL (Commons)matches theFilename.
Coverage-quality observations (informational, may be non-zero)
- Outcome distribution and
FALSEsub-bucket breakdown (no_beta_url,no_filename,no_file_url,no_source_url,other). - Metadata coverage histogram (out of 11 fields per row).
- Per-field empty-cell count — pinpoints which fields drive the gaps.
- TRUE rows missing one of the required fields (
Titel,Datering,Documenttype,Vervaardiger,Collectie,Afbeeldingsbestand). - Duplicate
Source URLs andBeta URLs — multiple Commons files pointing at the same record (often legitimate; worth a spot-check).
See INTEGRITY_REPORT.md for the latest results and a per-section explanation of what every check looks for.
See MANUAL.md for detailed usage instructions, including how to adapt this pipeline for other collections or other Memorix-based archives. See REPLACE_SOURCE_URLS.md for the design and operating procedure of the Commons writeback step.
- MediaWiki Action API — to fetch wikitext from Wikimedia Commons (batched, with proper User-Agent)
- Memorix Mediabank API by Vitec — to resolve record identifiers to UUIDs and extract metadata (public API key, embedded in the Beeldbank page source)
All scripts include rate limiting to be respectful to the servers:
- Wikimedia Commons (read, batched): 1 second between batches of 50
- Wikimedia Commons (write, per edit): 2 seconds between edits, plus
maxlag=5and exponential backoff on retry - Memorix API: 0.5 seconds between individual requests
This repository is dedicated to the public domain under the CC0 1.0 Universal license. The photographs by Jacob Olie are in the public domain.
