feat: Reep ID migration - self-minted universal entity IDs by rahulkeerthi · Pull Request #8 · withqwerty/reep

rahulkeerthi · 2026-04-02T22:04:40Z

Summary

Migrates Reep from Wikidata QIDs as primary keys to self-minted Reep IDs (reep_<type_prefix><8hex>). Wikidata becomes a provider mapping, not the identity backbone.

Closes #7

Changes

488,063 entities with reep_id as PK (was 475K with (qid, type))
1.7M provider_ids table replaces external_ids (dropped)
353K custom_ids rekeyed from (qid, type) to reep_id
13,232 Opta-only players imported (333 duplicates resolved)
API v2.0.0: /lookup?id= auto-detects Reep IDs and QIDs, legacy ?qid= still works
CSVs: reep_id as first column, key_wikidata as regular provider column
Pipeline: seed, incremental update, fetch, and export scripts all updated for reep_id

Migration phases

Phase	Commits	What
1	`e75d192`, `b65cf6c`	Mint 475K reep_ids, update worker
2	`1af712c`, `3f09c19`	provider_ids, custom_ids rekey, pipeline updates
3	`264ae56`, `9bcb9f6`	Import Opta-only players, resolve 333 dupes
4	`f7b4c17`, `81e9019`	Staging rehearsal passed, production cutover
docs	`9a642bf`, `3a2f4c9`	README, schema, openapi, data files

Design

See docs/plan-reep-id.md for the full design rationale (4 rounds of adversarial review). Follows the Chadwick Baseball Bureau Register model.

- Add mint-reep-ids.py: generates reep_<type_prefix><8hex> IDs for all entities. Idempotent (only processes reep_id IS NULL), collision-safe (constraint violation retry), batched SQL execution. - Add reep_id column + unique index to schema definition. - Add reep_id to worker API responses (ENTITY_COLS, search handler). Additive only — no breaking changes. Dry-run validated: 475,164 entities (386K players, 45K teams, 43K coaches). Run `python scripts/mint-reep-ids.py` to mint IDs on production D1. See docs/plan-reep-id.md and docs/plans/2026-04-02-001-feat-reep-id-migration-plan.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add reep_id to OpenAPI Entity schema (nullable) + search example - Clean up temp files after batch execution (try/finally + unlink) - Move Counter import to module level - Add type annotation to escape_sql parameter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The file was deliberately gitignored in f54a80d. Reverts the accidental re-addition from the review fix commit. The reep_id schema update lives locally and gets uploaded to RapidAPI manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- create-provider-ids.py: creates provider_ids table (1.7M rows) from external_ids (rekeyed to reep_id) + wikidata QID mappings. Includes lookup index and count verification. - rekey-custom-ids.py: migrates custom_ids from (qid, type) to reep_id. Backs up to JSON first, handles orphan detection, two-phase table swap. 780 orphan Club Elo rows dropped (QIDs not in entities). - .gitignore: exclude data/backups/ Both scripts executed on production D1: provider_ids: 1,735,049 rows (verified) custom_ids: 339,554 rows rekeyed (verified) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed script (seed-wikidata-d1.py): - Includes reep_id column in entity schema - Mints reep_ids for all entities after seeding - Rebuilds provider_ids from external_ids + wikidata QIDs - 90% count safety threshold on provider_ids Incremental update (incremental-update.py): - Looks up existing reep_ids before INSERT OR REPLACE (preserves them) - Mints reep_ids for new entities - Dual-writes to both external_ids and provider_ids - Includes wikidata QID mappings in provider_ids Fetch custom IDs (fetch-custom-ids.py): - Exports using reep_id (new schema) - Also exports reep_id_map.json for CSV export Export CSV (export-csv.py): - reep_id as first column in people.csv and teams.csv - Loads reep_id map to resolve QID -> reep_id - Custom IDs merged via reep_id key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- import-opta-entities.py: creates 13,232 new entities for Opta players not in Wikidata. Uses reep_id as qid placeholder, source=opta. - dedup-check.py: compares Opta entities against Wikidata entities via DOB + name similarity. Found 335 potential duplicates (score >= 0.90) that need manual resolution. Production D1 state: entities: 488,396 (475,164 Wikidata + 13,232 Opta) custom_ids: 352,787 (339,554 + 13,233 Opta imports) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merged Opta-only entities into their Wikidata counterparts where DOB + name matched at score >= 0.90. Repoints Opta IDs to the Wikidata entity's reep_id, then deletes the duplicate. Production D1: 488,063 entities (was 488,396, removed 333 dupes) Spot check: Gareth Bale (Q184586) now has opta ID via Wikidata entity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- cutover-reep-id.py: 10-step numbered script with checkpoint verification. Creates entities_new with reep_id PK, populates via INSERT...SELECT ORDER BY reep_id, drops old entities + external_ids, rebuilds FTS. - clone-to-staging.py: clones production data to staging D1 for rehearsal. Rehearsal results on staging D1 (reep-staging): - 488,063 entities migrated, count exact match - reep_id confirmed as PK, qid column removed - FTS search working (Cole Palmer found) - provider_ids + custom_ids unchanged - external_ids dropped - All post-flight checks passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Worker (worker.ts) rewritten for new schema: - Queries provider_ids instead of external_ids (dropped) - lookupByReepId replaces lookupEntity/lookupEntities - /lookup accepts ?id= with auto-detection: reep_id (reep_p...) or QID (Q...) - Legacy ?qid= param still works - /search returns qid as convenience field from provider_ids - /resolve + /batch/* use provider_ids + custom_ids - /stats queries provider_ids instead of external_ids - Version bumped to 2.0.0 Cutover script (cutover-reep-id.py) fixed: - Drop external_ids before entities (FK constraint) - Execute DROP and RENAME as separate commands (D1 batch limitation) Production D1 state: entities: 488,063 (reep_id PK, no qid column) provider_ids: 1,735,049 custom_ids: 352,787 external_ids: dropped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

README.md: - reep_id is now the canonical key (was key_wikidata) - Updated entity counts to 488K (was 430K) - New "Reep IDs" section explaining the ID scheme and Chadwick precedent - Updated people/teams schema tables with reep_id as first column - Updated usage examples (Python, R, SQL) to use reep_id - /lookup endpoint now documents ?id= with auto-detection - Coverage section references provider_ids (was external_ids) schemas/football-entities.sql: - entities PK is now reep_id (was qid, type) - current_team_reep_id replaces current_team_qid - source column added (wikidata, opta, etc.) - provider_ids replaces external_ids - custom_ids schema uses reep_id FK - All indexes updated openapi.yaml (gitignored, updated locally): - Version 2.0.0 - Entity schema has reep_id as primary field, qid nullable - /lookup documents ?id= with auto-detection - /batch/lookup accepts { ids: [...] } (was { qids: [...] }) - current_team_reep_id, source field added - Server URL uses RapidAPI (not direct worker URL) CLAUDE.md (gitignored, updated locally): - Architecture diagram shows provider_ids - D1 tables section reflects new schema and counts - Scripts table includes all migration scripts - Identity Model section explains reep_id scheme - Commands section updated - Provider coverage query uses JOINs on reep_id Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- people.csv: reep_id as first column (488K rows) - teams.csv: reep_id as first column (45K rows) - custom_ids.json: rekeyed from (qid, type) to reep_id (340K rows) - meta.json: updated counts - .gitignore: exclude reep_id_map.json and dedup-report.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ulti-position, native name Four Phase 2 enrichment additions from the Wikidata pipeline review (bundle A). Takes effect on the next weekly refresh; each field is an additive json key so existing consumers aren't broken. #6 DOB precision capture Phase 2 player + coach queries now read the P569 precision via p:P569/psv:P569/wikibase:timePrecision alongside the date value. Every DOB'd entity now carries `date_of_birth_precision`: 'day' / 'month' / 'year'. Previously a stored '1995-01-01' was ambiguous between a real Jan 1 birth and a year-precision stub (about 20K reep entities have YYYY-01-01 DOBs today). Match scripts like sync-transfermarkt-datasets.py can now use this to pick the right fallback path instead of heuristically inferring precision from the trailing '-01-01'. #7 Multi-language label fallback SPARQL label service was configured for English only, so any entity without an English label had its QID stored as the name (e.g. Q82595 Bundesliga, Q277533 Pablo Hernández). backfill-broken- names.py has been chasing these post-hoc via a full dump scan. Label service now accepts a 30-language chain (en → Romance/ Germanic → Slavic/Turkic → Asian/Arabic) and returns the first available. Live-verified: Q82595 now returns 'Lliga alemanya de futbol' (Catalan) instead of the QID. The QID-as-name drop in parse_ids_phase stays as the final safety net. #8 Multi-position capture Player P413 (position) was stored as the first-row value only. merge_bio now accumulates all position labels across Phase 2 rows into a set and joins them into a comma-separated `position` field. Players who play both forward and winger no longer lose that information. #9 Native name (P1559) Phase 2 player + coach queries now fetch wdt:P1559 and store it as `name_native`. This is the Wikidata equivalent of the salimt dataset's `name_in_home_country` field — reep now has native names for Eastern European / Asian / Arabic players without needing the salimt cross-reference. Entity dict additions: - name_native (was in schema, never populated) - date_of_birth_precision (new) Existing fields with new semantics: - position (was scalar, now comma-separated) Unit tested end-to-end with synthetic Phase 1 + Phase 2 rows for Messi, Ronaldo, Werner Herzog, and a year-precision stub. All four features verified. Also live-queried the real bio query against Wikidata — dobPrecision=11 and nativeName come through correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rahulkeerthi and others added 11 commits April 2, 2026 18:14

rahulkeerthi merged commit 6594873 into main Apr 2, 2026
1 check passed

rahulkeerthi deleted the feat/reep-id-migration branch April 2, 2026 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Reep ID migration - self-minted universal entity IDs#8

feat: Reep ID migration - self-minted universal entity IDs#8
rahulkeerthi merged 11 commits into
mainfrom
feat/reep-id-migration

rahulkeerthi commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

rahulkeerthi commented Apr 2, 2026

Summary

Changes

Migration phases

Design

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant