feat: Reep ID migration - self-minted universal entity IDs#8
Merged
Conversation
- Add mint-reep-ids.py: generates reep_<type_prefix><8hex> IDs for all entities. Idempotent (only processes reep_id IS NULL), collision-safe (constraint violation retry), batched SQL execution. - Add reep_id column + unique index to schema definition. - Add reep_id to worker API responses (ENTITY_COLS, search handler). Additive only — no breaking changes. Dry-run validated: 475,164 entities (386K players, 45K teams, 43K coaches). Run `python scripts/mint-reep-ids.py` to mint IDs on production D1. See docs/plan-reep-id.md and docs/plans/2026-04-02-001-feat-reep-id-migration-plan.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add reep_id to OpenAPI Entity schema (nullable) + search example - Clean up temp files after batch execution (try/finally + unlink) - Move Counter import to module level - Add type annotation to escape_sql parameter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The file was deliberately gitignored in f54a80d. Reverts the accidental re-addition from the review fix commit. The reep_id schema update lives locally and gets uploaded to RapidAPI manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- create-provider-ids.py: creates provider_ids table (1.7M rows) from external_ids (rekeyed to reep_id) + wikidata QID mappings. Includes lookup index and count verification. - rekey-custom-ids.py: migrates custom_ids from (qid, type) to reep_id. Backs up to JSON first, handles orphan detection, two-phase table swap. 780 orphan Club Elo rows dropped (QIDs not in entities). - .gitignore: exclude data/backups/ Both scripts executed on production D1: provider_ids: 1,735,049 rows (verified) custom_ids: 339,554 rows rekeyed (verified) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed script (seed-wikidata-d1.py): - Includes reep_id column in entity schema - Mints reep_ids for all entities after seeding - Rebuilds provider_ids from external_ids + wikidata QIDs - 90% count safety threshold on provider_ids Incremental update (incremental-update.py): - Looks up existing reep_ids before INSERT OR REPLACE (preserves them) - Mints reep_ids for new entities - Dual-writes to both external_ids and provider_ids - Includes wikidata QID mappings in provider_ids Fetch custom IDs (fetch-custom-ids.py): - Exports using reep_id (new schema) - Also exports reep_id_map.json for CSV export Export CSV (export-csv.py): - reep_id as first column in people.csv and teams.csv - Loads reep_id map to resolve QID -> reep_id - Custom IDs merged via reep_id key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- import-opta-entities.py: creates 13,232 new entities for Opta players not in Wikidata. Uses reep_id as qid placeholder, source=opta. - dedup-check.py: compares Opta entities against Wikidata entities via DOB + name similarity. Found 335 potential duplicates (score >= 0.90) that need manual resolution. Production D1 state: entities: 488,396 (475,164 Wikidata + 13,232 Opta) custom_ids: 352,787 (339,554 + 13,233 Opta imports) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merged Opta-only entities into their Wikidata counterparts where DOB + name matched at score >= 0.90. Repoints Opta IDs to the Wikidata entity's reep_id, then deletes the duplicate. Production D1: 488,063 entities (was 488,396, removed 333 dupes) Spot check: Gareth Bale (Q184586) now has opta ID via Wikidata entity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cutover-reep-id.py: 10-step numbered script with checkpoint verification. Creates entities_new with reep_id PK, populates via INSERT...SELECT ORDER BY reep_id, drops old entities + external_ids, rebuilds FTS. - clone-to-staging.py: clones production data to staging D1 for rehearsal. Rehearsal results on staging D1 (reep-staging): - 488,063 entities migrated, count exact match - reep_id confirmed as PK, qid column removed - FTS search working (Cole Palmer found) - provider_ids + custom_ids unchanged - external_ids dropped - All post-flight checks passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Worker (worker.ts) rewritten for new schema: - Queries provider_ids instead of external_ids (dropped) - lookupByReepId replaces lookupEntity/lookupEntities - /lookup accepts ?id= with auto-detection: reep_id (reep_p...) or QID (Q...) - Legacy ?qid= param still works - /search returns qid as convenience field from provider_ids - /resolve + /batch/* use provider_ids + custom_ids - /stats queries provider_ids instead of external_ids - Version bumped to 2.0.0 Cutover script (cutover-reep-id.py) fixed: - Drop external_ids before entities (FK constraint) - Execute DROP and RENAME as separate commands (D1 batch limitation) Production D1 state: entities: 488,063 (reep_id PK, no qid column) provider_ids: 1,735,049 custom_ids: 352,787 external_ids: dropped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md:
- reep_id is now the canonical key (was key_wikidata)
- Updated entity counts to 488K (was 430K)
- New "Reep IDs" section explaining the ID scheme and Chadwick precedent
- Updated people/teams schema tables with reep_id as first column
- Updated usage examples (Python, R, SQL) to use reep_id
- /lookup endpoint now documents ?id= with auto-detection
- Coverage section references provider_ids (was external_ids)
schemas/football-entities.sql:
- entities PK is now reep_id (was qid, type)
- current_team_reep_id replaces current_team_qid
- source column added (wikidata, opta, etc.)
- provider_ids replaces external_ids
- custom_ids schema uses reep_id FK
- All indexes updated
openapi.yaml (gitignored, updated locally):
- Version 2.0.0
- Entity schema has reep_id as primary field, qid nullable
- /lookup documents ?id= with auto-detection
- /batch/lookup accepts { ids: [...] } (was { qids: [...] })
- current_team_reep_id, source field added
- Server URL uses RapidAPI (not direct worker URL)
CLAUDE.md (gitignored, updated locally):
- Architecture diagram shows provider_ids
- D1 tables section reflects new schema and counts
- Scripts table includes all migration scripts
- Identity Model section explains reep_id scheme
- Commands section updated
- Provider coverage query uses JOINs on reep_id
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- people.csv: reep_id as first column (488K rows) - teams.csv: reep_id as first column (45K rows) - custom_ids.json: rekeyed from (qid, type) to reep_id (340K rows) - meta.json: updated counts - .gitignore: exclude reep_id_map.json and dedup-report.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rahulkeerthi
added a commit
that referenced
this pull request
Apr 10, 2026
…ulti-position, native name Four Phase 2 enrichment additions from the Wikidata pipeline review (bundle A). Takes effect on the next weekly refresh; each field is an additive json key so existing consumers aren't broken. #6 DOB precision capture Phase 2 player + coach queries now read the P569 precision via p:P569/psv:P569/wikibase:timePrecision alongside the date value. Every DOB'd entity now carries `date_of_birth_precision`: 'day' / 'month' / 'year'. Previously a stored '1995-01-01' was ambiguous between a real Jan 1 birth and a year-precision stub (about 20K reep entities have YYYY-01-01 DOBs today). Match scripts like sync-transfermarkt-datasets.py can now use this to pick the right fallback path instead of heuristically inferring precision from the trailing '-01-01'. #7 Multi-language label fallback SPARQL label service was configured for English only, so any entity without an English label had its QID stored as the name (e.g. Q82595 Bundesliga, Q277533 Pablo Hernández). backfill-broken- names.py has been chasing these post-hoc via a full dump scan. Label service now accepts a 30-language chain (en → Romance/ Germanic → Slavic/Turkic → Asian/Arabic) and returns the first available. Live-verified: Q82595 now returns 'Lliga alemanya de futbol' (Catalan) instead of the QID. The QID-as-name drop in parse_ids_phase stays as the final safety net. #8 Multi-position capture Player P413 (position) was stored as the first-row value only. merge_bio now accumulates all position labels across Phase 2 rows into a set and joins them into a comma-separated `position` field. Players who play both forward and winger no longer lose that information. #9 Native name (P1559) Phase 2 player + coach queries now fetch wdt:P1559 and store it as `name_native`. This is the Wikidata equivalent of the salimt dataset's `name_in_home_country` field — reep now has native names for Eastern European / Asian / Arabic players without needing the salimt cross-reference. Entity dict additions: - name_native (was in schema, never populated) - date_of_birth_precision (new) Existing fields with new semantics: - position (was scalar, now comma-separated) Unit tested end-to-end with synthetic Phase 1 + Phase 2 rows for Messi, Ronaldo, Werner Herzog, and a year-precision stub. All four features verified. Also live-queried the real bio query against Wikidata — dobPrecision=11 and nativeName come through correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates Reep from Wikidata QIDs as primary keys to self-minted Reep IDs (
reep_<type_prefix><8hex>). Wikidata becomes a provider mapping, not the identity backbone.Closes #7
Changes
reep_idas PK (was 475K with(qid, type))external_ids(dropped)(qid, type)toreep_id/lookup?id=auto-detects Reep IDs and QIDs, legacy?qid=still worksreep_idas first column,key_wikidataas regular provider columnMigration phases
e75d192,b65cf6c1af712c,3f09c19264ae56,9bcb9f6f7b4c17,81e90199a642bf,3a2f4c9Design
See
docs/plan-reep-id.mdfor the full design rationale (4 rounds of adversarial review). Follows the Chadwick Baseball Bureau Register model.