Skip to content

feat: Reep ID migration - self-minted universal entity IDs#8

Merged
rahulkeerthi merged 11 commits into
mainfrom
feat/reep-id-migration
Apr 2, 2026
Merged

feat: Reep ID migration - self-minted universal entity IDs#8
rahulkeerthi merged 11 commits into
mainfrom
feat/reep-id-migration

Conversation

@rahulkeerthi

Copy link
Copy Markdown
Member

Summary

Migrates Reep from Wikidata QIDs as primary keys to self-minted Reep IDs (reep_<type_prefix><8hex>). Wikidata becomes a provider mapping, not the identity backbone.

Closes #7

Changes

  • 488,063 entities with reep_id as PK (was 475K with (qid, type))
  • 1.7M provider_ids table replaces external_ids (dropped)
  • 353K custom_ids rekeyed from (qid, type) to reep_id
  • 13,232 Opta-only players imported (333 duplicates resolved)
  • API v2.0.0: /lookup?id= auto-detects Reep IDs and QIDs, legacy ?qid= still works
  • CSVs: reep_id as first column, key_wikidata as regular provider column
  • Pipeline: seed, incremental update, fetch, and export scripts all updated for reep_id

Migration phases

Phase Commits What
1 e75d192, b65cf6c Mint 475K reep_ids, update worker
2 1af712c, 3f09c19 provider_ids, custom_ids rekey, pipeline updates
3 264ae56, 9bcb9f6 Import Opta-only players, resolve 333 dupes
4 f7b4c17, 81e9019 Staging rehearsal passed, production cutover
docs 9a642bf, 3a2f4c9 README, schema, openapi, data files

Design

See docs/plan-reep-id.md for the full design rationale (4 rounds of adversarial review). Follows the Chadwick Baseball Bureau Register model.

rahulkeerthi and others added 11 commits April 2, 2026 18:14
- Add mint-reep-ids.py: generates reep_<type_prefix><8hex> IDs for all
  entities. Idempotent (only processes reep_id IS NULL), collision-safe
  (constraint violation retry), batched SQL execution.
- Add reep_id column + unique index to schema definition.
- Add reep_id to worker API responses (ENTITY_COLS, search handler).
  Additive only — no breaking changes.

Dry-run validated: 475,164 entities (386K players, 45K teams, 43K coaches).
Run `python scripts/mint-reep-ids.py` to mint IDs on production D1.

See docs/plan-reep-id.md and docs/plans/2026-04-02-001-feat-reep-id-migration-plan.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add reep_id to OpenAPI Entity schema (nullable) + search example
- Clean up temp files after batch execution (try/finally + unlink)
- Move Counter import to module level
- Add type annotation to escape_sql parameter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The file was deliberately gitignored in f54a80d. Reverts the
accidental re-addition from the review fix commit. The reep_id
schema update lives locally and gets uploaded to RapidAPI manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- create-provider-ids.py: creates provider_ids table (1.7M rows) from
  external_ids (rekeyed to reep_id) + wikidata QID mappings. Includes
  lookup index and count verification.
- rekey-custom-ids.py: migrates custom_ids from (qid, type) to reep_id.
  Backs up to JSON first, handles orphan detection, two-phase table swap.
  780 orphan Club Elo rows dropped (QIDs not in entities).
- .gitignore: exclude data/backups/

Both scripts executed on production D1:
  provider_ids: 1,735,049 rows (verified)
  custom_ids: 339,554 rows rekeyed (verified)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seed script (seed-wikidata-d1.py):
- Includes reep_id column in entity schema
- Mints reep_ids for all entities after seeding
- Rebuilds provider_ids from external_ids + wikidata QIDs
- 90% count safety threshold on provider_ids

Incremental update (incremental-update.py):
- Looks up existing reep_ids before INSERT OR REPLACE (preserves them)
- Mints reep_ids for new entities
- Dual-writes to both external_ids and provider_ids
- Includes wikidata QID mappings in provider_ids

Fetch custom IDs (fetch-custom-ids.py):
- Exports using reep_id (new schema)
- Also exports reep_id_map.json for CSV export

Export CSV (export-csv.py):
- reep_id as first column in people.csv and teams.csv
- Loads reep_id map to resolve QID -> reep_id
- Custom IDs merged via reep_id key

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- import-opta-entities.py: creates 13,232 new entities for Opta players
  not in Wikidata. Uses reep_id as qid placeholder, source=opta.
- dedup-check.py: compares Opta entities against Wikidata entities via
  DOB + name similarity. Found 335 potential duplicates (score >= 0.90)
  that need manual resolution.

Production D1 state:
  entities: 488,396 (475,164 Wikidata + 13,232 Opta)
  custom_ids: 352,787 (339,554 + 13,233 Opta imports)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merged Opta-only entities into their Wikidata counterparts where
DOB + name matched at score >= 0.90. Repoints Opta IDs to the
Wikidata entity's reep_id, then deletes the duplicate.

Production D1: 488,063 entities (was 488,396, removed 333 dupes)
Spot check: Gareth Bale (Q184586) now has opta ID via Wikidata entity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cutover-reep-id.py: 10-step numbered script with checkpoint verification.
  Creates entities_new with reep_id PK, populates via INSERT...SELECT
  ORDER BY reep_id, drops old entities + external_ids, rebuilds FTS.
- clone-to-staging.py: clones production data to staging D1 for rehearsal.

Rehearsal results on staging D1 (reep-staging):
  - 488,063 entities migrated, count exact match
  - reep_id confirmed as PK, qid column removed
  - FTS search working (Cole Palmer found)
  - provider_ids + custom_ids unchanged
  - external_ids dropped
  - All post-flight checks passed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Worker (worker.ts) rewritten for new schema:
- Queries provider_ids instead of external_ids (dropped)
- lookupByReepId replaces lookupEntity/lookupEntities
- /lookup accepts ?id= with auto-detection: reep_id (reep_p...) or QID (Q...)
- Legacy ?qid= param still works
- /search returns qid as convenience field from provider_ids
- /resolve + /batch/* use provider_ids + custom_ids
- /stats queries provider_ids instead of external_ids
- Version bumped to 2.0.0

Cutover script (cutover-reep-id.py) fixed:
- Drop external_ids before entities (FK constraint)
- Execute DROP and RENAME as separate commands (D1 batch limitation)

Production D1 state:
  entities: 488,063 (reep_id PK, no qid column)
  provider_ids: 1,735,049
  custom_ids: 352,787
  external_ids: dropped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README.md:
- reep_id is now the canonical key (was key_wikidata)
- Updated entity counts to 488K (was 430K)
- New "Reep IDs" section explaining the ID scheme and Chadwick precedent
- Updated people/teams schema tables with reep_id as first column
- Updated usage examples (Python, R, SQL) to use reep_id
- /lookup endpoint now documents ?id= with auto-detection
- Coverage section references provider_ids (was external_ids)

schemas/football-entities.sql:
- entities PK is now reep_id (was qid, type)
- current_team_reep_id replaces current_team_qid
- source column added (wikidata, opta, etc.)
- provider_ids replaces external_ids
- custom_ids schema uses reep_id FK
- All indexes updated

openapi.yaml (gitignored, updated locally):
- Version 2.0.0
- Entity schema has reep_id as primary field, qid nullable
- /lookup documents ?id= with auto-detection
- /batch/lookup accepts { ids: [...] } (was { qids: [...] })
- current_team_reep_id, source field added
- Server URL uses RapidAPI (not direct worker URL)

CLAUDE.md (gitignored, updated locally):
- Architecture diagram shows provider_ids
- D1 tables section reflects new schema and counts
- Scripts table includes all migration scripts
- Identity Model section explains reep_id scheme
- Commands section updated
- Provider coverage query uses JOINs on reep_id

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- people.csv: reep_id as first column (488K rows)
- teams.csv: reep_id as first column (45K rows)
- custom_ids.json: rekeyed from (qid, type) to reep_id (340K rows)
- meta.json: updated counts
- .gitignore: exclude reep_id_map.json and dedup-report.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rahulkeerthi rahulkeerthi merged commit 6594873 into main Apr 2, 2026
1 check passed
@rahulkeerthi rahulkeerthi deleted the feat/reep-id-migration branch April 2, 2026 22:05
rahulkeerthi added a commit that referenced this pull request Apr 10, 2026
…ulti-position, native name

Four Phase 2 enrichment additions from the Wikidata pipeline review
(bundle A). Takes effect on the next weekly refresh; each field is an
additive json key so existing consumers aren't broken.

#6 DOB precision capture
    Phase 2 player + coach queries now read the P569 precision via
    p:P569/psv:P569/wikibase:timePrecision alongside the date value.
    Every DOB'd entity now carries `date_of_birth_precision`:
    'day' / 'month' / 'year'. Previously a stored '1995-01-01' was
    ambiguous between a real Jan 1 birth and a year-precision stub
    (about 20K reep entities have YYYY-01-01 DOBs today). Match scripts
    like sync-transfermarkt-datasets.py can now use this to pick the
    right fallback path instead of heuristically inferring precision
    from the trailing '-01-01'.

#7 Multi-language label fallback
    SPARQL label service was configured for English only, so any
    entity without an English label had its QID stored as the name
    (e.g. Q82595 Bundesliga, Q277533 Pablo Hernández). backfill-broken-
    names.py has been chasing these post-hoc via a full dump scan.
    Label service now accepts a 30-language chain (en → Romance/
    Germanic → Slavic/Turkic → Asian/Arabic) and returns the first
    available. Live-verified: Q82595 now returns 'Lliga alemanya de
    futbol' (Catalan) instead of the QID. The QID-as-name drop in
    parse_ids_phase stays as the final safety net.

#8 Multi-position capture
    Player P413 (position) was stored as the first-row value only.
    merge_bio now accumulates all position labels across Phase 2 rows
    into a set and joins them into a comma-separated `position` field.
    Players who play both forward and winger no longer lose that
    information.

#9 Native name (P1559)
    Phase 2 player + coach queries now fetch wdt:P1559 and store it
    as `name_native`. This is the Wikidata equivalent of the salimt
    dataset's `name_in_home_country` field — reep now has native
    names for Eastern European / Asian / Arabic players without
    needing the salimt cross-reference.

Entity dict additions:
  - name_native                   (was in schema, never populated)
  - date_of_birth_precision       (new)

Existing fields with new semantics:
  - position                      (was scalar, now comma-separated)

Unit tested end-to-end with synthetic Phase 1 + Phase 2 rows for
Messi, Ronaldo, Werner Herzog, and a year-precision stub. All four
features verified. Also live-queried the real bio query against
Wikidata — dobPrecision=11 and nativeName come through correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Reep ID migration - self-minted universal entity IDs

1 participant