Skip to content

Commit 9766049

Browse files
rahulkeerthiclaude
andcommitted
fix: incremental-update.py — align changed-QIDs discovery with expanded filters
The previous commit aligned build_scoped_ids_query (the second-stage per-QID fetch) with the shared Phase 1 query shape. But the first-stage discovery query in fetch_changed_qids was still using the OLD narrow filters, so even if a newly-reachable entity got edited on Wikidata, the incremental wouldn't see it. Two filters expanded to match fetch-wikidata-entities.py: 1. Team discovery — three-path UNION matching build_team_ids_query: - subclass of Q476028 (association football club) — existing - subclass of Q103229495 (men's association football team) — new - P641=Q2736 AND subclass of Q847017 (sports club with football) — new Matters because Wikidata classifies many well-known clubs solely under Q103229495 (Q7156 Barcelona, Q170703 Boca Juniors) or under generic Q847017 (Q8206935 Estudiantes de Río Cuarto). Those were silently unreachable by the old filter. 2. Competition discovery — now mirrors the full-fetch competition query's shape: - Class path requires explicit P641 = Q2736 (blocks NHL/PGA/rugby from leaking through via their P31 chain) - Property path allows entities with a football-specific provider claim, with the non-football FILTER NOT EXISTS guard - FILTER NOT EXISTS { ?e wdt:P3450 ?parentComp } excludes seasonal entities (they're handled by the season query below) Matters because the old incremental discovery was picking up seasonal entities as competitions and letting non-football competitions leak through the property path. Live-verified: the new team discovery query returns 98 entities modified in the past day. Syntactically valid, non-zero, executes against Wikidata's endpoint cleanly. Note that this closes the drift for discovery, but the scoped per-QID fetch query (build_scoped_ids_query) still uses VALUES to restrict — which means entities that were modified BEFORE the expanded filter went live still won't be discovered until a full refresh. For those, the dump-based workflow (coming next) is the right tool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 088318e commit 9766049

1 file changed

Lines changed: 52 additions & 10 deletions

File tree

scripts/incremental-update.py

Lines changed: 52 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,15 @@ def execute_sql_file(sql_path: str, remote: bool = True) -> bool:
9797
# ---------------------------------------------------------------------------
9898

9999
def fetch_changed_qids(since: str) -> dict[str, list[str]]:
100-
"""Fetch QIDs modified since the given date, per entity type."""
100+
"""Fetch QIDs modified since the given date, per entity type.
101+
102+
These discovery queries must stay aligned with the initial-filter paths
103+
in fetch-wikidata-entities.py's build_*_ids_query functions — any entity
104+
reachable in the full fetch should also be discoverable here if it's
105+
been edited recently. When the full-fetch filter widens, this file has
106+
to widen too, otherwise newly-reachable entities would only arrive via
107+
a full refresh.
108+
"""
101109
type_queries = {
102110
"player": f"""
103111
SELECT DISTINCT ?e WHERE {{
@@ -107,10 +115,27 @@ def fetch_changed_qids(since: str) -> dict[str, list[str]]:
107115
FILTER NOT EXISTS {{ ?e wdt:P31 wd:Q95074 }}
108116
FILTER NOT EXISTS {{ ?e wdt:P31 wd:Q15632617 }}
109117
}}""",
118+
# Team: three-path UNION matching build_team_ids_query. Covers
119+
# subclass of Q476028 (association football club), Q103229495
120+
# (men's association football team — catches Q7156 Barcelona,
121+
# Q170703 Boca), and sports clubs with football as sport.
110122
"team": f"""
111123
SELECT DISTINCT ?e WHERE {{
112-
?e wdt:P31 ?type .
113-
?type (wdt:P279)* wd:Q476028 .
124+
{{
125+
?e wdt:P31 ?type .
126+
?type (wdt:P279)* wd:Q476028 .
127+
}}
128+
UNION
129+
{{
130+
?e wdt:P31 ?type .
131+
?type (wdt:P279)* wd:Q103229495 .
132+
}}
133+
UNION
134+
{{
135+
?e wdt:P641 wd:Q2736 .
136+
?e wdt:P31 ?type .
137+
?type (wdt:P279)* wd:Q847017 .
138+
}}
114139
?e schema:dateModified ?mod .
115140
FILTER(?mod > "{since}T00:00:00Z"^^xsd:dateTime)
116141
}}""",
@@ -122,15 +147,32 @@ def fetch_changed_qids(since: str) -> dict[str, list[str]]:
122147
FILTER NOT EXISTS {{ ?e wdt:P31 wd:Q95074 }}
123148
FILTER NOT EXISTS {{ ?e wdt:P31 wd:Q15632617 }}
124149
}}""",
150+
# Competition: class path requires explicit P641 = football
151+
# (blocks NHL / PGA / rugby leaks), property path allows any
152+
# entity with a football-specific provider claim. Seasons are
153+
# excluded via FILTER NOT EXISTS P3450 (they're handled by the
154+
# season query below).
125155
"competition": f"""
126156
SELECT DISTINCT ?e WHERE {{
127-
{{ ?e wdt:P31/wdt:P279* wd:Q15991290 . }}
128-
UNION
129-
{{ ?e wdt:P12758 [] . }}
130-
UNION
131-
{{ ?e wdt:P13664 [] . }}
132-
UNION
133-
{{ ?e wdt:P8735 [] . }}
157+
{{
158+
{{
159+
?e wdt:P31/wdt:P279* wd:Q15991290 .
160+
?e wdt:P641 wd:Q2736 .
161+
}}
162+
UNION
163+
{{
164+
{{ ?e wdt:P12758 [] . }}
165+
UNION
166+
{{ ?e wdt:P13664 [] . }}
167+
UNION
168+
{{ ?e wdt:P8735 [] . }}
169+
FILTER NOT EXISTS {{
170+
?e wdt:P641 ?sport .
171+
FILTER(?sport != wd:Q2736)
172+
}}
173+
}}
174+
}}
175+
FILTER NOT EXISTS {{ ?e wdt:P3450 ?parentComp . }}
134176
?e schema:dateModified ?mod .
135177
FILTER(?mod > "{since}T00:00:00Z"^^xsd:dateTime)
136178
}}""",

0 commit comments

Comments
 (0)