Skip to content

Commit 64c3ef7

Browse files
fix(audit): fold curly apostrophe in citation name matching
Citation auditor flagged `de Chaisemartin & D'Haultfœuille` (DOI 10.1162/rest_a_01414) as missing co-author because Crossref returns "Xavier D’Haultfœuille" with curly apostrophe U+2019 while the docstring uses straight U+0027. The previous _normalise stripped curly U+2019 as punctuation (only U+0027 was preserved by [^\w\s'-]), so the surname tokenised differently on each side and the name fold missed. Add an APOSTROPHE_FOLD table that maps typographic apostrophe / hyphen variants (U+2019, U+2018, U+02BC, U+2032, U+00B4, U+2010, U+2011, U+2013) to their ASCII forms before normalisation. After the fold, fresh strict audit reports 477 ok / 0 mismatch / 0 unresolved (was 475 ok / 2 mismatch).
1 parent f89a396 commit 64c3ef7

1 file changed

Lines changed: 19 additions & 1 deletion

File tree

tools/audit_citations.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,21 @@ class Verdict:
262262

263263
_PUNCT_TO_SPACE = re.compile(r"[^\w\s'-]", re.UNICODE)
264264

265+
# Apostrophe / hyphen variants that name fields use interchangeably across
266+
# sources. Crossref emits curly U+2019 in author names ("D’Haultfœuille"),
267+
# Python source typically uses straight U+0027 ("D'Haultfœuille"); without
268+
# this fold the same surname tokenises differently on each side.
269+
_APOSTROPHE_FOLD = str.maketrans({
270+
"’": "'", # right single quotation mark
271+
"‘": "'", # left single quotation mark
272+
"ʼ": "'", # modifier letter apostrophe
273+
"′": "'", # prime
274+
"´": "'", # acute accent (occasionally misused as apostrophe)
275+
"‐": "-", # hyphen
276+
"‑": "-", # non-breaking hyphen
277+
"–": "-", # en dash
278+
})
279+
265280

266281
def _strip_diacritics(s: str) -> str:
267282
"""NFD + drop combining marks, preserve case and punctuation."""
@@ -274,8 +289,11 @@ def _normalise(s: str) -> str:
274289
275290
Replacing punctuation with *spaces* (not deletion) is important: it
276291
keeps "(Imbens," from collapsing into one un-splittable token.
277-
Apostrophe and hyphen are kept for names like O'Neill, Tabord-Meehan.
292+
Apostrophe and hyphen are kept for names like O'Neill, Tabord-Meehan;
293+
curly / typographic apostrophe variants are folded to the straight
294+
ASCII form so D’Haultfœuille (Crossref) matches D'Haultfœuille (source).
278295
"""
296+
s = s.translate(_APOSTROPHE_FOLD)
279297
s = _strip_diacritics(s).lower()
280298
s = _PUNCT_TO_SPACE.sub(" ", s)
281299
return " ".join(s.split()) # collapse whitespace

0 commit comments

Comments
 (0)