Skip to content

Commit 7d88940

Browse files
fix(audit): close two Citation Audit coverage holes
Two regressions surfaced when the Citation Audit GitHub workflow turned red on the v1.15.6 release commit (run 26375546484, exit 1: "476 ok / 1 mismatch / 1 unresolved"). Root causes — both in tools/audit_citations.py, both pre-existing (workflow has been red since at least v1.15.5): 1. DOI regex captured markdown autolink closers. ``docs/joss_validation_dossier.md:9`` writes the Zenodo concept DOI as ``<https://doi.org/10.5281/zenodo.19933900>``. The DOI character class ``[^\s()"'\`,;}\]\[]+`` did NOT exclude ``<`` / ``>``, so the match captured the trailing ``>``, then Crossref + DataCite both 404'd on ``10.5281/zenodo.19933900>``. RFC 3986 reserves angle brackets in URIs (must be percent-encoded), so adding ``<>`` to the exclusion class and to the trailing-boundary lookahead is safe. 2. ``_bibtex_markers`` missed ``@software{`` (and adjacent valid BibTeX types ``@unpublished{`` / ``@incollection{``). ``src/statspai/_citation.py:26`` defines ``_CONCEPT_DOI``; four lines below sits a Python string template ``"@software{{...`` carrying the canonical author list. The ±3 claim-block window captured the template but the marker set didn't recognise ``@software{`` as a BibTeX block, so the script ran the missing- author check and flagged "missing author(s): Rozelle, Scott". Adding the three types to the marker tuple lets the script trust the structured ``author={...}`` field for software / unpublished / incollection entries the same way it does for ``@article`` etc. Verification (all four gates, locally, cache cleared): - Gate 0: 32/32 auditor unit tests pass - Gate 1: paper.bib has 0 duplicate keys / DOIs / arXiv ids - Gate 2: 0 dangling refs across src/ docs/ paper.md - Gate 3: 478 ok / 0 mismatch / 0 unresolved, exit 0 No package source touched — pure tooling fix, no version bump needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4b8ebe0 commit 7d88940

1 file changed

Lines changed: 11 additions & 5 deletions

File tree

tools/audit_citations.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -92,19 +92,24 @@
9292
# like ``10.1016/S0169-7218(11)00407-2``, Emerald volume 20 like
9393
# ``10.1108/S1049-2585(2012)0000020009``). Up to 2 levels of nesting
9494
# is plenty in practice.
95-
_DOI_NO_PAREN = r"[^\s()\"'`,;}\]\[]+"
95+
#
96+
# ``<`` / ``>`` are excluded so that markdown autolinks of the form
97+
# ``<https://doi.org/10.xxxx/yyyy>`` don't pull the trailing ``>`` into
98+
# the DOI body. RFC 3986 reserves angle brackets in URIs (they must be
99+
# percent-encoded), so no real DOI contains a literal ``<`` or ``>``.
100+
_DOI_NO_PAREN = r"[^\s()<>\"'`,;}\]\[]+"
96101
_DOI_PAREN = rf"\(?:{_DOI_NO_PAREN}\)?" # placeholder, see verbose form
97102
DOI_RE = re.compile(
98103
r"""
99104
\b(?P<id>
100105
10\.\d{4,9}/ # DOI prefix
101106
(?:
102-
[^\s()"'`,;}\]\[] # non-paren body char
103-
| \( [^\s()"'`,;}\]\[]* \) # balanced (...) one level
107+
[^\s()<>"'`,;}\]\[] # non-paren body char
108+
| \( [^\s()<>"'`,;}\]\[]* \) # balanced (...) one level
104109
)+?
105110
)
106111
\.? # optional trailing period
107-
(?= [\s)\"'`,;}\]\[] | $ )
112+
(?= [\s)<>\"'`,;}\]\[] | $ )
108113
""",
109114
re.VERBOSE,
110115
)
@@ -645,7 +650,8 @@ def diff_citation(c: Citation, truth: PaperMeta) -> list[str]:
645650
"author={", "title={", "journal={", "booktitle={",
646651
"year={", "doi={", "volume={", "number={", "pages={",
647652
"publisher={", "@article{", "@inproceedings{", "@book{",
648-
"@misc{", "@techreport{", "@phdthesis{",
653+
"@misc{", "@techreport{", "@phdthesis{", "@software{",
654+
"@unpublished{", "@incollection{",
649655
)
650656
is_bibtex = any(m in c.claim_block for m in _bibtex_markers)
651657

0 commit comments

Comments
 (0)