chore(release): cut v1.13.1

brycewang-stanford · claude · brycewang-stanford · commit 4eb680f86963 · 2026-05-05T16:05:53.000-07:00
Bump version 1.14.0 → 1.13.1 and consolidate the [Unreleased] +
[1.14.0] + [1.13.0] CHANGELOG sections into a single [1.13.1] —
2026-05-05 entry. Reason: TestPyPI's 1.13.0 slot was occupied by the
earlier untagged "cut v1.13.0" build, so we ship 1.13.1 to keep
TestPyPI and PyPI artifacts in lockstep. PyPI 1.13.1 will be the
first published 1.13.x with the stability tiers + R/Stata parity
dossier + cold-start sklearn surgery + weak-IV preflight gate +
CS-DiD REG IF-scaling correctness fix all bundled. README headlines,
bibtex version, parity-table captions, and the re-pinned DID
numerical fixtures' history notes are aligned to 1.13.1.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/README.md b/README.md
@@ -124,10 +124,24 @@ StatsPAI's focus is **causal inference** — and on this axis we aim to be the m
 
 ---
 
-**📦 v1.14.0 (2026-05-04) — External-validity dossier + cold-start surgery**
-
-v1.14 ships a 36-module R parity harness (`tests/r_parity/`), a 21-module
-Stata parity harness (`tests/stata_parity/`), 4 canonical-dataset
+**📦 v1.13.1 (2026-05-05) — Stability tiers + external-validity dossier + cold-start surgery**
+
+v1.13 stamps every `FunctionSpec` with a `stability` tier (`stable` /
+`experimental` / `deprecated`) plus per-function `limitations`,
+surfaced through `sp.describe_function`, `sp.list_functions(stability=...)`,
+the `statspai list` CLI, and the LLM-facing `sp.function_schema`;
+`sp.recommend` / `sp.causal` / `sp.paper` default to dropping
+`experimental` / `deprecated` entries unless `allow_experimental=True`
+is passed. Eight high-impact estimators (`aipw`, `aggte`,
+`pretrends_test`, `sensitivity_rr`, `mccrary_test`, `oster_bounds`,
+`wild_cluster_bootstrap`, `rd_honest`) are upgraded from
+auto-registered stubs to hand-written specs. A weak-instrument
+preflight gate in `sp.preflight(... "ivreg", formula=...)` flags
+first-stage F below the Staiger–Stock (1997) / Stock–Yogo (2005)
+thresholds, and `sp.recommend(... design='iv')` adaptively reorders
+LIML / AR ahead of 2SLS on weak first stages. v1.13 also ships a
+36-module R parity harness (`tests/r_parity/`), a 21-module Stata
+parity harness (`tests/stata_parity/`), 4 canonical-dataset
 original-paper replays (Card 1995, Callaway–Sant'Anna `mpdta`, Abadie
 Basque, LaLonde NSW + PSID-1 — all bit-equal to the published headline
 numbers), a Track-C performance harness (HDFE / CS-DiD / SCM / DML
@@ -145,7 +159,7 @@ submodules (down from 245). **⚠️ Correctness fix** —
 `sp.callaway_santanna(method='reg')` had a latent influence-function
 scaling bug; `'ipw'` and `'dr'` are unchanged but **re-run any
 v1.10–v1.13 CS-DiD analyses that used `method='reg'`**. Full notes in
-[`CHANGELOG.md`](CHANGELOG.md) under `[1.14.0]`.
+[`CHANGELOG.md`](CHANGELOG.md) under `[1.13.1]`.
 
 **📦 v1.12.2 (2026-05-01) — ML routing for `sp.causal_question` + shared robustness battery + weighted PLIV/IIVM**
 
@@ -1250,7 +1264,7 @@ resolves to the latest version):
   author       = {Wang, Biaoyue},
   title        = {StatsPAI: The Agent-Native Causal Inference \& Econometrics Toolkit for Python},
   year         = {2026},
-  version      = {1.14.0},
+  version      = {1.13.1},
   doi          = {10.5281/zenodo.19933900},
   url          = {https://doi.org/10.5281/zenodo.19933900},
   license      = {MIT},
diff --git a/README_CN.md b/README_CN.md
@@ -46,13 +46,26 @@ StatsPAI 聚焦**因果推断**——在这条主线上，我们的目标是成
 
 ---
 
-**📦 v1.14.0（2026-05-04）— 外部效度档案 + 冷启动手术**
-
-新增 36 模块 R parity harness（`tests/r_parity/`）、21 模块 Stata
-parity harness（`tests/stata_parity/`）、4 数据集原始论文复算
-（Card / `mpdta` / Basque / LaLonde NSW + PSID-1，全部 bit-equal 命中
-发表数字）、Track-C 性能 harness（HDFE / CS-DiD / SCM / DML 的 log-log
-扩展）、`tests/coverage_monte_carlo/` 上 B=1000 的 95% CI 实证覆盖
+**📦 v1.13.1（2026-05-05）— 稳定性分级 + 外部效度档案 + 冷启动手术**
+
+v1.13 给每个 `FunctionSpec` 打上 `stability` 标签
+（`stable` / `experimental` / `deprecated`）以及函数级 `limitations`，
+通过 `sp.describe_function` / `sp.list_functions(stability=...)` /
+`statspai list` CLI / `sp.function_schema` 的 LLM 描述全链路曝光；
+`sp.recommend` / `sp.causal` / `sp.paper` 默认丢弃 `experimental` /
+`deprecated` 条目，除非显式传 `allow_experimental=True`。`aipw` /
+`aggte` / `pretrends_test` / `sensitivity_rr` / `mccrary_test` /
+`oster_bounds` / `wild_cluster_bootstrap` / `rd_honest` 这 8 个高频
+估计器从 auto-registered 升级到 hand-written spec。`sp.preflight(...
+"ivreg", formula=...)` 增加弱工具变量预检关，第一阶段 F 跌破
+Staiger–Stock (1997) / Stock–Yogo (2005) 阈值时发结构化 warning；
+`sp.recommend(... design='iv')` 在弱第一阶段下自适应把 LIML / AR 排到
+2SLS 之前。同时新增 36 模块 R parity harness
+（`tests/r_parity/`）、21 模块 Stata parity harness
+（`tests/stata_parity/`）、4 数据集原始论文复算（Card / `mpdta` /
+Basque / LaLonde NSW + PSID-1，全部 bit-equal 命中发表数字）、Track-C
+性能 harness（HDFE / CS-DiD / SCM / DML 的 log-log 扩展）、
+`tests/coverage_monte_carlo/` 上 B=1000 的 95% CI 实证覆盖
 （OLS 0.952 / 2×2 DiD 0.955 / 强 IV 0.962，全部落在 99% Wilson 带
 [0.935, 0.967] 内），以及 900 trial 的 CausalAgentBench 提示套件
 （mock 模式已就绪，`--api` 一键开启）。新增三个顶层 meta API：
@@ -64,7 +77,7 @@ StatsPAI 的外部效度证据，无需离开 REPL。冷启动方面，`statspai
 statspai` 的 sklearn 子模块数从 245 → **0**。`sp.callaway_santanna(method='reg')`
 修复一个潜在的影响函数缩放错误（IPW / DR 路径不受影响）—— **请重跑
 v1.10–v1.13 期间使用 `method='reg'` 的 CS-DiD 分析**。完整发布说明见
-[`CHANGELOG.md`](CHANGELOG.md) `[1.14.0]`。
+[`CHANGELOG.md`](CHANGELOG.md) `[1.13.1]`。
 
 **📦 v1.12.2（2026-05-01）— `sp.causal_question` ML 路由 + 共享稳健性 battery + 加权 PLIV/IIVM**
 
@@ -515,7 +528,7 @@ sp.__citation__                 # 与 sp.citation("bibtex") 等价
   author       = {Wang, Biaoyue},
   title        = {StatsPAI: The Agent-Native Causal Inference \& Econometrics Toolkit for Python},
   year         = {2026},
-  version      = {1.14.0},
+  version      = {1.13.1},
   doi          = {10.5281/zenodo.19933900},
   url          = {https://doi.org/10.5281/zenodo.19933900},
   license      = {MIT},
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "StatsPAI"
-version = "1.14.0"
+version = "1.13.1"
 description = "The Agent-Native Causal Inference & Econometrics Toolkit for Python"
 readme = "README.md"
 license = {text = "MIT"}
diff --git a/src/statspai/__init__.py b/src/statspai/__init__.py
@@ -22,7 +22,7 @@
 >>> sp.outreg2(result, filename="results.xlsx")
 """
 
-__version__ = "1.14.0"
+__version__ = "1.13.1"
 __author__ = "Biaoyue Wang"
 __email__ = "brycew6m@stanford.edu"
 
diff --git a/tests/coverage_monte_carlo/FINDINGS.md b/tests/coverage_monte_carlo/FINDINGS.md
@@ -66,9 +66,9 @@ Findings interpretation:
   Stock–Yogo critical values, HC1 ignores the weak-instrument bias of
   2SLS, so CIs miss truth more often than nominal 0.95.  Recovery
   routes for users are LIML (`method='liml'`) or Anderson–Rubin
-  inference (`sp.iv(.., inference='ar')`); both are on the v1.14
-  roadmap as automatic fall-backs in the design-detect / preflight
-  pipeline.
+  inference (`sp.iv(.., inference='ar')`); both are wired into the
+  v1.13 design-detect / preflight pipeline as automatic fall-backs
+  via the new `first_stage_strength` gate.
 - **CS-DiD passes the heterogeneity stress test.** This is the
   designed behaviour of Callaway–Sant'Anna 2021: cell-by-cell ATT(g, t)
   estimation with the simple-ATT aggregation as an equally-weighted
diff --git a/tests/r_parity/compare.py b/tests/r_parity/compare.py
@@ -529,7 +529,7 @@ def render_tex(modules: list[str]) -> str:
         "% AUTO-GENERATED by tests/r_parity/compare.py\n"
         "% Re-run after any module change to refresh.\n"
         "\\begin{longtable}{p{0.10\\linewidth}p{0.27\\linewidth}p{0.40\\linewidth}p{0.16\\linewidth}}\n"
-        "\\caption{Track A parity headline at \\statspai{} 1.13.0 vs the "
+        "\\caption{Track A parity headline at \\statspai{} 1.13.1 vs the "
         "canonical \\proglang{R} reference on the calibrated replicas. The "
         "``Worst diff'' column reports the worst residual gap across the "
         "module's headline rows (point estimates only; per-row SE diffs "
@@ -610,7 +610,7 @@ def render_tex_3way(modules: list[str]) -> str:
         "\\small\n"
         "\\setlength{\\tabcolsep}{2pt}\n"
         "\\begin{longtable}{@{}p{0.055\\linewidth}p{0.205\\linewidth}p{0.30\\linewidth}p{0.30\\linewidth}p{0.10\\linewidth}@{}}\n"
-        "\\caption{Track A parity headline at \\statspai{} 1.13.0 against the canonical "
+        "\\caption{Track A parity headline at \\statspai{} 1.13.1 against the canonical "
         "\\proglang{R} reference \\emph{and} (where one exists) the canonical \\proglang{Stata} "
         "reference, on the calibrated replicas. The ID column is the two-digit module prefix; "
         "the two diff columns report the worst residual "
diff --git a/tests/stata_parity/README.md b/tests/stata_parity/README.md
@@ -72,7 +72,7 @@ the reason and the 3-way table prints it explicitly:
 The remaining modules (23-36, minus 25/28/30) currently have the
 status "Stata harness not yet built": a Stata sibling is feasible
 (many of them — `stcox`, `melogit`, `var`, `lpirf`, `xtreg`,
-`sfpanel`, etc. — are reachable) but is outside the v1.13.0 scope.
+`sfpanel`, etc. — are reachable) but is outside the v1.13.1 scope.
 
 ## Running
 
diff --git a/tests/test_cs_report_smoke.py b/tests/test_cs_report_smoke.py
@@ -141,8 +141,8 @@ def test_breakdown_M_all_strictly_positive(demo_report):
     assert (demo_report.breakdown["breakdown_M_star"] > 0).all()
     # Most event times should remain robust at one SE on this DGP.
     # We allow at most one boundary event-time to fall short because
-    # the v1.14 simple-ATT influence-function scaling fix
-    # (CHANGELOG ## [1.14.0]) made the SEs larger and therefore
+    # the v1.13 simple-ATT influence-function scaling fix
+    # (CHANGELOG ## [1.13.1]) made the SEs larger and therefore
     # makes the m_star >= se criterion stricter.  Pre-fix this
     # assertion was `.all()`; post-fix the right contract is
     # "essentially all".
diff --git a/tests/test_did_numerical_fixtures.py b/tests/test_did_numerical_fixtures.py
@@ -10,9 +10,9 @@
 The fixtures are checked to 4 decimal places so bit-level floating-point
 differences across BLAS backends do not cause spurious failures.
 
-History note (CHANGELOG ``[1.14.0]``): the SEs in ``PINNED_ATT_GT``,
+History note (CHANGELOG ``[1.13.1]``): the SEs in ``PINNED_ATT_GT``,
 ``PINNED_EVENT_STUDY``, and the overall-ATT SE in
-``test_overall_att_matches_pinned`` were re-pinned in v1.14 to absorb
+``test_overall_att_matches_pinned`` were re-pinned in v1.13 to absorb
 the simple-ATT influence-function scaling fix
 (``Fix CS-DiD parity inference``).  Each group-time IF is now
 multiplied by ``n_total / n_relevant`` when embedded in the full unit
@@ -59,7 +59,7 @@ def cs_fixture():
 
 # (group, time) -> (att, se).  Generated from the current implementation.
 PINNED_ATT_GT = {
-    # Re-pinned 2026-05-05 after the v1.14 simple-ATT IF-scaling fix.
+    # Re-pinned 2026-05-05 after the v1.13 simple-ATT IF-scaling fix.
     # ATT point estimates unchanged; SEs grew by the
     # n_total/n_relevant correction.
     (3, 1): (-0.583666, 0.552161),
@@ -102,7 +102,7 @@ def test_att_gt_matches_pinned_values(cs_fixture):
 
 def test_overall_att_matches_pinned(cs_fixture):
     assert cs_fixture.estimate == pytest.approx(1.282166, abs=1e-4)
-    # SE re-pinned to 0.289142 (was 0.101724 pre-v1.14) following the
+    # SE re-pinned to 0.289142 (was 0.101724 pre-v1.13) following the
     # simple-ATT IF-scaling fix; see module docstring.
     assert cs_fixture.se == pytest.approx(0.289142, abs=1e-4)
 
@@ -112,7 +112,7 @@ def test_overall_att_matches_pinned(cs_fixture):
 # --------------------------------------------------------------------------- #
 
 PINNED_EVENT_STUDY = {
-    # Re-pinned 2026-05-05 after the v1.14 simple-ATT IF-scaling fix.
+    # Re-pinned 2026-05-05 after the v1.13 simple-ATT IF-scaling fix.
     -6: (0.082153,  0.602161),
     -5: (0.284830,  0.536783),
     -4: (0.135512,  0.419463),