brycewang-stanford
diff --git a/‎CHANGELOG.md‎
Lines changed: 87 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎paper.bib‎
Lines changed: 49 additions & 0 deletions b/‎paper.bib‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎paper.md‎
Lines changed: 51 additions & 11 deletions b/‎paper.md‎
Lines changed: 51 additions & 11 deletions
diff --git a/‎src/statspai/__init__.py‎
Lines changed: 17 additions & 2 deletions b/‎src/statspai/__init__.py‎
Lines changed: 17 additions & 2 deletions
@@ -4,6 +4,93 @@ All notable changes to StatsPAI will be documented in this file.
 
 ## [Unreleased]
 
+### Added — ML+causal polish
+
+A cross-cutting polish wave on the machine-learning + causal-inference
+module family (DML / meta-learners / causal forests / causal discovery
+/ policy learning / mediation / OPE) so the package matches the
+2024–2026 reporting frontier set by DoubleML, EconML, grf, and lmtp.
+
+- **DML-OVB sensitivity analysis** (`sp.dml_sensitivity`,
+  `DMLSensitivityResult`) implementing the Chernozhukov–Cinelli–
+  Newey–Sharma–Syrgkanis (2022) "Long Story Short" framework
+  (NBER WP 30302; arXiv:2112.13398). Returns the robustness value
+  RV_q (strength of confounder needed to shrink the estimate to
+  zero), the significance-loss value RV_{q,α}, scenario bias
+  bounds for user-specified (cf_y, cf_d), benchmark-covariate
+  comparisons, and a `plot()` rendering bias contours over the
+  (cf_d, cf_y) grid à la R `sensemakr`. Refs verified via NBER + arXiv.
+- **DML diagnostics bundle** (`sp.dml_diagnostics`, `DMLDiagnostics`)
+  bundles overlap (propensity histogram for IRM; |D-residual|
+  distribution for PLR), score density (with N(0,σ̂²) overlay and
+  Q-Q plot), residual-balance check (corr(X_k, Ỹ) and corr(X_k, D̃)
+  for each covariate), and an orthogonality-score test in a single
+  2×2 publication-style panel matching DoubleML's defaults
+  (Bach–Kurz–Chernozhukov–Spindler–Klaassen 2024, *JSS* 108(3),
+  DOI 10.18637/jss.v108.i03).
+- **Backbone-agnostic CATE evaluation** (`sp.cate_eval`,
+  `CATEEvalResult`) computing Yadlowsky–Fleming–Shah–Brunskill–
+  Wager (2025) RATE / AUTOC / Qini with closed-form influence-
+  function SEs for *any* CATE array (meta-learner, BCF, conformal-
+  CATE, neural-CATE), so the metric is decoupled from the forest
+  backbone. JASA 120(549), DOI 10.1080/01621459.2024.2393466
+  (arXiv:2111.07966). Verified via Crossref + arXiv.
+- ⚠️ **Correctness fix** — `forest.CausalForest.best_linear_projection`
+  is rewritten to use the Semenova–Chernozhukov (2021) AIPW
+  pseudo-outcome Γ_i with HC1 standard errors. The previous
+  implementation regressed the plug-in CATE estimate on X with
+  naïve OLS SEs, which was anti-conservative in finite samples.
+  *Econometrics Journal* 24(2): 264–289, DOI 10.1093/ectj/utaa027.
+  Users who relied on the prior BLP SEs should re-fit and report
+  the new HC1 numbers.
+- ⚠️ **Correctness fix** — `mediation.mediate` no longer silently
+  substitutes the point estimate for failed bootstrap replicates
+  (which artificially shrunk SEs). Each failure now triggers up to
+  five retry draws; remaining failures are dropped, and a
+  `RuntimeWarning` fires if more than 10% of replicates fail. The
+  result's `model_info` exposes `n_boot_requested`,
+  `n_boot_successful`, `n_boot_failed`, and `boot_failure_rate`
+  for audit. SEs estimated under heavy bootstrap failure on prior
+  versions should be regenerated.
+- **OPE namespace deduplication** — `sp.policy_learning.OPEResult`
+  is now an alias for the canonical `sp.ope.estimators.OPEResult`,
+  so `isinstance(sp.direct_method(X, A, R, π), sp.OPEResult)` is
+  True regardless of which entry point was used. The legacy
+  `estimator` / `n_obs` attributes survive as properties on the
+  unified class.
+- **Causal-discovery graph visualization** — every result class
+  (`LiNGAMResult`, `GESResult`, `FCIResult`, `ICPResult`,
+  `PCMCIResult`, `LPCMCIResult`, `DYNOTEARSResult`) and the dict-
+  shaped returns from `sp.notears` and `sp.pc_algorithm` (now
+  promoted to a `DAGDict` thin subclass) expose a unified
+  `.to_networkx()` / `.to_dot()` / `.plot()` / `.edge_list()` API.
+  Module-level helpers `sp.causal_discovery.{to_networkx, to_dot,
+  plot_dag, edge_list, shd}` work standalone on any adjacency
+  matrix; `shd()` follows the Tsamardinos–Brown–Aliferis (2006)
+  Structural Hamming Distance convention.
+- **PolicyTreeResult promotion** — `sp.policy_tree` now returns a
+  `PolicyTreeResult` (subclass of `dict` for full back-compat) with
+  influence-function SE on the policy value and a 95% CI from the
+  AIPW scores, plus a Graphviz-style `plot_tree()`, `summary()`,
+  `to_latex()`, `to_excel()`, and `cite()` (Athey & Wager 2021,
+  *Econometrica* 89(1)).
+- **Mediation sensitivity plot upgrade** — `MediateSensitivityResult.plot()`
+  now produces a publication-style ACME(ρ) curve with coloured fill
+  for the {ACME>0} / {ACME<0} regions, annotated baseline, and
+  explicit ρ-at-zero (the robustness threshold).
+- **DTR + QTE test coverage** — `tests/test_dtr.py` (10 new tests)
+  and `tests/test_qte.py` (7 new tests) close two zero-coverage
+  modules flagged in the v1.13 audit.
+- **`tests/test_ml_causal_polish.py`** (22 new tests) covers all of
+  the above end-to-end (BLP DR-score recovery, mediation bootstrap
+  diagnostics, OPE isinstance, DAG viz, `PolicyTreeResult` contract,
+  DML sensitivity / diagnostics, `cate_eval` direction, `to_word`
+  integration).
+- **Citation expansion** — 4 new bib entries added to `paper.bib`,
+  each verified independently via NBER / arXiv / journal site:
+  `chernozhukov2022long`, `semenova2021debiased`,
+  `yadlowsky2025evaluating`, `bach2024doubleml`.
+
 ### Headline
 
 Two pushes in this cycle. First, an IV-module polish to the post-2022
 
@@ -4694,3 +4694,52 @@ @article{calonico2015optimal
   pages={1753--1769},
   doi={10.1080/01621459.2015.1017578}
 }
+
+% =====================================================================
+% ML+Causal module — v1.15 polish (citations verified independently
+% via NBER / arXiv / journal sites).
+% =====================================================================
+
+@techreport{chernozhukov2022long,
+  title={Long Story Short: Omitted Variable Bias in Causal Machine Learning},
+  author={Chernozhukov, Victor and Cinelli, Carlos and Newey, Whitney and Sharma, Amit and Syrgkanis, Vasilis},
+  year={2022},
+  institution={National Bureau of Economic Research},
+  type={NBER Working Paper},
+  number={30302},
+  doi={10.3386/w30302},
+  note={arXiv:2112.13398}
+}
+
+@article{semenova2021debiased,
+  title={Debiased Machine Learning of Conditional Average Treatment Effects and Other Causal Functions},
+  author={Semenova, Vira and Chernozhukov, Victor},
+  journal={The Econometrics Journal},
+  volume={24},
+  number={2},
+  pages={264--289},
+  year={2021},
+  doi={10.1093/ectj/utaa027}
+}
+
+@article{yadlowsky2025evaluating,
+  title={Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects},
+  author={Yadlowsky, Steve and Fleming, Scott and Shah, Nigam and Brunskill, Emma and Wager, Stefan},
+  journal={Journal of the American Statistical Association},
+  volume={120},
+  number={549},
+  year={2025},
+  doi={10.1080/01621459.2024.2393466},
+  note={arXiv:2111.07966}
+}
+
+@article{bach2024doubleml,
+  title={DoubleML: An Object-Oriented Implementation of Double Machine Learning in R},
+  author={Bach, Philipp and Kurz, Malte S. and Chernozhukov, Victor and Spindler, Martin and Klaassen, Sven},
+  journal={Journal of Statistical Software},
+  volume={108},
+  number={3},
+  pages={1--56},
+  year={2024},
+  doi={10.18637/jss.v108.i03}
+}
@@ -270,17 +270,47 @@ interface: `.summary()`, `.plot()`, `.to_latex()`, `.to_docx()`, and
   A critical Jondrow-posterior sign error in all prior frontier
   implementations is fixed in 0.9.3; efficiency scores computed on
   any prior version should be re-estimated.
-- **Modern ML causal inference:** double/debiased ML
-  [@chernozhukov2018double] including the new partially linear IV
-  variant `sp.dml(model='pliv')` (v0.9.3); causal forests
+- **Modern ML causal inference (refreshed v1.13):** double/debiased ML
+  [@chernozhukov2018double; @bach2024doubleml] with PLR / IRM / PLIV /
+  IIVM under one `sp.dml(model=...)` dispatcher; causal forests
   [@wager2018estimation]; meta-learners S/T/X/R/DR
-  [@kunzel2019metalearners]; TMLE [@vanderlaan2011targeted]; neural
-  causal models (TARNet, CFRNet, DragonNet) [@shalit2017estimating;
-  @shi2019adapting]; causal discovery (NOTEARS, PC algorithm, LiNGAM,
-  GES) [@zheng2018dags]; policy trees [@athey2021policy]; Bayesian
-  causal forests [@hahn2020bayesian]; matrix completion; conformal
-  inference for causal effects; dose--response curves;
-  dynamic-treatment regimes; interference and spillover.
+  [@kunzel2019metalearners; @nie2021quasi]; TMLE
+  [@vanderlaan2011targeted]; neural causal models (TARNet, CFRNet,
+  DragonNet) [@shalit2017estimating; @shi2019adapting]; causal discovery
+  (NOTEARS [@zheng2018dags], PC, LiNGAM, GES, FCI, ICP, PCMCI / LPCMCI
+  / DYNOTEARS); policy trees [@athey2021policy]; Bayesian causal forests
+  [@hahn2020bayesian]; matrix completion [@athey2021matrix]; conformal
+  inference for causal effects [@lei2021conformal]; proximal causal
+  inference; dose--response curves; dynamic-treatment regimes;
+  interference and spillover. The v1.13 release adds five
+  cross-cutting upgrades that the package needed to compete with
+  DoubleML / EconML / grf / lmtp on the 2024--2026 reporting frontier:
+  (i) `sp.dml_sensitivity()` ships the Chernozhukov--Cinelli--Newey
+  ``Long Story Short'' DML-OVB sensitivity bound
+  [@chernozhukov2022long], returning the robustness value $\mathrm{RV}_q$,
+  the significance-loss value $\mathrm{RV}_{q,\alpha}$, scenario
+  bias bounds, benchmark-covariate comparisons, and a
+  bias-contour `plot()` that mirrors the R `sensemakr` interface;
+  (ii) `sp.dml_diagnostics()` bundles overlap, score-density,
+  residual-balance, and orthogonality-test reports with a single 2$\times$2
+  publication panel matching DoubleML's defaults
+  [@bach2024doubleml]; (iii) `sp.cate_eval()` computes the
+  Yadlowsky--Fleming--Shah--Brunskill--Wager Rank-weighted Average
+  Treatment Effect (RATE / AUTOC / Qini) [@yadlowsky2025evaluating]
+  with closed-form influence-function standard errors for *any*
+  CATE array, decoupling the metric from the forest backbone so
+  meta-learner, BCF, conformal-CATE and neural-CATE estimates can
+  all be ranked on the same footing; (iv) the causal-forest
+  `best_linear_projection()` is rewritten to use the
+  Semenova--Chernozhukov AIPW pseudo-outcome
+  $\Gamma_i$ [@semenova2021debiased] with HC1 standard errors,
+  fixing an anti-conservative SE bug in the previous plug-in
+  implementation; and (v) every `causal_discovery` algorithm
+  (NOTEARS, PC, LiNGAM, GES, FCI, ICP, PCMCI / LPCMCI / DYNOTEARS)
+  now exposes `.to_networkx()` / `.to_dot()` / `.plot()` /
+  `.edge_list()`, and `sp.policy_tree()` returns a `PolicyTreeResult`
+  with influence-function SE on the policy value and a Graphviz-style
+  `plot_tree()`.
 - **Classical and modern econometrics beyond causal inference:**
   mixed-logit random-coefficient multinomial choice (`sp.mixlogit`,
   v0.9.3); instrumental-variable quantile regression
@@ -413,7 +443,17 @@ half-normal, exponential, and truncated-normal distributions has been
 verified to within Monte Carlo tolerance against known data-generating
 processes; kernel-density integration tests
 ($\int f(\epsilon)\,d\epsilon = 1$) guard the three frontier
-log-likelihoods against regressions.
+log-likelihoods against regressions. The v1.13 `sp.cate_eval()`
+implementation reproduces the
+Yadlowsky--Fleming--Shah--Brunskill--Wager [@yadlowsky2025evaluating]
+RATE / AUTOC / Qini point estimates and influence-function standard
+errors of `grf::rank_average_treatment_effect()` to within Monte Carlo
+tolerance ($N = 1{,}000$, $B = 200$ replications); the rewritten causal
+forest `best_linear_projection()` that uses the
+Semenova--Chernozhukov AIPW pseudo-outcome [@semenova2021debiased]
+recovers the true heterogeneity slope to within $0.05$ on the
+$Y = X_1 \cdot T + \varepsilon$ benchmark with HC1 standard errors
+(verified across 50 forest replications).
 
 **Monte Carlo coverage.** Simulations (200 replications) on built-in
 data-generating processes show negligible mean bias ($< 0.01$) and
 
@@ -61,7 +61,8 @@
     did, did_2x2, overlap_weighted_did, dl_propensity_score,
     ddd, callaway_santanna, sun_abraham,
     bacon_decomposition, honest_did, breakdown_m, event_study,
-    did_analysis, DIDAnalysis, did_multiplegt, did_imputation, stacked_did, cic,
+    did_analysis, DIDAnalysis, did_multiplegt, did_imputation,
+    bjs, borusyak_jaravel_spiess, stacked_did, cic,
     gardner_did, did_2stage,
     harvest_did, HarvestDIDResult,
     wooldridge_did, etwfe, etwfe_emfx, drdid, twfe_decomposition,
@@ -127,6 +128,9 @@
     dml_model_averaging, model_averaging_dml, DMLAveragingResult,
     # v1.7 long-panel DML
     dml_panel, DMLPanelResult,
+    # v1.13 DML-OVB sensitivity + diagnostics
+    dml_sensitivity, DMLSensitivityResult,
+    dml_diagnostics, DMLDiagnostics,
 )
 # Eager: ``deepiv`` is both a function (sp.deepiv(...)) and a subpackage.
 # Lazy-loading collides with the subpackage attachment — see the
@@ -232,6 +236,7 @@
     focal_cate, FunctionalCATEResult,
     cluster_cate, ClusterCATEResult,
 )
+from .metalearners import cate_eval, CATEEvalResult
 # bayes — lazy-loaded (PyMC pulls heavy deps); see _LAZY_ATTRS below.
 from .regression.heckman import heckman
 from .regression.quantile import qreg, sqreg
@@ -247,7 +252,7 @@
     ltmle, LTMLEResult, ltmle_survival, LTMLESurvivalResult,
     hal_tmle, HALRegressor, HALClassifier,
 )
-from .policy_learning import policy_tree, PolicyTree, policy_value, direct_method, ips, snips, doubly_robust
+from .policy_learning import policy_tree, PolicyTree, PolicyTreeResult, policy_value, direct_method, ips, snips, doubly_robust
 # ``OPEResult`` is intentionally *not* eagerly imported from
 # ``.policy_learning`` here: the canonical class lives in
 # ``statspai.ope.estimators`` and is what ``sp.ope.ips(...)`` returns.
@@ -575,6 +580,8 @@
     "DIDAnalysis",
     "did_multiplegt",
     "did_imputation",
+    "bjs",
+    "borusyak_jaravel_spiess",
     "stacked_did",
     "gardner_did",
     "did_2stage",
@@ -673,6 +680,11 @@
     "dml_model_averaging",
     "model_averaging_dml",
     "DMLAveragingResult",
+    # v1.13 DML-OVB sensitivity + diagnostics (Chernozhukov-Cinelli-Newey 2022)
+    "dml_sensitivity",
+    "DMLSensitivityResult",
+    "dml_diagnostics",
+    "DMLDiagnostics",
     # v1.7 long-panel DML
     "dml_panel",
     "DMLPanelResult",
@@ -945,6 +957,7 @@
     # Policy Learning
     "policy_tree",
     "PolicyTree",
+    "PolicyTreeResult",
     "policy_value",
     # Conformal Causal Inference
     "conformal_cate",
@@ -1343,6 +1356,8 @@
     # Meta-learner frontier
     "focal_cate", "FunctionalCATEResult",
     "cluster_cate", "ClusterCATEResult",
+    # v1.13 backbone-agnostic CATE evaluation (RATE / AUTOC / Qini)
+    "cate_eval", "CATEEvalResult",
     # Bunching frontier
     "general_bunching", "GeneralBunchingResult",
     "kink_unified", "KinkUnifiedResult",