Skip to content

Commit 5d77285

Browse files
feat(rlasso): faithful hdm::rlassologit port (logistic rigorous Lasso)
The binary-outcome analogue of rlasso. hdm::rlassologit delegates its penalized fit to glmnet's binomial lasso at a single data-driven lambda; StatsPAI reproduces glmnet directly rather than substituting sklearn: - _glmnet_logit_lasso: IRLS outer loop + weighted coordinate-descent inner loop, 1/n-scaled deviance objective, population-variance standardization, glmnet's pmin probability clamp. Matches R glmnet 4.1: selected support EXACT, coefficients ~1e-6 (glmnet's own tolerance). - rlassologit + RLassoLogitFit: post=True refits an unpenalized logistic glm on the selected set (IRLS Newton) -> coefficients/intercept/ residuals match hdm to ~1e-9; post=False keeps the glmnet fit (~1e-6). predict(type='response'/'link'); §3 result contract (cite/to_latex). - RlassologitClassifier: a GENUINE (calibrated) logistic propensity for Double-ML, wired as ml_m='rlassologit'; sp.dml(model='irm', ml_m='rlassologit') uses it instead of the linear-probability RlassoClassifier. Subtlety fixed vs a naive port: hdm's default penalty list carries c=1.1 explicitly, so the post=FALSE -> c=0.5 switch is dead code on a default call; penalty=None uses c=1.1 regardless of post (unlike rlasso). The high-dim logistic *effect* (rlassologitEffect) is intentionally deferred (a separate multi-day parity exercise) and documented as such. Coverage: test_rlassologit_parity.py (5 hdm/glmnet pins, via _generate_rlassologit.R) + 5 behavioural tests in test_rlasso.py. Citations: only chernozhukov2016hdm (verified in paper.bib); glmnet/BCW described in prose, no unverified bib keys (§10). Lazy-import contract intact (import statspai pulls 0 sklearn submodules). registry 1125->1127. (CHANGELOG also carries the earlier dispatcher note + a complementary bch-deprecation doc paragraph that were sitting in the working tree.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 5f7e4eb commit 5d77285

22 files changed

Lines changed: 1409 additions & 29 deletions

CHANGELOG.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,27 @@ All notable changes to StatsPAI will be documented in this file.
2828
(coef 0.2274, SE 0.2466) — **resolving the previously-tracked
2929
divergence** where the older `iv.bch_post_lasso_iv` was ~17× off
3030
(0.013). All four selection regimes (Z-only, X-only, both, none)
31-
match `hdm` to ~1e-6 on a well-conditioned design.
31+
match `hdm` to ~1e-6 on a well-conditioned design. Also routable
32+
through the IV family dispatcher: `sp.iv(method='rlasso', ...)`
33+
(instrument selection by default; double selection when `exog=`
34+
controls are passed).
3235
- `sp.RlassoRegressor` / `sp.RlassoClassifier` — scikit-learn-compatible
3336
adapters so the rigorous Lasso can serve as a Double-ML nuisance
3437
learner: `sp.dml(model='plr', ml_g='rlasso', ml_m='rlasso')` now
3538
works (clone-safe across cross-fitting folds).
39+
- `sp.rlassologit` — the **logistic** rigorous (post-)Lasso, a faithful
40+
port of `hdm::rlassologit`. hdm delegates the penalized fit to glmnet's
41+
binomial lasso at a single data-driven `λ`; StatsPAI reproduces glmnet
42+
directly (IRLS + weighted coordinate descent, `1/n` deviance,
43+
population-variance standardization, `pmin` clamp). The selected
44+
support matches glmnet **exactly**; engine coefficients match R glmnet
45+
4.1 to ~1e-6; `post=True` coefficients/residuals (unpenalized logistic
46+
refit on the selected set) match `hdm` to ~1e-9. `sp.RlassologitClassifier`
47+
is a *genuine* (calibrated) logistic propensity for Double-ML —
48+
`sp.dml(model='irm', ml_m='rlassologit')` — unlike the linear-probability
49+
`RlassoClassifier`. The high-dim logistic *effect* (`rlassologitEffect`)
50+
is intentionally deferred (a separate parity exercise). 5 hdm/glmnet
51+
parity pins in `test_rlassologit_parity.py`.
3652
Coverage: `tests/reference_parity/test_rlasso_parity.py` (17 pins vs
3753
`hdm`, generated by `_generate_rlasso.R` — including `rlasso_effects`
3854
multi-target and a tight `sp.dml(ml_g='rlasso')` pin against a manual
@@ -77,10 +93,12 @@ All notable changes to StatsPAI will be documented in this file.
7793
Chiang–Kato–Ma–Sasaki (*JBES* 40(3), 2022, doi
7894
`10.1080/07350015.2021.1895815`) references to `paper.bib`; cite `hdm`
7995
from the post-Lasso IV / RD-lasso modules that implement its methods.
80-
The DML guide now documents the relationship to `hdm` openly, including
81-
that `sp.lasso_iv` / `bch_post_lasso_iv` can select fewer instruments
82-
than `hdm::rlassoIV` on weak-instrument designs (a tracked roadmap item,
83-
not a silent divergence).
96+
The DML guide documents the relationship to `hdm` openly. (The original
97+
hand-rolled `bch_post_lasso_iv` under-selected instruments vs
98+
`hdm::rlassoIV` on weak-instrument designs; this is now resolved by the
99+
dedicated `sp.rlasso` port — `sp.rlasso_iv` reproduces `hdm::rlassoIV`
100+
exactly — and `bch_post_lasso_iv` is deprecated. See the `sp.rlasso`
101+
entry above.)
84102

85103
### Deprecated
86104

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ StatsPAI's focus is **causal inference**. The grid below summarizes method-famil
138138

139139
**Legend**: B = broad API coverage within this comparison table; Y = implemented entry points; P = partial, scattered, or single-algorithm support; N = no first-class entry point. These are API-breadth labels, not validation tiers.
140140

141-
**StatsPAI at a glance**: 1,125 registered functions in the live agent registry · 87 submodules · 333k LOC (core) + 178k LOC (tests). All four numbers are reproducible from the canonical generator (`python scripts/registry_stats.py`); the per-module table in [`docs/stats.md`](docs/stats.md) is regenerated from the same script. For the API-breadth matrix (23 method families) and cross-ecosystem line-count comparison, see [`docs/stats.md`](docs/stats.md).
141+
**StatsPAI at a glance**: 1,127 registered functions in the live agent registry · 87 submodules · 333k LOC (core) + 178k LOC (tests). All four numbers are reproducible from the canonical generator (`python scripts/registry_stats.py`); the per-module table in [`docs/stats.md`](docs/stats.md) is regenerated from the same script. For the API-breadth matrix (23 method families) and cross-ecosystem line-count comparison, see [`docs/stats.md`](docs/stats.md).
142142

143143
**Validation tiers matter**: `stability="stable"` means the public API is SemVer-stable; it does not by itself mean R/Stata/paper parity. Use `sp.list_functions(validation_status="certified")` for cross-language or published-reference evidence, and inspect `sp.describe_function(name)["limitations"]` before production use. See [`docs/guides/stability.md`](docs/guides/stability.md).
144144

README_CN.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ StatsPAI 聚焦**因果推断**。下表描述方法家族层面的 API 覆盖
4646

4747
**图例**:B = 本表范围内 API 覆盖较宽;Y = 有已实现入口;P = 部分、分散或单算法支持;N = 无一等入口。这些只是 API 广度标签,不是 validation tier。
4848

49-
**StatsPAI 一句话概览**:live agent registry 中有 1,125 个注册函数 · 87 个子模块 · 333k 行核心代码 + 178k 行测试。这四个数字都可以由唯一的生成器 (`python scripts/registry_stats.py`) 现场复算;[`docs/stats.md`](docs/stats.md) 中的按模块拆分表也由同一个脚本回写。23 个方法家族的 API 广度矩阵以及跨生态行数对比,详见 [`docs/stats.md`](docs/stats.md)。这些覆盖数字描述 API 广度,不等同于每个函数都有 R/Stata 数值验证;生产使用请查看 `validation_status`
49+
**StatsPAI 一句话概览**:live agent registry 中有 1,127 个注册函数 · 87 个子模块 · 333k 行核心代码 + 178k 行测试。这四个数字都可以由唯一的生成器 (`python scripts/registry_stats.py`) 现场复算;[`docs/stats.md`](docs/stats.md) 中的按模块拆分表也由同一个脚本回写。23 个方法家族的 API 广度矩阵以及跨生态行数对比,详见 [`docs/stats.md`](docs/stats.md)。这些覆盖数字描述 API 广度,不等同于每个函数都有 R/Stata 数值验证;生产使用请查看 `validation_status`
5050

5151
**📦 v1.19.0(2026-06-20)— 跨引擎验证、数据 MCP 归一化、社会网络分析**
5252

docs/guides/rigorous_lasso_hdm.md

Lines changed: 38 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -140,19 +140,44 @@ two are not numerically interchangeable.
140140

141141
Ported and parity-tested against `hdm`: `rlasso`, `rlassoEffect` /
142142
`rlassoEffects` (single and multi-target), `rlassoIV` (all four selection
143-
regimes), `tsls`, and the data-driven `lambdaCalculation` for the
144-
homoskedastic and heteroskedastic (X-independent) penalties.
145-
146-
**Not yet ported — `rlassologit`** (the *logistic* rigorous Lasso and its
147-
`rlassologitEffect(s)`). `hdm::rlassologit` delegates the penalized fit to
148-
`glmnet::glmnet(family="binomial")` at a single data-driven `λ`. Reproducing
149-
it faithfully means matching glmnet's logistic-lasso solution — its
150-
standardization, intercept handling and objective scaling differ from
151-
scikit-learn's L1 logistic regression at a fixed `λ` — which is a separate
152-
parity exercise. Rather than ship an unvalidated approximation (the very
153-
failure mode this module was built to avoid), it is intentionally left out
154-
until it can be pinned against `glmnet`. For a binary treatment under
155-
Double-ML, use `sp.dml(model='irm', ...)` with a genuine classifier.
143+
regimes), `tsls`, `rlassologit` (the logistic rigorous Lasso), and the
144+
data-driven `lambdaCalculation` for the homoskedastic and heteroskedastic
145+
(X-independent) penalties.
146+
147+
### `sp.rlassologit` — the logistic rigorous Lasso
148+
149+
`hdm::rlassologit` is the binary-outcome analogue of `rlasso`: its
150+
penalized fit is `glmnet(family="binomial", alpha=1, lambda=λ,
151+
standardize=TRUE)` at a single data-driven `λ`. StatsPAI reproduces
152+
glmnet's binomial lasso at that `λ` **directly** — an IRLS outer loop, a
153+
weighted coordinate-descent inner loop, the `1/n`-scaled deviance
154+
objective, population-variance standardization and glmnet's `pmin`
155+
probability clamp — rather than substituting scikit-learn's L1 logistic
156+
(whose objective/standardization differ at a fixed `λ`).
157+
158+
```python
159+
fit = sp.rlassologit(X, y, post=True) # y binary
160+
fit.predict(X, type="response") # probabilities (or "link" = log-odds)
161+
sp.RlassologitClassifier() # genuine logistic propensity for sp.dml
162+
```
163+
164+
Parity (vs `hdm` 0.3.2 / `glmnet` 4.1): the **selected support matches
165+
exactly**; the glmnet engine's coefficients match to ~1e-6 (glmnet's own
166+
convergence tolerance — no tighter ground truth exists); and `post=True`
167+
(the default) coefficients/intercept/residuals — coming from an
168+
*unpenalized* logistic refit on the selected set — match to ~1e-9.
169+
170+
`sp.RlassologitClassifier` is the principled binary nuisance learner for
171+
Double-ML: `sp.dml(model='irm', ml_m='rlassologit')` uses a *calibrated*
172+
logistic propensity (unlike the linear-probability `RlassoClassifier`).
173+
174+
**Not yet ported — `rlassologitEffect(s)`**, the high-dimensional
175+
*logistic treatment effect* (BCW double-selection with `√σ²`-weighting and
176+
a max-of-two sandwich variance). It layers several `rlassologit` /
177+
`rlasso` fits plus a glm with non-obvious internals; it is a separate
178+
multi-day parity exercise and is intentionally left out rather than
179+
shipped unvalidated. For a binary-treatment causal effect, use
180+
`sp.dml(model='irm', ml_m='rlassologit')`.
156181

157182
**X-dependent penalty simulation** (`penalty={"X.dependent.lambda": True}`)
158183
is implemented but matches `hdm` only *in distribution* — R's

docs/reference/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# API Reference — Overview
22

3-
StatsPAI exposes 1,125 registered public functions under a single
3+
StatsPAI exposes 1,127 registered public functions under a single
44
`import statspai as sp` namespace. Reference pages are grouped by
55
methodological area:
66

docs/stats.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ Sorted by LOC. This table is generated from the live source tree by `python scri
9090
| `causal_text` | 1,457 | 4 | 4 |
9191
| `target_trial` | 1,457 | 7 | 9 |
9292
| `mediation` | 1,454 | 4 | 6 |
93-
| `rlasso` | 1,442 | 5 | 6 |
93+
| `rlasso` | 2,128 | 6 | 8 |
9494
| `bunching` | 1,437 | 5 | 8 |
9595
| `fairness` | 1,418 | 3 | 6 |
9696
| `power` | 1,404 | 3 | 12 |
@@ -126,7 +126,7 @@ Sorted by LOC. This table is generated from the live source tree by `python scri
126126
| `checks` | 152 | 2 | 0 |
127127
| `causal` | 111 | 1 | 0 |
128128
| `schemas` | 0 | 0 | 0 |
129-
| **Total** | **335,702** | **692** | **1125** |
129+
| **Total** | **336,450** | **693** | **1127** |
130130
## 3 · Causal-inference coverage matrix (full)
131131

132132
Legend: B = broad API coverage within this comparison table; Y = implemented entry points; P = partial, scattered, or single-algorithm support; N = no first-class entry point. These are API-breadth labels, not validation tiers.

schemas/functions.json

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32548,6 +32548,56 @@
3254832548
"type": "object"
3254932549
}
3255032550
},
32551+
{
32552+
"description": "Logistic rigorous (post-)Lasso -- a faithful port of ``hdm::rlassologit``.",
32553+
"name": "rlassologit",
32554+
"parameters": {
32555+
"properties": {
32556+
"X": {
32557+
"description": "Feature matrix or covariate DataFrame.",
32558+
"type": "string"
32559+
},
32560+
"colnames": {
32561+
"description": "Column names (default ``V1..Vp``).",
32562+
"items": {
32563+
"type": "string"
32564+
},
32565+
"type": "array"
32566+
},
32567+
"control": {
32568+
"description": "``threshold`` -- coefficients below it are zeroed (default None).",
32569+
"type": "object"
32570+
},
32571+
"intercept": {
32572+
"default": true,
32573+
"description": "Include an intercept.",
32574+
"type": "boolean"
32575+
},
32576+
"penalty": {
32577+
"description": "Overrides for ``c`` (slack; default 1.1 for ``post=True``, else 0.5), ``gamma`` (default ``0.1/log n``) and ``lambda`` (raw penalty; bypasses the data-driven level).",
32578+
"type": "object"
32579+
},
32580+
"post": {
32581+
"default": true,
32582+
"description": "If ``True``, refit the selected support by *unpenalized* logistic regression (post-Lasso); else keep the glmnet-penalized fit.",
32583+
"type": "boolean"
32584+
},
32585+
"y": {
32586+
"description": "Outcome variable column name or outcome array.",
32587+
"enum": [
32588+
"0",
32589+
"1"
32590+
],
32591+
"type": "string"
32592+
}
32593+
},
32594+
"required": [
32595+
"X",
32596+
"y"
32597+
],
32598+
"type": "object"
32599+
}
32600+
},
3255132601
{
3255232602
"description": "Rigorous (post-)Lasso as a scikit-learn regressor.",
3255332603
"name": "RlassoRegressor",
@@ -32659,6 +32709,43 @@
3265932709
"type": "object"
3266032710
}
3266132711
},
32712+
{
32713+
"description": "Logistic rigorous-Lasso classifier -- a genuine (calibrated) propensity.",
32714+
"name": "RlassologitClassifier",
32715+
"parameters": {
32716+
"properties": {
32717+
"c": {
32718+
"description": "c parameter (Optional[float]).",
32719+
"type": "number"
32720+
},
32721+
"clip": {
32722+
"default": 1e-05,
32723+
"description": "clip parameter (float).",
32724+
"type": "number"
32725+
},
32726+
"gamma": {
32727+
"description": "gamma parameter (Optional[float]).",
32728+
"type": "number"
32729+
},
32730+
"intercept": {
32731+
"default": true,
32732+
"description": "intercept parameter (bool).",
32733+
"type": "boolean"
32734+
},
32735+
"lambda_": {
32736+
"description": "lambda_ parameter (Optional[float]).",
32737+
"type": "number"
32738+
},
32739+
"post": {
32740+
"default": true,
32741+
"description": "post parameter (bool).",
32742+
"type": "boolean"
32743+
}
32744+
},
32745+
"required": [],
32746+
"type": "object"
32747+
}
32748+
},
3266232749
{
3266332750
"description": "Estimate OLS / IV with high-dimensional fixed effects via pyfixest. Validation: certified parity evidence.",
3266432751
"name": "feols",

schemas/index.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
{
22
"counts": {
33
"agent_cards": 376,
4-
"functions": 1125,
5-
"tools": 510
4+
"functions": 1127,
5+
"tools": 511
66
},
77
"files": [
88
"tools.json",

schemas/tools.json

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19092,6 +19092,56 @@
1909219092
},
1909319093
"name": "rlasso_iv"
1909419094
},
19095+
{
19096+
"description": "Logistic rigorous (post-)Lasso -- a faithful port of ``hdm::rlassologit``.",
19097+
"input_schema": {
19098+
"properties": {
19099+
"X": {
19100+
"description": "Feature matrix or covariate DataFrame.",
19101+
"type": "string"
19102+
},
19103+
"colnames": {
19104+
"description": "Column names (default ``V1..Vp``).",
19105+
"items": {
19106+
"type": "string"
19107+
},
19108+
"type": "array"
19109+
},
19110+
"control": {
19111+
"description": "``threshold`` -- coefficients below it are zeroed (default None).",
19112+
"type": "object"
19113+
},
19114+
"intercept": {
19115+
"default": true,
19116+
"description": "Include an intercept.",
19117+
"type": "boolean"
19118+
},
19119+
"penalty": {
19120+
"description": "Overrides for ``c`` (slack; default 1.1 for ``post=True``, else 0.5), ``gamma`` (default ``0.1/log n``) and ``lambda`` (raw penalty; bypasses the data-driven level).",
19121+
"type": "object"
19122+
},
19123+
"post": {
19124+
"default": true,
19125+
"description": "If ``True``, refit the selected support by *unpenalized* logistic regression (post-Lasso); else keep the glmnet-penalized fit.",
19126+
"type": "boolean"
19127+
},
19128+
"y": {
19129+
"description": "Outcome variable column name or outcome array.",
19130+
"enum": [
19131+
"0",
19132+
"1"
19133+
],
19134+
"type": "string"
19135+
}
19136+
},
19137+
"required": [
19138+
"X",
19139+
"y"
19140+
],
19141+
"type": "object"
19142+
},
19143+
"name": "rlassologit"
19144+
},
1909519145
{
1909619146
"description": "Robust / unconstrained Synthetic Control. Assumptions: A convex (or regularized) combination of donor units reproduces the treated unit's pre-treatment outcome path; No interference: the treatment does not affect the donor units (SUTVA); No anticipation before the treatment date. Pre-conditions: Panel of one or more treated units plus an untreated donor pool, observed over time; Pre-treatment window long enough to fit donor weights (rule of thumb: more pre-periods than donors used); Outcome observed for every unit in every period. Failure modes: Large pre-treatment RMSPE -- the synthetic unit fails to track the treated unit before treatment -> Add donors / predictors, lengthen the pre-period, or use a bias-corrected estimator (sdid, augsynth); Placebo / permutation inference shows the estimate is not extreme relative to donors -> Report the placebo distribution honestly; the effect may not be distinguishable from noise. Alternatives: sp.sdid, sp.augsynth, sp.gsynth, sp.callaway_santanna. Typical minimum N: 15.",
1909719147
"input_schema": {

src/statspai/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -894,8 +894,10 @@
894894
rlasso_effect,
895895
rlasso_effects,
896896
rlasso_iv,
897+
rlassologit,
897898
RlassoRegressor,
898899
RlassoClassifier,
900+
RlassologitClassifier,
899901
)
900902

901903
# High-dimensional fixed effects (pyfixest backend)
@@ -1651,8 +1653,10 @@
16511653
"rlasso_effect",
16521654
"rlasso_effects",
16531655
"rlasso_iv",
1656+
"rlassologit",
16541657
"RlassoRegressor",
16551658
"RlassoClassifier",
1659+
"RlassologitClassifier",
16561660
# High-dimensional FE (pyfixest backend, optional)
16571661
"feols",
16581662
"fepois",

0 commit comments

Comments
 (0)