|
2 | 2 |
|
3 | 3 | All notable changes to StatsPAI will be documented in this file. |
4 | 4 |
|
| 5 | +## [1.14.0] — 2026-05-05 |
| 6 | + |
| 7 | +### Headline |
| 8 | + |
| 9 | +GPU-acceleration sprint. Three workloads now opt into accelerator |
| 10 | +backends without changing their public API: (1) the neural causal |
| 11 | +estimators (`sp.deepiv`, `sp.tarnet`, `sp.cfrnet`, `sp.dragonnet`, |
| 12 | +`sp.cevae`) route through PyTorch CUDA / MPS via a centralised |
| 13 | +device resolver; (2) `sp.fast.feols_jax` runs the full WLS solve on |
| 14 | +JAX / XLA; and (3) `sp.fast.feols_jax_bootstrap` lifts a JIT-compiled |
| 15 | +single-iteration WLS kernel to a `jax.vmap` batched primitive, |
| 16 | +giving a 10–100x speedup over sequential CPU bootstrap on CUDA / TPU |
| 17 | +at B ≥ 1000. Four bootstrap variants share the same JAX kernel |
| 18 | +infrastructure: pairs (multinomial-weight resampling), cluster |
| 19 | +(Cameron–Gelbach–Miller 2008 §III.A), wild (row-level Rademacher), |
| 20 | +and wild cluster (Cameron–Gelbach–Miller 2008 §III.B); the wild |
| 21 | +variants use the score formulation |
| 22 | +`β* = β̂ + (X'WX)⁻¹ X'W (η ⊙ û)` which is mathematically identical |
| 23 | +to refitting on `y* = X β̂ + η ⊙ û` but needs one mat-vec per |
| 24 | +iteration instead of a full QR. A new `cluster_meat` Rust kernel |
| 25 | +in `statspai_hdfe` (PyO3 + Rayon, parallel over clusters) is wired |
| 26 | +behind `statspai.core._numba_kernels.cluster_meat` with the existing |
| 27 | +numba kernel as automatic fallback. `sp.iv(absorb=...)` is the new |
| 28 | +2SLS-with-HDFE entry point: residualises `y`, exogenous controls, |
| 29 | +endogenous regressors, and instruments by one or more FE columns |
| 30 | +via the Phase-1 Rust demean kernel before fitting, with the residual |
| 31 | +DOF adjusted by `Σ(G_k - 1)` to charge the absorbed FE rank against |
| 32 | +iid / HC1 / CR1 SEs. A new `docs/guides/gpu_acceleration.md` is the |
| 33 | +canonical landing page for the accelerator story; the README and |
| 34 | +`paper.md` link to it and explicitly bound the GPU promise (most |
| 35 | +estimators are CPU-only by design — DiD / RD / synth / GMM are |
| 36 | +bandwidth-bound or small-K convex programs where a tuned CPU |
| 37 | +kernel matches GPU performance). |
| 38 | + |
| 39 | +### Added |
| 40 | + |
| 41 | +- `sp.fast.feols_jax` — JAX-backed end-to-end OLS / WLS with HDFE. |
| 42 | + Same formula DSL and `FeolsResult` return type as `sp.fast.feols`; |
| 43 | + the WLS solve and HC1 sandwich run on the default JAX device. |
| 44 | + CR1 cluster sandwich delegates to the existing `crve` (which |
| 45 | + itself dispatches to the new `cluster_meat` Rust kernel when |
| 46 | + built). Default `dtype="float64"` preserves bit-comparable |
| 47 | + numerics; `dtype="float32"` available for the GPU fast path. |
| 48 | +- `sp.fast.feols_jax_bootstrap` — vmap'd bootstrap with four |
| 49 | + variants (`pairs`, `cluster`, `wild`, `wild_cluster`). |
| 50 | + `vmap_chunk_size` parameter for memory control on tight devices. |
| 51 | + Same-seed → bit-identical reproducibility via `jax.random` PRNG. |
| 52 | + Returns a `FeolsBootstrapResult` dataclass with `coef`, `se_boot`, |
| 53 | + percentile `ci_lower` / `ci_upper`, and the full `boot_betas` |
| 54 | + table for custom CI methods. |
| 55 | +- `sp.iv(absorb=...)` — 2SLS with HDFE residualisation. Accepts |
| 56 | + `"firm + year"` string syntax or `["firm", "year"]` list. |
| 57 | + LIML / Fuller / GMM / JIVE raise `NotImplementedError` (Phase 3b). |
| 58 | +- `STATSPAI_TORCH_DEVICE` environment variable (`cpu` / `cuda` / |
| 59 | + `cuda:N` / `mps` / `auto`) routes neural causal estimators |
| 60 | + through the requested device. Default `cpu` preserves existing |
| 61 | + pinned numerics; explicit `cuda` raises if the device is |
| 62 | + unavailable rather than silently falling back. New |
| 63 | + `sp.fast.torch_device_info()` mirrors `sp.fast.jax_device_info()`. |
| 64 | +- `statspai_hdfe::cluster_meat` Rust kernel — Rayon parallel over |
| 65 | + clusters with thread-local k×k upper-triangle accumulator and |
| 66 | + elementwise reduction. Bumped the crate version 0.5.0-alpha.1 → |
| 67 | + 0.7.0-alpha.1. Activation requires a one-time |
| 68 | + `pip install maturin && cd rust/statspai_hdfe && maturin develop --release`; |
| 69 | + Python falls back to the numba kernel transparently when the |
| 70 | + Rust extension is absent. |
| 71 | +- `docs/guides/gpu_acceleration.md` — accelerator landing page |
| 72 | + with activation recipes, a Google Colab quickstart benchmark, |
| 73 | + and an explicit "what is *not* GPU-accelerated and why" table. |
| 74 | + |
| 75 | +### Changed |
| 76 | + |
| 77 | +- `paper.md` adds a fifth bullet to the *Unique features* |
| 78 | + list documenting the accelerator story, and notes the Rust |
| 79 | + HDFE / cluster-meat kernel in the implementation paragraph. |
| 80 | +- `README.md` comparison-table accelerator row now links to the |
| 81 | + new GPU guide; the *What StatsPAI is — and is not* bullet |
| 82 | + expands to explicitly mention `feols_jax`, |
| 83 | + `feols_jax_bootstrap`, and the vmap mechanism. |
| 84 | + |
| 85 | +### Internal |
| 86 | + |
| 87 | +- New helper `_jax_prep_inputs` shares formula-parse + FE-residualise |
| 88 | + logic between `feols_jax` and `feols_jax_bootstrap`. |
| 89 | + `feols_jax` itself is unchanged in this release; consolidation |
| 90 | + into a shared call site is a candidate follow-up. |
| 91 | +- Rust crate adds `src/cluster.rs` (kernel) and a `cluster_meat` |
| 92 | + PyO3 binding in `src/lib.rs`. 3 cargo unit tests cover small-DGP |
| 93 | + reference parity, k=1 closed form, and empty-input safety. |
| 94 | + |
| 95 | +### Verified |
| 96 | + |
| 97 | +- 10 PyTorch device-resolver tests |
| 98 | + (`tests/test_torch_device_resolver.py`); 51 existing neural |
| 99 | + tests pass without numerical drift on default CPU. |
| 100 | +- 9 `cluster_meat` Rust parity tests |
| 101 | + (`tests/test_cluster_meat_rust.py`) — auto-skip when |
| 102 | + `statspai_hdfe` is not built. |
| 103 | +- 13 `sp.iv(absorb=)` parity tests vs explicit drop-first |
| 104 | + dummies (`tests/test_iv_absorb.py`); coefficients agree to |
| 105 | + `atol=1e-9`, iid SE to `rtol=1e-3`, cluster SE to `rtol=1e-2`. |
| 106 | +- 12 `feols_jax` parity tests vs `feols` |
| 107 | + (`tests/test_jax_feols.py`); iid / hc1 / cr1 / weighted / |
| 108 | + float32 / 6 error-path validations. |
| 109 | +- 24 `feols_jax_bootstrap` tests |
| 110 | + (`tests/test_jax_feols_bootstrap.py`); convergence to HC1 SE |
| 111 | + for pairs / wild and CR1 SE for cluster / wild_cluster at |
| 112 | + B=2000 (rtol 10–15%); algebraic identity check that the wild |
| 113 | + score formulation reproduces the literal "refit on pseudo-y" |
| 114 | + bootstrap bit-for-bit on a no-FE DGP (`atol=1e-9`). |
| 115 | + |
5 | 116 | ## [1.13.1] — 2026-05-05 |
6 | 117 |
|
7 | 118 | ### Headline |
|
0 commit comments