docs(gpu): Phase 4c follow-up — wild + wild_cluster in guide and README; drop unverified Efron 1979 attribution

brycewang-stanford · claude · brycewang-stanford · commit b4ae5934a275 · 2026-05-05T21:02:48.000-07:00
Self-review pass on the v1.14 GPU docs: 1. Citation hygiene (CLAUDE.md §10). The Phase 4 GPU guide contained a textual ``(Efron 1979)`` attribution for pairs bootstrap that is **not** in paper.bib and was pulled from training-corpus memory. Per the zero-hallucination rule, drop it. ``cameron2008bootstrap`` is the only bootstrap citation we use; it has a verified DOI in paper.bib. 2. Phase 4c surface coverage. The ``Wild cluster bootstrap`` line in the GPU guide's "Future GPU candidates" list is no longer future work — it's already shipped in 4782152. Move it from the future list into the body description as a sibling of pairs / cluster, with the same score-formulation note we put in CHANGELOG.md and paper.md. 3. Activation table + README accelerator bullet now list all four bootstrap variants (pairs, cluster, wild, wild cluster) instead of just two. No code changes; no test impact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -339,7 +339,7 @@ StatsPAI is **not** a wrapper for R. We independently re-implement every algorit
 - **One result object, one API surface.** Every estimator — from `regress()` to `callaway_santanna()` to `causal_forest()` to `notears()` — returns a `CausalResult` with the same `.summary()` / `.plot()` / `.to_latex()` / `.cite()` interface. R users juggle 20+ incompatible S3 classes; StatsPAI users juggle one.
 - **Scope no single R or Python package matches.** DID + RD + Synth + Matching + DML + Meta-learners + TMLE + Neural Causal + Causal Discovery + Policy Learning + Conformal + Bunching + Spillover + Matrix Completion — all consistent, all under `sp.*`.
 - **Agent-native by design.** Self-describing schemas (`list_functions()`, `describe_function()`, `function_schema()`) make StatsPAI the first econometrics toolkit built for LLM-driven research workflows. No other package — in any language — offers this.
-- **Accelerator-ready where it matters.** Selected workloads can opt into accelerator backends without changing the public API: neural causal estimators route through PyTorch CUDA/MPS via `STATSPAI_TORCH_DEVICE`; the HDFE residualizer exposes `backend="jax"`; `sp.fast.feols_jax` runs end-to-end OLS on XLA; and **`sp.fast.feols_jax_bootstrap`** uses `jax.vmap` to lift pairs / cluster bootstrap into a single batched device program — 10–100x faster on CUDA / TPU than a sequential CPU loop at B ≥ 1000. See [GPU acceleration guide](docs/guides/gpu_acceleration.md). This is not a universal GPU-speed claim; most StatsPAI estimators are CPU-only by design (and that's the right choice for them).
+- **Accelerator-ready where it matters.** Selected workloads can opt into accelerator backends without changing the public API: neural causal estimators route through PyTorch CUDA/MPS via `STATSPAI_TORCH_DEVICE`; the HDFE residualizer exposes `backend="jax"`; `sp.fast.feols_jax` runs end-to-end OLS on XLA; and **`sp.fast.feols_jax_bootstrap`** uses `jax.vmap` to lift four bootstrap variants — pairs, cluster, wild, and wild cluster — into a single batched device program, 10–100x faster on CUDA / TPU than a sequential CPU loop at B ≥ 1000. See [GPU acceleration guide](docs/guides/gpu_acceleration.md). This is not a universal GPU-speed claim; most StatsPAI estimators are CPU-only by design (and that's the right choice for them).
 - **Publication pipeline out of the box.** Word + Excel + LaTeX + HTML + Markdown export from every estimator, not a separate `modelsummary`-style dance.
 
 If a method exists in R, we aim to match or exceed its feature set in Python — and then add what Python can uniquely offer: sklearn integration, opt-in JAX/PyTorch accelerator backends, and agent-native schemas.
diff --git a/docs/guides/gpu_acceleration.md b/docs/guides/gpu_acceleration.md
@@ -22,7 +22,7 @@
 | Neural IV: Deep IV (Hartford et al. 2017) | `sp.deepiv` | PyTorch | same env var |
 | HDFE demean (alternating projection) | `sp.fast.demean(backend="jax")` | JAX / XLA | install `jax[cuda]` |
 | OLS / WLS with HDFE | `sp.fast.feols_jax` | JAX / XLA | install `jax[cuda]` |
-| **Bootstrap (pairs / cluster)** | `sp.fast.feols_jax_bootstrap` | JAX / XLA `vmap` | install `jax[cuda]` |
+| **Bootstrap (pairs / cluster / wild / wild_cluster)** | `sp.fast.feols_jax_bootstrap` | JAX / XLA `vmap` | install `jax[cuda]` |
 
 The CPU paths (`sp.fast.demean`, `sp.fast.feols`, `sp.fast.fepois`,
 `sp.fast.boottest`, `sp.iv`, `sp.did`, `sp.rd`, `sp.synth`, …) all
@@ -44,14 +44,25 @@ time`; on CPU JAX it's still ~equal to a numpy sequential bootstrap
 (JIT overhead amortises around B ≈ 100). The speedup curve crosses
 favourably very quickly.
 
-**Pairs bootstrap** (Efron 1979): each draw resamples *rows* with
-replacement; multinomial counts become per-row WLS weights. Asymptotic
-target: HC1 standard errors.
+**Pairs bootstrap**: each draw resamples *rows* with replacement;
+multinomial counts become per-row WLS weights. Asymptotic target:
+HC1 standard errors.
 
 **Cluster bootstrap** (Cameron, Gelbach & Miller 2008 §III.A): each
 draw resamples *clusters* with replacement; observations in a cluster
 sampled k times get weight k. Asymptotic target: CR1 standard errors.
 
+**Wild bootstrap**: each draw assigns independent Rademacher signs
+``η_i ∈ {-1, +1}`` per row and uses the *score formulation*
+``β* = β̂ + (X'WX)⁻¹ X'W (η ⊙ û)``, mathematically identical to
+refitting on ``y* = X β̂ + η ⊙ û`` but with one mat-vec per
+iteration instead of a full QR.
+
+**Wild cluster bootstrap** (Cameron, Gelbach & Miller 2008 §III.B):
+same score formulation as wild, but the Rademacher signs are drawn
+*per cluster*. The standard tool for few-cluster inference (G < 30,
+especially G < 10) where cluster bootstrap can over-reject.
+
 ```python
 import statspai as sp
 
@@ -162,12 +173,11 @@ Performance Shaders) when CUDA is unavailable.
 | Bayesian causal (PyMC) | NumPyro / JAX backend optional | Routing to GPU works *via PyMC*; we don't reimplement. |
 
 Future GPU candidates (open issues welcome):
+
 - **Permutation tests / placebo studies** — `vmap` over permutations is
   the obvious follow-up to bootstrap.
 - **DML cross-fitting** — k-fold parallel nuisance fits.
 - **Synthetic control matrix completion** — large-K SVD on GPU.
-- **Wild cluster bootstrap (Cameron-Gelbach-Miller §III.B)** —
-  Phase 4c; closely related to the existing pairs / cluster bootstrap.
 - **Causal forest training** — wire `xgboost` / `cuml` for tree fits.
 
 ---