brycewang-stanford
diff --git a/‎CHANGELOG.md‎
Lines changed: 44 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎MIGRATION.md‎
Lines changed: 57 additions & 0 deletions b/‎MIGRATION.md‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 9 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎src/statspai/datasets/__init__.py‎
Lines changed: 97 additions & 13 deletions b/‎src/statspai/datasets/__init__.py‎
Lines changed: 97 additions & 13 deletions
@@ -2,6 +2,50 @@
 
 All notable changes to StatsPAI will be documented in this file.
 
+## [Unreleased]
+
+### Added — R-parity opt-in for `sp.rdrobust`
+
+- **New `bwselect='cct'`** in [`sp.rdrobust`](src/statspai/rd/rdrobust.py)
+  delegates the entire estimation (bandwidth selection + bias-corrected
+  inference) to the official `rdrobust>=1.3` Python port (Calonico,
+  Cattaneo & Titiunik 2014). This guarantees **bit-equal alignment with
+  R `rdrobust::rdrobust`** for users who need exact replication of
+  CCT-2014 published numbers — for example the canonical Lee/CCT Senate
+  case where R returns `Conv = 7.4141 / Robust = 7.5065 / h = 17.754`.
+  The internal `bwselect='mserd'` (default) is **kept unchanged for
+  backward compatibility** — it uses StatsPAI's own MSE-optimal recipe
+  which can drift from R's `rdbwselect` by up to ~70% on certain
+  datasets (documented in
+  [`tests/orig_parity/results/parity_table_orig.md`](tests/orig_parity/results/parity_table_orig.md)
+  row 52, module `05_lee_original`).
+- Install with `pip install statspai[rd-cct]` (adds the official
+  `rdrobust>=1.3` dependency).  Calling `bwselect='cct'` without it
+  raises a clear `ImportError` pointing to the install command.
+- See [MIGRATION.md](MIGRATION.md#sp-rdrobust-bwselect-cct-r-parity-opt-in)
+  for guidance on when to switch from `'mserd'` to `'cct'`.
+
+### Tests — `did::aggte` parity lock
+
+- Added [`TestAggteRParity`](tests/external_parity/test_published_replications.py)
+  in `tests/external_parity/test_published_replications.py`. Asserts
+  `sp.aggte(type='simple')` is bit-equal (≤1e-10) with R `did::aggte`
+  recorded in [`tests/orig_parity/results/02_mpdta_original_R.json`](tests/orig_parity/results/02_mpdta_original_R.json),
+  and `type='dynamic'` matches R's published vignette output to 1e-3.
+  Prevents future refactors from silently drifting away from R.
+- Added [`TestCCTDelegationParity`](tests/external_parity/test_published_replications.py)
+  and an `ImportError`-guarded test that pin the new `bwselect='cct'`
+  delegation to R `rdrobust` Senate-replication numbers (Conv 7.4141,
+  Robust 7.5065, h=17.754, 1e-3 tolerance).
+
+### Internal
+
+- Added `[project.optional-dependencies] rd-cct = ["rdrobust>=1.3"]` to
+  [`pyproject.toml`](pyproject.toml).
+- `sp.datasets.list_datasets()` now returns six columns
+  (added `paper_original` column to honestly distinguish the published
+  paper number from the simulated-replica's actual estimator output).
+
 ## [1.15.0] — 2026-05-05
 
 ### Docs — `sp.dml_panel` citation correction
 
@@ -5,6 +5,63 @@ Internal version-to-version migrations are at the top; the long-form
 
 ---
 
+## v1.15 → v1.16 — `sp.rdrobust(bwselect='cct')` R-parity opt-in
+
+**No breaking change.** `sp.rdrobust` keeps `bwselect='mserd'` (StatsPAI's
+own MSE-optimal recipe) as the default — every existing call returns the
+same numbers. A new opt-in value `bwselect='cct'` is added for users who
+need bit-equal R `rdrobust::rdrobust` parity.
+
+### When to switch from `'mserd'` to `'cct'`
+
+Use `bwselect='cct'` when **any** of these apply:
+
+- You're replicating a CCT 2014 / Cattaneo-Idrobo-Titiunik (2018, 2020)
+  paper and need the published numbers to the 4th decimal.
+- A reviewer asks for "the same number R `rdrobust` gives".
+- Your data has features that stress StatsPAI's internal pilot bandwidth
+  (heavy tails, small `n`, mass points). On the canonical Lee/CCT Senate
+  replication, `'mserd'` gives `Conv = 12.62 / h = 4.6` while `'cct'`
+  gives `Conv = 7.41 / h = 17.75` — the latter matches R bit-equal.
+
+Keep the default `bwselect='mserd'` when:
+
+- You don't need exact R parity, **and**
+- You don't want a soft dependency on the `rdrobust` package, **and**
+- Your downstream tests / pipelines have already been calibrated against
+  StatsPAI's `'mserd'` numbers.
+
+### How to switch
+
+```python
+import statspai as sp
+
+# Before — StatsPAI internal MSE-optimal (kept stable)
+res = sp.rdrobust(data=df, y='y', x='x', c=0)
+# After — R-bit-equal via official rdrobust delegation
+res = sp.rdrobust(data=df, y='y', x='x', c=0, bwselect='cct')
+```
+
+Install the optional dependency once:
+
+```bash
+pip install statspai[rd-cct]   # adds rdrobust>=1.3
+```
+
+Calling `bwselect='cct'` without it raises a clear `ImportError` that
+points you to the install command — no silent fallback.
+
+### Why we didn't change `'mserd'` itself
+
+Aligning the internal `'mserd'` to R `rdbwselect`'s recursive 3-step
+recipe would shift point estimates on every dataset that exercises
+StatsPAI's RD path (5+ test classes, `r_parity` scripts, downstream
+docs / notebooks). The additive `'cct'` route gives anyone who wants R
+parity an immediate path **and** preserves the 1.x line's numerical
+stability. A future major version may flip the default.
+
+---
+
 ## v1.11 → v1.12 — DML module hardening
 
 `sp.dml`, `sp.dml_panel`, `sp.dml_model_averaging` keep all of their
 
@@ -93,6 +93,12 @@ bayes = [
 tune = [
     "optuna>=3.0",
 ]
+rd-cct = [
+    # Official Calonico-Cattaneo-Titiunik (2014) rdrobust port; opt-in
+    # via ``sp.rdrobust(..., bwselect='cct')`` to get bit-equal R parity
+    # for bandwidth selection + inference.
+    "rdrobust>=1.3",
+]
 
 [project.scripts]
 statspai = "statspai.cli:main"
@@ -110,6 +116,9 @@ where = ["src"]
 [tool.setuptools.package-dir]
 "" = "src"
 
+[tool.setuptools.package-data]
+"statspai.datasets" = ["data/*.csv"]
+
 [tool.black]
 line-length = 88
 target-version = ['py39']
 
@@ -64,10 +64,73 @@
 # Re-export synth-shipped datasets (unchanged DGPs; this is the
 # consolidated namespace)
 from ..synth.datasets import (
-    california_tobacco as california_prop99,
+    california_tobacco as _california_tobacco_simulated,
     basque_terrorism,
     german_reunification,
 )
+from ._canonical import _load_bundled_csv
+
+
+def california_prop99(simulated: bool = True) -> pd.DataFrame:
+    """California Proposition 99 panel (Abadie-Diamond-Hainmueller 2010).
+
+    Parameters
+    ----------
+    simulated : bool, default True
+        If True, return the simulated covariate-rich replica from
+        ``synth.california_tobacco`` (39 states × 31 years, 1970-2000,
+        ADH-shaped DGP).  Default for backward compatibility.
+        If False, load the real ADH (2010) panel bundled in
+        ``statspai/datasets/data/california_prop99.csv`` (39 states ×
+        31 years, with covariates ``cigsale, retprice, lnincome,
+        age15to24, beer``; identical to tidysynth's smoking dataset).
+        Use this for exact paper replication.
+
+    Returns
+    -------
+    pd.DataFrame
+        Columns (both branches): ``state, year, cigsale, retprice,
+        lnincome, age15to24, beer``.  The simulated branch additionally
+        provides ``treated``; on the real branch we derive it as
+        ``(state == 'California') & (year >= 1989)``.
+
+    References
+    ----------
+    Abadie, A., Diamond, A. & Hainmueller, J. (2010).
+    Synthetic Control Methods for Comparative Case Studies.
+    Journal of the American Statistical Association 105(490), 493-505.
+    [@abadie2010synthetic]
+    """
+    if simulated:
+        return _california_tobacco_simulated()
+
+    df = _load_bundled_csv("california_prop99.csv")
+    # The bundled real CSV does not carry a 'treated' indicator; derive
+    # it so downstream callers (synth, synthdid, plotting) work uniformly.
+    if 'treated' not in df.columns:
+        df = df.copy()
+        df['treated'] = (
+            (df['state'] == 'California') & (df['year'] >= 1989)
+        ).astype(int)
+    df.attrs['paper'] = (
+        "Abadie, A., Diamond, A. & Hainmueller, J. (2010). "
+        "Synthetic Control Methods for Comparative Case Studies. "
+        "JASA 105(490), 493-505."
+    )
+    df.attrs['data_source'] = 'real'
+    df.attrs['simulated'] = False
+    df.attrs['source_origin'] = (
+        "Public-domain ADH (2010) California Prop 99 panel; "
+        "byte-identical to tidysynth's smoking dataset (1970-2000)."
+    )
+    df.attrs['notes'] = (
+        "Real ADH panel for exact paper replication.  Use the full "
+        "ADH (2010) predictor recipe via sp.synth(method='classic', "
+        "special_predictors=...) for canonical numbers; the headline "
+        "1989-2000 average gap is roughly -19 packs/capita per ADH "
+        "(2010) Figure 2."
+    )
+    return df
 
 # Convenience alias
 teen_employment = mpdta
@@ -76,40 +139,61 @@
 def list_datasets() -> pd.DataFrame:
     """Return a DataFrame describing all available datasets.
 
-    Columns: name, design, n_obs, paper, expected_main.
+    Columns: name, design, n_obs, paper, paper_original, expected_main.
+
+    - ``paper_original`` is the headline number from the published paper on the
+      ORIGINAL data (what readers expect to see).
+    - ``expected_main`` is what the canonical estimator recovers on this
+      simulated replica (what users will actually observe). The two differ
+      because the bundled replicas are deterministic DGPs calibrated to the
+      neighbourhood of the published values, not the original data.
+
+    For the strict numerical neighbourhood proofs see
+    ``tests/external_parity/test_published_replications.py`` and
+    ``tests/external_parity/PUBLISHED_REFERENCE_VALUES.md``.
     """
     registry = [
+        # (name, design, n_obs, paper, paper_original, expected_main)
         ('mpdta', 'DID', 2500,
          "Callaway-Sant'Anna (2021)",
-         "Simple ATT ≈ -0.040 (teen employment effect of min-wage)"),
+         "Simple ATT ≈ -0.0454 (R did::att_gt on original mpdta)",
+         "Simple ATT ≈ -0.033, dynamic ATT ≈ -0.034 on this replica"),
         ('card_1995', 'IV', 3010,
          "Card (1995)",
-         "IV returns-to-schooling ≈ 0.132 (OLS ≈ 0.075)"),
+         "IV β_educ ≈ 0.132, OLS ≈ 0.075 (Table 3, NLSYM)",
+         "IV β_educ ≈ 0.142, OLS ≈ 0.110 on this replica"),
         ('nsw_lalonde', 'RCT / matching', 445,
          "LaLonde (1986) / Dehejia-Wahba (1999)",
-         "Experimental ATT ≈ $1,794 (re78)"),
+         "Experimental ATT ≈ $1,794 (DW 1999, re78)",
+         "Naive OLS ≈ $1,556 on this replica (calibrated to $1,794)"),
         ('nsw_dw', 'SOO', 2675,
          "Dehejia-Wahba (1999)",
-         "Naive OLS ≈ -$8,498; PSM ≈ $1,794"),
+         "Naive OLS ≈ -$8,498; PSM ≈ $1,794 (DW 1999)",
+         "Naive OLS ≈ -$8,387; covariate-adjusted ≈ $2,313 on replica"),
         ('lee_2008_senate', 'RD', 6558,
          "Lee (2008)",
-         "Incumbent advantage ≈ 0.08 voteshare points"),
+         "Incumbent advantage ≈ 0.077 voteshare pts (Table 4)",
+         "Conventional ≈ 0.073, CCT robust ≈ 0.062 on this replica"),
         ('angrist_krueger_1991', 'IV', 5000,
          "Angrist-Krueger (1991)",
-         "QOB IV returns-to-schooling ≈ 0.08–0.11"),
+         "QOB IV β_educ ≈ 0.08–0.11 (Table V, range)",
+         "IV β_educ ≈ 0.10 by construction on this replica"),
         ('california_prop99', 'SCM', 1200,
          "Abadie-Diamond-Hainmueller (2010)",
-         "ATT ≈ -15 packs/capita (1988-2000)"),
+         "Mean 1989-2000 ATT ≈ -19 packs/capita (JASA Fig. 2)",
+         "Classic ADH ≈ -13.1, ASCM ≈ -13.3 packs/capita on this replica"),
         ('basque_terrorism', 'SCM', 774,
          "Abadie-Gardeazabal (2003)",
-         "GDP gap ≈ -0.855 (mean 1975-1997)"),
+         "GDP gap ≈ -0.855 (mean 1975-1997)",
+         "GDP gap ≈ -0.855 on this replica (calibrated)"),
         ('german_reunification', 'SCM', 748,
          "Abadie-Diamond-Hainmueller (2015)",
-         "West Germany GDPpc gap ≈ -1,500 (post-1990)"),
+         "West Germany GDPpc gap ≈ -1,500 (post-1990)",
+         "GDPpc gap ≈ -1,500 on this replica (calibrated)"),
     ]
     return pd.DataFrame(registry,
-                        columns=['name', 'design', 'n_obs',
-                                 'paper', 'expected_main'])
+                        columns=['name', 'design', 'n_obs', 'paper',
+                                 'paper_original', 'expected_main'])
 
 
 __all__ = [