Zero-to-hero Jupyter notebooks on causal inference and experimentation for global LLM rollouts. Each notebook starts from a naive baseline and builds up to modern best practice on the same synthetic dataset, so methods are comparable across notebooks.
The starting point was Rudrendu Paul's Product Experimentation series on freeCodeCamp. It was a solid introduction to the methods, but stopped at the textbook version. I took the same methods and pushed them to where they hold up in real product and business analytics work: staged and opt-in rollouts, clustered standard errors, placebo and sensitivity diagnostics, and a clear answer to when each method breaks.
| # | Notebook | Method family | Setting | Source article |
|---|---|---|---|---|
| 01 | Synthetic Control | Vanilla SCM, donor screening, ridge SCM, Augmented SCM, Synthetic DiD, per-unit SCM | Staged rollout, +5pp lift | link |
| 02 | Difference-in-Differences | 2x2 DiD, regression DiD with clustered SE, event study, doubly robust DiD, Honest DiD | Staged rollout, +5pp lift | link |
| 03 | Propensity Scores | IPW ATE/ATT, 1-NN matching, DR-ATT, E-value sensitivity, positivity trimming | Opt-in feature, +8pp effect | link |
| 04 | Regression Discontinuity | Local linear RDD, McCrary density test, bandwidth sweep, placebo cutoffs | Threshold routing, +6pp effect | link |
| 05 | Cluster Randomization | ICC and design effect, CR2 + Satterthwaite, wild cluster bootstrap, CV3 jackknife, randomization inference | Parallel CRT, +5pp lift | link |
Each notebook also carries an inference / diagnostics suite (placebo tests, bootstrap CIs, sensitivity analysis) and a side-by-side comparison of the methods it covers.
Notebooks 01-04 are observational / quasi-experimental: they recover effects from data where the assignment was not under your control. Notebook 05 is experimental: you randomize whole workspaces by design, and the work is paying the statistical cost of clustering correctly.
Notebooks 01 and 02 share one generator: 50k users, 50 workspaces, a 30-week window, treatment at week 20, a +5pp ground-truth lift. Wave 1 (workspaces 0-24) is treated; Wave 2 (25-49) acts as donors. Notebook 05 reuses the same 50-workspace, 50k-user universe but assigns it as a parallel CRT (25 workspaces randomized to treatment, 25 to control). Notebooks 03 and 04 use their own generators matched to the method (individual-level opt-in and a Beta(5,2) running variable with a 0.85 cutoff).
pip install numpy pandas scipy matplotlib scikit-learn
jupyter labPython 3.9. No dependencies outside the list above.
Open any notebook top to bottom. Structure is the same throughout: motivation, naive baseline, progressive method upgrades, inference suite, comparison, and when to use what. Implementations are the shortest clear version, not a production library.
Planned additions (instrumental variables, online experiment toolkit, heterogeneous effects, double machine learning) are in ROADMAP.md.