Skip to content

firefly1248/causal-inference-for-llm-rollouts

Repository files navigation

GenAIEvaluation

License: MIT Python 3.9

Zero-to-hero Jupyter notebooks on causal inference and experimentation for global LLM rollouts. Each notebook starts from a naive baseline and builds up to modern best practice on the same synthetic dataset, so methods are comparable across notebooks.

Background

The starting point was Rudrendu Paul's Product Experimentation series on freeCodeCamp. It was a solid introduction to the methods, but stopped at the textbook version. I took the same methods and pushed them to where they hold up in real product and business analytics work: staged and opt-in rollouts, clustered standard errors, placebo and sensitivity diagnostics, and a clear answer to when each method breaks.

Notebooks

# Notebook Method family Setting Source article
01 Synthetic Control Vanilla SCM, donor screening, ridge SCM, Augmented SCM, Synthetic DiD, per-unit SCM Staged rollout, +5pp lift link
02 Difference-in-Differences 2x2 DiD, regression DiD with clustered SE, event study, doubly robust DiD, Honest DiD Staged rollout, +5pp lift link
03 Propensity Scores IPW ATE/ATT, 1-NN matching, DR-ATT, E-value sensitivity, positivity trimming Opt-in feature, +8pp effect link
04 Regression Discontinuity Local linear RDD, McCrary density test, bandwidth sweep, placebo cutoffs Threshold routing, +6pp effect link
05 Cluster Randomization ICC and design effect, CR2 + Satterthwaite, wild cluster bootstrap, CV3 jackknife, randomization inference Parallel CRT, +5pp lift link

Each notebook also carries an inference / diagnostics suite (placebo tests, bootstrap CIs, sensitivity analysis) and a side-by-side comparison of the methods it covers.

Notebooks 01-04 are observational / quasi-experimental: they recover effects from data where the assignment was not under your control. Notebook 05 is experimental: you randomize whole workspaces by design, and the work is paying the statistical cost of clustering correctly.

Synthetic data

Notebooks 01 and 02 share one generator: 50k users, 50 workspaces, a 30-week window, treatment at week 20, a +5pp ground-truth lift. Wave 1 (workspaces 0-24) is treated; Wave 2 (25-49) acts as donors. Notebook 05 reuses the same 50-workspace, 50k-user universe but assigns it as a parallel CRT (25 workspaces randomized to treatment, 25 to control). Notebooks 03 and 04 use their own generators matched to the method (individual-level opt-in and a Beta(5,2) running variable with a 0.85 cutoff).

Setup

pip install numpy pandas scipy matplotlib scikit-learn
jupyter lab

Python 3.9. No dependencies outside the list above.

How to read

Open any notebook top to bottom. Structure is the same throughout: motivation, naive baseline, progressive method upgrades, inference suite, comparison, and when to use what. Implementations are the shortest clear version, not a production library.

Roadmap

Planned additions (instrumental variables, online experiment toolkit, heterogeneous effects, double machine learning) are in ROADMAP.md.

About

Zero-to-hero notebooks on causal inference and experimentation for global LLM rollouts: synthetic control, difference-in-differences, propensity scores, regression discontinuity, cluster randomization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors