GenAIEvaluation

Zero-to-hero Jupyter notebooks on causal inference and experimentation for global LLM rollouts. Each notebook starts from a naive baseline and builds up to modern best practice on the same synthetic dataset, so methods are comparable across notebooks.

Background

The starting point was Rudrendu Paul's Product Experimentation series on freeCodeCamp. It was a solid introduction to the methods, but stopped at the textbook version. I took the same methods and pushed them to where they hold up in real product and business analytics work: staged and opt-in rollouts, clustered standard errors, placebo and sensitivity diagnostics, and a clear answer to when each method breaks.

Notebooks

#	Notebook	Method family	Setting	Source article
01	Synthetic Control	Vanilla SCM, donor screening, ridge SCM, Augmented SCM, Synthetic DiD, per-unit SCM	Staged rollout, +5pp lift	link
02	Difference-in-Differences	2x2 DiD, regression DiD with clustered SE, event study, doubly robust DiD, Honest DiD	Staged rollout, +5pp lift	link
03	Propensity Scores	IPW ATE/ATT, 1-NN matching, DR-ATT, E-value sensitivity, positivity trimming	Opt-in feature, +8pp effect	link
04	Regression Discontinuity	Local linear RDD, McCrary density test, bandwidth sweep, placebo cutoffs	Threshold routing, +6pp effect	link
05	Cluster Randomization	ICC and design effect, CR2 + Satterthwaite, wild cluster bootstrap, CV3 jackknife, randomization inference	Parallel CRT, +5pp lift	link

Each notebook also carries an inference / diagnostics suite (placebo tests, bootstrap CIs, sensitivity analysis) and a side-by-side comparison of the methods it covers.

Notebooks 01-04 are observational / quasi-experimental: they recover effects from data where the assignment was not under your control. Notebook 05 is experimental: you randomize whole workspaces by design, and the work is paying the statistical cost of clustering correctly.

Synthetic data

Notebooks 01 and 02 share one generator: 50k users, 50 workspaces, a 30-week window, treatment at week 20, a +5pp ground-truth lift. Wave 1 (workspaces 0-24) is treated; Wave 2 (25-49) acts as donors. Notebook 05 reuses the same 50-workspace, 50k-user universe but assigns it as a parallel CRT (25 workspaces randomized to treatment, 25 to control). Notebooks 03 and 04 use their own generators matched to the method (individual-level opt-in and a Beta(5,2) running variable with a 0.85 cutoff).

Setup

pip install numpy pandas scipy matplotlib scikit-learn
jupyter lab

Python 3.9. No dependencies outside the list above.

How to read

Open any notebook top to bottom. Structure is the same throughout: motivation, naive baseline, progressive method upgrades, inference suite, comparison, and when to use what. Implementations are the shortest clear version, not a production library.

Roadmap

Planned additions (instrumental variables, online experiment toolkit, heterogeneous effects, double machine learning) are in ROADMAP.md.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
01_synthetic_control.ipynb		01_synthetic_control.ipynb
02_difference_in_differences.ipynb		02_difference_in_differences.ipynb
03_propensity_scores.ipynb		03_propensity_scores.ipynb
04_regression_discontinuity.ipynb		04_regression_discontinuity.ipynb
05_cluster_randomization.ipynb		05_cluster_randomization.ipynb
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenAIEvaluation

Background

Notebooks

Synthetic data

Setup

How to read

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenAIEvaluation

Background

Notebooks

Synthetic data

Setup

How to read

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages