This repository presents a statistical inference workflow in Python using a two-way ANOVA (2×2 design).
Case study simulation: we treat the public StudentsPerformance dataset as if it were the result of evaluating some real test preparation course, and we want to estimate whether students who complete this course score higher in math and whether the pattern differs by gender.
Scope note: The dataset is observational (not a randomized experiment) and contains post-test scores only, therefore the results quantify associations and estimated group differences, not strict causality. A causal claim would require randomized assignment or a stronger quasi-experimental design.
Significance level: We use the conventional α = 0.05 for all hypothesis tests.
- Course effect (primary): Do students who complete the test preparation course score higher in math than those who did not?
- Moderation by gender: Is the course effect different for female vs male students?
- Gender main effect (secondary): Are there mean differences by gender, regardless of test preparation course participation?
- File included in the repository:
StudentsPerformance.csv - Source: Students Performance in Exam (Kaggle) – used directly in the code
Sample size: 1 000 students
Missing values: 0
For this analysis, the dependent variable is:
– math score* (0–100, continuous)
Independent variables in the 2×2 design:
– test preparation course* (none vs completed)
– gender (female vs male)
Potential observed confounders (checked for association with test preparation course):
– race/ethnicity*
– parental level of education*
– lunch
Variables not used in the analysis:
– reading score
– writing score
* at the beginning of the analysis, these variable names are standardized (shortened and converted to snake_case)
- Imports, data loading and preparation
- Descriptive statistics per group
- Visual signal check (boxplot and interaction plot)
- Selection / confounding check – Chi-squared test
- Outlier control and flagging
- ANOVA assumptions:
- homogeneity of variances (Brown–Forsythe / Levene test)
- normality of residuals (Q–Q plot + Shapiro–Wilk test)
- Two-way ANOVA (Type II SS)
- Confidence Intervals
- Final conclusions, limitations and recommendations
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.api import qqplot
from scipy.stats import levene, shapiro, chi2_contingency
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans
path = kagglehub.dataset_download("spscientist/students-performance-in-exams")
dataset = pd.read_csv((f"{path}\StudentsPerformance.csv"))
dataset = dataset.drop(columns=["reading score", "writing score"])
dataset = dataset.rename(columns={
"race/ethnicity": "race_ethnicity",
"parental level of education": "parent_edu",
"test preparation course": "test_prep",
"math score": "math_score"
})
cat_cols = dataset.select_dtypes(include=["object"]).columns
dataset[cat_cols] = dataset[cat_cols].astype("category")
dataset["test_prep"] = pd.Categorical(
dataset["test_prep"], categories=["none", "completed"], ordered=True
)desc_stat = (dataset
.groupby(["gender", "test_prep"])["math_score"]
.agg(["count", "mean", "std", "median"])
.reset_index())
desc_stat.round(2)| gender | test_prep | count | mean | std | median |
|---|---|---|---|---|---|
| female | none | 334 | 61.67 | 15.82 | 62.0 |
| female | completed | 184 | 67.20 | 14.24 | 67.0 |
| male | none | 308 | 66.69 | 14.05 | 67.0 |
| male | completed | 174 | 72.34 | 14.22 | 73.0 |
Observations: Mean math scores are higher for completed than none in both genders, suggesting a potential course effect.
The male means are also higher than the female means across course groups, suggesting a possible gender main effect.
plt.figure(figsize=(8,5))
ax = sns.boxplot(data=dataset, x="test_prep", y="math_score", hue="gender")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
ax.set_title("Math score by test preparation and gender")
plt.show()
means = (dataset
.groupby(["test_prep", "gender"])["math_score"]
.mean()
.reset_index())
plt.figure(figsize=(7,5))
ax = sns.pointplot(
data=means, x="test_prep", y="math_score", hue="gender",
dodge=False, markers=["o","s"], capsize=0.1
)
ax.set_ylim(0, 100)
ax.set_title("Interaction plot: mean math score")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
plt.show()
Interpretation: The mean lines are approximately parallel, visually suggesting little to no interaction (course effect looks similar in both genders).
Is course participation associated with some student's trait?
Because the dataset does not describe random assignment to the course, we check – with Chi-square (χ²) Test of Independence – whether test_prep is associated with background variables (which could in the end affect the test scores).
- H0: The background variable is independent of
test_prep. - H1: The background variable is associated with
test_prep.
def chi2_and_cramers_v(tab):
chi2, p, df, expected = chi2_contingency(tab)
n = tab.to_numpy().sum()
r, k = tab.shape
v = np.sqrt(chi2 / (n * min(r-1, k-1)))
return chi2, p, df, v
for col in covars:
tab = pd.crosstab(dataset[col], dataset["test_prep"])
chi2, p, df, v = chi2_and_cramers_v(tab)
print(f"{col}: chi2={chi2:.3f}, df={df}, p={p:.3f}, Cramér's V={v:.3f}")| variable | chi2 | df | p | cramers_v |
|---|---|---|---|---|
| race_ethnicity | 5.488 | 4 | 0.241 | 0.074 |
| parent_edu | 9.544 | 5 | 0.089 | 0.098 |
| lunch | 0.221 | 1 | 0.638 | 0.015 |
Interpretation: None of the χ² test is significant (p>0.05), and Cramér’s V values are small, so we do not reject any of the null hypotheses. This suggests no evidence (in these observed variables) that course participation is driven by systematic selection. However, unmeasured confounding may still exist (student's decision about taking part in course could have been caused by low baseline level or high motivation).
Outliers are flagged within each 2×2 cell using the standard 1.5×IQR rule:
def flag_outliers(s, k=1.5):
q1, q3 = np.percentile(s, [25, 75])
iqr = q3 - q1
low = q1 - k * iqr
high = q3 + k * iqr
return (s < low) | (s > high)
dataset["cell"] = dataset["gender"].astype(str) + " × " + dataset["test_prep"].astype(str)
dataset["is_outlier"] = dataset.groupby("cell")["math_score"].transform(flag_outliers)
dataset["is_outlier"].value_counts()Flagged outliers: 11 / 1000
Checking if outliers go beyond the 0–100 range:
n_bad = dataset.loc[dataset["is_outlier"].eq(True) & ((dataset["math_score"]<0) | (dataset["math_score"]>100))]
print(n_bad.shape[0])Outliers beyond range: 0
Decision: We do not remove outliers by default because math scores remain within the valid range (0–100), i.e., they are plausible observations. Instead, we will later run a sensitivity/robustness check by repeating key steps without outliers.
First two assumptions are met:
- independence of observations: every observation represents different student,
- depednent variable is continuous.
Now we need to check another two assumptions.
- H0: variances are equal across the 2×2 cells
- H1: at least one cell has a different variance
groups = [g["math_score"].values for _, g in dataset.groupby("cell")]
F, p = levene(*groups, center="median")
k = len(groups)
N = sum(len(g) for g in groups)
df1 = k - 1
df2 = N - k
print(f"Levene (Brown–Forsythe): F({df1}, {df2}) = {F:.2f}, p = {p:.3f}")Result: F(3, 996) = 1.13, p = 0.638
Interpretation: Levene / Brown-Fortyshe test is not significant (p>0.05), so we do not reject the null hipotesis and assume homogeneity of variances as satisfied.
model = ols("math_score ~ C(gender) * C(test_prep)", data=dataset).fit()
resid = model.residqqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()
Interpretation: The distribution of residuals appears close to normal.
- H0: residuals are normally distributed
- H1: residuals deviate from normality
W, p = shapiro(model.resid)
print(f"Shapiro–Wilk (residuals): W = {W:.3f}, p = {p:.3f}")Result: W = 0.995, p = 0.004
Interpretation: The overall Shapiro–Wilk test is significant (p<0.05), so we should reject the null hipotesis, but with n = 1 000 Shapiro–Wilk can be sensitive (and Q–Q plot shows approximately normal distribution); therefore, we additionally check residual normality within each cell.
dataset["resid"] = model.resid
print("Shapiro-Wilk test per group:\n")
for cell, sub in dataset.groupby("cell"):
W, p = shapiro(sub["resid"])
print(f"{cell}: W = {W:.3f}, p = {p:.3f}")| cell | W | p |
|---|---|---|
| female × completed | 0.989 | 0.180 |
| female × none | 0.990 | 0.021 |
| male × completed | 0.987 | 0.110 |
| male × none | 0.992 | 0.086 |
Interpretation: Only one cell shows as statistically significant deviation (female × none).
To verify whether extreme values drive this deviation, we re-fit the model excluding flagged outliers and repeat Q–Q plot and Shapiro–Wilk test.
model2 = ols("math_score ~ C(gender) * C(test_prep)", data=dataset[dataset["is_outlier"]==False]).fit()
resid2 = model2.residqqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()
W, p = shapiro(resid2)
print(f"Shapiro-Wilk test (no outliers): W = {W:.3f}, p = {p:.3f}")Result: W = 0.997, p = 0.088
Conclusion: In our sample, the normality assessment is driven by outliers: the Shapiro–Wilk test rejects normality when outliers are included, but normality is not rejected after outlier removal. Therefore, we will run the two-way ANOVA, but both with and without outliers and check if it provides stable results.
- Gender main effect
- H0: There is no difference in means between genders.
- H1: There is a difference in means between genders.
- Course main effect
- H0: There is no difference in means between course participants and non-participants.
- H1: There is a difference in means between course participants and non-participants.
- Interaction (gender × course)
- H0: There is no interaction between gender and course (the effect of one factor does not depend on the level of the other factor).
- H1: There is an interaction between gender and course (the effect of one factor depends on the level of the other factor).
anova_table = anova_lm(model, typ=2)
anova_table.round(3)| term | sum_sq | df | F | PR(>F) |
|---|---|---|---|---|
| C(gender) | 6399.210 | 1.0 | 29.503 | 0.000 |
| C(test_prep) | 7171.996 | 1.0 | 33.061 | 0.000 |
| C(gender):C(test_prep) | 0.908 | 1.0 | 0.004 | 0.948 |
| Residual | 216036.801 | 996 | NaN | NaN |
- Course main effect: statistically significant → we reject the null hipotesis → students who completed the course score higher on average.
- Gender main effect: statistically significant → we reject the null hipotesis → mean scores differ by gender.
- Interaction: not significant → we fail to reject the null hipotesis → the course effect does not meaningfully differ by gender.
anova_table2 = anova_lm(model2, typ=2)
anova_table2.round(3)| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(gender) | 5589.879 | 1.0 | 28.246 | 0.000 |
| C(test_prep) | 6545.141 | 1.0 | 33.073 | 0.000 |
| C(gender):C(test_prep) | 9.645 | 1.0 | 0.049 | 0.825 |
| Residual | 194933.766 | 985.0 | NaN | NaN |
Repeating the ANOVA without flagged outliers yields the same qualitative conclusions (course and gender main effects significant; interaction not significant).
x_c = DescrStatsW(dataset.loc[dataset["test_prep"] == "completed", "math_score"])
x_n = DescrStatsW(dataset.loc[dataset["test_prep"] == "none", "math_score"])
cm_prep = CompareMeans(x_c, x_n)
diff_prep = x_c.mean - x_n.mean
ci_prep = cm_prep.tconfint_diff(alpha=0.05, usevar="pooled")
x_m = DescrStatsW(dataset.loc[dataset["gender"] == "male", "math_score"])
x_f = DescrStatsW(dataset.loc[dataset["gender"] == "female", "math_score"])
cm_gender = CompareMeans(x_m, x_f)
diff_gender = x_m.mean - x_f.mean
ci_gender = cm_gender.tconfint_diff(alpha=0.05, usevar="pooled")
print(f"Course: ΔM = {diff_prep:.2f}, 95% CI [{ci_prep[0]:.2f}, {ci_prep[1]:.2f}]")
print(f"Gender: ΔM = {diff_gender:.2f}, 95% CI [{ci_gender[0]:.2f}, {ci_gender[1]:.2f}]")Results
- Course: ΔM = 5.62, 95% CI [3.69, 7.55]
- Gender: ΔM = 5.10, 95% CI [3.24, 6.95]
Interpretation: Both intervals exclude 0, being confirmation of statistically significant mean differences. In a population of similar students, we would expect the true mean difference to fall within [3.69, 7.55] for course participants vs non-participants and within [3.24, 6.95] for males vs females, with 95% confidence.
- Test preparation course is associated with higher math scores, but the estimated gain is only about +5.6 points on average, which is a modest effect in practical terms (especially on a 0–100 scale). If we were the course authors, we could treat this as a signal to review and strengthen the curriculum (but we should also remember about the possibility that lower-performing students were more likely to enroll, causing the mean difference to be smaller than the true course impact).
- No evidence of a course × gender interaction – the course effect appears similar across genders.
- Gender shows a statistically significant main effect – about +5.1 points for male vs female.
- Causality: The dataset is observational. Even though control variables are not significantly associated with course participation, unmeasured confounding may remain (e.g., baseline ability, motivation).
- Stronger design: To estimate a causal course effect, use a randomized or stratified assignment (e.g., strata by baseline math level).
LinkedIn: Julita Wawreszuk-Chylińska
Thank you for your interest in this project!