Two-Way ANOVA (2×2) – Does a Test Preparation Course Appear to Improve Math Scores?

This repository presents a statistical inference workflow in Python using a two-way ANOVA (2×2 design).

Case study simulation: we treat the public StudentsPerformance dataset as if it were the result of evaluating some real test preparation course, and we want to estimate whether students who complete this course score higher in math and whether the pattern differs by gender.

Scope note: The dataset is observational (not a randomized experiment) and contains post-test scores only, therefore the results quantify associations and estimated group differences, not strict causality. A causal claim would require randomized assignment or a stronger quasi-experimental design.

Significance level: We use the conventional α = 0.05 for all hypothesis tests.

Research Questions

Course effect (primary): Do students who complete the test preparation course score higher in math than those who did not?
Moderation by gender: Is the course effect different for female vs male students?
Gender main effect (secondary): Are there mean differences by gender, regardless of test preparation course participation?

Dataset

File included in the repository: StudentsPerformance.csv
Source: Students Performance in Exam (Kaggle) – used directly in the code

Sample size: 1 000 students
Missing values: 0

For this analysis, the dependent variable is:
– math score* (0–100, continuous)

Independent variables in the 2×2 design:
– test preparation course* (none vs completed)
– gender (female vs male)

Potential observed confounders (checked for association with test preparation course):
– race/ethnicity* – parental level of education* – lunch

Variables not used in the analysis: – reading score – writing score

* at the beginning of the analysis, these variable names are standardized (shortened and converted to snake_case)

Repository Workflow

Imports, data loading and preparation
Descriptive statistics per group
Visual signal check (boxplot and interaction plot)
Selection / confounding check – Chi-squared test
Outlier control and flagging
ANOVA assumptions:
- homogeneity of variances (Brown–Forsythe / Levene test)
- normality of residuals (Q–Q plot + Shapiro–Wilk test)
Two-way ANOVA (Type II SS)
Confidence Intervals
Final conclusions, limitations and recommendations

1) Imports, Data Loading and Preparation

import kagglehub
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.api import qqplot

from scipy.stats import levene, shapiro, chi2_contingency
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans


path = kagglehub.dataset_download("spscientist/students-performance-in-exams")
dataset = pd.read_csv((f"{path}\StudentsPerformance.csv"))


dataset = dataset.drop(columns=["reading score", "writing score"])

dataset = dataset.rename(columns={
    "race/ethnicity": "race_ethnicity",
    "parental level of education": "parent_edu",
    "test preparation course": "test_prep",
    "math score": "math_score"
})

cat_cols = dataset.select_dtypes(include=["object"]).columns
dataset[cat_cols] = dataset[cat_cols].astype("category")

dataset["test_prep"] = pd.Categorical(
    dataset["test_prep"], categories=["none", "completed"], ordered=True
)

2) Descriptive Statistics

desc_stat = (dataset
        .groupby(["gender", "test_prep"])["math_score"]
        .agg(["count", "mean", "std", "median"])
        .reset_index())
desc_stat.round(2)

Cell-wise summary (2×2)

gender	test_prep	count	mean	std	median
female	none	334	61.67	15.82	62.0
female	completed	184	67.20	14.24	67.0
male	none	308	66.69	14.05	67.0
male	completed	174	72.34	14.22	73.0

Observations: Mean math scores are higher for completed than none in both genders, suggesting a potential course effect. The male means are also higher than the female means across course groups, suggesting a possible gender main effect.

3) Visual Signal Check

Boxplot by course and gender

plt.figure(figsize=(8,5))
ax = sns.boxplot(data=dataset, x="test_prep", y="math_score", hue="gender")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
ax.set_title("Math score by test preparation and gender")
plt.show()

Boxplot: math score by course and gender

Interaction plot (means)

means = (dataset
         .groupby(["test_prep", "gender"])["math_score"]
         .mean()
         .reset_index())

plt.figure(figsize=(7,5))
ax = sns.pointplot(
    data=means, x="test_prep", y="math_score", hue="gender",
    dodge=False, markers=["o","s"], capsize=0.1
)
ax.set_ylim(0, 100)
ax.set_title("Interaction plot: mean math score")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
plt.show()

Interpretation: The mean lines are approximately parallel, visually suggesting little to no interaction (course effect looks similar in both genders).

4) Selection / confounding check

Is course participation associated with some student's trait?

Because the dataset does not describe random assignment to the course, we check – with Chi-square (χ²) Test of Independence – whether test_prep is associated with background variables (which could in the end affect the test scores).

Hypothesis (per variable)

H0: The background variable is independent of test_prep.
H1: The background variable is associated with test_prep.

def chi2_and_cramers_v(tab):
    chi2, p, df, expected = chi2_contingency(tab)
    n = tab.to_numpy().sum()
    r, k = tab.shape
    v = np.sqrt(chi2 / (n * min(r-1, k-1)))
    return chi2, p, df, v

for col in covars:
    tab = pd.crosstab(dataset[col], dataset["test_prep"])
    chi2, p, df, v = chi2_and_cramers_v(tab)
    print(f"{col}: chi2={chi2:.3f}, df={df}, p={p:.3f}, Cramér's V={v:.3f}")

variable	chi2	df	p	cramers_v
race_ethnicity	5.488	4	0.241	0.074
parent_edu	9.544	5	0.089	0.098
lunch	0.221	1	0.638	0.015

Interpretation: None of the χ² test is significant (p>0.05), and Cramér’s V values are small, so we do not reject any of the null hypotheses. This suggests no evidence (in these observed variables) that course participation is driven by systematic selection. However, unmeasured confounding may still exist (student's decision about taking part in course could have been caused by low baseline level or high motivation).

5) Outlier Control and Flagging

Outliers are flagged within each 2×2 cell using the standard 1.5×IQR rule:

def flag_outliers(s, k=1.5):
    q1, q3 = np.percentile(s, [25, 75])
    iqr = q3 - q1
    low = q1 - k * iqr
    high = q3 + k * iqr
    return (s < low) | (s > high)

dataset["cell"] = dataset["gender"].astype(str) + " × " + dataset["test_prep"].astype(str)
dataset["is_outlier"] = dataset.groupby("cell")["math_score"].transform(flag_outliers)
dataset["is_outlier"].value_counts()

Flagged outliers: 11 / 1000

Checking if outliers go beyond the 0–100 range:

n_bad = dataset.loc[dataset["is_outlier"].eq(True) & ((dataset["math_score"]<0) | (dataset["math_score"]>100))]

print(n_bad.shape[0])

Outliers beyond range: 0

Decision: We do not remove outliers by default because math scores remain within the valid range (0–100), i.e., they are plausible observations. Instead, we will later run a sensitivity/robustness check by repeating key steps without outliers.

6) ANOVA Assumptions

First two assumptions are met:

independence of observations: every observation represents different student,
depednent variable is continuous.

Now we need to check another two assumptions.

6.1 Homogeneity of variances (Brown–Forsythe / Levene median-centered)

H0: variances are equal across the 2×2 cells
H1: at least one cell has a different variance

groups = [g["math_score"].values for _, g in dataset.groupby("cell")]
F, p = levene(*groups, center="median")

k = len(groups)
N = sum(len(g) for g in groups)
df1 = k - 1
df2 = N - k

print(f"Levene (Brown–Forsythe): F({df1}, {df2}) = {F:.2f}, p = {p:.3f}")

Result: F(3, 996) = 1.13, p = 0.638
Interpretation: Levene / Brown-Fortyshe test is not significant (p>0.05), so we do not reject the null hipotesis and assume homogeneity of variances as satisfied.

6.2 Normality of residuals (Q–Q plot + Shapiro–Wilk)

OLS model with residuals

model = ols("math_score ~ C(gender) * C(test_prep)", data=dataset).fit()
resid = model.resid

Q–Q plot

qqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()

Interpretation: The distribution of residuals appears close to normal.

Shapiro–Wilk test

H0: residuals are normally distributed
H1: residuals deviate from normality

W, p = shapiro(model.resid)
print(f"Shapiro–Wilk (residuals): W = {W:.3f}, p = {p:.3f}")

Result: W = 0.995, p = 0.004
Interpretation: The overall Shapiro–Wilk test is significant (p<0.05), so we should reject the null hipotesis, but with n = 1 000 Shapiro–Wilk can be sensitive (and Q–Q plot shows approximately normal distribution); therefore, we additionally check residual normality within each cell.

Shapiro–Wilk test per cell

dataset["resid"] = model.resid

print("Shapiro-Wilk test per group:\n")

for cell, sub in dataset.groupby("cell"):
    W, p = shapiro(sub["resid"])
    print(f"{cell}: W = {W:.3f}, p = {p:.3f}")

cell	W	p
female × completed	0.989	0.180
female × none	0.990	0.021
male × completed	0.987	0.110
male × none	0.992	0.086

Interpretation: Only one cell shows as statistically significant deviation (female × none).
To verify whether extreme values drive this deviation, we re-fit the model excluding flagged outliers and repeat Q–Q plot and Shapiro–Wilk test.

OLS model with residuals (no outliers)

model2 = ols("math_score ~ C(gender) * C(test_prep)", data=dataset[dataset["is_outlier"]==False]).fit()
resid2 = model2.resid

Q–Q plot (no outliers)

qqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()

Shapiro-Wilk test (no outliers)

W, p = shapiro(resid2)
print(f"Shapiro-Wilk test (no outliers): W = {W:.3f}, p = {p:.3f}")

Result: W = 0.997, p = 0.088
Conclusion: In our sample, the normality assessment is driven by outliers: the Shapiro–Wilk test rejects normality when outliers are included, but normality is not rejected after outlier removal. Therefore, we will run the two-way ANOVA, but both with and without outliers and check if it provides stable results.

7) Two-Way ANOVA (Type II)

Hypotheses

Gender main effect
- H0: There is no difference in means between genders.
- H1: There is a difference in means between genders.
Course main effect
- H0: There is no difference in means between course participants and non-participants.
- H1: There is a difference in means between course participants and non-participants.
Interaction (gender × course)
- H0: There is no interaction between gender and course (the effect of one factor does not depend on the level of the other factor).
- H1: There is an interaction between gender and course (the effect of one factor depends on the level of the other factor).

anova_table = anova_lm(model, typ=2)
anova_table.round(3)

term	sum_sq	df	F	PR(>F)
C(gender)	6399.210	1.0	29.503	0.000
C(test_prep)	7171.996	1.0	33.061	0.000
C(gender):C(test_prep)	0.908	1.0	0.004	0.948
Residual	216036.801	996	NaN	NaN

Interpretation

Course main effect: statistically significant → we reject the null hipotesis → students who completed the course score higher on average.
Gender main effect: statistically significant → we reject the null hipotesis → mean scores differ by gender.
Interaction: not significant → we fail to reject the null hipotesis → the course effect does not meaningfully differ by gender.

Robustness: ANOVA without outliers

anova_table2 = anova_lm(model2, typ=2)
anova_table2.round(3)

	sum_sq	df	F	PR(>F)
C(gender)	5589.879	1.0	28.246	0.000
C(test_prep)	6545.141	1.0	33.073	0.000
C(gender):C(test_prep)	9.645	1.0	0.049	0.825
Residual	194933.766	985.0	NaN	NaN

Repeating the ANOVA without flagged outliers yields the same qualitative conclusions (course and gender main effects significant; interaction not significant).

9) 95% Confidence Intervals for Mean Differences

x_c = DescrStatsW(dataset.loc[dataset["test_prep"] == "completed", "math_score"])
x_n = DescrStatsW(dataset.loc[dataset["test_prep"] == "none", "math_score"])
cm_prep = CompareMeans(x_c, x_n)
diff_prep = x_c.mean - x_n.mean
ci_prep = cm_prep.tconfint_diff(alpha=0.05, usevar="pooled")

x_m = DescrStatsW(dataset.loc[dataset["gender"] == "male", "math_score"])
x_f = DescrStatsW(dataset.loc[dataset["gender"] == "female", "math_score"])
cm_gender = CompareMeans(x_m, x_f)
diff_gender = x_m.mean - x_f.mean
ci_gender = cm_gender.tconfint_diff(alpha=0.05, usevar="pooled")

print(f"Course: ΔM = {diff_prep:.2f}, 95% CI [{ci_prep[0]:.2f}, {ci_prep[1]:.2f}]")
print(f"Gender: ΔM = {diff_gender:.2f}, 95% CI [{ci_gender[0]:.2f}, {ci_gender[1]:.2f}]")

Results

Course: ΔM = 5.62, 95% CI [3.69, 7.55]
Gender: ΔM = 5.10, 95% CI [3.24, 6.95]

Interpretation: Both intervals exclude 0, being confirmation of statistically significant mean differences. In a population of similar students, we would expect the true mean difference to fall within [3.69, 7.55] for course participants vs non-participants and within [3.24, 6.95] for males vs females, with 95% confidence.

Final Conclusions

Test preparation course is associated with higher math scores, but the estimated gain is only about +5.6 points on average, which is a modest effect in practical terms (especially on a 0–100 scale). If we were the course authors, we could treat this as a signal to review and strengthen the curriculum (but we should also remember about the possibility that lower-performing students were more likely to enroll, causing the mean difference to be smaller than the true course impact).
No evidence of a course × gender interaction – the course effect appears similar across genders.
Gender shows a statistically significant main effect – about +5.1 points for male vs female.

Limitations and Recommendations

Causality: The dataset is observational. Even though control variables are not significantly associated with course participation, unmeasured confounding may remain (e.g., baseline ability, motivation).
Stronger design: To estimate a causal course effect, use a randomized or stratified assignment (e.g., strata by baseline math level).

Author: Julita Wawreszuk-Chylińska

LinkedIn: Julita Wawreszuk-Chylińska

Thank you for your interest in this project!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
README.md		README.md
StudentsPerformance.csv		StudentsPerformance.csv

Folders and files

Latest commit

History

Repository files navigation

Two-Way ANOVA (2×2) – Does a Test Preparation Course Appear to Improve Math Scores?

Research Questions

Dataset

Repository Workflow

1) Imports, Data Loading and Preparation

2) Descriptive Statistics

Cell-wise summary (2×2)

3) Visual Signal Check

Boxplot by course and gender

Interaction plot (means)

4) Selection / confounding check

Hypothesis (per variable)

5) Outlier Control and Flagging

6) ANOVA Assumptions

6.1 Homogeneity of variances (Brown–Forsythe / Levene median-centered)

6.2 Normality of residuals (Q–Q plot + Shapiro–Wilk)

OLS model with residuals

Q–Q plot

Shapiro–Wilk test

Shapiro–Wilk test per cell

OLS model with residuals (no outliers)

Q–Q plot (no outliers)

Shapiro-Wilk test (no outliers)

7) Two-Way ANOVA (Type II)

Hypotheses

Interpretation

Robustness: ANOVA without outliers

9) 95% Confidence Intervals for Mean Differences

Final Conclusions

Limitations and Recommendations

Author: Julita Wawreszuk-Chylińska

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages