Skip to content

julita-w-chylinska/Statistical-Analysis-Python-Two-way-ANOVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 

Repository files navigation

Two-Way ANOVA (2×2) – Does a Test Preparation Course Appear to Improve Math Scores?

This repository presents a statistical inference workflow in Python using a two-way ANOVA (2×2 design).

Case study simulation: we treat the public StudentsPerformance dataset as if it were the result of evaluating some real test preparation course, and we want to estimate whether students who complete this course score higher in math and whether the pattern differs by gender.

Scope note: The dataset is observational (not a randomized experiment) and contains post-test scores only, therefore the results quantify associations and estimated group differences, not strict causality. A causal claim would require randomized assignment or a stronger quasi-experimental design.

Significance level: We use the conventional α = 0.05 for all hypothesis tests.


Research Questions

  1. Course effect (primary): Do students who complete the test preparation course score higher in math than those who did not?
  2. Moderation by gender: Is the course effect different for female vs male students?
  3. Gender main effect (secondary): Are there mean differences by gender, regardless of test preparation course participation?

Dataset

Sample size: 1 000 students
Missing values: 0

For this analysis, the dependent variable is:
math score* (0–100, continuous)

Independent variables in the 2×2 design:
test preparation course* (none vs completed)
gender (female vs male)

Potential observed confounders (checked for association with test preparation course):
race/ethnicity* – parental level of education* – lunch

Variables not used in the analysis: – reading scorewriting score

* at the beginning of the analysis, these variable names are standardized (shortened and converted to snake_case)


Repository Workflow

  1. Imports, data loading and preparation
  2. Descriptive statistics per group
  3. Visual signal check (boxplot and interaction plot)
  4. Selection / confounding check – Chi-squared test
  5. Outlier control and flagging
  6. ANOVA assumptions:
    • homogeneity of variances (Brown–Forsythe / Levene test)
    • normality of residuals (Q–Q plot + Shapiro–Wilk test)
  7. Two-way ANOVA (Type II SS)
  8. Confidence Intervals
  9. Final conclusions, limitations and recommendations

1) Imports, Data Loading and Preparation

import kagglehub
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.api import qqplot

from scipy.stats import levene, shapiro, chi2_contingency
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.weightstats import DescrStatsW, CompareMeans


path = kagglehub.dataset_download("spscientist/students-performance-in-exams")
dataset = pd.read_csv((f"{path}\StudentsPerformance.csv"))


dataset = dataset.drop(columns=["reading score", "writing score"])

dataset = dataset.rename(columns={
    "race/ethnicity": "race_ethnicity",
    "parental level of education": "parent_edu",
    "test preparation course": "test_prep",
    "math score": "math_score"
})

cat_cols = dataset.select_dtypes(include=["object"]).columns
dataset[cat_cols] = dataset[cat_cols].astype("category")

dataset["test_prep"] = pd.Categorical(
    dataset["test_prep"], categories=["none", "completed"], ordered=True
)

2) Descriptive Statistics

desc_stat = (dataset
        .groupby(["gender", "test_prep"])["math_score"]
        .agg(["count", "mean", "std", "median"])
        .reset_index())
desc_stat.round(2)

Cell-wise summary (2×2)

gender test_prep count mean std median
female none 334 61.67 15.82 62.0
female completed 184 67.20 14.24 67.0
male none 308 66.69 14.05 67.0
male completed 174 72.34 14.22 73.0

Observations: Mean math scores are higher for completed than none in both genders, suggesting a potential course effect. The male means are also higher than the female means across course groups, suggesting a possible gender main effect.


3) Visual Signal Check

Boxplot by course and gender

plt.figure(figsize=(8,5))
ax = sns.boxplot(data=dataset, x="test_prep", y="math_score", hue="gender")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
ax.set_title("Math score by test preparation and gender")
plt.show()
Boxplot: math score by course and gender

Interaction plot (means)

means = (dataset
         .groupby(["test_prep", "gender"])["math_score"]
         .mean()
         .reset_index())

plt.figure(figsize=(7,5))
ax = sns.pointplot(
    data=means, x="test_prep", y="math_score", hue="gender",
    dodge=False, markers=["o","s"], capsize=0.1
)
ax.set_ylim(0, 100)
ax.set_title("Interaction plot: mean math score")
ax.set_xlabel("Test preparation course")
ax.set_ylabel("Math score")
plt.show()
Interaction plot: mean math score

Interpretation: The mean lines are approximately parallel, visually suggesting little to no interaction (course effect looks similar in both genders).


4) Selection / confounding check

Is course participation associated with some student's trait?

Because the dataset does not describe random assignment to the course, we check – with Chi-square (χ²) Test of Independence – whether test_prep is associated with background variables (which could in the end affect the test scores).

Hypothesis (per variable)

  • H0: The background variable is independent of test_prep.
  • H1: The background variable is associated with test_prep.
def chi2_and_cramers_v(tab):
    chi2, p, df, expected = chi2_contingency(tab)
    n = tab.to_numpy().sum()
    r, k = tab.shape
    v = np.sqrt(chi2 / (n * min(r-1, k-1)))
    return chi2, p, df, v

for col in covars:
    tab = pd.crosstab(dataset[col], dataset["test_prep"])
    chi2, p, df, v = chi2_and_cramers_v(tab)
    print(f"{col}: chi2={chi2:.3f}, df={df}, p={p:.3f}, Cramér's V={v:.3f}")
variable chi2 df p cramers_v
race_ethnicity 5.488 4 0.241 0.074
parent_edu 9.544 5 0.089 0.098
lunch 0.221 1 0.638 0.015

Interpretation: None of the χ² test is significant (p>0.05), and Cramér’s V values are small, so we do not reject any of the null hypotheses. This suggests no evidence (in these observed variables) that course participation is driven by systematic selection. However, unmeasured confounding may still exist (student's decision about taking part in course could have been caused by low baseline level or high motivation).


5) Outlier Control and Flagging

Outliers are flagged within each 2×2 cell using the standard 1.5×IQR rule:

def flag_outliers(s, k=1.5):
    q1, q3 = np.percentile(s, [25, 75])
    iqr = q3 - q1
    low = q1 - k * iqr
    high = q3 + k * iqr
    return (s < low) | (s > high)

dataset["cell"] = dataset["gender"].astype(str) + " × " + dataset["test_prep"].astype(str)
dataset["is_outlier"] = dataset.groupby("cell")["math_score"].transform(flag_outliers)
dataset["is_outlier"].value_counts()

Flagged outliers: 11 / 1000

Checking if outliers go beyond the 0–100 range:

n_bad = dataset.loc[dataset["is_outlier"].eq(True) & ((dataset["math_score"]<0) | (dataset["math_score"]>100))]

print(n_bad.shape[0])

Outliers beyond range: 0

Decision: We do not remove outliers by default because math scores remain within the valid range (0–100), i.e., they are plausible observations. Instead, we will later run a sensitivity/robustness check by repeating key steps without outliers.


6) ANOVA Assumptions

First two assumptions are met:

  • independence of observations: every observation represents different student,
  • depednent variable is continuous.

Now we need to check another two assumptions.

6.1 Homogeneity of variances (Brown–Forsythe / Levene median-centered)

  • H0: variances are equal across the 2×2 cells
  • H1: at least one cell has a different variance
groups = [g["math_score"].values for _, g in dataset.groupby("cell")]
F, p = levene(*groups, center="median")

k = len(groups)
N = sum(len(g) for g in groups)
df1 = k - 1
df2 = N - k

print(f"Levene (Brown–Forsythe): F({df1}, {df2}) = {F:.2f}, p = {p:.3f}")

Result: F(3, 996) = 1.13, p = 0.638
Interpretation: Levene / Brown-Fortyshe test is not significant (p>0.05), so we do not reject the null hipotesis and assume homogeneity of variances as satisfied.

6.2 Normality of residuals (Q–Q plot + Shapiro–Wilk)

OLS model with residuals

model = ols("math_score ~ C(gender) * C(test_prep)", data=dataset).fit()
resid = model.resid

Q–Q plot

qqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()
Q-Q plot of residuals

Interpretation: The distribution of residuals appears close to normal.

Shapiro–Wilk test

  • H0: residuals are normally distributed
  • H1: residuals deviate from normality
W, p = shapiro(model.resid)
print(f"Shapiro–Wilk (residuals): W = {W:.3f}, p = {p:.3f}")

Result: W = 0.995, p = 0.004
Interpretation: The overall Shapiro–Wilk test is significant (p<0.05), so we should reject the null hipotesis, but with n = 1 000 Shapiro–Wilk can be sensitive (and Q–Q plot shows approximately normal distribution); therefore, we additionally check residual normality within each cell.

Shapiro–Wilk test per cell

dataset["resid"] = model.resid

print("Shapiro-Wilk test per group:\n")

for cell, sub in dataset.groupby("cell"):
    W, p = shapiro(sub["resid"])
    print(f"{cell}: W = {W:.3f}, p = {p:.3f}")
cell W p
female × completed 0.989 0.180
female × none 0.990 0.021
male × completed 0.987 0.110
male × none 0.992 0.086

Interpretation: Only one cell shows as statistically significant deviation (female × none).
To verify whether extreme values drive this deviation, we re-fit the model excluding flagged outliers and repeat Q–Q plot and Shapiro–Wilk test.

OLS model with residuals (no outliers)

model2 = ols("math_score ~ C(gender) * C(test_prep)", data=dataset[dataset["is_outlier"]==False]).fit()
resid2 = model2.resid

Q–Q plot (no outliers)

qqplot(resid, line="r")
plt.title("Q-Q plot of residuals")
plt.show()
Q-Q plot of residuals (no outliers)

Shapiro-Wilk test (no outliers)

W, p = shapiro(resid2)
print(f"Shapiro-Wilk test (no outliers): W = {W:.3f}, p = {p:.3f}")

Result: W = 0.997, p = 0.088
Conclusion: In our sample, the normality assessment is driven by outliers: the Shapiro–Wilk test rejects normality when outliers are included, but normality is not rejected after outlier removal. Therefore, we will run the two-way ANOVA, but both with and without outliers and check if it provides stable results.


7) Two-Way ANOVA (Type II)

Hypotheses

  • Gender main effect
    • H0: There is no difference in means between genders.
    • H1: There is a difference in means between genders.
  • Course main effect
    • H0: There is no difference in means between course participants and non-participants.
    • H1: There is a difference in means between course participants and non-participants.
  • Interaction (gender × course)
    • H0: There is no interaction between gender and course (the effect of one factor does not depend on the level of the other factor).
    • H1: There is an interaction between gender and course (the effect of one factor depends on the level of the other factor).
anova_table = anova_lm(model, typ=2)
anova_table.round(3)
term sum_sq df F PR(>F)
C(gender) 6399.210 1.0 29.503 0.000
C(test_prep) 7171.996 1.0 33.061 0.000
C(gender):C(test_prep) 0.908 1.0 0.004 0.948
Residual 216036.801 996 NaN NaN

Interpretation

  • Course main effect: statistically significant → we reject the null hipotesis → students who completed the course score higher on average.
  • Gender main effect: statistically significant → we reject the null hipotesis → mean scores differ by gender.
  • Interaction: not significant → we fail to reject the null hipotesis → the course effect does not meaningfully differ by gender.

Robustness: ANOVA without outliers

anova_table2 = anova_lm(model2, typ=2)
anova_table2.round(3)
sum_sq df F PR(>F)
C(gender) 5589.879 1.0 28.246 0.000
C(test_prep) 6545.141 1.0 33.073 0.000
C(gender):C(test_prep) 9.645 1.0 0.049 0.825
Residual 194933.766 985.0 NaN NaN

Repeating the ANOVA without flagged outliers yields the same qualitative conclusions (course and gender main effects significant; interaction not significant).


9) 95% Confidence Intervals for Mean Differences

x_c = DescrStatsW(dataset.loc[dataset["test_prep"] == "completed", "math_score"])
x_n = DescrStatsW(dataset.loc[dataset["test_prep"] == "none", "math_score"])
cm_prep = CompareMeans(x_c, x_n)
diff_prep = x_c.mean - x_n.mean
ci_prep = cm_prep.tconfint_diff(alpha=0.05, usevar="pooled")

x_m = DescrStatsW(dataset.loc[dataset["gender"] == "male", "math_score"])
x_f = DescrStatsW(dataset.loc[dataset["gender"] == "female", "math_score"])
cm_gender = CompareMeans(x_m, x_f)
diff_gender = x_m.mean - x_f.mean
ci_gender = cm_gender.tconfint_diff(alpha=0.05, usevar="pooled")

print(f"Course: ΔM = {diff_prep:.2f}, 95% CI [{ci_prep[0]:.2f}, {ci_prep[1]:.2f}]")
print(f"Gender: ΔM = {diff_gender:.2f}, 95% CI [{ci_gender[0]:.2f}, {ci_gender[1]:.2f}]")

Results

  • Course: ΔM = 5.62, 95% CI [3.69, 7.55]
  • Gender: ΔM = 5.10, 95% CI [3.24, 6.95]

Interpretation: Both intervals exclude 0, being confirmation of statistically significant mean differences. In a population of similar students, we would expect the true mean difference to fall within [3.69, 7.55] for course participants vs non-participants and within [3.24, 6.95] for males vs females, with 95% confidence.


Final Conclusions

  1. Test preparation course is associated with higher math scores, but the estimated gain is only about +5.6 points on average, which is a modest effect in practical terms (especially on a 0–100 scale). If we were the course authors, we could treat this as a signal to review and strengthen the curriculum (but we should also remember about the possibility that lower-performing students were more likely to enroll, causing the mean difference to be smaller than the true course impact).
  2. No evidence of a course × gender interaction – the course effect appears similar across genders.
  3. Gender shows a statistically significant main effect – about +5.1 points for male vs female.

Limitations and Recommendations

  • Causality: The dataset is observational. Even though control variables are not significantly associated with course participation, unmeasured confounding may remain (e.g., baseline ability, motivation).
  • Stronger design: To estimate a causal course effect, use a randomized or stratified assignment (e.g., strata by baseline math level).

Author: Julita Wawreszuk-Chylińska

LinkedIn: Julita Wawreszuk-Chylińska

Thank you for your interest in this project!

About

Two-way ANOVA (2×2) in Python: effect of test preparation course and gender on math scores + assumption checks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors