Skip to content

N3haRao/narac-ra-gwas-prs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 narac-ra-gwas-prs

Sex-Stratified GWAS, Population Stratification Control, and Polygenic Risk Scoring in Rheumatoid Arthritis (NARAC)

This study was carried out as a final project for the course SPH BS 859.

Author: Neha Rao

Tools: PLINK · EIGENSOFT · GMMAT · PRSice2 · LDSC · R · Bash · Python

Environment: HPC (Linux cluster)

Study Type: Case-control GWAS with mixed model association testing


Project Overview

This repository contains a fully reproducible genome-wide association analysis (GWAS) pipeline for Rheumatoid Arthritis (RA) using data from the North American Rheumatoid Arthritis Consortium (NARAC).

The analysis includes:

  • Rigorous genotype quality control
  • Population structure correction via PCA
  • Sex-stratified mixed-model GWAS
  • Polygenic Risk Score (PRS) analysis using external summary statistics
  • SNP heritability estimation using LD Score Regression

This project demonstrates end-to-end genetic epidemiology workflow design, statistical modeling, and HPC-based genomic data processing.


Dataset

North American Rheumatoid Arthritis Consortium (NARAC)

  • 868 RA cases
  • 1,194 controls
  • Predominantly Northern European ancestry
  • Case-control design
  • Unrelated individuals

Cases met American College of Rheumatology criteria. Controls were sourced from the New York Cancer Project.

Note: NARAC genotype data are controlled-access and not included in this repository. This repo provides code and workflow only.

Acknowledgment: Genetic Analysis Workshop 16 (GAW16) https://doi.org/10.1002/gepi.20464


Analysis Workflow

Below is the full analytical pipeline implemented in this project.


Genotype Quality Control (PLINK)

SNP-Level Filters

  • Minor Allele Frequency (MAF) > 0.01
  • Missingness (geno) < 5%
  • Hardy–Weinberg Equilibrium p > 1e-6

Sample-Level Filters

  • Individual missingness (mind) < 5%

Output:

cleaned/ra_cleaned.*

LD Pruning + PCA (EIGENSOFT smartpca)

To control for population stratification:

  • LD pruning (--indep-pairwise)
  • PCA on pruned SNP set
  • Extraction of top 10 PCs
  • Logistic regression of PCs vs case status to identify significant covariates

Output:

pca/ra_pruned.evec
pca/RA_pcs.txt

Sex-Stratified GWAS (Mixed Model: GMMAT)

To account for relatedness and residual structure:

  • Built Genetic Relationship Matrix (GRM) per sex
  • Fitted logistic mixed models (glmmkin)
  • Genome-wide score test (glmm.score)

Models:

Null model

case ~ 1

Covariate model

case ~ PC2 + PC3

(PCs selected based on association testing)

Output:

gwas/test.glmm.score.female.*
gwas/test.glmm.score.male.*

Polygenic Risk Scores (PRSice2)

PRS computed using external GWAS summary statistics:

  • European meta-analysis
  • Asian meta-analysis

Covariates included selected PCs.

Output:

prs/PRS_European.*
prs/PRS_Asian.*

SNP Heritability (LD Score Regression)

  • Summary statistics munged using munge_sumstats.py
  • Heritability estimated using ancestry-matched LD scores
  • Liability-scale correction applied

Output:

RA_h2_European.log
RA_h2_Asian.log

Results Snapshot

Quality Control

  • Initial SNPs: 544,276

  • Post-QC SNPs: 502,304

  • Final individuals retained: 2062

    • 868 RA cases
    • 1194 controls
  • LD-pruned SNPs used for PCA: 79,644

No individuals were removed during mind filtering, preserving full study power.


Population Stratification (PCA)

ANOVA testing of eigenvectors revealed significant case-control differences in:

  • PC1 (p = 1.62e-18)
  • PC2 (p = 3.56e-26)
  • PC4 (p = 2.09e-61)

PC1 and PC4 were selected as covariates in downstream GWAS models to control for population structure.


Sex-Stratified GWAS

Genomic Inflation

Analysis Lambda GC
Female (no covariates) 1.01
Female (with PCs) 1.01
Male (no covariates) 1.02
Male (with PCs) 1.01

These values indicate minimal inflation and effective control of confounding.


Association Results

  • Strong genome-wide significance in the HLA region (chromosome 6p21) in both males and females.
  • No additional loci reached genome-wide significance outside the HLA region.
  • Suggestive signals observed on chromosomes 8, 9, and 20.

These findings are consistent with known RA genetic architecture.


SNP Heritability (LD Score Regression)

Using summary statistics from Okada et al. (2014):

  • European and Asian heritability estimates were calculated
  • Liability-scale conversion applied
  • Cross-ancestry comparison performed

Key Methodological Strengths

  • Mixed-model GWAS accounting for relatedness
  • Explicit correction for population stratification
  • Sex-stratified analysis
  • Cross-ancestry PRS comparison
  • HPC reproducible pipeline design
  • Integration of Bash, R, Python, and external genomic tools

Technical Skills Demonstrated

  • Genome-wide association analysis
  • Logistic mixed modeling
  • Genetic relationship matrices
  • Polygenic risk scoring
  • LD score regression
  • Linux/HPC scripting
  • End-to-end genomic pipeline design
  • R statistical modeling
  • Python file parsing
  • Reproducible research practices

References

  • Purcell S, Neale B, Todd-Brown K, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559-575. doi:10.1086/519795
  • Price AL, Patterson NJ, Plenge RM, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904-909. doi:10.1038/ng1847
  • Anderson CA, Pettersson FH, Clarke GM, et al. Data quality control in genetic case-control association studies. Nat Protoc. 2010;5(9):1564-1573. doi:10.1038/nprot.2010.116
  • Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17(10):1520-1528. doi:10.1101/gr.6665407

About

End-to-end GWAS pipeline for rheumatoid arthritis (NARAC): QC, PCA-based stratification control, sex-stratified mixed-model association testing, PRS, and LD score regression.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors