gwas by subtraction doc edit (#869)

trafalmadorian97 · web-flow · commit 5bd5506a2a3a · 2026-07-02T09:56:16.000-04:00
- Edit some of the GWAS by subtraction document. Includes rewording and
clarificatons
diff --git a/docs/Bioinformatics_Concepts/GWAS_By_Subtraction.md b/docs/Bioinformatics_Concepts/GWAS_By_Subtraction.md
@@ -14,8 +14,8 @@ It is useful to understand GWAS-by-subtraction via linear algebra.
 Consider a [Euclidian space](https://en.wikipedia.org/wiki/Euclidean_space) in which:
 
 - GWAS traits are vectors.
-- The [inner product](https://en.wikipedia.org/wiki/Inner_product_space) of two traits is their [genetic covariance](Genetic_Correlation.md).  Denote the inner product of $u$ and $v$ as $\langle u,v \rangle$.
-- We assume all phenotypes have been normalized to have variance of 1.  Under this assumption, a trait's squared [Euclidian norm](https://en.wikipedia.org/wiki/Inner_product_space#Norm_properties) is its heritability: $\lVert v \rVert^2=h^2_v$ where $h^2_v$ is the heritability of $v$.
+- The [inner product](https://en.wikipedia.org/wiki/Inner_product_space) of two traits is their [genetic covariance](Genetic_Correlation.md#genetic-covariance).  Denote the inner product of traits $u$ and $v$ as $\langle u,v \rangle$.
+- We assume all phenotypes have been normalized to have variance of 1.  Under this assumption, a trait's squared [Euclidian norm](https://en.wikipedia.org/wiki/Inner_product_space#Norm_properties) is its heritability: $\lVert v \rVert^2=h^2_v$ where $h^2_v$ is the heritability of trait $v$.
 
 
 
@@ -90,13 +90,14 @@ Where:
 - $x\in\mathbb{R}^M$ is the random genotype. We assume $x$ has mean zero, but unlike in [LDSC](LDSC.md), we do not assume it has been variance standardized.  Let $H_i$ be the variance of the $i$th variant.
 - $\beta_F,\beta_R\in\mathbb{R}^M$ are the underlying causal effects of the genetic variants.
 - $F,R$ are the two orthonormal underlying factors.
-- $\delta_1, \delta_2$ are the non-genetic components of the two traits.  We assume these effects are independent of all genotypes.
+- $a_F,a_R,b\in\mathbb{R}$ are the scalar multipliers that relate the normalized factors $F,R$ to the unnormalized factors $F',R'$.
+- $\delta_1, \delta_2\in\mathbb{R}$ are the random non-genetic components of the two traits.  We assume these effects are independent of all genotypes.
 
 
 ### Marginal Model
 
 
-Let's now focus on SNP $i$, and develop a model around the marginal GWAS regression on this SNP.
+Let's now focus on arbitrary SNP $i$, and model the marginal GWAS regression on this SNP.
 
 
 Define
@@ -125,32 +126,32 @@ R &= \hat\beta_{R,i}x_i+\zeta_{R,i}\\
 \end{align}
 $$
 
-We assume $\zeta_{F,i},\zeta_{R_i}$ are approximately independent of $x_i$.  This is a good approximation so long as individual variant effects ($\beta_{R,i},\beta_{R,i}$) are small, as is the case for most non-Mendelian traits.
+We assume $\zeta_{F,i},\zeta_{R_i}$ are approximately independent of $x_i$.  While not strictly true, this is a good approximation so long as individual variant effects ($\beta_{R,i},\beta_{R,i}$) are small, as is the case for polygenic traits.
 
 ### Theoretical covariance
 
-Next, let us examine the genetic covariance structure of the random variables $(x_i, T_1, T_2)$.  
+Next, let us examine the genetic covariance structure of the scalar random variables $(x_i, T_1, T_2)$.  
 
-We will denote by  $\mathrm{GCov}$ and $\mathrm{GVar}$ the genetic covariance and variance respectively\footnote{Because of our earlier assumption that phenotype variance has been normalized to 1, genetic variance equals heritability.}.
+We will denote by  $\mathrm{GCov}$ and $\mathrm{GVar}$ the genetic covariance and variance respectively[^covnote].  
 
 
 $$
 \begin{align}
-&\mathrm{GCov}(X_i, T_1)\\
-&=\mathrm{GCov}(X_i, a_F F + a_R R + \delta_1)\\
-&=\mathrm{Cov}(X_i, a_F F + a_R R ) & \text{Since $\delta_1$ is non-genetic}\\
-&=\mathrm{Cov}(X_i, a_F (\hat\beta_{F,i}X_i+\zeta_{F,i}) + a_R (\hat\beta_{R,i}X_i+\zeta_{R,i})+\delta_1)\\
+&\mathrm{GCov}(x_i, T_1)\\
+&=\mathrm{GCov}(x_i, a_F F + a_R R + \delta_1)\\
+&=\mathrm{Cov}(x_i, a_F F + a_R R ) & \text{Since $\delta_1$ is non-genetic}\\
+&=\mathrm{Cov}(x_i, a_F (\hat\beta_{F,i}x_i+\zeta_{F,i}) + a_R (\hat\beta_{R,i}x_i+\zeta_{R,i})+\delta_1)\\
 &\approx \left(a_F\hat\beta_{F,i}+a_R\hat\beta_{R,i}\right) H_i & \text{By approximate independence}
 \end{align}
 $$
 
 
 $$
 \begin{align}
-&\mathrm{GCov}(X_i, T_2)\\
-&=\mathrm{GCov}(X_i, b F+\delta_2)\\
-&=\mathrm{Cov}(X_i, b F) & \text{Since $\delta_2$ is non-genetic}\\
-&=\mathrm{Cov}(X_i, b (\hat\beta_{F,i}X_i+\zeta_{F,i}))\\
+&\mathrm{GCov}(x_i, T_2)\\
+&=\mathrm{GCov}(x_i, b F+\delta_2)\\
+&=\mathrm{Cov}(x_i, b F) & \text{Since $\delta_2$ is non-genetic}\\
+&=\mathrm{Cov}(x_i, b (\hat\beta_{F,i}x_i+\zeta_{F,i}))\\
 &\approx b\hat\beta_{F,i} H_i  & \text{By approximate independence}
 \end{align}
 $$
@@ -361,3 +362,4 @@ Of the components of $\theta_i$ and $Q_i$, the most interesting is $\hat\beta_{R
 
 
 
+[^covnote]: Because of our earlier assumption that phenotype variance has been normalized to 1, genetic variance equals heritability.
diff --git a/docs/Bioinformatics_Concepts/Genetic_Correlation.md b/docs/Bioinformatics_Concepts/Genetic_Correlation.md
@@ -8,8 +8,8 @@ The model is
 
 $$
 \begin{align}
-Y_A &= E_A + G_A,\\
-Y_B &= E_B + G_B.
+Y_A &= E_A + G_A,  \label{model1} \\ 
+Y_B &= E_B + G_B.  \label{model2}
 \end{align}
 $$
 
@@ -27,4 +27,8 @@ What does it mean biologically when two traits are genetically correlated? The m
 Besides these straightforward cases, there are more exotic possible causes of genetic correlation, as discussed [here](https://gcbias.org/2016/04/19/what-is-genetic-correlation/).  Briefly,
 
 - Two traits can be genetically correlated because genetics affects the behavior of a parent, which affects the phenotype of their child.
-- Two traits can be genetically correlated because individuals with these traits tend to mate at a higher rate than would be expected under random mating.
+- Two traits can be genetically correlated because individuals with these traits tend to mate at a higher rate than would be expected under random mating.
+
+## Genetic Covariance
+
+Some applications require the calculation of the genetic covariance between two traits.  In the context of the model of $(\ref{model1},\ref{model2})$, the genetic covariance is $\mathbb{Cov}(G_A, G_B)$.  Note that genetic covariance depends strongly on how the traits are scaled.
diff --git a/docs/Bioinformatics_Concepts/Heritability.md b/docs/Bioinformatics_Concepts/Heritability.md
@@ -22,7 +22,7 @@ Note that in the uncorrelated-additive model above, $h^2$ is equal to the [coeff
 
 $$
 \begin{align}
-\mathrm{cor}(G,Y)^2&=\frac{\mathrm{Cov}(Y,G)^2}{\mathrm{Var}(G) \mathrm{Var(Y)}}\\
+\mathrm{Corr}(G,Y)^2&=\frac{\mathrm{Cov}(Y,G)^2}{\mathrm{Var}(G) \mathrm{Var(Y)}}\\
 &=\frac{\mathrm{Cov}(E+G,G)^2}{\mathrm{Var}(G) \mathrm{Var(Y)}}& \text{ by }(\ref{model})\\
 &=\frac{\mathrm{Var}(G)^2}{\mathrm{Var}(G) \mathrm{Var(Y)}} & \text{$G$ and  $E$ uncorrelated}\\
 &=\frac{\mathrm{Var}(G)}{\mathrm{Var}(Y)}\\
@@ -74,20 +74,20 @@ $$
 \begin{align}
 G&:=\mathbb{E} (Y|g),\\
 E&:=Y-G,\\
-h^2&:= \frac{\mathbb{Var}(G) }{\mathbb{Var}(Y)}.
+h^2&:= \frac{\mathrm{Var}(G) }{\mathrm{Var}(Y)}.
 \end{align}
 $$
 
 We have
 
 $$
 \begin{align}
-\mathbb{Cov}(G,E)&=\mathbb{E}(  \mathbb{E}(Y|g) -\mathbb{E}Y   )(  Y- \mathbb{E}(Y|g) )\\
+\mathrm{Cov}(G,E)&=\mathbb{E}(  \mathbb{E}(Y|g) -\mathbb{E}Y   )(  Y- \mathbb{E}(Y|g) )\\
 &=0,
 \end{align}
 $$
 
-where the last line follows from the Projection Theorem (pg. 345 in Grimmet and Stirzaker[@grimmett2020probability]).  Where before we needed to assume $\mathbb{Cov}(E,G)=0$, here this property is automatic.
+where the last line follows from the Projection Theorem (pg. 345 in Grimmet and Stirzaker[@grimmett2020probability]).  Where before we needed to assume $\mathrm{Cov}(E,G)=0$, here this property is automatic.
 
 
  - This approach has the **advantage** of its mathematical clarity.  Whereas the standard definition of heritability requires some fairly restrictive assumptions, this alternative definition is applicable to any phenotype representable by a random variable in $L_2$.  Mathematically, it is now crystal clear what we mean when we speak of $G$ and $E$.