Making sense of gene and proteins lists with functional enrichment analysis

4 Enrichment Statistics

Enrichment statistics are based on a contingency table like so:

..in term ..not in term Total
..in gene list 30 270 300
..not in gene list (but in background) 470 15230 15700
Total 500 15500 16000

This is based on the 16,000 genes that were measured in your experiment.

Note that some genes may not have been measured; these are excluded entirely from the calculations. For example, there might have been an additional 5,000 genes (some of which could have been annotated with the term of interest), resulting in a total of 21,250 annotated genes.


4.1 Fisher’s Exact Test

Fisher’s Exact Test is a statistical test used to determine if there are nonrandom associations between the proportions of two categorical variables. It calculates the exact probability of observing the given distribution of counts in a 2x2 contingency table, under the null hypothesis of no association between the variables.

Note: This is just a toy calculator for this training, it is quite limited. You can also use some online tools like Social Science Statistics to play with.

Formula: \[P = \frac{(a + b)!(c + d)!(a + c)!(b + d)!}{a!b!c!d!N!}\]

Where:

  • \(a\), \(b\), \(c\), and \(d\) are the observed counts in the 2x2 contingency table.

  • \(N\) is the total number of observations, \(N = a + b + c + d\).

Given this contingency table:

Category 1 Category 2 Total
Group 1 \(a\) \(b\) \(a + b\)
Group 2 \(c\) \(d\) \(c + d\)
Total \(a + c\) \(b + d\) \(a + b + c + d\)
R syntax
a = 30
b = 270
c = 470
d = 15230

data <- matrix(c(a, b, c, d), nrow = 2, byrow = TRUE)

# One-tailed test
fisher.test(data, alternative = "g") # greater or "less"

4.2 Hypergeometric Test

Hypergeometric test calculates the probability of observing the given number of genes from a specific category (e.g., a pathway) in the gene list (differentially expressed genes) by chance. It models the situation where you draw a sample (the gene list) from a finite population (the background of all genes), and success is defined as a gene being in the category (e.g., belonging to the pathway).

Note: Here is a tool by Stat Trek to play around with the hypergeometric test.

Formula: \[P(X = k) = \frac{\binom{K}{k} \binom{N - K}{n - k}}{\binom{N}{n}}\] Where:

  • \(k\) = Number of success items in the sample (overlap).

  • \(n\) = Number of items in the sample (DE genes).

  • \(K\) = Number of success items in the population (in term).

  • \(N\) = Total number of items in the population (background).

The parameters in our example: k=30; n=300; K=500; N=16000;

Where:

k−1 is the number of observed successes minus 1 (for the “at least” scenario). lower.tail = FALSE gives the probability of getting at least k successes (right-tail).

R syntax
# Observed value
k <- a
n <- a + b
K <- a + c
N <- a + b + c + d

# (P[X >= k])
phyper(k - 1, K, N - K, n, lower.tail = FALSE)

4.3 Activity

Challenge: Interactive Calculator

Link to open toy enrichment calculator.

This calculates enrichment for a single hypothetical genelist (e.g. your RNAseq differentially expressed genelist) against a single hypothetical ‘term’ (some set of interesting genes, e.g. synaptic signaling genes). It makes a Venn diagram and a wordy description of what is being tested.

You can adjust various factors and see their effect on the enrichment p-values.

Questions

  1. If 24 of the 300 differentially expressed genes are annotated with the 500-gene term of interest. Is it significant at p=0.05?

    Show

    No, corrected pval=0.087

     

  2. What about with a smaller background of 5000 genes (e.g. proteomic datasets)?

    Show

    Even less so - corrected pval=1

     

  3. Or, testing against a smaller database of terms; 2000 terms instead of 10000? With the original 16000 gene background.

    Show

    Yes, now corrected pval=0.017

     

  4. 19 out of 200 differentially expressed genes (9.5%), need to hit for a 500-gene term (3.1% of all genes) to be significant at (p=0.048). How many hits would be needed for a more specific 30-gene term?

    Show

    5 hits - 2.5% of the differentially expressed genes vs 0.19% of all genes

4.4 Kolmogorov–Smirnov (KS)–like Test

The Kolmogorov–Smirnov (KS)–like test is used in Functional Class Scoring approaches such as Gene Set Enrichment Analysis (GSEA) to test whether genes from a predefined set are non-randomly distributed near the top or bottom of a ranked list of all genes.

This test computes a running-sum statistic as it walks down the ranked list:

  • The running sum increases when a gene in the set is encountered (weighted by the strength of its rank statistic).

  • It decreases when a gene not in the set is encountered.

  • The maximum deviation from zero is the enrichment score (ES).

The significance of the observed ES is then estimated by permutation testing, generating a null distribution to calculate p-values, followed by FDR correction for multiple comparisons.

Formula:

According to Subramanian et al. (2005), the enrichment score is defined as:

\[ P_{\text{hit}}(S,i) = \sum_{\substack{g_j \in S \\ j \le i}} \frac{|r_j|^{p}}{N_R} \]

\[ P_{\text{miss}}(S,i) = \sum_{\substack{g_j \notin S \\ j \le i}} \frac{1}{N - N_H} \]

\[ \mathrm{ES}(S) = \max_i \big[ P_{\text{hit}}(i) - P_{\text{miss}}(i) \big] \]

Explanation of the Terms:

Symbol Meaning
\(S\) The gene set being tested (e.g., pathway or GO term).
\(L = \{ g_1, g_2, \dots, g_N \}\) The ranked list of all genes, ordered by correlation with the phenotype.
\(g_j\) The j-th gene in the ranked list.
\(r_j\) The ranking metric for gene \(g_j\) (e.g., signal-to-noise ratio, t-statistic, or correlation).
\(p\) The weighting parameter (usually \(p = 1\) in standard GSEA). Higher \(p\) values emphasize genes with stronger correlations.
\(N_H = |S|\) The number of genes in the gene set \(S\).
\(N\) The total number of genes in the ranked list.
\(N_R = \sum_{g_j \in S} |r_j|^p\) The normalization constant for the “hits” in the gene set.
\(P_{\text{hit}}(i)\) The cumulative fraction of genes in \(S\) found up to position \(i\).
\(P_{\text{miss}}(i)\) The cumulative fraction of genes not in \(S\) found up to position \(i\).
\(\mathrm{ES}(S)\) The enrichment score, i.e., the maximum deviation between \(P_{\text{hit}}(i)\) and \(P_{\text{miss}}(i)\). A positive ES indicates enrichment at the top of the ranked list; a negative ES indicates enrichment at the bottom.

(Adapted from Subramanian et al., 2005, Proc. Natl. Acad. Sci. U.S.A. 102:15545–15550.)