5 Enrichment Statistics
Enrichment statistics are based on a contingency table like so:
..in term | ..not in term | Total | |
---|---|---|---|
..in gene list | 50 | 100 | 150 |
..not in gene list (but in background) | 200 | 15900 | 16100 |
Total | 250 | 16000 | 16250 |
This is based on the 16250 genes that were measured in your experiment.
Note that there might be extra genes that weren’t measured these are excluded from the calculations entirely. E.g. There might have been an extra 5000 terms (some of which might have been annotated with the term of interest), making for 21250 annotated genes.
5.1 Fisher’s Exact Test
Fisher’s Exact Test is a statistical test used to determine if there are nonrandom associations between the proportions of two categorical variables. It calculates the exact probability of observing the given distribution of counts in a 2x2 contingency table, under the null hypothesis of no association between the variables.
Note: This is just a toy calculator for this training, it is quite limited. You can also use some online tools like Social Science Statistics to play with.
Formula: \[P = \frac{(a + b)!(c + d)!(a + c)!(b + d)!}{a!b!c!d!N!}\]
Where:
\(a\), \(b\), \(c\), and \(d\) are the observed counts in the 2x2 contingency table.
\(N\) is the total number of observations, \(N = a + b + c + d\).
Given this contingency table:
Category 1 | Category 2 | Total | |
---|---|---|---|
Group 1 | \(a\) | \(b\) | \(a + b\) |
Group 2 | \(c\) | \(d\) | \(c + d\) |
Total | \(a + c\) | \(b + d\) | \(a + b + c + d\) |
R syntax
5.2 Hypergeometric Test
Hypergeometric test calculates the probability of observing the given number of genes from a specific category (e.g., a pathway) in the gene list (differentially expressed genes) by chance. It models the situation where you draw a sample (the gene list) from a finite population (the background of all genes), and success is defined as a gene being in the category (e.g., belonging to the pathway).
Note: Here is a tool by Stat Trek to play around with the hypergeometric test.
Formula: \[P(X = k) = \frac{\binom{K}{k} \binom{N - K}{n - k}}{\binom{N}{n}}\] Where:
\(N\) = Total number of items in the population.
\(K\) = Number of success items in the population.
\(n\) = Number of items in the sample.
\(k\) = Number of success items in the sample.
The parameters in our example: N=16250; K=250; n=150; k=50
R syntax
Where:
k−1 is the number of observed successes minus 1 (for the “at least” scenario). lower.tail = FALSE gives the probability of getting at least k successes (right-tail).
5.3 Activity
Challenge: Interactive Calculator
Link to open toy enrichment calculator.
This calculates enrichment for a single hypothetical genelist (e.g. your RNAseq differentially expressed genelist) against a single hypothetical ‘term’ (some set of interesting genes, e.g. synaptic signaling genes). It makes a Venn diagram and a wordy description of what is being tested.
You can adjust various factors and see their effect on the enrichment p-values.
Questions
-
If 24 of the 300 differentially expressed genes are annotated with the 500-gene term of interest. Is it significant at p=0.05?
Show
No, corrected pval=0.087
-
What about with a smaller background of 5000 genes (e.g. proteomic datasets)?
Show
Even less so - corrected pval=1
-
Or, testing against a smaller database of terms; 2000 terms instead of 10000? With the original 16000 gene background.
Show
Yes, now corrected pval=0.017
-
19 out of 200 differentially expressed genes (9.5%), need to hit for a 500-gene term (3.1% of all genes) to be significant at (p=0.048). How many hits would be needed for a more specific 30-gene term?
Show
5 hits - 2.5% of the differentially expressed genes vs 0.19% of all genes