Making sense of gene and proteins lists with functional enrichment analysis

5 Enrichment Statistics

Enrichment statistics are based on a contingency table like so:

..in term ..not in term Total
..in gene list 50 100 150
..not in gene list (but in background) 200 15900 16100
Total 250 16000 16250

This is based on the 16250 genes that were measured in your experiment.

Note that there might be extra genes that weren’t measured these are excluded from the calculations entirely. E.g. There might have been an extra 5000 terms (some of which might have been annotated with the term of interest), making for 21250 annotated genes.


5.1 Fisher’s Exact Test

Fisher’s Exact Test is a statistical test used to determine if there are nonrandom associations between the proportions of two categorical variables. It calculates the exact probability of observing the given distribution of counts in a 2x2 contingency table, under the null hypothesis of no association between the variables.

Note: This is just a toy calculator for this training, it is quite limited. You can also use some online tools like Social Science Statistics to play with.

Formula: \[P = \frac{(a + b)!(c + d)!(a + c)!(b + d)!}{a!b!c!d!N!}\]

Where:

  • \(a\), \(b\), \(c\), and \(d\) are the observed counts in the 2x2 contingency table.

  • \(N\) is the total number of observations, \(N = a + b + c + d\).

Given this contingency table:

Category 1 Category 2 Total
Group 1 \(a\) \(b\) \(a + b\)
Group 2 \(c\) \(d\) \(c + d\)
Total \(a + c\) \(b + d\) \(a + b + c + d\)

R syntax

5.2 Hypergeometric Test

Hypergeometric test calculates the probability of observing the given number of genes from a specific category (e.g., a pathway) in the gene list (differentially expressed genes) by chance. It models the situation where you draw a sample (the gene list) from a finite population (the background of all genes), and success is defined as a gene being in the category (e.g., belonging to the pathway).

Note: Here is a tool by Stat Trek to play around with the hypergeometric test.

Formula: \[P(X = k) = \frac{\binom{K}{k} \binom{N - K}{n - k}}{\binom{N}{n}}\] Where:

  • \(N\) = Total number of items in the population.

  • \(K\) = Number of success items in the population.

  • \(n\) = Number of items in the sample.

  • \(k\) = Number of success items in the sample.

The parameters in our example: N=16250; K=250; n=150; k=50

R syntax

Where:

k−1 is the number of observed successes minus 1 (for the “at least” scenario). lower.tail = FALSE gives the probability of getting at least k successes (right-tail).


5.3 Activity

Challenge: Interactive Calculator

Link to open toy enrichment calculator.

This calculates enrichment for a single hypothetical genelist (e.g. your RNAseq differentially expressed genelist) against a single hypothetical ‘term’ (some set of interesting genes, e.g. synaptic signaling genes). It makes a Venn diagram and a wordy description of what is being tested.

You can adjust various factors and see their effect on the enrichment p-values.

Questions

  1. If 24 of the 300 differentially expressed genes are annotated with the 500-gene term of interest. Is it significant at p=0.05?

    Show

    No, corrected pval=0.087

     

  2. What about with a smaller background of 5000 genes (e.g. proteomic datasets)?

    Show

    Even less so - corrected pval=1

     

  3. Or, testing against a smaller database of terms; 2000 terms instead of 10000? With the original 16000 gene background.

    Show

    Yes, now corrected pval=0.017

     

  4. 19 out of 200 differentially expressed genes (9.5%), need to hit for a 500-gene term (3.1% of all genes) to be significant at (p=0.048). How many hits would be needed for a more specific 30-gene term?

    Show

    5 hits - 2.5% of the differentially expressed genes vs 0.19% of all genes