Meng's Notes: Simple Enrichment Test -- calculate hypergeometric p-values in R

Wednesday, December 19, 2012

Simple Enrichment Test -- calculate hypergeometric p-values in R

Hypergeometric test are useful for enrichment analysis. For example, having a gene list in hand, people might want to tell which functions (GO terms) are enriched among these genes. Hypergeometric test (or its equivalent: one-tailed Fisher's exact test) will give you statistical confidence in $p$-values.

R software provids function phyper and fisher.test for Hypergeometric and Fisher's exact test accordingly. However, it is tricky to get it right. I spent some time to make it clear.

Here is a simple example:
Five cards were chosen from a well shuffled deck
x = the number of diamonds selected.
We use a 2x2 table to represent the case:

                Diamond     Non-Diamond
selected        x                     5-x               total 5 sampled cards
left               13-x                 34+x             total 47 left cards after sampling
                 13 Dia        39 Non-Dia         total 52 cards

We 're asking if diamond enriched or depleted in our selected cards, comparing to the background.

Here are the different parameters used by phyper and fisher.test:

phyper(x, 13, 39, 5, lower.tail=TRUE);

# Numerical parameters in order:

# (success-in-sample, success-in-bkgd, failure-in-bkgd, sample-size).

fisher.test(matrix(c(x, 13-x, 5-x, 34+x), 2, 2), alternative='less');

# Numerical parameters in order:

# (success-in-sample, success-in-left-part, failure-in-sample, failure-in-left-part).

It's obvious that hypergeometric test compares sample to bkgd, while fisher's exact test compares sample to the left part of bkgd after sampling without replacement. They will give the same p-value (because they assume the same distribution).

Here is the results:

x=1; # x could be 0~5

hitInSample = 1 # could be 0~5

hitInPop = 13

failInPop = 52 - hitInPop

sampleSize = 5

Test for under-representation (depletion)

phyper(hitInSample, hitInPop, failInPop, sampleSize, lower.tail= TRUE);

## [1] 0.6329532

fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='less')$p.value;

## [1] 0.6329532

Test for over-representation (enrichment)

phyper(hitInSample-1, hitInPop, failInPop, sampleSize, lower.tail= FALSE);

## [1] 0.7784664

fisher.test(matrix(c(hitInSample, hitInPop-hitInSample, sampleSize-hitInSample, failInPop-sampleSize +hitInSample), 2, 2), alternative='greater')$p.value;

## [1] 0.7784664

Why hitInSample-1 when testing over-representation?

Because if lower.tail is TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]. We subtract x by 1, when P[X ≥ x] is needed.

So are there any advantages fisher.test has over phyper, as they give the same p-values?
Yes, fisher.test can do two other jobs: two-side test, and giving confidence intervals of odds ratio. Please refer to its manual for details. For one-side p-value calculating, they don't have any difference if correct parameters were used.

21 comments:

Xianjun DongApril 16, 2015 at 8:39 AM
Nice note!
ReplyDelete
Replies
fliesSeptember 21, 2015 at 12:06 PM
"We subtract x by 1, when P[X ≥ x] is needed." But you subtract one for both depletion and enrichment tests...
ReplyDelete
Replies
PabloJanuary 22, 2016 at 8:17 AM
Thanks for the note. Since you begin the post talking about GO term enrichment in a set of genes (reason why I landed here during a google search), I think the example should reflect this analysis instead of the how many diamonds from a set of cards (those of us who don't play cards might not even know how many diamonds should be in a deck to start with ;)). Also, using the same color for different meaning parameters of the two functions can be a bit misleading! All best, Pablo
ReplyDelete
Replies
UnknownJanuary 22, 2016 at 9:29 AM
Thank you Pablo, for the great suggestions! I would made it better if I know people actually read this. Again, thanks!
ReplyDelete
Replies
Pablo RamosJanuary 22, 2016 at 9:39 AM
This comment has been removed by the author.
ReplyDelete
Replies
PabloJanuary 22, 2016 at 9:40 AM
Your post was actually top-ranked in my search, so people are surely reading it! I got into reading your other posts and the one explaining the PCA is fabulous. Thank you very much for putting this much time and effort on explaining these very important--but often misunderstood--concepts on bioinformatics. Please keep posting, I'll make sure to come back!
ReplyDelete
Replies
Adi B.April 13, 2016 at 10:08 PM
Superb summary, Meng, thanks for that.
ReplyDelete
Replies
AnonymousJune 13, 2017 at 12:24 PM
Hi - thanks for posting, but please update
failInPop = 54-hitInPop
to
failInPop = 52-hitInPop
ReplyDelete
Replies
AnonymousSeptember 8, 2017 at 9:05 AM
Great post but the colors make it a little hard to compare fisher to phyper. Slight re-order of the colors would make it more clear.
ReplyDelete
Replies
KumarOctober 4, 2017 at 5:46 PM
Useful post.
I have question regarding interpretation of p-values.
In both the cases (i.e. depletion and enrichment), the p-value >0.05 (if 0.05 is my threshold), what would I interpret? I understand that it is neither enriched nor depleted in the above example!! Is that correct? In other words, if p-value would have been less than 0.05, I would say that this sampling is enriched or depleted? Thanks
ReplyDelete
Replies
KumarOctober 10, 2017 at 1:00 PM
Yes, it easier to interpret now. Thank you.
ReplyDelete
Replies
UnknownJuly 24, 2018 at 3:43 PM
This has been really helpful to me, thank you!
ReplyDelete
Replies
DanielFebruary 5, 2019 at 7:10 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMay 28, 2019 at 9:39 PM
Great post! Thank you (:
ReplyDelete
Replies
AnonymousFebruary 8, 2024 at 9:27 PM
Thank you for this beautifully lucid blog entry.
ReplyDelete
Replies

Add comment

mathjax

Wednesday, December 19, 2012

Simple Enrichment Test -- calculate hypergeometric p-values in R

21 comments: