This year’s Biology of Genomes (BoG) meeting maintained its high standard with another display of excellent and exciting science. One of my favourite presentations was given by Matthew Stephens from the University of Chicago on the topic of False Discovery Rates (FDRs).

The FDR is a basic concept in statistical testing that we all come across in our research. By controlling the FDR, we aim to limit the expected proportion of false positives among significant loci identified by association studies. Slide1 The idea is that under the null hypothesis (H0, that the locus is not associated with the trait), the observed p-values are expected to be distributed uniformly (Fig.1(a)); and under an alternative hypothesis (H1), more of the p-values should be close to zero (Fig.1(b)). In other words, the observed distributions of p-values in a genome-wide scan should be a mixture of these two distributions. The existing FDR methods find a maximum cutoff value (Fig. 1(c)) such that the results with smaller p-values are likely to be true positives from H1.

Prior to reading this work, I have always taken this concept at face value. However, Stephens takes a fresh look at the problem, challenging two assumptions in the standard FDR calculation:

  1. Firstly, current methods assume that all p-values near 1 are assumed to be null (also known as the Zero-Assumption) – something that appears to be intuitive at first, but unrealistic in practice.
  2. Secondly, current methods that are based directly on p-values do not fully account for the different precision levels for different measurements.

To address these problems, Stephens proposed “a new deal” of FDR estimation using empirical Bayes. He illustrates the potential for the new method to increase the number of discoveries at a given FDR threshold with a toy simulated example.

I, on the other hand, decided to test how the method performs in real data. The estimated effect size and standard errors that I analyzed with Stephens’ well coded R-package “ashr” come from our low-coverage whole genome sequencing project, where ~3,000 Crohn’s disease patients were sequenced at 4x and ~4,000 controls from the UK10K project were sequenced at 6x.

The second drawback of traditional FDR calculations is particularly true in this experimental setup, where the effect sizes of rare variants have higher standard errors compared to common ones due to experimental errors in sequencing reads. These imprecise tests can dilute signals and thus increase FDR for other tests. Not surprisingly, our p-values confirm the imprecision criticism (Fig.2). The distributions of p-values become flatter as the variants become rarer. This is not because there is no signal in the rare region, but due to the large standard errors, the effect sizes are less significant.

To address the problem where the distribution of z-scores in the current FDR approaches are assumed to be bimodal, the new FDR method models the effect size directly with unimodal distributions centered at 0. Fig.3 shows the old and new distributions of z-scores under H1 in our dataset.


After both ‘zero-assumption’ and ‘imprecision of measurements’ have been accounted for in the new model, here is a summary of the number of findings at an estimated FDR level of 0.05 using the traditional ‘fdrtools’ package and the new ‘ashr’ package in R respectively:












Low-freq & rare




It is notable that the new approach can provide unsettling enrichment of true positive signals if the null is mostly false. Do we really have 2,480 common variants that are associated with Crohn’s disease in Chromosome 16 alone?! Stephens further argues that in high-signal contexts, the False Sign Rate (FSRs) is preferable to the FDR, i.e., by asking the question whether we can estimate the sign of the effect correctly instead of whether there is any effect at all. After we change our question, the estimated number of signals falls by 66% to 866 (835 common).

It is evident that the new FDR approach successfully reduces conservatism when compared with the old method, but subsequently may also provide us with too many signals to work with.

What’s more important about Stephens’ talk, in my opinion, is that he provides an excellent example to demonstrate the value of reproducible research. I have uncountable experiences where I’ve read or heard about a new method and tried to implement it on my own data, only to be stymied by a few bugs in the code here, some mysterious library dependencies there, and other such silly issues.

In contrast, a method which has been implemented with easy-to-use software where published results (or illustrative examples) can be directly replicated with careful preservation of data, analysis code and associated files, can make one’s research much more appealing and ‘popular’.

So two lessons were learnt: be wary of FDR and always spend a little time to ensure your research is reproducible.

1 Comment

  1. Matthew stephens

    Glad you found it useful! There are some issues with applying these methods to GWAS with the correlated tests due to ld… Not sure how severe they are or how much to worry about them… It us on my radar to try to account for the ld.