28. October 2013 · Comments Off on A neat idea to tell polygenic signal from stratification noise · Categories: Conferences, Science · Tags: , , ,

One of my favorite presentations at ASHG this year was a poster given by Brendan Bulik-Sullivan from the Broad. Brendan and his colleagues attempted to answer a puzzling question which has come up quite often recently: “If we see an inflation of GWAS test statistics, is it because of polygenic risk (good) or population stratification (bad)?”

This particular puzzle has become prominent lately because GWAS sample sizes are now routinely in the tens of thousands, which means they have some power to detect many effects at sub-genome-wide significance. Since many complex diseases have hundreds (probably thousands) of independent risk loci, these weak signals are spread widely over the genome, and become difficult to tell apart from more insidious causes of overall inflation, like population stratification.

The neat idea the authors had was that while the overall distribution of test statistics looks the same when faced with either polygenic risk or stratification, there are subtle differences in which markers are inflated. Specifically, in the case of polygenic risk, a marker’s association statistic should be correlated to its number of “LD friends” (measured in this case by a sum of pairwise r2 values). One can straightforwardly compare the number of LD friends a marker has to its test statistic: positive correlation indicates polygenic risk, whereas no correlation indicates stratification. Brendan showed analyses of both simulations, which confirm their approach, and the (widely discussed, but still unpublished) 100-hit schizophrenia GWAS meta-analysis, showing clearly that it’s not stratification causing all those signals, but real risk genes!

The one thing I don’t fully understand is the math behind the idea that stratified differences aren’t correlated with LD but disease associations are. The latter is pretty intuitive: if one imagines a large number of true causal alleles, any given marker is more likely to show association if it’s in LD with lots of other things, as it increases the chance of picking up the “shadow” of a real signal. And I suppose if one imagines markers drifting to different frequencies in different populations there’s no reason that those with lots of LD friends will be more or less differentiated than those with few LD friends. This may not hold if selection were driving frequency differences, as such a signal would look just like a real disease association, but presumably that plays a fairly modest role between the populations (mostly within European ancestry) in these studies.

In any event, the simulations support the underlying notion convincingly, and a very neat idea provides an answer to a thorny question.

Comments closed.