Out this week in Nature is the first big paper from the inflammatory bowel disease Immunochip project. The international project collected data from over 75 thousand individuals, and brought the total number of known IBD loci to a record-breaking 163. You can read more about the paper on the Sanger Institute website.

One interesting thing about the paper was how difficult it was to visualize the results. With one exception there were no single image that naturally fell out of any of the analyses, and we had to put quite a bit of work into displaying the messages of the paper in the figures. You can judge for yourself how much success we had, but I can say that up until the last few days before submission we still had images that everyone hated but couldn’t think what to replace them with. The last one to be replaced was the evocatively named “Smear-o-venn”, that we were all relieved to see the back of.

One glaring question from early on was how to display all the associations across the genome. Generally studies like this use a so-called Manhattan plot that uses height to display the evidence of association for each variant in the genome, with real associations standing out like skyscrapers in the New York skyline. But our study had a large number of associations with massive variation in significance: variants that were found 10 years ago in a few hundred samples now gave test statistics that are hundreds of times larger than any of our (still significant) newly discovered variants. The Manhattan plot was thus a blurred mess.

Instead we came up with the above image that we called the Belgravia plot (named for the grand, but flat, Regency terraces in central London). We no longer use height to show the statistical evidence for each locus, and indeed we only include loci that show conclusive evidence of association. Instead, the width of each bar is proportional to the variance explained (an absolute measure of the contribution of that locus to disease risk). This allows us to get an overall idea of the contribution of each locus, and of each chromosome, to each disease.

By putting the Belgravia plots for CD and UC on top of each other we can look at the overall genetic landscape of each disease. We can see the NOD2 locus, involved in recognising intercellular bacteria, has a large contribution to CD, but very little to UC. Likewise, the HLA locus, involved in the immune system’s recognition of self, contributes strong to UC but little to CD. Others, such as the IL23R locus involved in immune cell signaling, contributes strongly to both disease but more strongly to CD. What you can also see is that much of the discovered risk for both diseases is made up of little bars, too small to fit a name onto: this demonstrates how much (but not all) of IBD risk is made up of many loci, each contributing a very small amount to the total risk.

However, even the Belgravia plot started life as somewhat of an eyesore. I started off calling it the “Time Square plot”, and it originally had a somewhat more psychedelic feel to it:

Thanks to the (now ex) Barrett group member Kate for suggesting the somewhat classier colour scheme. It was a great loss for the team when Kate moved on to pastures greener (but tastefully so), as I don’t know who will put a stop to runaway colour schemes in the future.

5 Comments

  1. Pingback: Some updates | Genetic Inference

  2. Oh one other comment: at the point of submission if you zoomed in close enough on the GRAIL network plot, you could see tiny little gene names in the middle of the nodes. This was because I couldn’t figure out how to turn labels off in the software I was using, so just set the font to really, really small. It seems that the Nature copy editors managed to remove these, though. Which was a shame, they felt like tiny little easter eggs.

  3. The singular of loci is locus, goddamnit!

  4. locus schmocus

  5. Nice. I’m totally going to copy this idea for my papers with 20+ hits. Any tips on making it? Guessing you used R based on the horrid initial colors.

    • Yes at heart it is just a sideways barplot in R with labels added in via text(). Maybe I should clean up the code and add it to my R package.

      One thing I couldn’t get right automatically was the chromosome labels – they kept overlapping for the smaller ones. In the end I cleaned them up in Inkscape…