18. September 2016 · Comments Off on About us · Categories: Uncategorized

playground-journal-clubThe Barrett group works on medical genomics research at the Wellcome Trust Sanger Institute. We are interested in how genetic variation affects risk for diseases, and in finding ways to apply that knowledge to improve health care. We analyze genome-wide association studies and next-generation sequence data collected on thousands of individuals, and develop statistical and computational methods for these analyses.

You can keep up to date with news and posts about our research by subscribing to our RSS feed.

Thomas Evans and Laurence Ettwiller of New England Biolabs don’t hesitate to answer that question in their recent paper — it’s right in the title: “DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification”. Indeed, the word “variant” is used 88 times in the paper, including hard-to-decipher phrases like “Variants originating from real in-vivo variants”, because it’s used to mean at least three different things:

  1. a germline variant is a position in an individual’s germline DNA that is different from the reference genome sequence
  2. a somatic variant is a position in a somatic cell’s DNA that is different from that individual’s germline sequence
  3. a sequence read variant is a position in a specific DNA sequencing read that is different from the reference genome sequence, which I’ll call an observed non-reference allele

Much of the interest in this paper I’ve seen on twitter (and my own spit-take when I saw it as a preprint just after submitting a low-coverage sequencing paper) arises from the potential effects on those first two definitions, which are the foundation of human genetics and cancer genetics, respectively.
More »

The analyses had been run (and re-run), QC was finished, and the only thing left to do was somehow corral all that work into a paper. With that realization, Alice had a vivid flashback to her last collaborative paper:


More »

The bioinformatics community offers a wealth of tools, each honed to perform a specific function. Performing complex tasks will invariably involve passing your data from one of these tools to another – along with suitable parameters – and writing some scripts to connect the pieces. To record this sequence of steps and describe the results, a log or README is usually written. This certainly gets the job done, but I argue there is a better way to create and record workflows involving a mixture of command line tools, scripting languages, and written narrative: Jupyter Notebook. The Jupyter Notebook is a browser-based command shell for interactive computing in several languages: Python, bash, R, Julya, Haskell, Ruby, and more. To provide a feel for what Jupyter Notebook can do I’ll first present an overview of the user interface. The second part of this blog post will discuss use cases.

More »

14. August 2015 · Comments Off on Two new reviews · Categories: Papers

Summertime was review writing time in the group, with two new papers published recently.

The first was, Strategies for fine-mapping complex traits. One of the activities that has kept us busy lately has been trying to narrow down GWAS hits to causal variants and genes. Our biggest applied effort in fine-mapping has been, unsurprisingly, in inflammatory bowel disease (IBD), where our longstanding collaboration with the International IBD Genetics Consortium, access to big sample sets, and a generally tractable genetic architecture have made it a fruitful exercise. This review was largely motivated by the idea that our experience might be useful to others working on other diseases.

The second was, Understanding inflammatory bowel disease via immunogenetics. Sticking with the IBD theme, this is the latest in a number of IBD genetics reviews (things change quickly in this business!), this time as part of a series in the Journal of Autoimmunity, aimed at putting the immunogenetics of many different disorders in context in a single issue.

Congrats to Katie and Sarah on these two papers!

20. March 2015 · Comments Off on Genetic study sheds new light on TB pathogenesis · Categories: Papers, Science · Tags: , , ,

One of the world’s most ancient diseases

Tuberculosis, also known as consumption, was first recorded in Greek literature around 460 BCE. Hippocrates identified it as the most widespread and fatal disease of his time. Tuberculosis (TB) is caused by a pathogen called Mycobacterium tuberculosis (M.tb). In Greek myco refers to a mushroom-like shape, vividly describing these fungal looking bacterium that float into the human system through the airways.

TB accounted for approximately 25% of total deaths in Europe from the 17th to 19th centuries. Many of the writers and artists of the Victorian era suffered and died from the disease and painted it with a pathological – yet somehow romantic – extreme: febrile, unrelenting and breathless.

Experiment eleven

It was not until 1943 when a young Ph.D. student called Albert Schatz, from Professor Selman Waksman’s lab at Rutger’s University in the US, discovered the first effective cure for treating TB. On Schatz’s eleventh experiment on a common bacterium found in farmyard soil, the first antibiotic agent for treating TB, streptomycin, was discovered. The battle for the ownership of streptomycin became a famous scientific scandal [Experiment Eleven], when Waksman took credit and the Nobel prize for the discovery, downplaying Schatz’s contributions. Thanks to a sustained effort from the government and society, including better nutrition, housing, improved sewage systems and ventilation, the number of TB cases was reduced significantly by the 1980s. The efforts to seek cures for TB have not only brought TB mortality down, but also helped to shape modern medicine and our understanding towards infectious illness.
More »

20. March 2015 · Comments Off on ExAC: It’s BIG and easy to use · Categories: Science, Software · Tags: , ,

There’s a new genomic database in town and everyone should know about it.

As humans, we are each different.  With the exception of the majority of monozygous twins  each of us carries our own unique set of ‘variants’, or positions in the genome where the coding bases differ from the known reference sequence.  It’s 12 years since the completion of the first draft of Human Genome Project, but this was just the beginning of getting to the bottom of which of the millions of genetic variants humans carry are benign.

The Exome Aggregation Consortium (ExAC) is a collection of scientists who seek to combine the exomes from global sequencing projects. Each variant, its functional annotation, and its frequency in global populations (African, American, Non-Finnish Europeans, Finnish Europeans, East Asians, South Asians) is made available to the scientific community. Previously, scientists and clinicians have used the 1000 Genomes Project and the NHLBI GO Exome Sequencing Project for annotating and filtering variants, with summary statistics for ~2500 and ~6500 individuals respectively. Impressively, in its current release (2015-02-12), the ExAC dataset includes over 60,000 individuals: by far the single largest aggregation of coding variants in the world. This resource will provide much greater resolution for filtering variants in clinical and population-based cohorts, and further refine our ability to identify functional coding variants relevant to disease.

When we try to interpret variants in a population cohort or clinical exome results, one of the most effective ways to identify highly damaging variants is to use a population frequency filter from a database like ExAC. Quite simply, highly pathogenic variants with a large effect should be selected against and seen with a lower frequency in the general population. For instance, Wiedemann-Steiner syndrome is an autosomal dominant disease with a clinically recognisable phenotype caused by mutations in KMT2A.  When we look at ExaC, the data is from individuals who are reportedly known not to have severe paediatric disease (but be aware it does contain data from individuals with psychiatric disorders such as schizophrenia) therefore we would expect to see no or only a small number of individuals with likely pathogenic mutations in this gene.  In fact there are no individuals in ExAC with variants in KMT2A predicted to result in a frameshift or gain of a stop codon, despite hundreds of individuals carrying missense variants.  ExAC also allows us to be able to calculate carrier frequency. A quick look for variants in CFTR, the gene implicated in the recessive disease cystic fibrosis shows the most common European mutation (delta F508) is carried on 823 alleles, with no homozygotes, consistent with a carrier rate for this specific mutation in the individuals in the ExAC population of 1/71.  This is of course only one of the mutations that can cause cystic fibrosis and therefore the carrier rate for this specific mutation in this population and not cystic fibrosis in general, but this illustrates what type of data you have available to you at your fingertips.

To use the ExAC database is incredibly simple! If you have a single variant, gene, or region of interest, just go to http://exac.broadinstitute.org/ and enter the ID or chromosomal position into the search box.  (This is something anyone could do very easily!). After a quick search, you can see the sequencing coverage (which tells us about our ability to identify variants using current technologies), the list of variants, its functional annotation (intronic, synonymous, missense, LoF), and its frequency in the database. You can filter the list by using the missense and LoF tabs to only look at putatively damaging variation. Even if you have just sequenced a single exome, it becomes much more interpretable.

If you have a VCF file on hand, you can use a tool called ANNOVAR (http://www.openbioinformatics.org/annovar/)  to annotate the frequencies directly. The program can download a local copy of the ExAC database, and with a few simple commands, each line of the VCF file will be annotated with ExAC population frequencies. There is an online version of ANNOVAR as well (http://wannovar.usc.edu/), which allows you to upload a VCF file to the server. And if you want to write your own scripts to do this, you can always download the entire ExAC database from the online website.

However, it should be remembered that the ExAC database is not a collection of variants from phenotypically healthy individuals. The studies included in ExAC have some individuals diagnosed with schizophrenia, diabetes, heart disease, and inflammatory bowel disease (to name just a few).  And of course even ‘healthy’ individuals in population sequencing studies can be identified as having disease retrospectively, following a detailed look at their exome variant profile leading to a targeted closer clinical or biochemical inspection. ExAC is also at this stage only a beta version, with some changes between the variants contained in recent releases.  However, without a doubt, ExaC is big and easy to use, and it’s the best there is. Use freely and often, but with a bit of caution.

06. August 2014 · Comments Off on 1000 Genomes and Beyond: a retrospective · Categories: Conferences, Science · Tags: , ,

1000 Genomes and Beyond Conference held on 24-26 of June 2014 in Cambridge, UK was the latest of the 1000 Genome Project community meetings, marking the end of this grandiose endeavor launched in 2008. With the final Phase 3 of the 1000 Genome project being released on 24th of June, this was an excellent opportunity to get an update on this release but also to see what was learned leveraging the genetic variation catalogued in 1000 Genomes so far and provide a glimpse of future opportunities and directions.

The final, phase 3 is a catalog of genetic variants identified through low-coverage (8x) whole-genome, exome sequencing and genotyping arrays in 2,504 individuals from 26 populations (each population is represented with 60-100 individuals). The catalog includes over 79 million variant sites, covering short variants (bi- and multiallelic SNPs, indels), tandem repeats and structural variants; and is expected to contain over 95% of common variants.

1000 Genome Project had a tremendous impact on our understanding of population genetics and human evolution. It also enabled studies on population isolates, easy and effective study design which revealed a numerous candidate loci for complex traits. One of the most well-rounded studies presented during the meeting was the one about the common Greenlandic stop-gain variant in TBC1D4 conferring muscle insulin resistance and T2D. In the discovery analysis variant was found to be associated with higher plasma glucose and serum insulin levels in the Greenlandic participants without previous known T2D, while the consequent T2D case-control analysis showed strong association with increased T2D risk. The variant has a MAF of 17% in Greenlandic cohort, and it has been observed in only one Japanese individual out of all individuals sequenced in the 1000 Genome and several related large sequencing projects. The observed effect sizes are several times larger than any previous finding in large-scale GWASs of these traits, with ~60% of the homozygous carriers developing T2D between 40 and 60 years of age; indicating a Mendelian-disease-like pattern of inheritance. The well thought-out design of this study was commended, and it sparked some discussion about the often lack of population-based control groups in the studies of the extreme and rare phenotypes, where the initial ascertainment bias is introducing an upward bias in reported effect sizes.

What I found particularly encouraging is that the focus of the genetic studies is slowly moving toward the large-scale functional and mechanistic studies. This direction has been advocated for years but finally the results are being produced. From study of genetic variation in human DNA replication timing (rtQTL), high-quality and resolution transcriptome analysis, investigation of loss-of-function variants effects on transcriptome to development of new software tools for analysing sequenced variants, such as Ensembl’s variant effect predictor (VEP); are all gradually unveiling the functional consequences of genetic variation in humans. Still, some of this work is still in its infancy and will require larger, better powered studies to produce meaningful conclusions. This was nicely demonstrated during Andrew Wood’s talk on epistatic effects of genetic variants influencing gene expression levels. They sought to replicate findings from the first study reporting 30 epistasis interactions affecting traits (Hemani et al., Nature 2014) using genotypes imputed from the 1000 Genomes reference panel. 14 interaction effects were replicated, however, in each case, a third variant uncaptured in the initial study could provide explanation for the apparent epistasis. A second study reporting epistatic effects was also published in 2014 but despite the fact that sharing data between these groups helped, Wood concluded that clear picture of epistatic interaction effects on gene expression will require large samples sizes and whole-genome-sequencing.

The overall conclusion of the meeting was that developments in the computational biology methods will be of critical importance, and that data sharing and functional characterization will be the biggest challenges in human genetics in the future. Richard Durbin also noted that we should be considering the 1 Million Genomes Project, but that new initiative, goals and leadership is needed.

This year’s Biology of Genomes (BoG) meeting maintained its high standard with another display of excellent and exciting science. One of my favourite presentations was given by Matthew Stephens from the University of Chicago on the topic of False Discovery Rates (FDRs).

The FDR is a basic concept in statistical testing that we all come across in our research. By controlling the FDR, we aim to limit the expected proportion of false positives among significant loci identified by association studies. Slide1 The idea is that under the null hypothesis (H0, that the locus is not associated with the trait), the observed p-values are expected to be distributed uniformly (Fig.1(a)); and under an alternative hypothesis (H1), more of the p-values should be close to zero (Fig.1(b)). In other words, the observed distributions of p-values in a genome-wide scan should be a mixture of these two distributions. The existing FDR methods find a maximum cutoff value (Fig. 1(c)) such that the results with smaller p-values are likely to be true positives from H1.

More »

11. April 2014 · Comments Off on Two jobs available in the lab · Categories: Jobs

We’re currently advertising for two positions in the lab:

  1. A postdoc to work on genomic data analysis in inflammatory bowel disease, shared with Carl Anderson’s team
  2. A software developer to help build the tools the community needs for large-scale genomic analysis more generally, and to support the variety of projects happening in the team.

The first ad closes in just a few days, so apply soon if you’re interested (the second one is open for another couple of weeks).