There’s a new genomic database in town and everyone should know about it.
As humans, we are each different. With the exception of the majority of monozygous twins each of us carries our own unique set of ‘variants’, or positions in the genome where the coding bases differ from the known reference sequence. It’s 12 years since the completion of the first draft of Human Genome Project, but this was just the beginning of getting to the bottom of which of the millions of genetic variants humans carry are benign.
The Exome Aggregation Consortium (ExAC) is a collection of scientists who seek to combine the exomes from global sequencing projects. Each variant, its functional annotation, and its frequency in global populations (African, American, Non-Finnish Europeans, Finnish Europeans, East Asians, South Asians) is made available to the scientific community. Previously, scientists and clinicians have used the 1000 Genomes Project and the NHLBI GO Exome Sequencing Project for annotating and filtering variants, with summary statistics for ~2500 and ~6500 individuals respectively. Impressively, in its current release (2015-02-12), the ExAC dataset includes over 60,000 individuals: by far the single largest aggregation of coding variants in the world. This resource will provide much greater resolution for filtering variants in clinical and population-based cohorts, and further refine our ability to identify functional coding variants relevant to disease.
When we try to interpret variants in a population cohort or clinical exome results, one of the most effective ways to identify highly damaging variants is to use a population frequency filter from a database like ExAC. Quite simply, highly pathogenic variants with a large effect should be selected against and seen with a lower frequency in the general population. For instance, Wiedemann-Steiner syndrome is an autosomal dominant disease with a clinically recognisable phenotype caused by mutations in KMT2A. When we look at ExaC, the data is from individuals who are reportedly known not to have severe paediatric disease (but be aware it does contain data from individuals with psychiatric disorders such as schizophrenia) therefore we would expect to see no or only a small number of individuals with likely pathogenic mutations in this gene. In fact there are no individuals in ExAC with variants in KMT2A predicted to result in a frameshift or gain of a stop codon, despite hundreds of individuals carrying missense variants. ExAC also allows us to be able to calculate carrier frequency. A quick look for variants in CFTR, the gene implicated in the recessive disease cystic fibrosis shows the most common European mutation (delta F508) is carried on 823 alleles, with no homozygotes, consistent with a carrier rate for this specific mutation in the individuals in the ExAC population of 1/71. This is of course only one of the mutations that can cause cystic fibrosis and therefore the carrier rate for this specific mutation in this population and not cystic fibrosis in general, but this illustrates what type of data you have available to you at your fingertips.
To use the ExAC database is incredibly simple! If you have a single variant, gene, or region of interest, just go to http://exac.broadinstitute.org/ and enter the ID or chromosomal position into the search box. (This is something anyone could do very easily!). After a quick search, you can see the sequencing coverage (which tells us about our ability to identify variants using current technologies), the list of variants, its functional annotation (intronic, synonymous, missense, LoF), and its frequency in the database. You can filter the list by using the missense and LoF tabs to only look at putatively damaging variation. Even if you have just sequenced a single exome, it becomes much more interpretable.
If you have a VCF file on hand, you can use a tool called ANNOVAR (http://www.openbioinformatics.org/annovar/) to annotate the frequencies directly. The program can download a local copy of the ExAC database, and with a few simple commands, each line of the VCF file will be annotated with ExAC population frequencies. There is an online version of ANNOVAR as well (http://wannovar.usc.edu/), which allows you to upload a VCF file to the server. And if you want to write your own scripts to do this, you can always download the entire ExAC database from the online website.
However, it should be remembered that the ExAC database is not a collection of variants from phenotypically healthy individuals. The studies included in ExAC have some individuals diagnosed with schizophrenia, diabetes, heart disease, and inflammatory bowel disease (to name just a few). And of course even ‘healthy’ individuals in population sequencing studies can be identified as having disease retrospectively, following a detailed look at their exome variant profile leading to a targeted closer clinical or biochemical inspection. ExAC is also at this stage only a beta version, with some changes between the variants contained in recent releases. However, without a doubt, ExaC is big and easy to use, and it’s the best there is. Use freely and often, but with a bit of caution.