The analyses had been run (and re-run), QC was finished, and the only thing left to do was somehow corral all that work into a paper. With that realization, Alice had a vivid flashback to her last collaborative paper:


More »

The bioinformatics community offers a wealth of tools, each honed to perform a specific function. Performing complex tasks will invariably involve passing your data from one of these tools to another – along with suitable parameters – and writing some scripts to connect the pieces. To record this sequence of steps and describe the results, a log or README is usually written. This certainly gets the job done, but I argue there is a better way to create and record workflows involving a mixture of command line tools, scripting languages, and written narrative: Jupyter Notebook. The Jupyter Notebook is a browser-based command shell for interactive computing in several languages: Python, bash, R, Julya, Haskell, Ruby, and more. To provide a feel for what Jupyter Notebook can do I’ll first present an overview of the user interface. The second part of this blog post will discuss use cases.

More »

20. March 2015 · Comments Off on ExAC: It’s BIG and easy to use · Categories: Science, Software · Tags: , ,

There’s a new genomic database in town and everyone should know about it.

As humans, we are each different.  With the exception of the majority of monozygous twins  each of us carries our own unique set of ‘variants’, or positions in the genome where the coding bases differ from the known reference sequence.  It’s 12 years since the completion of the first draft of Human Genome Project, but this was just the beginning of getting to the bottom of which of the millions of genetic variants humans carry are benign.

The Exome Aggregation Consortium (ExAC) is a collection of scientists who seek to combine the exomes from global sequencing projects. Each variant, its functional annotation, and its frequency in global populations (African, American, Non-Finnish Europeans, Finnish Europeans, East Asians, South Asians) is made available to the scientific community. Previously, scientists and clinicians have used the 1000 Genomes Project and the NHLBI GO Exome Sequencing Project for annotating and filtering variants, with summary statistics for ~2500 and ~6500 individuals respectively. Impressively, in its current release (2015-02-12), the ExAC dataset includes over 60,000 individuals: by far the single largest aggregation of coding variants in the world. This resource will provide much greater resolution for filtering variants in clinical and population-based cohorts, and further refine our ability to identify functional coding variants relevant to disease.

When we try to interpret variants in a population cohort or clinical exome results, one of the most effective ways to identify highly damaging variants is to use a population frequency filter from a database like ExAC. Quite simply, highly pathogenic variants with a large effect should be selected against and seen with a lower frequency in the general population. For instance, Wiedemann-Steiner syndrome is an autosomal dominant disease with a clinically recognisable phenotype caused by mutations in KMT2A.  When we look at ExaC, the data is from individuals who are reportedly known not to have severe paediatric disease (but be aware it does contain data from individuals with psychiatric disorders such as schizophrenia) therefore we would expect to see no or only a small number of individuals with likely pathogenic mutations in this gene.  In fact there are no individuals in ExAC with variants in KMT2A predicted to result in a frameshift or gain of a stop codon, despite hundreds of individuals carrying missense variants.  ExAC also allows us to be able to calculate carrier frequency. A quick look for variants in CFTR, the gene implicated in the recessive disease cystic fibrosis shows the most common European mutation (delta F508) is carried on 823 alleles, with no homozygotes, consistent with a carrier rate for this specific mutation in the individuals in the ExAC population of 1/71.  This is of course only one of the mutations that can cause cystic fibrosis and therefore the carrier rate for this specific mutation in this population and not cystic fibrosis in general, but this illustrates what type of data you have available to you at your fingertips.

To use the ExAC database is incredibly simple! If you have a single variant, gene, or region of interest, just go to and enter the ID or chromosomal position into the search box.  (This is something anyone could do very easily!). After a quick search, you can see the sequencing coverage (which tells us about our ability to identify variants using current technologies), the list of variants, its functional annotation (intronic, synonymous, missense, LoF), and its frequency in the database. You can filter the list by using the missense and LoF tabs to only look at putatively damaging variation. Even if you have just sequenced a single exome, it becomes much more interpretable.

If you have a VCF file on hand, you can use a tool called ANNOVAR (  to annotate the frequencies directly. The program can download a local copy of the ExAC database, and with a few simple commands, each line of the VCF file will be annotated with ExAC population frequencies. There is an online version of ANNOVAR as well (, which allows you to upload a VCF file to the server. And if you want to write your own scripts to do this, you can always download the entire ExAC database from the online website.

However, it should be remembered that the ExAC database is not a collection of variants from phenotypically healthy individuals. The studies included in ExAC have some individuals diagnosed with schizophrenia, diabetes, heart disease, and inflammatory bowel disease (to name just a few).  And of course even ‘healthy’ individuals in population sequencing studies can be identified as having disease retrospectively, following a detailed look at their exome variant profile leading to a targeted closer clinical or biochemical inspection. ExAC is also at this stage only a beta version, with some changes between the variants contained in recent releases.  However, without a doubt, ExaC is big and easy to use, and it’s the best there is. Use freely and often, but with a bit of caution.

07. December 2012 · 2 comments · Categories: Software · Tags:

Out today in Bioinformatics is an applications note describing our Olorin software. Olorin is an easy to use tool for filtering variants identified by high throughput family sequencing studies. Using Olorin, variants can be prioritized based on haplotype sharing across selected individuals in a pedigree as well as many other measures such as predicted functional consequence and population frequency.

More »

23. April 2012 · Comments Off on Manual genotype calling in Evoker · Categories: Software · Tags:

Version 2.2 of the Evoker software has just been released through sourceforge and is available to download here.

This is a major new release of Evoker as it includes an important new feature, users now have the ability to manually recall the genotypes of any marker in a dataset. A number of other general improvements and bug fixes are also included in this latest release.

Genotype calling is an automated process which can produce errors, below is an example of how such errors appear when loaded in Evoker.
poorly called marker
Using the latest version of Evoker it is now possible to correct such errors with manual calling.

Here is a step by step guide of how to go about manually recalling such a marker:

More »