Thomas Evans and Laurence Ettwiller of New England Biolabs don’t hesitate to answer that question in their recent paper — it’s right in the title: “DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification”. Indeed, the word “variant” is used 88 times in the paper, including hard-to-decipher phrases like “Variants originating from real in-vivo variants”, because it’s used to mean at least three different things:

  1. a germline variant is a position in an individual’s germline DNA that is different from the reference genome sequence
  2. a somatic variant is a position in a somatic cell’s DNA that is different from that individual’s germline sequence
  3. a sequence read variant is a position in a specific DNA sequencing read that is different from the reference genome sequence, which I’ll call an observed non-reference allele

Much of the interest in this paper I’ve seen on twitter (and my own spit-take when I saw it as a preprint just after submitting a low-coverage sequencing paper) arises from the potential effects on those first two definitions, which are the foundation of human genetics and cancer genetics, respectively.

The paper begins with a neat observation that some types of DNA damage produce an imbalance of non-reference alleles between the first and second reads of paired-end sequencing experiments (Figure 1). They go on to build a quantitative DNA damage score, and show that it captures experimentally induced DNA damage, confirms that a “repair enzyme cocktail” made by New England Biolabs can successfully reverse this damage, and that many public datasets, including the 1000 Genomes Project and The Cancer Genome Atlas are affected by this problem.

Crucially, the authors show that this type of DNA damage is stochastic and uncorrelated with local DNA context, so the observed non-reference alleles it causes don’t all pile-up in the same position in the genome. This means the issue is potentially very important for detecting somatic variants present in only a small fraction (Figure 3 suggests less than 5%) of cells in a tumor, but pretty unimportant for analysis of germline variants. The authors make this point nicely:

Therefore, the identification of low-frequency variants—e.g., somatic variants—would be affected by damage, whereas variants present at higher frequency—e.g., germline variants— would be unaffected.

This straightforward statement of the problem becomes potentially confusing because human geneticists use the phrase, “low frequency variant” to mean a germline variant present at low frequency among many individual humans in a population. This ambiguity between groups of researchers isn’t the authors’ fault, but they exacerbate the problem by repeatedly (abstract, early results, conclusion) noting that data from the 1000 Genomes Project are affected by this type of DNA damage. Indeed, Figure 2 makes it clear that they have discovered a cause of observed non-reference alleles, but they’re probably not very relevant to false-positive germline variant calls. The Editor’s summary of the paper also blurs this distinction:

Large-scale sequencing studies have set out to determine the low-frequency pathogenic genetic variants in individuals and populations. However, Chen et al. demonstrate that many so-called low-frequency genetic variants in large public databases may be due to DNA damage.

I think this paper makes an important contribution to understanding NGS errors. More careful choice of nomenclature (and certainly using a word other than “variant” every now and then) would have helped to make it clearer which types of analyses are most dramatically affected.


  1. 8-oxoguanine

    This paper is completely recapitulating a publication from 2013* that observed the exact same artifact mode and presents an identical solution. How the utter lack of novelty of this publication snuck by Science(!) editors and reviewers is beyond me.

    The alarmist claims of this paper that all cancer sequencing data ought to be called into question are absurd, since large cancer genomics studies (e.g. TCGA) have filtered this artifact since its initial discovery five years ago. Only by intentionally going back to raw sequencing data and making no effort to run well-known, longstanding best practice pipelines (as the authors of this paper do) does this artifact rears its ugly head. For newcomers to the field, these best practice filtration/curation processes are well-documented in most publications’ supplementary methods.

    * “Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation”, Costello et al.,

    • 8-oxoguanine

      Typo — that should read “longstanding best practice pipelines (as the authors of this paper *fail to* do).”