This web page was produced as an assignment for an undergraduate course at Davidson College.

Anthony Ciancone's Genomics Second Assignment Page

DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification

Lixin Chen, Pingfang Liu, Thomas C. Evans Jr.,* Laurence M. Ettwiller*


The authors of this paper argue that DNA is damaged at a low rate and that some of the damaged DNA they have identified in genomic libraries might be mistaken for somatic mutations. Identifying point mutations in large scale data sets is done by deep data sequencing and analysis, but the threshold for these single point mutations is the same as for point DNA damage.



They looked first at the global imbalance of variants detected of the first two reads. They used a Global Imbalance Value (GIV) to quantify this; imbalance is directly proportional to DNA damage. A GIV > 1 means damaged DNA. They prepped DNA by damaging it with 8-oxo-dG, resulting in G-to-T transversions after amplification. They also tried to repair DNA with an enzyme cocktail, standard in labs. They confirmed their methodology through some tests from publicly available genomes.

1000 GP and TCGA

They looked at the 1000 Genomes Project and The Cancer Genome Atlas and found significant G-to-T damage present in both, suggesting up to 1/3rd of the ID’d reads are actually DNA damage and not mutations.

They then performed experiments on their data set (?) and on a cancer probe and performed this oxidative damage and GIV testing. They found that most of the very low frequency variant reads were actually damaged parts of DNA, confirmed by enzymatic fixing.

They claim that they found 180 false positives, or about 1 false read per cancer gene.

Varscan/TCGA Data Set

Analysis tool used to ID somatic TCGA tumor variants. An excess of one mutation type suggests DNA damage, similar to GIV. Most of the public data sets showed an excess of G-to-T, especially the ones predicted by the program to be highly damaged.

They estimated the false positive rate to be 50% in 78% of tumors analyzed. These false positives strongly correlated with DNA damage, suggesting confounding results on previously noted somatic variants.

Lung Adenocarcinoma TCGA Data Set

They downloaded publicly available LAC-TCGA data and looked for damage. They split the data sets up into low-to-moderately damaged and highly damaged. The highly damaged set contained a moderate increase in expected damage. The Mutect2 however contained significant damage (9%) for either G-to-T or C-to-A damage.


This piece was a real eye-opener because I feel like I assume too often that just because researchers in a field collectively agree on certain methodologies do not make them irrefutably correct. This paper offers substantial evidence that two of the widely used, publicly available data sets for genomes contain significant DNA damage. There may be a real problem with confounding specific somatic variations with actual DNA damage, complicating a lot of research already done on these subjects.

I do have some questions I would ask about the authors’ findings, however. For instance, they claim that a significant portion of damage is due to DNA oxidation. Could it also be that a similar oxidative mechanism is the cause for the mutations in the first place? Perhaps all this shows is that DNA can be very susceptible to oxidative damage, inside or outside of the body. Also how do they know that 8-oxo-dG operates the same way in vitro as it does in the body? The authors claim that an enzymatic cocktail can repair a lot of this DNA damage. How do they know that this “repairing” does not also clean over actual mutations, considering they (as far as I can tell) do not know the exact mechanism by which it operates? In their defense, they do state that their stringent qualifications for DNA damage may lead to false-negatives for actual somatic mutations.

Overall though, this paper was an interesting read and provides more evidence that the human genome is complicated and that there is still much work to be done in improving the methodologies behind analyzing it. The next logical step might then be to test whether different types of DNA damage are caused by sample preparation.


Figure 1

A: The figure is split into two figures, both showing a flow chart of the paired-end sequencing techniques used to validate their work. The left side of the figure gives a depiction for what real oxidative damage would look like whereas the right side shows what actual somatic mutations would look like. They are measuring the imbalance in base reads between the R1 and R2 reads, which is the basis for their GIV score and quantification of DNA damage.

B: Like figure 1A, this figure is split into two parts, with the left showing data for G-to-T variants on R1 and R2 reads and the right showing complementary C-to-A variants on the same thing. It appears that R1 reads without enzymatic fixing show a higher degree of G-to-T variants for the left graph. The R2 reads on the left graph show no G-to-T imbalance. On the right side of figure 1B, R2 reads without enzymatic fixing show a higher degree of C-to-A variants. This imbalance is not present for R1 reads. The authors say that this complementarity evidences what DNA damage should look like with the paired-end sequencing.

Figure 2:

Parts A and B show very similar things so I will be including them together for discussion purposes. A and B are both vertical box and whisker plots which show the log2 GIV score with respect to all twelve of the DNA nucleotide mutations (e.g. - G-to-T, G-to-A, C-to-A, etc.). Each point on both graphs shows a single GIV score for a sequencing read of 5 million base pairs. Part A depicts data from the 1000 Genomes Project; B a subset from TCGA. The authors have drawn a black line where a GIV score of 1.5 would be, indicative of DNA damage. For both data sets, both C-to-A and G-to-T high and low GIV scores indicate damaged DNA data sets. In part B, the authors also note that enzymatically repaired DNA was included in the GIV score calculations, explaining the bimodal distribution of C-to-A and G-to-T mutations, with areas both above and below the 1.5 GIV score threshold.

Figure 3:

A: This was data from the enrichment experiment, which involved using a commercial cancer panel probe to get an accurate read of 151 cancer genes. Part A looks at the G-to-T variant frequency of R1 and R2 reads at different base positions between samples treated or not with the repairing enzymatic cocktail. It appears that unrepaired R1 DNA reads contain more G-to-T variant frequency than any other type of reads.

B: Four paired bar graphs each showing the relative distribution for all 12 of the nucleotide variants between repaired and unrepaired reads. Each bar is split up into relative portions of these variants per megabase. The four graphs display the data for how relatively rare the variant was during the experiment, with increasing discovery from left to right. For variants showing up less than 1% or between 1-5% of the time, there was significantly more G-to-T and C-to-A variant frequency accounting for all reads for unrepaired DNA compared to repaired DNA. For more common variant frequencies, this difference was not detected.

C: The same general graphs as in B but for only R1 reads. Now there is only a significantly increased proportion of only G-to-T reads for unrepaired DNA for rarer variants.

Figure 4:

A: Here the researchers were looking at the TCGA data set for somatic variants using Varscan, a popular data analysis tool. The graph shows all the 1800 sequencing runs ordered by Varscan in order of increasing GIV score looking specifically for G-to-T imbalances. Most data sets went over the 1.5 GIV score threshold, suggesting widespread DNA damage.

B: This figures confirms the data presented in part A by presenting a breakdown of the fraction of each type of mutation present. There is a higher G-to-T presence in samples than for C-to-A and every other type of mutation for R1 reads. This includes all the reads/data.

C: The same as B except for high confidence samples only. These samples were already noted by the researchers’ algorithm to be highly damaged.

D: The same as C with R1 reads except only looking at germline variants using Varscan. There are no significant DNA damages noted.

E: The researchers estimated the false-positive discovery rate of somatic variants looking at GIV G-to-T score. They found a strong correlation of 0.79, meaning false positives correlated to estimated damage reasonably well.


Chen, Lixin, Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA Damage Is a Major Cause of Sequencing Errors, Directly Confounding Variant Identification." Science 355 (2016): 752-56. Web. 27 Apr. 2017.

Click here to view the original paper.

 This is my first assignment homepage. Click here to return to Anthony's Genomics homepage.

Genomics Page
Biology Home Page

Email Questions or Comments:

© Copyright 2017 Department of Biology, Davidson College, Davidson, NC 28035