This web page was produced as an assignment for an undergraduate course at Davidson College.
DNA damage is a pervasive cause of sequencing errors, directly
confounding variant identification.
Summary:
DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification is a study published in Science that shows DNA damage is the direct cause for most of the erroneous identifications of somatic variants and other variants with a low frequency (Chen, et al., 2017). When looking at the sequencing of high-quality human genomic DNA, another study showed that certain library preparations cause oxidative damage. Chen et al. wanted to figure out a method to measure DNA damage that occurs in sequencing runs. Based on the knowledge that mutagenic damage results in a global imbalance of variants found in read 1 (R1) and read 2 (R2) during paired-end sequencing, the researchers needed to create a way to measure this global imbalance. Through an innovative method, the researchers “devised an analysis strategy based on this imbalance to deconvolute both the origin and orientation of variants and computed a metric, the Global Imbalance Value (GIV) score, that is indicative of damage” (Chen, et al., 2017). By using the GIV, the researchers could conclude that the most commonly used data sets such as the Cancer Genome Atlas (TCGA) and the 1000 Genome Project have widespread damage that is directly linked to sequencing errors. Furthermore, these sequencing errors impact the identification of the somatic variants in these data sets.
Explanation of figures:
Fig 1. GIV score. (replicated Chen, et al., 2017).
This
first figure is not asking a question but instead supporting the
principle behind their innovative method known as the GIV score. Figure
1A is using Illumina
sequencing or more specifically paired-end sequencing. This
technique focuses on the two ends of the same DNA molecule known as the
“paired ends.” Through this method, you can sequence one end, then turn
it around and sequence the other end. The two sequences you get are
called the “paired end reads.” Essentially, oxidative damage affects
only one base of a pair and leads to an excess of G-to-T transversion
errors when sequencing Read 1 (R1). In paired-end sequencing, when there
is an imbalance of G-to-T variants in R1 reads, R2 reads have an excess
of C-to-A transversion errors (the reverse compliment of G-to-T). The
GIV score is important because it measures the DNA damage caused by this
imbalance. In Figure 1A, juxtaposed to this diagram displaying DNA
damage is one showing true variation which does not cause an imbalance. The
take
home message for Figure 1B was that oxidative damage is the cause of the
excess G-to-T variants in the unrepaired DNA samples.
Fig 2. GIV scores (y axis) for the 12 nucleotide substitution classes (x axis). (replicated Chen, et al., 2017).
The
question being asked was how much widespread damage the leads to an
excess of G-to-C variants is found in The Cancer Genome Atlas (TCGA) and
the 1000 Genome Project data sets. When using the GIV analysis, a GIV
score above 1.5 is defined as damage, while a GIV score of 1 is defined
as undamaged. The 1000 Genome project data set reveals that much
erroneous damage was done to the G-to-T variants (GIVG_T
score ≥ 1.5). The TCGA data set shows that most G-to-T imbalance (GIVG_T
score ≥ 2). A GIV score above 1.5 means that “there are 1.5 times more
variants on R1 than on R2, suggesting that at least one-third of the
variants are erroneous” (Chen, et al.,
2017). The
researchers concluded that most public data sets have at least one-third
of the G-to-T variant reads are a result of erroneous damage.
Fig 3. Target enrichment experiment. (replicated Chen, et al., 2017).
Their data showed that damage which results in G-to-T transversions is random. Thus, the randomness of errors that cause damage should occur at low allelic frequency. Low-frequency variants are somatic variants and higher frequency variants are germline variants. Therefore, the researchers then tested how the damage affects somatic variants since germline variants should be unaffected. This experiment was measured by repeating oxidative damage experiments using common library preparation procedures. Figure 3A shows that without DNA repair, the somatic variant frequency of G-to-T transversions is higher in R1 then with DNA repair. The only difference between 3B and 3C is that 3C only includes variant frequencies for R1 which is more G-to-T specific than including both R1 and R2. These two figures demonstrate that over 75% of the G-to-T variant positions can be removed by DNA repair at the lower frequencies. This supports the idea that those positions were erroneous and a result of oxidative damage. This data shows that DNA damage directly impacts the accuracy of identifying somatic variants, which are at the very low and low to moderate frequency.
Fig 4. Variants identified in TCGA data sets. (replicated Chen, et al., 2017).
The
next question these researchers looked at was “the extent that damage
affects somatic variant calls in cancer samples [using] Varscan,
popular analysis tool, to identify germline and somatic variants for all
TCGA tumor samples with matched tumor-normal pairs” (Chen, et
al., 2017). Drawing back to the
concept in Figure 1A (which shows that true variation leads to no
imbalance), the researchers organized the global balance of somatic
mutations calls between R1 and R2 reads. It is easy to visually see the
imbalance from the increasing G-to-T damage level. Another point to make
is that the fraction of G-to-T variants increased with the GIVG_T
score-based damage, unlike the other variants. Figures 4C and 4D
contrast each other. Figure 4D didn’t show excess in R1 reads because
the G-to-T variants were germline variants (which is unaffected due to
being a higher frequency variant). Figure 4C on the other hand shows
that there is a high confidence of G-to-T somatic variants (which is
supported by Figure 3). Figure 4E shows a positive correlation between
the percentage of estimated false positive somatic variants and the GIVG_T
score. The main take home message of Figure 4 was that there is a
correlation between DNA damage and false-positive variant calls. From
this information, the researchers deduced that erroneous variant
identification of somatic variants is caused by DNA damage.
Conclusions:
There
are several major conclusions from this paper. In Figure 2, GIV
scores
revealed that both the
Cancer Genome Atlas (TCGA) and the 1000 Genome Project have excessive
DNA damage from G-to-T variants. In Figure 3, target enrichment
experiments suggest that most erroneous damage affects the ability to
find variants at a low frequency, such as somatic variants. Maybe even
more important, DNA repair has shown to almost completely remove
oxidative damage that occurs during common library preparation
procedures. This process legitimized the ability for the GIV score to
identify somatic variants. Figure 4 shows high-confidence mutations
calls in the TCGA data sets (Chen, et
al., 2017). This paper points out that more attention
needs to put towards differentiating between true and artificial somatic
variants in widely used data sets and future scientific projects that
will use sequencing samples. Detailed criteria must be put in place to
reduce the DNA damage during variant-calling.
My Opinions of the Paper:
The main issue I had with this project was after Figure 4. Throughout the entire study, the readers could view that data being discussed in various figures. It seems a bit suspicious to me that they didn’t create a figure after evaluating how the damage affects current TCGA reference variant files. I’m guessing that the data sets predicted to have heavily damaged variants was not nearly visually pleasing as the weak damaged variants data sets. But nevertheless, the researchers concluded that artificial damage occurs when there is a high confidence during large number variant-calling. Overall, I’m glad that this study brought provided evidence that scientific protocols for differentiating between true and artificial somatic variants need to improve. However, I’m still not completely satisfied by their proposed solution which feels like a “catch 22” situation. In attempting to further differentiate between true and artificial somatic variants, scientists could start to increase the false-negative rate. Certainly, this makes me think, which is the lesser of two evils? Is it adding erroneous DNA damage or creating variant-calling algorithms that could remove true variants?
References:
Chen, Lixin, Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA damage is a major cause of sequencing errors, directly confounding variant identification." Science 355 (2017): 752-56. Web.
Genomics
Page
Biology Home Page
Email Questions or Comments: jomarshall@davidson.edu
© Copyright 2017 Department of Biology, Davidson College, Davidson, NC 28035