This web page was produced as an
assignment for an undergraduate course at Davidson College.
DNA damage is a pervasive cause of sequencing errors, directly
confounding variant identification
Article Summary:
In
this article Chen et al aimed
to look at the validity of the low-frequency variants that have been
identified on large databases (particularly the Cancer
Genome Atlas and the 1000
Genome Project) by addressing the issue that many of these
variants could be the result of oxidative damage to the DNA during
sequencing instead of real variants in the DNA. They looked specifically
at G to T and C to A variants through the use of paired-end
sequencing to determine if these variants are a result of
oxidative damage during sequencing or if they are valid variants in the
genome. Their approach focuses on the difference between the reads and
whether or not G to T and C to A variants are more common than other
nucleotide mutations and if there is an imbalance in these variants
between reads from the same genome. The paper concludes that these
variants are indicative of DNA damage during sequencing and that many of
the low-frequency variations found within somatic cells on the genome
databases are not real variants but a result of improper sample
treatment during sequencing.
Figure 1:
Figure 1. (A) Schematic
of how DNA variants occur through oxidative damage during sequencing.
(B) Evidence showing that during pair-end sequencing the
Figure
1
is showing the overall idea behind their design behind the GIV scores
and the DNA damage. In the left column of Figure 1A Chen et al is showing how oxidative damage occurs during paired-end
sequencing. It shows an example of oxidative damage to the G’s within a
sequence and after adaptor ligation and PCR amplification that the cell
fixes the damage by switching to a thymine and then pairing to an A. This
creates an imbalance in the read 1 and read 2 templates after the paired
end sequencing occurs and the different strands are copied and paired
which will lead to mapping and identification of false variants. This
creates an imbalance in the reads which gives rise to the articles GIV
(Global Imbalance Value) scoring system which indicates the amount of
damage to the DNA which is correlated with the amount of imbalance. On
the right side of the figure is a depiction of the mechanism behind
non-damaged paired-end sequencing which ends in no imbalance between
read 1 and read 2. Figure 1B is showing the fraction of G to T or C to A
variants in various conditions that either allow for DNA repair or do
not allow for DNA repair. In the left panel the conditions that do not
allow for DNA repair show greater G to T variants in their sequence than
those that allow DNA repair. This indicates that without the DNA being
able to repair after oxidative damage then the number of G to T variants
is increased in only one of the reads and not the other which indicates
an imbalance. The same is shown in the right panel in Figure 1B but with
C to A variants instead of G to T to demonstrate that there is an
imbalance in the two reads when exposed to oxidative damage.
Figure 2:
Figure 2. (A) GIV score
data calculated from the 1000 Genome Project. (B) GIV score data from
the Cancer Genome Atlas.
In
Figure 2 Chen et al. are
showing that C to A and G to T are low-frequency variants that occur in
both the Cancer Genome Atlas and the 1000 Genome Project databases. A
GIV score greater than 1.5 is indicative of DNA damage and the GIV
scores for nucleotide changes from G to T and C to A are greater than
1.5 in the data from both the Cancer Genome Atlas and the 1000 Genome
Project. This indicates that their theory on the oxidative damage
causing G to T and C to A mutations during sequencing could be the cause
of the variation.
Figure 3:
Figure
3. (A) Variant frequencies from G to T both allowing for DNA repair and
not allowing for DNA repair with both read one and read 2. (B) Number of
variants per MB divided into nucleotide to nucleotide variant type and
percent frequency. (C) Same data from part B but only data from read 1
was included.
In Figure 3 Chen et al is
demonstrating that with allowing DNA repair and without allowing DNA
repair after oxidative damage there is a difference in the frequency of
G to T variants in samples. Figure 3A shows that in read 1 there are
more G to T variants when you disable DNA repair than when you do not.
This demonstrates that without DNA repair being allowed there is an
imbalance in the reads which supports the idea that sequencing with
oxidative damage does cause false variants. Based on figure 1 and it is
implied that read 2 would have C to A variants but this data is not
shown. Figure 3B is highlighting the higher number of positions per
megabase of G to T and C to A variants in comparison to other
possibilities of nucleotide variants. The different panels also
highlight that there is increasing number of G to T and C to A variants
in lower-frequency variations. Figure 3C is zoning in on just the data
from read 1 demonstrating that the G to T variants are more frequent and
that these variations are caused by oxidative damage due to the
imbalance of reads when compared to the reads together. It is showing
that it is not just that G to T variants are more common overall but
that they are more common in one read over another so there has to be an
imbalance in the reads that they hypothesize resulted from oxidative
damage.
Figure 4:
Figure 4. (A) Tumor cell data from the Cancer Genome Atlas sequenced revealed a large range of GIV scores for TCGA G to T damage. (B) The number of somatic variants of different nucleotide to nucleotide variations for various different Cancer Genome Atlas data sets ordered in increasing amount of damage. (C) Same as B but using only high confidence variants from the Varscan. (D) Same as B but only using the germline variants. (E) Estimated false variant calls on G to T variants based on GIV score.
In Figure 4 Chen et al are
wanting to demonstrate that
oxidative damage has an effect on these large scale databases. It
begins by
showing in Figure 4A that there is a range in GIV scores, this would
show that
there is not a consistent amount of damage on these data sets which
shows that
there could be a lot or a little false positives scattered throughout
the data which points out the inconsistency in the amount of damage
that is found throughout these data sets which questions their
reliability when it comes to G to T and C to A variations. Figure 4B
is zoning in on only data from read 1 that is from somatic cells and
is showing the Varscan score of each variant type for the data set in
the same arrangement as in figure 4A. The Varscan score is indicative
of the percentage of somatic variants for that mutation type. This
figure is highlighting the number of G to T and C to A variants in
somatic cells and that this is a large percentage of the number of
variants found on this database. Figure 4C is zoning further in on the
data from Figure 4B but just showing the variants that are high
confidence variants. This again highlights the G to T and C to A
variants showing that they are much less (although significance is not
discussed) than those of the total variants. This demonstrates that
reads with much higher confidence in variant validity there is a
decrease in the amount of G to T and C to A variants so there must be
less damage in these reads. Figure 4D is the same type of depiction
but with just germline cells. This shows that these G to T and C to A
variants are more frequent in somatic cells. Figure 4E is showing the
G to T variants ordered based on their GIV score and the percentage of
their variants found in the Cancer Genome Atlas that are valid/real
variants. Most of the data is clustered around 70% which demonstrates
that the validity of low-frequency variations in these large databases
such as the Cancer Genome Atlas could be invalid.
Conclusions/My Opinion:
Overall this experiment aims to question the accuracy of these large
genome databases in regards to low-frequency variants due to the fact
that there is not a lot of regulation on conditions in which the
genome is sequenced before it is on the database. The paper begins by
focusing on their own method and emphasizing the need for the GIV
score system. In Figure 1 Chen et al is questioning whether
or not their is an effect of oxidation on the results of sequencing.
Through this figure they conclude that oxidative damage does have an
effect on G to T and C to A variants through their use of comparing
read 1 and read 2 of paired end sequencing of DNA that is allowed to
repair itself and DNA that has its repair mechanisms disabled. In
Figure 2 Chen et al look at two different databases, the
1000 Genome Project and the Cancer Genome Atlas, and are questioning
whether or not G to T and C to A variants are more common in DNA that
has a GIV score that is indicative of damage. The data shows that this
is valid for both databases. From this they decide to look further
into whether or not these G to T and C to A variants are just more
common or if it is an error from sequencing causing an imbalance of
the reads. They begin by measuring the number of G to T variants in
similar conditions but one the DNA can be repaired while the other it
is disabled and then determining whether there is an imbalance in the
reads. The data shows that there is an imbalance in the reads. They
also continue this by showing the G to T and C to A variants in
comparison to other variants as well as the frequency of the mutation
within a population and it is shown that G to T and C to A variants
are more common than other types of variants in the low-frequency
variations and that those are even more common in one read over
another, further suggesting imbalance. This all validates their data
with using GIV scores and variant frequencies from figure 2. They
finally zone in on the data from the Cancer Genome Atlas in Figure 4
to determine whether these variants are more common in somatic cells
and whether or not they are just as common in data that are considered
to have a higher confidence and how this might relate to whether or
not these low-frequency variants are valid. They conclude that
oxidative damage is common and at various levels throughout the data
that is currently on these databases and that this does lead to lower
confidence in the variations in low-frequency variants.
I do think that this article does bring up a very good point, we
should not have as much confidence that all of the data on these
databases is valid and that there is a need for more standards for the
sequencing data that gets put into these databases. This paper
provides a substantial amount of evidence but there is not a single
figure that significance for the data is shown. This made me question
how valid their data is or if they were just arranging the figures so
that it showed what they wanted it to. I do think that they could also
begin asking questions about whether there are other effects outside
of oxidative damage that could cause skewed results in sequencing data
during sample preparation and another direction could be looking at
more ways that sequencing errors can occur. Another issue I had was in
Figure 4 their claims with the difference between somatic and germline
cells with sequencing troubled me. If it is a matter of the oxidative
damage in DNA sequencing then why are there not a lot of G to T and C
to A variants in germline cells? To me I saw that they were
highlighting that there were still G to T and C to A variants but it
was not a large amount of these variant types so if it truly was a
sequencing error then it would occur in both somatic and germline
cells and effect both types equally, at least that was my own logic
behind this. I also think that seeing significance would have been
really helpful on this figure. I also did not like that on Figure 4E it
was a predicted number of somatic variants that were false positives
because their whole paper is trying to show that these variants are
false positives and the databases are showing invalid variants so I
would prefer to see real data rather than their own estimations that
support their claims.
References:
Chen, Lixin, Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA damage is a major cause of sequencing errors, directly confounding variant identification." Science 355 (2017): 752-56. Web.
*Link in the title of this page*
Email Questions or Comments: sthautamaa@davidson.edu
© Copyright 2017 Department of Biology,
Davidson College, Davidson, NC 28035