This web page was produced
as an assignment for an undergraduate course at Davidson College.
Helen Webster's Genomics
Home Page
DNA damage is a pervasive cause of sequencing errors, directly
confounding variant identification
Main Idea
To test the possibility that mutagenic damage causes sequencing inaccuracies
and variation, Chen et al. created a standard score (GIV score) for
determining read imbalances and sequencing variants. Public sequencing
datasets contain this mutant variation mixed in with natural variation, and
it is unclear to those who use the data that some variation is caused by
mutagens. Mutagenic damage of DNA causes an imbalance in base transversions
between the reads of the two DNA strands, and is a major confounding effect
of damage. This imbalance is used to the researchers' advantage, as it can
ideally be used as a basis to track mutagenic damage and therefore
sequencing errors in these databases, as opposed to natural variance present
in the population. This GIV score can ideally be used to accurately estimate
the damage present in public data sets. Chen et al. conclude that the GIV
score accurately quantifies mutagenic damage in somatic variant cells, which
occurs in very low frequencies. This affects a substantial portion of
My Opinion
I found this paper especially intriguing given the integral role DNA
sequencing has played in the biological research I have conducted at
Davidson, as well as vital to nearly every part of the work we read about in
our genomics course. It is incredible to me that mutagens could play such a
detrimental role not just physically for the DNA itself, but for further
findings and usaage of DNA reads and sequences. I think this paper did an
impressive job of establishing a necessary a score system for damage, that
would have obvious positive benefits for improving public data sets. In
addition, the display of proving the scoring system and proceeding to apply
it with tumor cells thoroughly convinced the reader of the benefit of the
GIV score in addition to the credibility of the method. Finally, this GIV
score is a novel idea with a necessary application, that I think can quickly
be useful to genomicists and bioinformaticists.
Figure 1
Figure 1A outlines the principle behind the
GIV score. Mutagenic variation causes an imblanace of transversion between
the two reads because the base switch between the two strands does not
happen equally. The left side of Panel A shows the base transversion
imbalance that results from sequencing mutagenically damaged sequences.
When the variant is just a natural SNP (right side of panel), reads from
both strands will transverse equally. The degree of imbalance due to
damage is the basis of the GIV score. Figure 1B visually shows the
fraction of G to T transversions and the inequity between reads 1 and 2.
Figure 2
To estimate the amount of damage present in public DNA
sets, Chen et al. calculated the GIV scores for the 1000 Genomes Project
dataset (Figure 2A) and a subset of the TCGA data set (Figure 2B). The
gray line in both A and B demarcates a GIV score of 1.5, above which the
score implicates damage, below which is non-damage. Both sets show
widespread erroneous sequencing calls, 30% of which were G to T variant
reads. T to A and C to T were 0.5% and 3% of erroneous calls,
respectively. Overall, Figure 2 shows there is indeed DNA sequence
damage in public data sets that leads to erroneous sequencing calls in
at least 1/3 of G to T variant reads.
Figure 3
Supplementary data found G to T
transversions to be randomly generated, implying they occur at low
allelic fractions (known as somatic variants, as opposed to the high
frequency germline variants). Figure 3 looks at how damage affects
somatic variant identification. DNA repair eliminates 82% of G-to-T
and C-to-A variant positions in the low frequency groups (less than
1% and 1% to 5%), proving those positions are erroneous and due to
damage in somatic variants. This leads to false positives and direct
confounding in the identification of variance in sequences reads.
Figure 4
Figure 4 sorts approximately 1800 tumor sequencing runs by
G-to-T variant GIV score. There are more G-to-T somatic variants than
C-to-A, and the fraction increases with increasing GIV score. In addition,
panel 4D shows germline variants remaining consistent in GIV score.
Estimated false positives in somatic variants is strongly correlated to
estimated damage in these tumor samples, ultimately supporting the
application of the GIV score to accurately detect high somatic damage and
false positives in sequence reads.
Citation: Chen L, Liu P, Evans T,
Ettwiller L. (2017). DNA damage is a pervasive cause of sequencing errors,
directly confounding variant identification. Science 355, 752-756.
Helen's Home Page
Genomics
Page
Biology
Home Page
Email Questions or Comments: hewebster@davidson.edu
© Copyright 2016 Department of Biology,
Davidson College, Davidson, NC 28035