This web page was produced as an assignment
for an undergraduate course at Davidson College.
Anthony Ciancone's Genomics
Second Assignment Page
DNA
damage is a
pervasive cause of sequencing errors, directly confounding variant
identification
Lixin
Chen, Pingfang
Liu, Thomas C. Evans Jr.,* Laurence M. Ettwiller*
Abstract/Intro:
The authors of this paper argue that DNA is
damaged at a low rate
and that some of the damaged DNA they have identified in genomic libraries
might be mistaken for somatic mutations. Identifying point mutations in
large scale
data sets is done by deep data sequencing and analysis, but the threshold
for
these single point mutations is the same as for point DNA damage.
Summary/Methods:
GIV
They looked first at the global imbalance of variants
detected of the first two reads. They used a Global Imbalance Value
(GIV)
to quantify this; imbalance is directly proportional to DNA damage. A GIV
>
1 means damaged DNA. They prepped DNA by damaging it with 8-oxo-dG,
resulting
in G-to-T transversions after amplification. They also tried to repair DNA
with
an enzyme cocktail, standard in labs. They confirmed their methodology
through
some tests from publicly available genomes.
1000 GP and TCGA
They looked at the 1000 Genomes Project and The
Cancer Genome Atlas and found significant G-to-T damage present in both,
suggesting up to 1/3rd of the ID’d reads are actually DNA damage
and
not mutations.
They then performed experiments on their data set
(?) and on a cancer probe and performed this oxidative damage and GIV
testing.
They found that most of the very low frequency variant reads were actually
damaged parts of DNA, confirmed by enzymatic fixing.
They claim that they found 180 false positives, or
about 1 false read per cancer gene.
Varscan/TCGA Data Set
Analysis tool used to ID somatic TCGA tumor
variants. An excess of one mutation type suggests DNA damage, similar to
GIV.
Most of the public data sets showed an excess of G-to-T, especially the ones
predicted by the program to be highly damaged.
They estimated the false positive rate to be 50%
in 78% of tumors analyzed. These false positives strongly correlated with
DNA
damage, suggesting confounding results on previously noted somatic variants.
Lung Adenocarcinoma TCGA Data Set
They downloaded publicly available LAC-TCGA data
and looked for damage. They split the data sets up into low-to-moderately
damaged and highly damaged. The highly damaged set contained a moderate
increase in expected damage. The Mutect2 however contained significant
damage
(9%) for either G-to-T or C-to-A damage.
Opinion:
This piece was a real eye-opener because I feel
like I assume too often that just because researchers in a field
collectively
agree on certain methodologies do not make them irrefutably correct. This
paper
offers substantial evidence that two of the widely used, publicly available
data sets for genomes contain significant DNA damage. There may be a real
problem with confounding specific somatic variations with actual DNA damage,
complicating a lot of research already done on these subjects.
I do have some questions I would ask about the
authors’ findings, however. For instance, they claim that a significant
portion
of damage is due to DNA oxidation. Could it also be that a similar oxidative
mechanism is the cause for the mutations in the first place? Perhaps all
this
shows is that DNA can be very susceptible to oxidative damage, inside or
outside
of the body. Also how do they know that 8-oxo-dG operates the same way in
vitro
as it does in the body? The authors claim that an enzymatic cocktail can
repair
a lot of this DNA damage. How do they know that this “repairing” does not
also
clean over actual mutations, considering they (as far as I can tell) do not
know the exact mechanism by which it operates? In their defense, they do
state
that their stringent qualifications for DNA damage may lead to
false-negatives
for actual somatic mutations.
Overall though, this paper was an interesting read
and provides more evidence that the human genome is complicated and that
there
is still much work to be done in improving the methodologies behind
analyzing
it. The next logical step might then be to test whether different types of
DNA
damage are caused by sample preparation.
Figures:
Figure 1
A: The
figure is split into two figures, both showing a flow chart of the
paired-end
sequencing techniques used to validate their work. The left side of the
figure
gives a depiction for what real oxidative damage would look like whereas the
right side shows what actual somatic mutations would look like. They are
measuring the imbalance in base reads between the R1 and R2 reads, which is
the
basis for their GIV score and quantification of DNA damage.
B: Like
figure 1A, this figure is split into two parts, with the left showing data
for G-to-T
variants on R1 and R2 reads and the right showing complementary C-to-A
variants
on the same thing. It appears that R1 reads without enzymatic fixing show a
higher degree of G-to-T variants for the left graph. The R2 reads on the
left
graph show no G-to-T imbalance. On the right side of figure 1B, R2 reads
without enzymatic fixing show a higher degree of C-to-A variants. This
imbalance is not present for R1 reads. The authors say that this
complementarity evidences what DNA damage should look like with the
paired-end
sequencing.
Figure 2:
Parts A and B show very similar things so I will
be including them together for discussion purposes. A and B are both
vertical
box and whisker plots which show the log2 GIV score with respect
to
all twelve of the DNA nucleotide mutations (e.g. - G-to-T, G-to-A, C-to-A,
etc.). Each point on both graphs shows a single GIV score for a sequencing
read
of 5 million base pairs. Part A depicts data from the 1000 Genomes Project;
B a
subset from TCGA. The authors have drawn a black line where a GIV score of
1.5
would be, indicative of DNA damage. For both data sets, both C-to-A and
G-to-T
high and low GIV scores indicate damaged DNA data sets. In part B, the
authors
also note that enzymatically repaired DNA was included in the GIV score
calculations,
explaining the bimodal distribution of C-to-A and G-to-T mutations, with
areas
both above and below the 1.5 GIV score threshold.
Figure 3:
A: This
was data from the enrichment experiment, which involved using a commercial
cancer panel probe to get an accurate read of 151 cancer genes. Part A looks
at
the G-to-T variant frequency of R1 and R2 reads at different base positions
between samples treated or not with the repairing enzymatic cocktail. It
appears that unrepaired R1 DNA reads contain more G-to-T variant frequency
than
any other type of reads.
B: Four paired
bar graphs each showing the relative distribution for all 12 of the
nucleotide
variants between repaired and unrepaired reads. Each bar is split up into
relative portions of these variants per megabase. The four graphs display
the
data for how relatively rare the variant was during the experiment, with
increasing discovery from left to right. For variants showing up less than
1%
or between 1-5% of the time, there was significantly more G-to-T and C-to-A
variant frequency accounting for all reads for unrepaired DNA compared to
repaired DNA. For more common variant frequencies, this difference was not
detected.
C: The
same general graphs as in B but for only R1 reads. Now there is only a
significantly increased proportion of only G-to-T reads for unrepaired DNA
for
rarer variants.
Figure 4:
A: Here
the researchers were looking at the TCGA data set for somatic variants using
Varscan, a popular data analysis tool. The graph shows all the 1800
sequencing
runs ordered by Varscan in order of increasing GIV score looking
specifically
for G-to-T imbalances. Most data sets went over the 1.5 GIV score threshold,
suggesting widespread DNA damage.
B: This
figures confirms the data presented in part A by presenting a breakdown of
the
fraction of each type of mutation present. There is a higher G-to-T presence
in
samples than for C-to-A and every other type of mutation for R1 reads. This
includes all the reads/data.
C: The
same as B except for high confidence samples only. These samples were
already noted
by the researchers’ algorithm to be highly damaged.
D: The
same as C with R1 reads except only looking at germline variants using
Varscan.
There are no significant DNA damages noted.
E: The
researchers estimated the false-positive discovery rate of somatic variants
looking at GIV G-to-T score. They found a strong correlation of 0.79,
meaning
false positives correlated to estimated damage reasonably well.
Reference:
Chen, Lixin,
Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA
Damage Is a Major Cause of Sequencing Errors, Directly Confounding
Variant Identification." Science 355 (2016): 752-56. Web. 27
Apr. 2017.
Click
here to view the original paper.
This is my first
assignment homepage. Click
here to return to Anthony's Genomics homepage.
Genomics
Page
Biology Home Page
Email Questions or Comments: anciancone@davidson.edu
© Copyright 2017 Department of Biology,
Davidson College, Davidson, NC 28035