DNA
damage is a pervasive cause of
sequencing errors, directly
confounding variant identification
Figure Analysis:
Figure 1A: The schematic
demonstrates the principle that DNA
damage leads to read imbalances between R1 and R2,
specifically when using
paired end sequencing. The schematic shows that when 8-oxo-dG
damage occurs,
the repair leads to a G->T conversion, but this switch only
happens to one
strand. This single base switch leads to the imbalances
between R1 and R2 which
the group was able to use to identify damaged samples.
Figure 1B: The group experimentally demonstrates the read imbalance principle, performing sequencing on damaged DNA under various conditions. They show that for damaged DNA read 1 has a disproportionate number of G->T conversions, whereas read 2 has a disproportionate number of C->A conversions. They go on to show that sequencing in the presence of a DNA repair enzyme essentially eliminated this read imbalance. This result reinforces the idea that DNA damage is driving the imbalance, and it indicates that sequencing conditions can greatly effect the resultant sequence.
Figure 2B: Shows the same data for The Cancer Genome Atlas (TCGA). GIV score is a measure of read imbalance for each conversion, with a GIV score>1.5 indicating that the sample is damaged. Both of the data sets have GIV>1.5 for G->T and C->A, indicating that large fractions of observed G->T conversions were due to damage, not sequence difference. This figure calls into question the reliability of publicly available genomic data sets.
Figure
3B: Shows the candidate variants after
sequencing,
grouped based on frequency. The data show a large proportion
of G->T and
C->A variants, especially in the low and moderately low
frequency groups.
Many of these variants were not present in the repair group,
indicating that
they are due to DNA damage, and are not true variants.
Figure 3C: Displays the data from Figure 3B again, but only using R1 reads. These data show that there is a massive imbalance between G->T and C->A variants on the R1 read, especially in the low and moderately low frequency groups, and imbalance which was alleviated by repair enzyme. This is yet another verification that damaged DNA can skew variant numbers in sequencing experiments, but that the use of repair enzyme in preparation could alleviate these errors.
Figure 4A: Shows the
results of 1800 tumor sequences,
ordered by GIV score.
Figure 4B: The team
used a program called Varscan to
identify somatic variants in the tumor samples. The group
ordered the varscan
identified variants based on GIV score and plotted the
ratio of G->T and C->A
variants to other variants for read 1. They showed that
most of the identified
variants had a huge proportion of G->T variant,
indicating that errors can
confound somatic variants identified in cancer data bases.
Figure 4C: This
figure plotted the same data as above, but did
so for “high confidence” variants, with the variants still
showing a disproportionate
amount of G->T conversions. This indicates that even
the “high confidence” variants
are susceptible to DNA damage related error.
Figure 4D: The group
finally did the same analysis of
germline variants and found that the proportions of
various conversions were
normal. This indicates that for high frequency variants
the damage associated
variation is not present, as would be expected.
Figure 4E: Finally the group estimated the false positive rate of identified G->T somatic variants, plotting it against GIV score. The results show that there are a huge number of false positives, with more than 75% of samples having a 50%+ false positive rate. This result indicates strongly that databases like TCGA have large number of false positives, many of which are attributable to DNA damage during sample prep.
Final Takeaway:
Overall the results of this paper are incredibly important, as they indicate that there are widespread errors in public data bases. They show that without proper sample care these errors are common and widespread, and proper sample care is difficult to impossible to ensure for public databases. This knowledge is particularly important given the prevalence of dry lab research. The data indicate that papers which identify somatic mutations in databases, but do not somehow verify their claims in vitro, should question the validity of their results. This paper is significant enough that I would argue all biological scientists ought to be aware of its outcome and as such be aware of the shortcomings of public genomic databases.
Works Cited:
Chen, L., Liu, P., Evans, T.C., and Ettwiller, L.M.
(2017).
DNA damage is a pervasive cause of sequencing errors,
directly confounding
variant identification. Science 355, 752–756.