DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification


Article: "DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification"

Summary:
 

    This paper details how DNA damage can cause errors in paired end sequencing experiments, and details how these errors are currently impacting publicly available genomic databases. The group begins by showing, conceptually and experimentally, how DNA damage can lead to variant imbalances between the two reads of paired end sequencing. They then created their own measure of this imbalance, which they called the Global Imbalance Value (GIV). They used this GIV measure to show that DNA damage is incredibly common in publicly available data sets, specifically The Cancer Genome Atlas (TCGA). They also showed that these errors in samples confounds the identification of low frequency somatic variants in database samples.
    I felt the paper was really well done, although I did not fully understand many of the in silico methods. I only wish they would have conveyed some of the basics of how they calculated GIV, and specific which portion of TCGA they used in Figure 2. However, even if they chose a portion of the TCGA with a high error rate, a 75%+ false positive rate in any portion of a public data base is of note.
   

Figure 1

Figure Analysis:

Figure 1A: The schematic demonstrates the principle that DNA damage leads to read imbalances between R1 and R2, specifically when using paired end sequencing. The schematic shows that when 8-oxo-dG damage occurs, the repair leads to a G->T conversion, but this switch only happens to one strand. This single base switch leads to the imbalances between R1 and R2 which the group was able to use to identify damaged samples.

Figure 1B: The group experimentally demonstrates the read imbalance principle, performing sequencing on damaged DNA under various conditions. They show that for damaged DNA read 1 has a disproportionate number of G->T conversions, whereas read 2 has a disproportionate number of C->A conversions. They go on to show that sequencing in the presence of a DNA repair enzyme essentially eliminated this read imbalance. This result reinforces the idea that DNA damage is driving the imbalance, and it indicates that sequencing conditions can greatly effect the resultant sequence.



Figure 2
Figure 2A: Shows the 1000 Genome Project’s GIV scores for each of the possible base pair conversions.

Figure 2B: Shows the same data for The Cancer Genome Atlas (TCGA). GIV score is a measure of read imbalance for each conversion, with a GIV score>1.5 indicating that the sample is damaged. Both of the data sets have GIV>1.5 for G->T and C->A, indicating that large fractions of observed G->T conversions were due to damage, not sequence difference. This figure calls into question the reliability of publicly available genomic data sets.



3

Figure 3A: The group performs the same experiment performed in Figure 1B, again demonstrating that G->T frequency is higher in read 1, but that this is reduced by using DNA damage repair enzymes during prep.

Figure 3B: Shows the candidate variants after sequencing, grouped based on frequency. The data show a large proportion of G->T and C->A variants, especially in the low and moderately low frequency groups. Many of these variants were not present in the repair group, indicating that they are due to DNA damage, and are not true variants.

Figure 3C: Displays the data from Figure 3B again, but only using R1 reads. These data show that there is a massive imbalance between G->T and C->A variants on the R1 read, especially in the low and moderately low frequency groups, and imbalance which was alleviated by repair enzyme. This is yet another verification that damaged DNA can skew variant numbers in sequencing experiments, but that the use of repair enzyme in preparation could alleviate these errors.


4

Figure 4A: Shows the results of 1800 tumor sequences, ordered by GIV score.

Figure 4B: The team used a program called Varscan to identify somatic variants in the tumor samples. The group ordered the varscan identified variants based on GIV score and plotted the ratio of G->T and C->A variants to other variants for read 1. They showed that most of the identified variants had a huge proportion of G->T variant, indicating that errors can confound somatic variants identified in cancer data bases.

Figure 4C: This figure plotted the same data as above, but did so for “high confidence” variants, with the variants still showing a disproportionate amount of G->T conversions. This indicates that even the “high confidence” variants are susceptible to DNA damage related error.

Figure 4D: The group finally did the same analysis of germline variants and found that the proportions of various conversions were normal. This indicates that for high frequency variants the damage associated variation is not present, as would be expected.

Figure 4E: Finally the group estimated the false positive rate of identified G->T somatic variants, plotting it against GIV score. The results show that there are a huge number of false positives, with more than 75% of samples having a 50%+ false positive rate. This result indicates strongly that databases like TCGA have large number of false positives, many of which are attributable to DNA damage during sample prep.

Final Takeaway:

    Overall the results of this paper are incredibly important, as they indicate that there are widespread errors in public data bases. They show that without proper sample care these errors are common and widespread, and proper sample care is difficult to impossible to ensure for public databases. This knowledge is particularly important given the prevalence of dry lab research. The data indicate that papers which identify somatic mutations in databases, but do not somehow verify their claims in vitro, should question the validity of their results. This paper is significant enough that I would argue all biological scientists ought to be aware of its outcome and as such be aware of the shortcomings of public genomic databases.

Works Cited:

Chen, L., Liu, P., Evans, T.C., and Ettwiller, L.M. (2017). DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 355, 752–756.