Conceived and designed the experiments: AWZ EYL. Performed the experiments: AWZ. Analyzed the data: AWZ EYL GMC. Contributed reagents/materials/analysis tools: TZ TC. Wrote the paper: AWZ EYL.
The authors have declared that no competing interests exist.
While it is widely held that an organism's genomic information should remain constant, several protein families are known to modify it. Members of the AID/APOBEC protein family can deaminate DNA. Similarly, members of the ADAR family can deaminate RNA. Characterizing the scope of these events is challenging. Here we use large genomic data sets, such as the two billion sequences in the NCBI Trace Archive, to look for clusters of mismatches of the same type, which are a hallmark of editing events caused by APOBEC3 and ADAR. We align 603,249,815 traces from the NCBI trace archive to their reference genomes. In clusters of mismatches of increasing size, at least one systematic sequencing error dominates the results (G-to-A). It is still present in mismatches with 99% accuracy and only vanishes in mismatches at 99.99% accuracy or higher. The error appears to have entered into about 1% of the HapMap, possibly affecting other users that rely on this resource. Further investigation, using stringent quality thresholds, uncovers thousands of mismatch clusters with no apparent defects in their chromatograms. These traces provide the first reported candidates of endogenous DNA editing in human, further elucidating RNA editing in human and mouse and also revealing, for the first time, extensive RNA editing in
Most biomedical, genomic research begins with the painstaking assembly of a “reference genome” for the organism of interest. Implicit in this process is an assumption that genomic information is constant throughout an organism. There are enzymes, however, that can change, or “edit,” genomic information so that variations from the reference can exist within a single organism. In this work, we analyze the raw data used to assemble the reference genomes of ten organisms to discover evidence for editing. We found candidates for DNA and RNA editing as well as a sequencing error that has become incorporated into commonly used genomic resources. Our analysis demonstrates the utility of raw genomic data for the discovery of some editing events and sets the stage for further analysis as sequencing costs continue to decrease exponentially.
With the exception of infrequent random somatic mutations, it is widely believed that the same genomic content should be fixed in an organism throughout its lifetime. This information will also serve as a template for exact RNA copies. Proteins that can modify genomic content, nevertheless, have been identified in humans and in many other organisms.
RNA editing involves alteration of particular RNA nucleotides by specifically changing Adenosine (A) into Inosine (I), which in turn is read as Guanosine (G)
For many years, the only known human endogenous target of the APOBEC protein family was the apoB RNA transcript. In this case, editing in position 6,666 by APOBEC1 leads to a stop codon and eventually results in two functionally distinct isoforms of apolipoprotein B (ApoB)
Deamination of cytosines to uracils in DNA (DNA editing) by various APOBEC protein families is characterized, in many cases, by clusters of G-to-A mismatches between the reference genome and the edited sequence. These mismatches are the end product of deamination of “C” into “U” in the other DNA strand. Recently, it was found that APOBEC3G can serve as a potent inhibitor of a wide range of retroviruses, including endogenous retrotransposons. This protein introduces large numbers of C-to-U mutations in the minus-strand of the viral DNA, eventually leading to G-to-A mutations after plus-strand synthesis
Although editing of retrotransposons and their integration back into the genome is expected to be rare, very deep DNA sequencing can be used to identify these events. In this paper we report initial results of a novel bioinformatic approach for detection of endogenous RNA and DNA candidate sites in various organisms. We obtained 600 million sequence traces from the NCBI Trace archive. This data repository contains DNA sequence chromatograms (traces) from various large-scale capillary electrophoresis sequencing projects, base calls, and quality estimates. Next, we aligned these traces to their consensus reference genomes and searched for clusters of mismatches. Interestingly, we have found not only evidence of genuine RNA and DNA editing events but have also isolated a very common technical sequencing artifact that leads to such clusters.
One hallmark of editing enzymes is a cluster of mismatches of the same type in the edited substrate. While the results of the RNA editing ADARs are clusters of A-to-G mismatches, the hallmark of members of the APOBEC3s protein family is a cluster of G-to-A mismatches in the newly formed DNA strand after reverse transcription. In order to find new endogenous editing events we looked for such mismatch clusters in the largest available repository of “raw” sequencing data, before they have been processed and assembled. We aligned “raw” sequencing reads from the NCBI trace archive to their consensus, reference genome. We repeated this procedure, in parallel, for each of ten organisms (in total more than 600 million reads - see
In sum, we curated more than 56 gigabases of aligned sequence in human, about 62 gigabases of aligned sequence in mouse and much lower numbers for other organisms reflecting smaller genomes and/or lower coverage. In human, 85,181,171 traces aligned uniquely to the reference genome, 4,626,984 traces aligned to multiple locations, and 123,110,314 traces had no alignment under our strict cutoffs. For all organisms combined, approximately 300 million, out of 603,249,815 traces in total, were analyzed further (See
Organism name | #reference bp (millions) | #unique traces (millions) | Mean coverage | Space (Gb) | Time (millions of node seconds) |
260 | 4.3 | 9.9 | 13 | 0.56 | |
2,900 | 22 | 4.6 | 160 | 1.5 | |
2,400 | 33 | 8.3 | 370 | 3.4 | |
160 | 0.67 | 2.5 | 2.5 | 0.06 | |
1,000 | 12 | 7.2 | 30 | 1.3 | |
2,900 | 85 | 18 | 530 | 30 | |
2,600 | 93 | 21 | 4,200 | 114 | |
2,900 | 32 | 6.6 | 150 | 7.0 | |
350 | 2.5 | 4.2 | 6.4 | 1.2 | |
1400 | 14 | 6.0 | 360 | 4.8 | |
298.47 | 5821.90 | 163.82 |
Total data generated from analysis of 603,249,815 traces, 30% of the total number of traces at NCBI (outside the short-read archive). Approximately half were placed uniquely while applying our cutoffs, with total data consuming six terabytes of disk and more than five “node years”of CPU time. The computation on mouse traces produced the bulk of the data.
Clusters of consecutive mismatches of the same type (C-to-T or G-to-A) are common in APOBEC targets, such as IAP mouse retroelements edited by APOBEC3
Since editing enzymes have a preferred sequence context, the large data set allows us to restrict our search to traces with the same three base-pair motif centered at each mismatch site in the trace
Out of the 53,639 total examples conforming to the above criteria, we found 46,483 (82%) examples of G-to-A traces in human. Thus, the restrictions above reduced the total number of traces more than 12-fold while only reducing the number of G-to-A examples by less than 5-fold. Moreover, we found a striking preference for either an “AGA-to-AAA” mismatch motif (26,694/53,639 traces) or an “AGG-to-AAG” motif (21,274/53,639 traces). This tendency was observed in traces from all sequencing centers tested but one (Celera) (see
(A) Human traces are mined for clusters of mismatches of the same type. Shown is the percent frequency of clusters by type. The G-to-A mismatch type becomes more dominant with increasing numbers of mismatches (as does T-to-G). (B) Runs of five (or more) mismatches by type and sequencing center with an identical 3bp motif centered on each mismatch. Data from eight sequencing centers is shown. All of these centers had at least 1000 examples that meet the above criteria. (C) Clusters with three (or more) mismatches with at least two very high quality mismatches (Phred 40). A mismatch spectrum consistent with editing can be observed.
Sequence traces are derived from both DNA strands, thus one would expect to observe a symmetric over representation of C-to-T mismatch clusters. Lack of similar numbers of complementary mismatches led us to the conclusion that most of these mismatches are not caused by a biological source but rather are sequencing artifacts.
In order to understand the origin of the artifact, we analyzed sample traces, and noticed that traces with “runs” of mismatches, with identical three base-pair motifs, centered on the mismatch, often had a peculiar defect in their chromatograms. Such defects can arise when the florescent dyes used in DNA sequencing have sequence specific incorporation differences which lead to unevenly spaced or shaped peaks in the electronic trace chromatogram after capillary electrophoresis.
(A) A chromatogram, from a trace matching the criteria in
We used strict criteria to construct the artifact set, thus the actual number of those errors is probably much larger than the 260K we found and may disrupt the accuracy of genomic assemblies. Indeed, we found evidence that these common errors influence the consensus sequence of a few genomes. The number of runs of G-to-A mismatches with the AGA motif was much higher in genomes with high coverage, where each position in the reference genome has many traces to support each call. In these cases, the reference is determined according to the “majority voting” of all the supporting traces. Since the reported type of mismatch is much less abundant than the correct call, the reference will have the correct “G” in virtually all cases. In genomic projects with lower coverage, however, such events can become part of the reference genome and therefore could not have been detected by our method. Indeed, we found that genomes with lower coverage tended to be free of G-to-A mismatches. This effect is most striking in drosophila where mean coverage of the reference by aligned traces is only 2.5 (See
Reference genome version | G-to-A | C-to-T | A-to-G | T-to-C | Other |
anoGam1 | 2836 | 2830 | 2907 | 3098 | 440 |
calJac1 | 3012 | 3362 | 2735 | 3133 | 145 |
canFam2 | 3170 | 3777 | 3270 | 3027 | 212 |
dm3 | 1 | 1 | 0 | 1 | 0 |
galGal3 | 1290 | 878 | 1026 | 1760 | 48 |
hg18 | 17719(82) | 16778(72) | 13701(188) | 15301(419) | 700(8) |
mm9 | 1801(219) | 1644(272) | 1346(276) | 1411(346) | 76(11) |
panTro2 | 3485 | 3120 | 2918 | 4046 | 240 |
fr2 | 467 | 449 | 390 | 482 | 45 |
xenTro2 | 1483(202) | 1574(262) | 1461(1289) | 1631(1066) | 269(28) |
Number of traces by mismatch type with two or more mismatches at or above a quality threshold of phred 40, spanning 100bp or more. All mismatches belong to runs of three consecutive mismatches of the same type of any quality. The number of traces from the next largest substitution type, or the largest substitution type if it is not one of A-to-G, T-to-C, G-to-A, or C-to-T, is shown in the “other” column for comparison. The numbers in parentheses indicate traces of RNA origin. See
Another effect of this error was found in the assignment of single nucleotide polymorphisms (SNPs). A sequencing error in one genomic trace will not usually lead to the determination of a SNP at this position. However, since many of the “AGA” mismatches have a quality score of phred 20 or higher, which is considered an acceptable quality with an estimated error probability of only 1%
Once we realized that the majority of “AGA” and “AGG” mismatch motifs were caused by a sequencing error, we endeavored to eliminate such errors from our dataset. To do so, we incorporated phred quality scores, also available from the trace archive. We obtained quality scores for all traces with a run of three or more substitutions of the same type. This set contains 20.7 million traces out of the 300 million that aligned uniquely. We then applied various quality score thresholds on to the data (see
Recently, DNA editing has been reported to be a powerful defense mechanism against the threat of genomic instability imposed by viruses and retrotransposons. However, the full magnitude of the phenomenon
Active retrotransposons exist in human. For example, two edited HERVK elements have been recently discovered
Trace 1735626615 aligns uniquely to chromosome 2 where the known retrotransposon HERVL-A1 is located (chr2: 100697697–100700125). A cluster of 15 G-to-A mismatches (worst mismatch phred 35; best mismatch phred 49) suggests that the trace originates from an edited version of the element. Support for the APOBEC source of the editing comes from the preferred GG-to-AG motif (11 out of the 15 cases) and GA-to-AA (remaining 4 cases) which is the dinucleotide context (in the same order) in an HIV hypermutated genome, and is the sequence motif of APOBEC3G and APOBEC3F
Example of possible DNA editing in human chr21:40977741–40978045. Alignment of trace 1745107496 to the human reference genome lead to large number of G-to-A mismatches which are indications for possible DNA editing in this retrotransposon. All the mismatches are located in high quality sequence positions, reducing the possibility of sequence errors.
The actual number of edited traces in the trace archive is most probably much higher than we have found, for several reasons: More than half of all traces were rejected with our alignment parameters, at least partially due to the fact that DNA editing tends to lead to hyper-mutation in its target sequences
RNA editing is a general term for the modification of RNA after it is transcribed from DNA. The most common modification in mammals is A-to-I editing by the ADAR protein family. As I (Inosine) is read as a G (Guanosine) after sequencing, this editing type manifests itself as an A-to-G substitutions after cDNA sequencing and alignment to the original genomic locus. Recently it was found that the human genome harbors large numbers of editing events that are located in clusters, mainly in
A fraction of the human, mouse and
(A) While no over-representation of the RNA derived mismatches (A-to-G and its complimentary T-to-C) clusters are observed in the full set of RNA traces in human (n = 238,370) and
Further evidence that the higher quality set is indeed a result of RNA editing comes from two additional observations. First, a significant under-representation of “G” immediately upstream to the editing sites which is in agreement with the known sequence motif of the ADAR proteins
Significant under-representation of “G” immediately upstream to the editing sites which is in agreement with the known sequence motif of the ADAR proteins.
Detection of RNA editing from short EST sequences has proven to be challenging, due to their relatively low sequence quality
Of the organisms we studied, only human, mouse and
The
(A) Evidence for RNA editing can be seen in this locus as multiple traces of RNA origin align to it with numerous A-to-G mismatches. The trace accession numbers and their coordinates are given in the multiple alignment. (B) Predicted RNA structure of the genomic locus indicates a long and stable dsRNA structure which is a favorite target for editing by ADARs. Each editing site from the multiple alignment is marked by an arrow. The length of the arrow corresponds to the editing level.
The NCBI trace archive serves as a repository of raw data for the assembly of consensus genomes. Recently, it was utilized for a different purpose in the search for structural variation in the human genome
Recently, we did an initial analysis of Illumina's human resequencing reads and the SOLiD reads from the same individual. These reads are available at the NCBI short-read archive and are the basis for the first individual African consensus genome
The availability of computational resources for carrying out our analyses was essential to this project, as large computational effort was needed, six terabytes of disk for intermediate data and more than five “node years” of CPU time. With further computational effort, combining existing data in the trace archive with next generation sequencing data sets from multiple sequencing platforms and chemistries, it should be possible to greatly improve genomic databases and eliminate the sequencing errors reported here.
By using well-calibrated quality scores and selecting traces with clusters of consecutive mismatches, we are able to investigate the scope of RNA editing sites in human and other genomes. The application of this technique in the search for editing events will make many large EST datasets more accessible for other organisms where quality scores are available. Currently, only a very small number of organisms, with large sets of full length RNA sequences, have been the subject of large-scale editing studies. Using quality scores, many additional genomes can be surveyed for editing with the opportunity for new discoveries in this emerging field.
As a demonstration of the value of using quality data for ESTs, we are able to find a large number of candidate RNA editing events in
Despite the identification of thousands of newly discovered RNA editing sites in the current work, it is reasonable to believe that the actual number of editing sites is still significantly under-estimated. Support for this assertion comes from the stringency of our parameters: including length of alignment, percentage of identity and exclusion of insertions or deletions. These choices most likely limited the subset of EST data that we analyzed. Refinement of these criteria could lead to more comprehensive detection of RNA editing levels and, due to the breadth of EST data, even permit the comparison of editing levels in different tissues and disease conditions.
In this work we also found evidence for recent or active events of DNA editing. While the true scope of these phenomena must be explored in future work, our approach, including the use of strict alignment criteria and quality scores, has proved effective at finding many intriguing examples. Using different parameters, mainly lower cutoffs and relaxation of the requirement for unique alignments, more DNA editing sites could be detected in the trace archive. Careful investigation, most likely combined with next-generation sequencing experiments, will help unravel the mechanisms of retroelement defenses in a variety of organisms. Moreover, DNA editing is known not to be limited to retrotransposons and can take place in other genomic loci. The most recognized example is the AID protein, which is a member of the AID/APOBEC protein family, and targets single stranded DNA in the immunoglobulin locus in B-cells. Similar approaches to the ones used here provide an exciting opportunity to survey how leakage of DNA editing events, outside retroelements, or immunoglobulins could cause many simultaneous mutations in the genome, a process that can eventually lead to cancer.
We obtained all traces for 10 organisms (600M traces in total), in FASTA format, at the NCBI Trace Archive
We augmented the above data by downloading auxiliary information and quality scores for a subset of about 20.7 million traces which were, potentially, enriched for editing events. We used runs of three consecutive mismatches of the same type as the enrichment criteria. The number of high quality traces for each editing type (G-to-A, C-to-T, A-to-G, and T-to-C) - is listed in
The complete set of mismatches found in these two sets of traces is available to the community as two files, “all.c2.t100.q40+.bed.gz” (5.95MB) and “all.c2.t100.q0-9.bed.gz” (122MB), respectively. The first set is included on the journal's web-site while the second file is available, on request, from the authors. The files contain: the genomic coordinate of the mismatch, the mismatch type, the position on the trace, the quality of the mismatch, the length of the run in which the mismatch was found, the sequencing center, the trace id, the organism, and the likely origin of the trace, DNA or RNA. In order to be counted, each trace must have at least two mismatches with phred 40 or greater that are separated by 100bp or more. Only mismatches with phred scores of 40 or greater are included in the high quality set (see
For sequence alignment, we used MegaBlast
Two computational clusters were used to perform the analysis. These clusters were built to assist in deploying data intensive web services
Enriched set of editing candidates.
(5.95 MB ZIP)
Xenopus RNA editing sites.
(0.36 MB TXT)
DNA editing of mouse MMTV-int retrotransposons (both clone mates). DNA editing in a mouse retrotransposon. Two traces (ti#71971190 and ti#71976546 which are mate pairs from one sequencing clone) are aligned to the mouse genomic full length MMTV-int retrotransposon (ERVK family) locus (chr6:68193707-68200951). Both aligned with a large number of G-to-A mismatches, an indication of DNA editing in this active retrotransposon. Additional mismatches are present as well, probably due to the activity of DNA damage proteins.
(0.03 MB DOC)
Substitution spectrum, by quality score, sampled from runs of three substitutions of the same type in ten organisms. In all organisms examined the abundance of G-to-A mismatches dominates all other substitution types for mismatches with Phred quality scores between 10 and 40. From Phred40 and onward the spectrum becomes more even with G-to-A, C-to-T, A-to-G and T-to-C all roughly the same with each of those mismatch types representing 20% of all substitutions.
(0.06 MB TIF)
Absolute abundance of mismatches in human w/100 bp runs. Shows absolute abundance of runs from
(0.03 MB TIF)
Absolute abundance of mismatches in human. Shows absolute abundance of runs from
(0.08 MB TIF)
Summary of traces without enrichment (RNA origin) by mismatch type. “Other” indicates the most abundant type other than those listed. No enrichment for the ADAR derived mismatches are observed in the full set.
(0.03 MB DOC)
Sequence context preceding mismatch (enriched, higher quality, RNA). There is a clear under representation of the “G” nucleotide upstream to the mismatch, in agreement with known ADAR signatures in both human and Xenopus. RNA editing is known to be less common in mouse, thus, this is consistent with a lack of depletion.
(0.03 MB DOC)
Sequence context preceding mismatch (not enriched, RNA). The position preceding an edited site is known to be depleted in “g”. We looked at the position preceding an A-to-G or T-to-C mismatch in RNA derived traces. The depletion is clearly visible in the enriched set (see
(0.03 MB DOC)
Summary of Traces without enrichment (RNA origin). “unique bp” indicates the total number of genomic positions covered by the placed traces of the RNA traces.
(0.03 MB DOC)
Editing enriched traces-lower quality. Number of traces, by mismatch type, with two or more mismatch below a quality threshold of Phred 10, spanning 100 bp or more. For Mouse, Human, and
(0.03 MB DOC)