Measuring quality of DNA sequence data via degradation

We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.


Introduction
As public genome databases proliferate, their immense scientific power is tempered by skepticism about their quality. The skepticism is not merely anecdotal: there are documented instances and implications (Commichaux et al., 2021;Langdon, 2014;Steinegger and Salzberg, 2020). Although we argue in Appendix A that data quality should not be construed as comprising only errors in data, the principal contribution of the paper is a novel paradigm for measuring quality of genome sequences by deliberately introducing errors that reduce quality, a process we term degradation. The errors are single nucleotide polymorphisms (SNPs), insertions and deletions that both occur naturally as mutations and arise in next generation sequencing. Our reasoning is that higher quality data are more fragile: the higher the initial quality, the greater the effect of the same amount of degradation. We present evidence that supports this reasoning, as well as demonstrates the scope and consequences of the phenomenon.
Even though the main contribution of the paper is methodological, applicability to bioinformatics problems is its raison d'être. Our exemplar problem is detection of outliers in genome databases: we identify genomes in a 26,953 coronavirus database whose degradation behavior is anomalous, and whose quality, therefore, may be suspect. We detect deliberately inserted low quality genomes, but other genomes in the original database are equally problematic. A second potential application is to thwart adversarial attacks on genome databases that, for instance, insert artificial genomes so that sequences of concern such as those generated by the methods in Farbiash and Puzis (2020) will pass screening tests. Finally, degradation can be used to characterize the quality of synthetic DNA reads that are used to evaluate genome assemblers (Wang et al., 2020).
Our method is rooted in total quality paradigms for official statistics, that is, censuses and surveys conducted by national statistics offices (see Appendix A). In that context, data quality is a longstanding issue, and low quality data are known to be resistant to further errors, such as those introduced by editing, imputation or statistical disclosure limitation (SDL). We also draw on official statistics for techniques to quantify data quality. In experimental settings and because it is intuitive, we measure degradation by distance, appropriately defined, from the stating point. In real databases, this is not possible, so we employ measures of distance from a universal "endpoint" representing the lowest possible quality-pure randomness in the form of maximal entropy, which every genome reaches in the limit of infinite degradation.
The remainder of the paper is organized as follows. To maintain flow, brief background on data quality appears in Appendix A. Our hypothesis that higher quality means greater fragility and initial evidence supporting it appear in §2. In §3, we examine degradation of tuple distributions, linking this paper to work on applying the Markov structure of genome sequences (Karr et al., 2021b). A central degradation measureentropy of triplet distributions-is applied in §4 to outlier detection. In §5, we demonstrate the effects of degradation on higher-level genome structure, viz, repeats and palindromes, and on non-virus genomes. Conclusions appear in §6.

Argumentation
Here are our central hypothesis and initial evidence supporting it.

The Hypothesis
We hypothesize that because high quality data are more fragile than low quality data, 1 the quality of elements of a DNA sequence database can be measured by degrading them, Secondarily, the more complex the characteristic examined, the greater the impact of degradation. As we see below, the effect of degradation increases as we move from base distributions to pair distributions to triplet distributions to quartet distributions to repeats and palindromes.
We perform the degradation by iteratively applying the Mason_variator software (Holtgrewe, 2010). Briefly, the Mason_variator simulates changes to a genome sequence: SNPs, insertions, deletions, inversions, translocations, and duplications, with specified probabilities for each. Such changes occur naturally as mutations as well as in reads produced by next generation sequencers, such as those made by Illumina. Mason_variator runs from a command line interface with user-specified parameters, input files, and output files. For simplicity, in most of our analyses, only SNPs were simulated. 2 Iterative use of Mason_variator means starting with a given genome, running Mason_variator on it, running Mason_variator again on the result, . . . , up to a specified number of iterations, which is usually 2000. 3 We need measures of quality for degraded genomes, and have investigated several. The first two of these are employed in string matching: Hamming distance (Navarro, 2001) is usable when only SNPs are simulated, while Levenshtein distance (Navarro, 2001) allows insertions and deletions that alter the length of the DNA sequence. 4 As discussed below, the origin point for Hamming and Levenshtein distances is crucial. Distances based on distributions of nucleotides, pairs, triplets and quartets are discussed in §3. Entropy of triplet distributions of degraded genomes is examined in detail in §3.2 and forms the basis of §4. Figure 1 visualizes the hypothesis for a single element of the coronavirus database employed in §3 and 4. In the figure, the x-axis is the number of iterations of the Mason_variator, and the y-axis is Levenshtein distance between the degraded genome and the original genome. All forms of errors were allowed. The most salient characteristic of the curve is its concavity: the more degradation already done, the less the effect of each additional iteration. There are issues with the choice of the origin for Levenshtein distances. In Figure 2, there are 21 initial genomes-the one randomly selected coronavirus genome and that genome after 100, 200, . . . , 2000 Mason_variator iterations, representing continually decreasing initial data quality. In the top panel, Levenshtein distance is measured from the parent (0-iteration) genome, and the distance at iteration 0 has been subtracted from each curve. In a sense, however, this is "cheating," because in real databases, there are not known parent genomes. In the middle panel in Figure 2, Levenshtein distance for each curve is measured from its starting point. The curves there differ little, and certainly not systematically. Fortunately, a work-around exists: the bottom panel in Figure 2 shows that (within reason), any fixed genome can be used as the origin. There, all Levenshtein distances are measured from a second randomly selected genome in the National Center for Biotechnology Information (NCBI) dataset. The key point is that the curves in the bottom panel vary significantly and systematically with respect to the initial degradation.

Initial Evidence
In §3.3, we instead measure distance from a fixed "endpoint" interpretable as infinite degradation.

Degradation of Tuple Distributions
Distributions of bases, pairs of successive bases, triplets of successive bases and quartets of successive bases differ across genomes, in ways that support a variety of analyses, including not only outlier identification ( §4.1), but also Bayesian classification of simulated Illumina reads and detection of contamination (Karr et al., 2021b). How tuple distributions behave under degradation supports our hypothesis that application of the Mason_variator decreases data quality. Higher-level genome structure such as repeats and palindromes is discussed in §5.1. As noted in §1, our experimental platform is a coronavirus database containing 26,953 genomes, which was downloaded from the NCBI in November of 2020. To it, we added eleven "known" problem cases: a single adenovirus genome with length 34,125 BP (downloaded as part of the Art read simulator software package (Huang et al., 2012)) and low-quality versions of 10 coronavirus selected randomly from the original 26,953, each created by 2000 iterations of the Mason_variator. Our method detects not only these known outliers-the minimal criterion for credibility, but also others.

Preliminaries
In this paper, a genome G is a character string chosen from the alphabet B = {A, C, G, T }, and represents one strand of the DNA 5 in an organism. The constituent bases (nucleotides) are A = adenine, C = cytosine, G = guanine and T = thymine. We denote the length of a genome G by |G|; the i th base in G is G(i); and the bases from location i to location j > i are G(i : j). Given an integer n ≥ 1, the n-gram distribution is the probability distribution P n (·|G) on the set all sequences of length n chosen from B-there are 4 n of them-constructed by forming a table of all length n contiguous substrings of G and normalizing it so that its entries sum to 1. 6 In §4, we focus on triplets, which are 64-dimensional summaries of genomes, and which also encode amino acids-the building blocks of proteins. The interpretation of P 3 (·|G) is that for where k is chosen at random from {1, . . . , |G| − 2}. Triplets provide a generative model of a genome as a second-order Markov process, since P 3 contains the same information as the pair distribution P 2 and the 5 Or, for viruses, RNA. 6 There are |G|−n+1 such sequences, starting at 1, 2, . . . , |G|−n+1, so the normalization amounts to division by |G|−n+1. Figure 2: Degradation behavior as a function of Mason_variator iterations and initial data quality, for a randomly selected coronavirus genome. Top: Degradation is measured by incremental Levenshtein distance from the parent (0-iteration) genome. Middle: Degradation is measured by Levenshtein distance from the starting point. Bottom: Degradation is measured by incremental Levenshtein distance from a second randomly selected genome in the NCBI database. Color encodes the initial number of Mason_variator iterations.
16 × 4 transition matrix that gives the distribution of each base conditional on its two immediate predecessors.
Other cases of interest are less suited to our purposes. Base (n = 1) and pair (n = 2) distributions are too coarse to be useful on their own. Quartets (n = 4) have been studied extensively (Pride et al., 2003;Teeling et al., 2004a,b). For the problems we address, they are more cumbersome than triplets without being significantly more informative. Finally, although we do not do so here, triplet distributions can usefully be converted to amino acid distributions (Karr et al., 2021b).

General Effects of Degradation
For the adenovirus genome, Figure 3 shows the effect of degradation on the base, pair, triplet and quartet distributions, measured by Hellinger distance (Nikulin, 2001) from corresponding distributions for the original genome. The interpretation is that as the number of Mason_variator iterations increases, base, pair, triplet distributions, and quartet distributions all move farther and farther away from the parent genome, at slower and slower rates. Moreover, confirming the secondary hypothesis in §2.1, the higher the dimensionality, the more rapid the movement: quartets are more fragile than triplets, which are more fragile than pairs, which are more fragile than individual bases. The horizontal lines in Figure 3, whose colors match those of the curves, are simulation-derived 1% p-values: the probability that the distribution matches that of the original genome is less than 0.01 when the distance exceeds the line. Interestingly, the numbers of iterations at which the 1% thresholds are passed (i.e., where the curves cross the lines) are nearly the same for pairs, triplets and quartets and lower than for bases alone.

Measuring Degradation of Triplet Distributions via Entropy
So far, we have confined attention to what degradation moves away from. In many ways, what it moves toward is more useful, because there is a single infinite degradation endpoint representing maximum entropy, which in statistical physics and probability theory, measures disorder. The entropy of a probability distribution P on a finite set S is with the convention that 0 × −∞ = 0. Entropy is minimized by distributions concentrated at a single point and maximized at the uniform distribution on S, with maximizing value log(|S|), where |·| denotes cardinality. The existence of the universal maximizing value enables us to measure degradation as movement toward maximum entropy, removing the common origin issue discussed in §2.2. Figure 4 shows the effect of 500 Mason_variator iterations on entropy of triplet distributionshereafter, just triplet entropy-of the adenovirus genome, starting from the genome itself (black curve) compared to starting from the genome degraded by 250 Mason_variator iterations (blue curve), degraded by 500 MV iterations (green curve), degraded by 1000 Mason_variator iterations (yellow curve), and degraded by 1500 Mason_variator iterations (red curve). The y-axis is the increase in entropy as a function of Mason_variator iterations, so Figure 4 shows movement toward maximal entropy.

Outlier Detection
One effective strategy for identifying elements of a genome database with problematic quality is to search for outliers. We concede that the implicit presumption that the bulk of the database is of high quality may be untested. The key question is, "Outlying with respect to what metric?" In this section, the metric is based on hierarchical cluster analysis of triplet entropy increase resulting from Mason_variator degradation, that is, on the shapes of the curves in Figure 5. The clustering is in three dimensions, as opposed to 64 dimensions for triplet distributions and 21 for amino acid distributions. Possibly unexpectedly, the results in Karr et al. (2021b) and those in §4.2 are very similar.

Outlier Detection Using Triplet Distributions
We showed in Karr et al. (2021b) that clustering of triplet distributions identifies outliers. Briefly, we performed hierarchical clustering, using Euclidean distances and "complete clustering" in R (R Core Team, 2020), on the 26,964-genome database, using as clustering variables the 64 standardized components of the triplet distributions. By means of standard heuristics that trade off model fit and model complexity, the number of clusters was determined to be 23.
The top panel in Figure 6 is a plot of two-dimensional multidimensional scaling (MDS) (Kruskal, 1964;Cox and Cox, 2001) of the 23 cluster centroids. The overwhelming majority of coronavirus genomes-26,433 of the original 26,953, or 98.1%-are in a single cluster. One original coronavirus genome appears by itself, in cluster 12. Cluster 13 contains the adenovirus genome alone, while each of the 10 degraded coronavirus genomes appears in a cluster by itself (clusters 14-23). Thus, the deliberate outliers are not only detected but also distinguished from one another. The dendrogram in the bottom panel of Figure 6 shows that the coronavirus genome in cluster 12 and the deliberate outliers are separated from the remaining 26,952 coronavirus genomes at the first split in the clustering process. Clusters 1-10, which are small, are potential outliers as well. See Karr et al. (2021b) for details and a scientific interpretation.

Outlier Detection Using Entropy Degradation
Here we cluster the genomes on the basis of degradation behavior of triplet entropy, that is, the curves shown in Figure 5. As noted already, the results match closely with those in §4.1.
The clustering is now in only three dimensions, reached by a path that starts with Figure 5. Every one of the 26,964 curves plotted there is based on entropy following 0, 250, 500, 1000 and 2000 Mason_variator iterations. We fitted a quadratic function to each set of 5 values, reducing the dimension to 3. These quadratic models are uniformly good: the smallest of the coefficients of determination, R 2 , is 0.941 and 99% of them exceed 0.976. Hierarchical clustering was then performed on standardized versions of the three quadratic coefficients, using the "ward.D" option in R, resulting in 34 clusters, with counts ranging from 11 to 1470. Statistically, the clustering is extremely good: the cluster numbers alone explain 98.96509%, 99.02426% and 98.45677% of the variation in the quadratic coefficients.
Paralleling Figure 6, Figure 7 shows the result of applying two-dimensional MDS to the cluster centroids, as well as the associated dendrogram. There is nothing comparable to the massive coronavirus cluster in the triplet distribution analysis. As noted above, the largest cluster in the triplet entropy degradation cluster contains only 1470 genomes. The adenovirus outlier and the ten degraded coronavirus outliers are placed together in cluster 34, and Figure 7 shows that they clearly differ from all of the other genomes. Clusters 1-5 are candidate outliers. Not only are they relatively small, but also each differs strongly from all of the other clusters. They are suggestively similar to clusters 1-10 in Figure 6, which we pursue in §4.3. In Figure 7, the 11 deliberate outliers in cluster 34 are distant from the majority of the coronavirus genomes, but no more so than the 18 coronavirus genomes in cluster 2.

Relationships Between the Two Sets of Clusters
Clusters 1-10 in the triplet distribution analysis together contain 519 genomes, a number similar to the number of genomes in clusters 1-5 for the triplet entropy analysis. Moreover, both analyses separate the 11 deliberate outliers from the 26,953 legitimate coronavirus genomes, although differently. The triplet distribution analysis places these outliers in 11 distinct clusters, while the triplet entropy degradation analysis places them all in a single cluster. Table 1 shows the complete and strongly block-diagonal relationship between the two sets of clusters. For clarity, cells in Table 1 containing values of 0 are highlighted in pink. In detail,  1. Triplet entropy degradation cluster 34 is, as noted above, an amalgamation of triplet distribution clusters 13-24; both contain the 11 deliberate outliers.
2. The lone coronavirus genome in triplet distribution cluster 12 is absorbed into entropy degradation cluster 7, along with 843 other coronavirus genomes. Perhaps it is not an outlier after all. 7 3. Triplet entropy degradation clusters 6-33 disaggregate the massive, 26,433-genome triplet distribution cluster 11, modulo four additional genomes.
4. Triplet entropy degradation clusters 1-5, containing 516 genomes and triplet distribution clusters 1-8 and 10, both containing 516 genomes, are identical collectively. These are in the upper-left corner in Table 1. Clearly, the two approaches are detecting the same outliers, with different nuances.
5. Triplet distribution cluster 9, with 3 genomes, is anomalous. Each genome it contains lies in its own large entropy degradation cluster.
Therefore, much of the scientific interpretation of outliers in Karr et al. (2021b), which is based on the text string ID variable in the NCBI database, carries over here.

Higher-Order DNA Structure
We showed in §3 that degradation attenuates (relatively) low-dimensional genome characteristics such as tuple distributions (Figure 3). We see here that more complex structure such as repeats and palindromes is affected even more strongly. As exemplar, we use an E. coli genome of length 4,639,675 downloaded from NCBI; the same genome appears again in Figure 8.

Palindromes
Genomic palindromes, unlike those in ordinary language, consist of a sequence of bases followed immediately by its reverse complement. So, an example is ATTCGATT||AATCGAAT. 8 In what follows, palin-dromes are parameterized by half-length; the example has half-length 8. Their behavior with respect to Mason_variator degradation differs somewhat from that of repeats. Table 3 shows that long palindromes (half-lengths 12, 14 and 16), are not plentiful to begin with, and vanish, modulo noise discussed momentarily, within 100 Mason_variator iterations. Palindromes with half-length 8 and 10 do decline in number, but do not vanish, even by 2000 iterations. Moreover, their numbers can increase, although not enormously. Palindromes of half length 6 barely diminish at all, and fluctuate substantially, The noise and the increases suggest that short palindromes differ from the other genome features discussed in this paper, and especially from repeats. They are resistant to the Mason_variator degradation, and can even be produced by it. This is not surprising because very short repeats are lowdimensional and may, therefore, be too short to be interesting biologically.   Figure 8 demonstrates that degradation behavior is not confined to viruses. It is the analog of Figure 5 for two bacterial genomes-P. gingivalis and E. coli (the same genome in §5.1), for three human chromosomes-1, X and Y, and for human mitochondrial DNA. All six genomes were downloaded from NCBI; the human genome is identified as GRCh38.

Conclusions
In this paper, we have introduced and investigated a new, degradation-based approach to data quality for genome sequence databases, and established that it is sound scientifically and statistically. Our principal application is to outlier detection, and our methods are demonstrably effective. At least three paths for further research are clear. The first is that our paradigm, at this point, does not produce quantified uncertainties about the decisions it may engender, so tools such as Specified Certainty Classification (Karr et al., 2021a) cannot be applied. Following this path also requires, as raised in Appendix A, more explicit attention to the decisions to be made using the data. To consider decision quality fully leads to the second path-better understanding of the effects of data quality on bioinformatics software pipelines. Now that we can create data of demonstrably and quantifiably lower quality, this path is feasible. Third and more speculatively, there is the relationship between data quality and adversarial attacks on genome databases or software pipelines (Biggio and Roli, 2018;Farbiash and Puzis, 2020;Valdivia-Granda, 2019). Attempts to "pollute" databases with (what may turn out to be) low quality genomes are potentially detectable using the outlier identification strategies presented here. Risk-utility paradigms for statistical disclosure limitation (SDL) discussed in Cox et al. (2011) are relevant, especially the need to distinguish attackers from legitimate users of the data.
Finally, there is clear potential to extend our paradigm to contexts other than genomics, provided that a credible generative model for quality degradation can be constructed. To illustrate, in the official statistics context of Appendix A, one simply needs a mechanism, such as the microsimulation tool in Karr and Cox (2012), to simulate one of more forms of total survey error.

A Background on Data Quality
This section draws on Karr et al. (2006b), with additional material from Office of Management and Budget (OMB) documents (Office of Management and Budget, 2006).
Data quality is a complex, multi-dimensional construct driven ultimately by data usage and subsequent decisions (Karr et al., 2001(Karr et al., , 2006bKarr, 2013). There are two key questions: What are the intended uses of the data? What decisions are to be made based in part on the data? In genomics, quality often focuses solely on errors. But, there are multiple hyperdimensions of data quality, each containing multiple dimensions: Process Dimensions related to the generation, assembly, description and maintenance of data-reliability (with several sub-dimensions), metadata, security and confidentiality. 9 Data Dimensions specifically associated with the data themselves. At the record/table level, these comprise accuracy, completeness, consistency and validity. Database-level dimensions are identifiability and joinability.
User Dimensions related to users and user-accessibility, integrability, interpretability, rectifiability, relevance and timeliness.
Objectivity Dimensions describing whether disseminated information is accurate, reliable, scientifically sounds and unbiased in terms of both substance and presentation.
Utility Dimensions addressing usefulness of the information for the intended audience's anticipated purposes.
Integrity Dimensions pertaining to protection of information from unauthorized, unanticipated or unintentional falsification or corruption.
Data quality cannot be disconnected from economics, because of the necessity to ask the question "To what end have the data been collected?" (English, 1999;Karr, 2013). Data quality-associated costs are imposed on multiple classes of stakeholders, among them, data subjects, data collectors and stewards, decision makers and society at large. There are both actual and opportunity costs. Whether incurring them is justified in a particular cases depends on the associated decisions.
Because of the large, public financial consequences, the field in which data quality has received the most attention and deepest investigation is official statistics. The entire field of total survey error (TSE) has emerged in response (Biemer and Lyberg, 2003;Groves, 2004;Biemer et al., 2017), which is dominated by "total quality" thinking.
Measurement of data quality has long been a perplexing issue if construed narrowly to mean error rates, because ground truth is unknown. In §4 and Karr et al. (2021b), we applied clustering to identify outliers that may be-and in synthesized cases, are-data quality problems. In other settings-notably, statistical disclosure limitation, where data are altered deliberately in order to protect confidentiality, quality is measured by analytical utility. While sometimes problematic, because measures based on utility are often either too blunt or too narrow (Cox et al., 2011), this approach has been broadly productive (Karr et al., 2006a;Reiter et al., 2009).
Uncertainty has not dominated data quality strategies, but has lurked in the background for years. Recent approaches such as "fitness for purpose" (Mocnik et al., 2017) and the "decision quality rather than data quality" focus of Karr (2013) account implicitly for uncertainty. Re-formulated as "Total Survey Uncertainty," TSE can address uncertainty. Yet another means of accommodating uncertainties is explicit consideration of risk (Eltinge et al., 2013).