Excap: Maximization of Haplotypic Diversity of Linked Markers

Genetic markers, defined as variable regions of DNA, can be utilized for distinguishing individuals or populations. As long as markers are independent, it is easy to combine the information they provide. For nonrecombinant sequences like mtDNA, choosing the right set of markers for forensic applications can be difficult and requires careful consideration. In particular, one wants to maximize the utility of the markers. Until now, this has mainly been done by hand. We propose an algorithm that finds the most informative subset of a set of markers. The algorithm uses a depth first search combined with a branch-and-bound approach. Since the worst case complexity is exponential, we also propose some data-reduction techniques and a heuristic. We implemented the algorithm and applied it to two forensic caseworks using mitochondrial DNA, which resulted in marker sets with significantly improved haplotypic diversity compared to previous suggestions. Additionally, we evaluated the quality of the estimation with an artificial dataset of mtDNA. The heuristic is shown to provide extensive speedup at little cost in accuracy.


Introduction
Genetic markers are ubiquitous in molecular biology and have many applications, such as forensic analysis, taxonomic barcoding, and detection of inherited diseases. While full-length sequences would be the preferred material for most studies, real-world circumstances sometimes force the usage of a limited set of markers as a proxy. In particular, SNPs and short tandem repeats are commonly used as markers in forensics. How to choose markers is an important question and many factors can affect such a decision. For example, sample availability, application, sequencing technology, cost, time, and practicality has to be taken under consideration [1][2][3][4]. Although high-throughput sequencing has revolutionized molecular biology and genetics, it is not yet an economically feasible route for forensic laboratories.
The costs of the analysis and the amount of work are usually directly dependent on the number of markers that should be examined. This number is affected by marker length, which in turn depends on available sequencing technology and the size of the biological sample.
One specific goal is to maximize the information gained by a set of markers. A common measurement for the information gained from a marker is its haplotypic diversity h [5,6]. It describes the probability that this marker differs in two individuals randomly chosen from a given population. Hence, h is a measurement of the genetic variability of the marker. In forensic sciences this diversity is known as exclusion capacity, because the markers are used to identify individuals or to exclude them from a panel of suspects.
If the markers are found in nuclear DNA, as is chosen for many applications, they can often be regarded as statistically independent, provided the markers are situated on different chromosomes or sufficiently far apart if on the same chromosome. Thus, the haplotypic diversity of a set of markers can be calculated by multiplying the marker's diversities.
In some applications, nuclear DNA is less interesting or unsuitable. In forensics, for example, nuclear DNA is for some sample types, such as hairs, often highly degraded and many markers may therefore not be available (see e.g. [7,8]). Another example can be phylogenetic studies, where other DNA sources have properties more suitable for the investigation's purpose [9]. In both cases, DNA from mitochondria (mtDNA), which is many times more abundant, can be studied. However, a disadvantage with mtDNA is its relatively small size, and it has to be considered as one single linkage group. Thus, the information given by its potential markers may no longer be considered as statistically independent. This raises a question: if you are given an unbiased sample of mtDNA sequences from a population and want to find the most efficient combination of markers, what do you do?
Genetic variability is commonly due to a relatively small and dispersed set of positions, causing potential markers to be dispersed as well. If the variable positions, and hence the potential markers, are linked, the haplotypic diversity of marker sets cannot directly be determined from h-values of single markers. Thus, it is not easy to find the optimal subset of markers. To calculate the most informative marker subset, haplotypic diversities of the markers have to be estimated from a sample set of individuals taken from the population. Since the strength of the informational coupling between the single markers influences the haplotypic diversity of their combination, this has to be estimated from the given sample set as well. Today, this is done by a combination of visual inspection and subjective choice of markers in the lack of suitable programs. With this methodology it is very laborious and virtually impossible to find the best marker sets even for relatively small datasets.
We present algorithms for finding the most informative subset of markers, subject to constraints such as marker size and number, and estimating the markers' haplotypic diversity from a sample set given as multiple sequence alignment (MSA). We implemented the algorithms in the Excap program, which reads an input alignment in FASTA format and returns an optimal set of markers of a requested length and size. This is a significant step forward as compared to current practice.

Haplotypic diversity
The calculation of h is based on an estimation for the diversity of a genetic marker [5]. Considering a sample set of size n and a marker with q different haplotypes, each with frequency x i ,1ƒiƒq, the haplotypic diversity is estimated as: For technical reasons, a new method to calculate the diversity of a marker is introduced. Instead of the frequencies x i , the number of sequences each haplotype comprises, r i~n x i , is used to count the number of possible pairs of different haplotypes in the sample, henceforth designated as separation index s: Its maximum s max is n(n{1) 2 when the number of haplotypes equals the number of sequences (n) and therefore r i~1 ,Vi. The haplotypic diversity can easily be calculated as h~s s max : ð3Þ One advantage of this approach is that all s-values can be handled as integers during the whole calculation without loss of precision. It furthermore simplifies the estimation of the maximum h-value a set of markers can achieve, which is used in our branchand-bound algorithm.

Data representation
The input data, given as an MSA, is reduced to its polymorphic columns. Each of those columns is transformed into an integer vector of length n, representing the different haplotypes. This vector is further designated as haplocode. The different haplotypes are numbered from 0 to the number of haplotypes at this position minus 1. Figure 1 provides a small example.
A marker's width is the number of columns it spans in the MSA. Markers that are wider than 1 bp are allowed to overlap, but we ignore markers that are contained by another marker.
Analoguously to a marker, a set of markers has a haplocode describing its haplotypic information, too. The set's haplocode is easy to calculate in O(n log n) time by combining the haplocodes of the markers it contains. Furthermore, each marker i has a separation index s i which indicates how many sequence pairs this marker separates. It is easy to calculate s i from the marker's haplocode in O(n log n) time. It is analogous to definite a separation index s C for a marker set C.
Before combining single markers, the complexity of the dataset is reduced: 1. Elimination of redundancy. If two markers represent the same information or if the information of one marker is a subset of another marker's information, one of them is removed. The first case is easy to handle because markers that contain the same information will have the same haplocode. The second case is detected by looking at the haplocode of the two markers combined. If the combination's haplocode is identical to the haplocode of one of the markers, the other marker is deleted.
2. Sorting of markers. All markers are sorted by their separation index in decreasing order. This step is necessary for the estimation-step in the Excap algorithm.

Finding the best marker set
Since the diversity of a marker set cannot be calculated from its members' diversities directly, all possible combinations have to be built up to determine their h-value. The number of possible combinations grows exponentially, and we will try to limit how many combinations we have to consider. Later on a heuristic is provided that gives a good approximation to the optimal solution.
To find the best marker set of size k, the sorted markers are combined successively using a depth-first search combined with a branch-and-bound approach. Beginning with the strongest marker, according to its separation index, the next markers with weaker svalues are added recursively step by step until the set has reached size k. In each step it is estimated if the current set can achieve a better separation index than the best one found so far. If the separation index is worse, the recursion returns to the calling step and tries to add the marker with the next best s-value to the set.
The excap algorithm ( Figure 2) describes how markers are combined using a depth-first search. The initial parameters are an empty set of markers (C), a pointer to the start of the list of sorted

A bound for the separation index
In the bounding step, it is estimated how good a set of markers would be if l further markers were added before the set reached maximum size. If the set C separates s C sequence pairs and if the current marker to add i separates s i sequence pairs, then the combination of C and i can at best separate s C zs i sequence pairs. That is a weak upper bound, relying on marker independence, but since the strength of the coupling of the single markers is unknown, it is difficult to estimate how much information they share. Since the markers are sorted decreasingly by their s-values, the separation index is bounded a set C can achieve with l additional markers is bounded by As a heuristic approach, we modified the bounding in order to reduce the number of alternatives. Instead of assuming that each newly added marker can contribute its whole separation index to the separation index of the new combination, the marker's s i is scaled down by a factor 1{f , where f is the heuristic parameter. We require that 0ƒf v1, and f~0 corresponds to the normal, nonheuristic approach, while f~1 would mean that every added marker is assumed to contribute no new information. The separation index is heuristically bounded by Note that f most likely has a non-linear effect on the method's estimation of h, which is probably undesirable from a user's perspective.

Artificial data
An artificial population was generated, represented by 10,000 sequences with a length of 16,000 bp each. Ten markers, each spanning 800 bp, with different sets of polymorphic columns and increasing h-values were introduced in sequences. The mutations were made using the pseudo-random number-generator mt19937 [10] provided by the Boost C++ Library with default parameters.
From the given population, sample sets of different sizes have been drawn (uniformly at random, using mt19937).

Biological data
The biological dataset consisted of 241 full length mtDNA sequences also used by Coble et al [11], which were taken from the mtDB database [12].
All sequences were aligned to the revised Cambridge reference sequence (rCRS) for human mtDNA [13] using the Kalign program [14] with its default values. In all calculations the rCRS   was not considered as a part of the dataset and was only used for indexing. This alignment induced 483 polymorphisms, and 264 remained after reducing redundancy (see Section ).
The 241 input sequences were separated in 18 groups according to their HV1/HV2 type based on the rules of Coble et al. Some calculations were limited to 59 significant positions from the coding region, previously defined by Coble et al.
The disease position exclusions were done according to the information available in the Mitomap project [15].

Artificial data
To test the accuracy of the estimation and the influence of the sample size on the quality of the estimated h-values for a given population an artificial population was created.
All 10,000 sequences were used as input for the Excap program in order to calculate the`true'' h-values of markers in the whole population. In the following step the haplotypic diversity of those markers should be estimated from samples randomly drawn from the population. This estimation was done using the Excap program. The procedure was repeated 100 times for each sample size (except for sample size 1,000 which was drawn 50 times). From each drawn set the markers' diversity in the whole population was estimated. The mean value and standard deviation of h over all 100 (50) samples was recorded for each specific sample size.
The tests revealed a strong correlation between sample size and quality of the estimation, see Figure 3. For weak markers that had an h-value below 0:2, a sample set of size 10 was of limited value due to the high the standard deviation. With a sample size of 50 and stronger markers, the influence of the standard deviation sank and the estimates became more reliable. For a sample of size 10 and a marker with a low h-value (e.g. h~0:179) the standard deviation (0.177)was almost as high as the estimated value (0.191) itself and the marker's utility was therefor relatively uncertain. On the other hand, a sample size of 50 and a stronger marker (e.g. h~0:635) had an acceptable standard deviation (0.072).
It is clear that one must be careful not to overstate the importance of h values estimated from small samples. It is worth noting, however, that standard deviation seems to be independent of h and only determined by sample size.
Detailed information about the correlation between sample size and the quality of the estimation can be found in Table 1.

Tests on biological data
Tests on biological data was a comparison between different panels of mitochondrial single nucleotide polymorphisms (SNP panels) from the work of Coble et al. [11] and sets of SNPs created using the Excap program.
Coble et al. [11] sequenced the mtDNA of 241 individuals of and presented eight different multiplex panels of overall 59 SNPs which were designed to complement the results of the HV1/HV2 testing. The panels were chosen to maximize their ability to separate two individuals from the same subtype. Due to practical reasons, Coble et al [11] excluded all polymorphic positions related to diseases or positions with nonsynonymous mutations. All positions of the hypervariable regions were also disregarded.
To re-evaluate the eight panels, the sequences from their respective HV1/HV2 types, as designated by [11], were given as input to Excap. That is, if panel B was most applicable to HV1/ HV2 types H:2, H:3 and H:6, Excap was run on the sequences of those subtypes. To keep our results comparable, only the 59 positions used in the multiplexed panels proposed by Coble have been taken as input data for the recalculation. Would Excap suggest the same marker sets, or find better combinations?
We found that Excap gave better marker sets for each of the eight panels, either by having a higher haplotypic diversity with the same number of markers (seven cases) or by having a smaller marker set (on case) but the same h. The improvements on diversity ranged from 0 % (but less SNPs used) to 101.81 %. See Table 2 for all results. The actual differences in terms of chosen SNPs are shown in Table 3 and they demonstrate why the automated Excap method does better than visual inspection based on individual sites' diversity. Also note that the actual number of different SNP positions are reduced from 59 to 47.
Even better results were achieved with the input data not limited to the 59 SNPs present in Coble's panels. To achieve similar initial conditions, all positions of the hypervariable regions (73-340 and 16024-16365), the poly AC repeat (515-524) and all positions related to diseases (174 positions) were excluded using information provided by the MitoMap database [15]. Input data was the same sets of sequences as used for the previous calculations. All panels could be improved significantly (see Table  4).
Another important result is the correlation between the haplotypic diversity and the width of the combined markers in biological data. We computed h-values for up to 9 markers f widths 1, 30 or 100 bp, on mitochondrial genome sequences. As few as three SNPs, or two 30 bp markers, have a higher haplotypic diversity than one 100 bp marker, see Figure 4. Also note that to get the same h as from nine SNPs, you would need as much as four 100 bp markers or six 30 bp markers.
With increasing set size k, the h-value of a set converges to a maximum, defined as the diversity of the whole sequence. The size of the marker set has a higher effect to the convergence than the width of the included markers and the speed of convergence decreases with the growing set size. This has an important effect on practical work because, e.g., typing 20 SNPs instead of 10 SNPs   Table 3. Improved panels for HV subtypes. may be much more expensive, but does not really provide more information.

Heuristic analysis
Running Excap on large datasets can be time consuming and a heuristic approach becomes necessary. To get a better understanding of our heuristic's impact on results and execution times, we again turned to the data from [11].
We ran the Excap algorithm to find an optimal marker set of k SNPs with different values on the heuristic parameter f , starting from f~0 (no heuristic) and up to f~1. Each calculation was performed for a single SNP-marker up to a set of seven SNPmarkers.
The more markers a set contained, with consequently higher h, the more stable the estimation of h was, see Figure 5(a). For a set of four markers, f~0:7 still resulted in the optimal solution. For k~10, we could set f~0:9 without notably worse estimation of h. This observation is encouraging, since it is for the larger values on k that the heuristic becomes indispensable.
In the tested cases, the heuristic approach had a substantial improvement of execution time. With increasing f , the execution time dropped exponentially, see Figure 5(b). Calculating the optimal set of seven SNPs in the data provided by Coble et al. took 143 seconds without heuristic. Using a 80 % heuristic reduced runningtime to 5 seconds and the provided solution was still the optimal one. The most dramatic case is for k~10, which dropped from being a day-to-day calculation to executing in a few seconds when f~0:9. As a comparison with Figure 5(c) reveals, f~0:9 gives a result which is still close to optimal.

Discussion
Our experiments with Excap on artificial and biological data showed significant advantages. Most sets of variable markers chosen by hand are not optimal. In almost every case, the most variable markers do not form the most variable set. In many cases, one marker with a weak diversity improved the chosen set significantly, much more than a marker with a much higher hvalue which is not easy to find with a`manual'' approach. Although our main algorithm's execution time is exponential, the related heuristic achieves reasonable and competitive results in many practical cases. Importantly, our parameterized heuristic allows for experimentation and adaption to different datasets.
As long as there is a dataset available that is big enough and representative for a certain population, Excap can be used to estimate the diversity and to optimize the choice of markers. One   should be aware to keep the size and quality of the provided sample set in mind and not to overestimate the results. There are some important consequences arising from having our program suggesting markers. Besides manual labour being reduced, the risk of making mistakes decreases significantly. Achieving the same variability with less and shorter markers also results in a reduction of costs. It is clear that the quality of chosen markers improves and that data is utilized better.
There is an important cost-benefit analysis to do when making decisions on marker sets, since there are limits of utility to including yet another marker. Excap enables a careful consideration regarding how many markers to use by estimating how much more information can be gained.
We believe Excap will also help choosing between technologies for DNA typing, and not just what markers to choose. By systematically trying different marker sizes and set sizes, one can determine the most economical and efficient way to reach a desired haplotypic diversity. As shown in Figure 4, a small number of SNPs may be more informative than hundreds of bases in larger markers.
There are opportunities for improvements to the present work. The suggested heuristic was a first approach to keep computation time in manageable scales, but is not necessarily the most efficient one. It could also be worthwhile to analyze the influence of the heuristic factor f on the quality of the estimated h-value. A heuristic factor f effecting the results in a more sensitive way than the linear one might allow a more fine-grained application of the heuristic. Due to the simple heuristic approach it is possible that a heuristic with a smaller f -value results in a worse set of markers than one with a greater f . We would also like to investigate an approach in which the heuristic was used as a preprocessing step to reduce a large set of input markers. The optimal algorithm could then be applied to this more manageable set of markers. Furthermore, it could be investigated how existing parametrized algorithms, that solve NP-complete problems for a fixed parameter in polynomial time, could be applied to this problem. The size of the optimal set could be such a fixed parameter.

Availability
Excap is written in C++ and the source code is hosted at http:// sourceforge.net/projects/excap, distributed under the GNU Public License.