## Figures

## Abstract

Genetic markers, defined as variable regions of DNA, can be utilized for distinguishing individuals or populations. As long as markers are independent, it is easy to combine the information they provide. For nonrecombinant sequences like mtDNA, choosing the right set of markers for forensic applications can be difficult and requires careful consideration. In particular, one wants to maximize the utility of the markers. Until now, this has mainly been done by hand.

We propose an algorithm that finds the most informative subset of a set of markers. The algorithm uses a depth first search combined with a branch-and-bound approach. Since the worst case complexity is exponential, we also propose some data-reduction techniques and a heuristic.

We implemented the algorithm and applied it to two forensic caseworks using mitochondrial DNA, which resulted in marker sets with significantly improved haplotypic diversity compared to previous suggestions. Additionally, we evaluated the quality of the estimation with an artificial dataset of mtDNA. The heuristic is shown to provide extensive speedup at little cost in accuracy.

**Citation: **Kahles A, Sarqume F, Savolainen P, Arvestad L (2013) Excap: Maximization of Haplotypic Diversity of Linked Markers. PLoS ONE 8(11):
e79012.
https://doi.org/10.1371/journal.pone.0079012

**Editor: **David C. Samuels, Vanderbilt University Medical Center, United States of America

**Received: **December 12, 2012; **Accepted: **September 18, 2013; **Published: ** November 7, 2013

**Copyright: ** © 2013 Kahles et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **The work of AK was funded through a scholarship of the Klaus Murmann Fellowship Programme of the Foundation of German Business (Stiftung der Deutschen Wirtschaft, SDW). PS is a Royal Academy of Sciences Research Fellow supported by a grant from the Knuth and Alice Wallenberg Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Genetic markers are ubiquitous in molecular biology and have many applications, such as forensic analysis, taxonomic barcoding, and detection of inherited diseases. While full-length sequences would be the preferred material for most studies, real-world circumstances sometimes force the usage of a limited set of markers as a proxy. In particular, SNPs and short tandem repeats are commonly used as markers in forensics. How to choose markers is an important question and many factors can affect such a decision. For example, sample availability, application, sequencing technology, cost, time, and practicality has to be taken under consideration [1]–[4]. Although high-throughput sequencing has revolutionized molecular biology and genetics, it is not yet an economically feasible route for forensic laboratories.

The costs of the analysis and the amount of work are usually directly dependent on the number of markers that should be examined. This number is affected by marker length, which in turn depends on available sequencing technology and the size of the biological sample.

One specific goal is to maximize the information gained by a set of markers. A common measurement for the information gained from a marker is its *haplotypic diversity* [5], [6]. It describes the probability that this marker differs in two individuals randomly chosen from a given population. Hence, is a measurement of the genetic variability of the marker. In forensic sciences this diversity is known as *exclusion capacity*, because the markers are used to identify individuals or to exclude them from a panel of suspects.

If the markers are found in nuclear DNA, as is chosen for many applications, they can often be regarded as statistically independent, provided the markers are situated on different chromosomes or sufficiently far apart if on the same chromosome. Thus, the haplotypic diversity of a set of markers can be calculated by multiplying the marker's diversities.

In some applications, nuclear DNA is less interesting or unsuitable. In forensics, for example, nuclear DNA is for some sample types, such as hairs, often highly degraded and many markers may therefore not be available (see e.g. [7], [8]). Another example can be phylogenetic studies, where other DNA sources have properties more suitable for the investigation's purpose [9]. In both cases, DNA from mitochondria (mtDNA), which is many times more abundant, can be studied. However, a disadvantage with mtDNA is its relatively small size, and it has to be considered as one single linkage group. Thus, the information given by its potential markers may no longer be considered as statistically independent. This raises a question: if you are given an unbiased sample of mtDNA sequences from a population and want to find the most efficient combination of markers, what do you do?

Genetic variability is commonly due to a relatively small and dispersed set of positions, causing potential markers to be dispersed as well. If the variable positions, and hence the potential markers, are linked, the haplotypic diversity of marker sets cannot directly be determined from -values of single markers. Thus, it is not easy to find the optimal subset of markers. To calculate the most informative marker subset, haplotypic diversities of the markers have to be estimated from a sample set of individuals taken from the population. Since the strength of the informational coupling between the single markers influences the haplotypic diversity of their combination, this has to be estimated from the given sample set as well. Today, this is done by a combination of visual inspection and subjective choice of markers in the lack of suitable programs. With this methodology it is very laborious and virtually impossible to find the best marker sets even for relatively small datasets.

We present algorithms for finding the most informative subset of markers, subject to constraints such as marker size and number, and estimating the markers' haplotypic diversity from a sample set given as multiple sequence alignment (MSA). We implemented the algorithms in the Excap program, which reads an input alignment in FASTA format and returns an optimal set of markers of a requested length and size. This is a significant step forward as compared to current practice.

## Materials and Methods

### Haplotypic diversity

The calculation of is based on an estimation for the diversity of a genetic marker [5]. Considering a sample set of size and a marker with different haplotypes, each with frequency , the haplotypic diversity is estimated as:(1)

For technical reasons, a new method to calculate the diversity of a marker is introduced. Instead of the frequencies , the number of sequences each haplotype comprises, , is used to count the number of possible pairs of different haplotypes in the sample, henceforth designated as *separation index* :(2)

Its maximum is when the number of haplotypes equals the number of sequences () and therefore . The haplotypic diversity can easily be calculated as(3)

One advantage of this approach is that all -values can be handled as integers during the whole calculation without loss of precision. It furthermore simplifies the estimation of the maximum -value a set of markers can achieve, which is used in our branch-and-bound algorithm.

### Data representation

The input data, given as an MSA, is reduced to its polymorphic columns. Each of those columns is transformed into an integer vector of length , representing the different haplotypes. This vector is further designated as . The different haplotypes are numbered from 0 to the number of haplotypes at this position minus 1. Figure 1 provides a small example.

Each polymorphic column is transformed into an integer vector. **a)** Multiple sequence alignment, polymorphic columns/markers in bold face. **b)** Haplocode representation of the markers. **c)** Haplocode representation of a combination of the markers.

A marker's *width* is the number of columns it spans in the MSA. Markers that are wider than 1 bp are allowed to overlap, but we ignore markers that are contained by another marker.

Analoguously to a marker, a set of markers has a haplocode describing its haplotypic information, too. The set's haplocode is easy to calculate in time by combining the haplocodes of the markers it contains. Furthermore, each marker has a separation index which indicates how many sequence pairs this marker separates. It is easy to calculate from the marker's haplocode in time. It is analogous to definite a separation index for a marker set .

Before combining single markers, the complexity of the dataset is reduced:

1. **Elimination of redundancy.** If two markers represent the same information or if the information of one marker is a subset of another marker's information, one of them is removed. The first case is easy to handle because markers that contain the same information will have the same haplocode. The second case is detected by looking at the haplocode of the two markers combined. If the combination's haplocode is identical to the haplocode of one of the markers, the other marker is deleted.

2. **Sorting of markers.** All markers are sorted by their separation index in decreasing order. This step is necessary for the estimation-step in the Excap algorithm.

### Finding the best marker set

Since the diversity of a marker set cannot be calculated from its members' diversities directly, all possible combinations have to be built up to determine their -value. The number of possible combinations grows exponentially, and we will try to limit how many combinations we have to consider. Later on a heuristic is provided that gives a good approximation to the optimal solution.

To find the best marker set of size , the sorted markers are combined successively using a depth-first search combined with a branch-and-bound approach. Beginning with the strongest marker, according to its separation index, the next markers with weaker - values are added recursively step by step until the set has reached size . In each step it is estimated if the current set can achieve a better separation index than the best one found so far. If the separation index is worse, the recursion returns to the calling step and tries to add the marker with the next best -value to the set.

The *excap* algorithm (Figure 2) describes how markers are combined using a depth-first search. The initial parameters are an empty set of markers (), a pointer to the start of the list of sorted markers (*begin*), the maximum size of the set () and the recursion depth (*depth* = 1).

The worst case running time for finding the best combination of size out of available markers is . To perform a little better for large values of and , we provide a heuristic that modifies the estimation step and thereby carries out the bounding step earlier, see section.

### A bound for the separation index

In the bounding step, it is estimated how good a set of markers would be if further markers were added before the set reached maximum size. If the set separates sequence pairs and if the current marker to add separates sequence pairs, then the combination of and can at best separate sequence pairs. That is a weak upper bound, relying on marker independence, but since the strength of the coupling of the single markers is unknown, it is difficult to estimate how much information they share. Since the markers are sorted decreasingly by their -values, the separation index is bounded a set can achieve with additional markers is bounded by(4)

As a heuristic approach, we modified the bounding in order to reduce the number of alternatives. Instead of assuming that each newly added marker can contribute its whole separation index to the separation index of the new combination, the marker's is scaled down by a factor , where is the *heuristic parameter*. We require that , and corresponds to the normal, non-heuristic approach, while would mean that every added marker is assumed to contribute no new information. The separation index is heuristically bounded by(5)

Note that most likely has a non-linear effect on the method's estimation of , which is probably undesirable from a user's perspective.

### Artificial data

An artificial population was generated, represented by 10,000 sequences with a length of 16,000 bp each. Ten markers, each spanning 800 bp, with different sets of polymorphic columns and increasing -values were introduced in sequences. The mutations were made using the pseudo-random number-generator mt19937 [10] provided by the Boost C++ Library with default parameters.

From the given population, sample sets of different sizes have been drawn (uniformly at random, using mt19937).

### Biological data

The biological dataset consisted of 241 full length mtDNA sequences also used by Coble *et al* [11], which were taken from the mtDB database [12].

All sequences were aligned to the revised Cambridge reference sequence (rCRS) for human mtDNA [13] using the Kalign program [14] with its default values. In all calculations the rCRS was not considered as a part of the dataset and was only used for indexing.

This alignment induced 483 polymorphisms, and 264 remained after reducing redundancy (see Section ).

The 241 input sequences were separated in 18 groups according to their HV1/HV2 type based on the rules of Coble *et al*. Some calculations were limited to 59 significant positions from the coding region, previously defined by Coble et al.

The disease position exclusions were done according to the information available in the Mitomap project [15].

## Results

### Artificial data

To test the accuracy of the estimation and the influence of the sample size on the quality of the estimated -values for a given population an artificial population was created.

All 10,000 sequences were used as input for the Excap program in order to calculate the ``true'' -values of markers in the whole population. In the following step the haplotypic diversity of those markers should be estimated from samples randomly drawn from the population. This estimation was done using the Excap program. The procedure was repeated 100 times for each sample size (except for sample size 1,000 which was drawn 50 times). From each drawn set the markers' diversity in the whole population was estimated. The mean value and standard deviation of over all 100 (50) samples was recorded for each specific sample size.

The tests revealed a strong correlation between sample size and quality of the estimation, see Figure 3. For weak markers that had an -value below , a sample set of size 10 was of limited value due to the high the standard deviation. With a sample size of 50 and stronger markers, the influence of the standard deviation sank and the estimates became more reliable. For a sample of size 10 and a marker with a low -value (e.g. ) the standard deviation (0.177)was almost as high as the estimated value (0.191) itself and the marker's utility was therefor relatively uncertain. On the other hand, a sample size of 50 and a stronger marker (e.g. ) had an acceptable standard deviation (0.072).

The estimated heterogozities and the corresponding standard deviations for sample sizes 10, 50, 100 and 1000.

It is clear that one must be careful not to overstate the importance of values estimated from small samples. It is worth noting, however, that standard deviation seems to be independent of and only determined by sample size.

Detailed information about the correlation between sample size and the quality of the estimation can be found in Table 1.

### Tests on biological data

Tests on biological data was a comparison between different panels of mitochondrial single nucleotide polymorphisms (SNP panels) from the work of Coble *et al.* [11] and sets of SNPs created using the Excap program.

Coble *et al.* [11] sequenced the mtDNA of 241 individuals of and presented eight different multiplex panels of overall 59 SNPs which were designed to complement the results of the HV1/HV2 testing. The panels were chosen to maximize their ability to separate two individuals from the same subtype. Due to practical reasons, Coble *et al* [11] excluded all polymorphic positions related to diseases or positions with nonsynonymous mutations. All positions of the hypervariable regions were also disregarded.

To re-evaluate the eight panels, the sequences from their respective HV1/HV2 types, as designated by [11], were given as input to Excap. That is, if panel B was most applicable to HV1/HV2 types H:2, H:3 and H:6, Excap was run on the sequences of those subtypes. To keep our results comparable, only the 59 positions used in the multiplexed panels proposed by Coble have been taken as input data for the recalculation. Would Excap suggest the same marker sets, or find better combinations?

We found that Excap gave better marker sets for each of the eight panels, either by having a higher haplotypic diversity with the same number of markers (seven cases) or by having a smaller marker set (on case) but the same . The improvements on diversity ranged from 0 % (but less SNPs used) to 101.81 %. See Table 2 for all results. The actual differences in terms of chosen SNPs are shown in Table 3 and they demonstrate why the automated Excap method does better than visual inspection based on individual sites' diversity. Also note that the actual number of different SNP positions are reduced from 59 to 47.

Even better results were achieved with the input data not limited to the 59 SNPs present in Coble's panels. To achieve similar initial conditions, all positions of the hypervariable regions (73-340 and 16024-16365), the poly AC repeat (515-524) and all positions related to diseases (174 positions) were excluded using information provided by the MitoMap database [15]. Input data was the same sets of sequences as used for the previous calculations. All panels could be improved significantly (see Table 4).

Another important result is the correlation between the haplotypic diversity and the width of the combined markers in biological data. We computed -values for up to 9 markers f widths 1, 30 or 100 bp, on mitochondrial genome sequences. As few as three SNPs, or two 30 bp markers, have a higher haplotypic diversity than one 100 bp marker, see Figure 4. Also note that to get the same as from nine SNPs, you would need as much as four 100 bp markers or six 30 bp markers.

The more markers a combination contains the less effect the width has to the overall information of the combination.

With increasing set size , the -value of a set converges to a maximum, defined as the diversity of the whole sequence. The size of the marker set has a higher effect to the convergence than the width of the included markers and the speed of convergence decreases with the growing set size. This has an important effect on practical work because, e.g., typing 20 SNPs instead of 10 SNPs may be much more expensive, but does not really provide more information.

### Heuristic analysis

Running Excap on large datasets can be time consuming and a heuristic approach becomes necessary. To get a better understanding of our heuristic's impact on results and execution times, we again turned to the data from [11].

We ran the Excap algorithm to find an optimal marker set of SNPs with different values on the heuristic parameter , starting from (no heuristic) and up to . Each calculation was performed for a single SNP-marker up to a set of seven SNP-markers.

The more markers a set contained, with consequently higher , the more stable the estimation of was, see Figure 5(a). For a set of four markers, still resulted in the optimal solution. For , we could set without notably worse estimation of . This observation is encouraging, since it is for the larger values on that the heuristic becomes indispensable.

The more markers a combination contains and the more variable the combinations get, the more stable are the estimations also under a stronger heuristic.

In the tested cases, the heuristic approach had a substantial improvement of execution time. With increasing , the execution time dropped exponentially, see Figure 5(b). Calculating the optimal set of seven SNPs in the data provided by Coble *et al.* took 143 seconds without heuristic. Using a 80 % heuristic reduced runningtime to 5 seconds and the provided solution was still the optimal one. The most dramatic case is for , which dropped from being a day-to-day calculation to executing in a few seconds when . As a comparison with Figure 5(c) reveals, gives a result which is still close to optimal.

## Discussion

Our experiments with Excap on artificial and biological data showed significant advantages. Most sets of variable markers chosen by hand are not optimal. In almost every case, the most variable markers do not form the most variable *set*. In many cases, one marker with a weak diversity improved the chosen set significantly, much more than a marker with a much higher -value which is not easy to find with a ``manual'' approach. Although our main algorithm's execution time is exponential, the related heuristic achieves reasonable and competitive results in many practical cases. Importantly, our parameterized heuristic allows for experimentation and adaption to different datasets.

As long as there is a dataset available that is big enough and representative for a certain population, Excap can be used to estimate the diversity and to optimize the choice of markers. One should be aware to keep the size and quality of the provided sample set in mind and not to overestimate the results.

There are some important consequences arising from having our program suggesting markers. Besides manual labour being reduced, the risk of making mistakes decreases significantly. Achieving the same variability with less and shorter markers also results in a reduction of costs. It is clear that the quality of chosen markers improves and that data is utilized better.

There is an important cost-benefit analysis to do when making decisions on marker sets, since there are limits of utility to including yet another marker. Excap enables a careful consideration regarding how many markers to use by estimating how much more information can be gained.

We believe Excap will also help choosing between technologies for DNA typing, and not just what markers to choose. By systematically trying different marker sizes and set sizes, one can determine the most economical and efficient way to reach a desired haplotypic diversity. As shown in Figure 4, a small number of SNPs may be more informative than hundreds of bases in larger markers.

There are opportunities for improvements to the present work. The suggested heuristic was a first approach to keep computation time in manageable scales, but is not necessarily the most efficient one. It could also be worthwhile to analyze the influence of the heuristic factor on the quality of the estimated -value. A heuristic factor effecting the results in a more sensitive way than the linear one might allow a more fine-grained application of the heuristic. Due to the simple heuristic approach it is possible that a heuristic with a smaller -value results in a worse set of markers than one with a greater . We would also like to investigate an approach in which the heuristic was used as a preprocessing step to reduce a large set of input markers. The optimal algorithm could then be applied to this more manageable set of markers. Furthermore, it could be investigated how existing parametrized algorithms, that solve NP-complete problems for a fixed parameter in polynomial time, could be applied to this problem. The size of the optimal set could be such a fixed parameter.

### Availability

Excap is written in C++ and the source code is hosted at http://sourceforge.net/projects/excap, distributed under the GNU Public License.

## Author Contributions

Conceived and designed the experiments: PS LA AK FS. Performed the experiments: AK FS. Analyzed the data: AK FS. Wrote the paper: LA AK. Supervision: PS LA. Planning: PS LA. Programming: AK FS.

## References

- 1. Coble M, Vallone P, Just R, Diegoli T, Smith B, et al. (2006) Effective strategies for forensic analysis in the mitochondrial DNA coding region. Int J Legal Med 120: 27–32.
- 2. Andréeasson H, Nilsson M, Styrman H, Pettersson U, Allen M (2007) Forensic mitochondrial coding region analysis for increased discrimination using pyrosequencing technology. Forensic Sci Int Genet 1: 35–43.
- 3. Anjum G, Du W, Klein R, Amara U, Huber-Lang M, et al. (2010) Pyrosequencing-based strategy for a successful snp detection in two hypervariable regions: Hv-i/hv-ii of the human mitochondrial displacement loop. Electrophoresis 31: 309–314.
- 4. Egeland T, Salas A (2011) A statistical framework for the interpretation of mtDNA mixtures: forensic and medical applications. PLoS One 6: e26723.
- 5. Nei M, Tajima F (1981) DNA polymorphism detectable by restriction endonuclaeses. Genetics 97: 145–163.
- 6.
Nei M (1987) Molecular Evolutionary Genetics. Columbia University Press.
- 7. Lutz S, Weisser H, Heizmann J, Pollak S (1996) mtDNA as a tool for identification of human remains. Int J Legal Med 109: 205–209.
- 8. Szibor R, Michael M, Plate I, Krause D (2000) Efficiency of forensic mtDNA analysis: Case examples demonstrating the identification of traces. Forensic Sci Int 113: 71–78.
- 9. Ingman M, Kaessmann H, Pääbo S, Gyllensten U, Liu L, et al. (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408: 708–13.
- 10. Matsumoto M, Nishimura T (1998) Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans Model Comput Simul 8: 3–30.
- 11. Coble M, Just R, O'Callaghan J, Letmanyi I, Peterson C, et al. (2004) Single nucleotide polymorphisms over the entire mtDNA genome that increase the power of forensic testing in Caucasians. Int J Legal Med 118: 137–146.
- 12.
Ingman M, Gyllensten U (2006) mtDB: Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Res.
- 13. Andrews R, Kubacka I, Chinnery P, Lightowlers R, Turnbull D, et al. (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23: 147–148.
- 14. Lassmann T, Sonnhammer E (2005) Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6: 298.
- 15.
Ruiz-Pesini E, Lott M, Procaccio V, Poole J, Brandon M, et al.. (2006) An enhanced MITOMAP with a global mtDNA mutational phylogeny. Nucleic Acids Res.