Conserved PCR Primer Set Designing for Closely-Related Species to Complete Mitochondrial Genome Sequencing Using a Sliding Window-Based PSO Algorithm

Background Complete mitochondrial (mt) genome sequencing is becoming increasingly common for phylogenetic reconstruction and as a model for genome evolution. For long template sequencing, i.e., like the entire mtDNA, it is essential to design primers for Polymerase Chain Reaction (PCR) amplicons which are partly overlapping each other. The presented chromosome walking strategy provides the overlapping design to solve the problem for unreliable sequencing data at the 5′ end and provides the effective sequencing. However, current algorithms and tools are mostly focused on the primer design for a local region in the genomic sequence. Accordingly, it is still challenging to provide the primer sets for the entire mtDNA. Methodology/Principal Findings The purpose of this study is to develop an integrated primer design algorithm for entire mt genome in general, and for the common primer sets for closely-related species in particular. We introduce ClustalW to generate the multiple sequence alignment needed to find the conserved sequences in closely-related species. These conserved sequences are suitable for designing the common primers for the entire mtDNA. Using a heuristic algorithm particle swarm optimization (PSO), all the designed primers were computationally validated to fit the common primer design constraints, such as the melting temperature, primer length and GC content, PCR product length, secondary structure, specificity, and terminal limitation. The overlap requirement for PCR amplicons in the entire mtDNA is satisfied by defining the overlapping region with the sliding window technology. Finally, primer sets were designed within the overlapping region. The primer sets for the entire mtDNA sequences were successfully demonstrated in the example of two closely-related fish species. The pseudo code for the primer design algorithm is provided. Conclusions/Significance In conclusion, it can be said that our proposed sliding window-based PSO algorithm provides the necessary primer sets for the entire mt genome amplification and sequencing.


Introduction
Mitochondrial DNA (mtDNA) is very popular for studying evolutionary relationships [1] because of it's maternal inheritance and very high mutation rate, and lack of recombination [2,3,4,5]. Although establishing the phylogenetic tree is an important tool in studying the evolutionary relationships between organisms, most organisms depend solely on a small part of the entire mtDNA, such as cytochrome b (cyt b) [6,7], cytochrome oxidase subunit I (COI) [8,9], and others. These studies tend to underestimate the contribution of the variation of the entire mitochondrial genome in evolutionary processes.
For example, different parts in mtDNA sequences may show different mutation rates [10]. High homology often occurs in protein-coding genes and high variability appears in non-coding sequence segments [11,12]. Moreover, the mitochondrial proteincoding genes and the D-loop region evolve faster than 12S and 16S rRNA genes [13]. Hence, it is essential to address the phylogenetic relationship among species inferred from the different parts of mtDNA if the whole mt genomic sequences are available.
Although some mt genomic sequencing studies have been published [14,15,16,17,18,19,20], mitochondrial genome sequencing for most species is still incomplete as shown in the GenBank due to the technical problems related to the primer design. This is especially true for closely-related species. Previously, development of the conserved primers for rapid sequencing of the complete mitochondrial genome for several species, i.e. three species of bears, has been proposed [21]. However, the conserved primers were commonly designed by manual inspection of the prealigned mtDNA sequences, especially for the whole mitochondrial genome. To perform mitochondrial genome sequencing for many species without computation is still a challenge.
To solve these problems, we developed a heuristic approach, particle swarm optimization (PSO), coupled with the sliding window mechanism to design the most suitable primer sets for amplification of to several similar mtDNA sequences after multiple sequence alignments from several closely-related species. PSO and the sliding window technique constitute an optimization technique and a randomized search respectively, and derive their working principles from the social behavior of organisms. Several important criteria in primer design, including the melting temperature, the PCR product size, the secondary structure, and the uniqueness of each designed primer [22] were considered in this study. Because the unreliable sequencing data at the start end of the first 30 to 40 nucleotides (nts), our proposed algorithm was designed to avoid these problems. Our strategy is to provide the primer sets that generate the PCR amplicons for each neighboring region with partial overlap. Accordingly, the unreliable sequencing data at the start end in one PCR amplicon can be compensated for by the 39 end sequencing data from its upstream PCR amplicon. Finally, we selected two sets of entire mt genomic sequences from two closely-related fish species to successfully demonstrate that our proposed algorithm was able to effectively identify the common primer sets for amplifying the entire mt genomic sequence.

Alignment of multiple sequences
Sequence alignment is the first step when designing a set of common PCR primer pairs of homologous sequences, which may be derived from closely-related species. A well-known multiple sequence alignment tool, ClustalW [23], was employed in this study. In the example of sequence alignment from three homologous sequences (Figure 1), the length of the match regions with larger than the constraint of the shortest primer length are regarded as the viable regions for primer design. After multiple sequences are aligned, numerous viable regions may be found and conserved. After the sliding window process is performed, the forward and reverse primers within the overlapping and conserved regions can easily to be designed to amplify the region between them.

Primer design using particle swarm optimization
Primer design is of crucial importance for PCR experiments. The quality of primers always influences whether a PCR experiment is successful or not. To obtain high quality primers, many primer design constraints must be satisfied. A heuristic algorithm particle swarm optimization (PSO) is employed. Particle swarm optimization is a population-based stochastic optimization technique, which was developed by Kennedy and Eberhart in 1995 [24]. PSO simulates the social behavior of organisms, such as birds in a flock and fish in a school. This behavior can be described as an automatically and iteratively updated system. In PSO, each single candidate solution can be considered ''an individual bird of the flock'', that is, a particle in the search space. Each particle makes use of its own memory and knowledge gained by the swarm as a whole to find the best (i.e., optimum) solution. All of the particles have fitness values, which are evaluated by a fitness function so that they can be optimized. During movement, each particle adjusts its position by changing its velocity according to its own experience and according to the experience of a neighboring particle, thus making use of the best position encountered by itself and its neighbor. Particles move through the problem space by following a current of optimum particles. The process is then reiterated a fixed number of times or until a predetermined minimum error is achieved [24]. The computational flowchart of the PSO is provided ( Figure 2). Details of the PSO algorithm are discussed is the following sections.

Encoding schemes
As stated above, we employed PSO to design suitable primer sets. In the PSO, each particle is designed in a format that enabled us to express a particular primer pair. The encoding method used was the following: where n represents the size of the population, F s represents the start index of the forward primer, F l represents the length of forward primer, R l represents the length of the reverse primer and P l represents the length of the PCR product. In PSO, a particle thus described represents a primer pair in a specific window. A particle represented by P i = {2, 3, 4, 10} as shown in Figure 3 as an example. Based on the template sequence, the i-th particle representing the PCR product is 'CTTAGCGAAT' in which the forward and reverse primer are 'CTT' and 'ATTC' (with complementary to reverse primer of GAAT), respectively.

Population initialization
To initialize a population, all of the particles are randomly generated and each particle is given a velocity (v) within 0,1. Initially, F s is randomly generated in the template sequence length of the window, and F l and R l are generated within the constraints of the primer length. P l is randomly generated within the constraints of the PCR length. The velocity of each dimension of each particle is randomly generated.

Fitness function
A successful PCR experiment depends on the quality and specificity of the designed primers. The optimal primer must satisfy many criteria [25]. In PSO, a fitness function is used to evaluate the fitness of each particle in order to check whether the primer pairs satisfy the design constraints. We combined many primer design constraint functions with weights into the fitness function. The constraints used are the followings: 1) melting temperature, 2) primer length and GC content, 3) PCR product length, 4) secondary structure, 5) specificity, and 6) terminal limitation. These constraints are described in detail below: 1) Melting temperature. To perform a successful PCR experiment, the melting temperature (Tm) for each primer must be confined to a suitable range. In this study, the value of the melting temperature of the primer is denoted Tm(P i ), referring to the formula based on nearest neighbor thermodynamic theory by Freier et. al (Eq. (2)) [26,27] and adopted from the user manual of ''NetPrimer'' software from PREMIER Biosoft International (http://www.premierbiosoft.com/DATAFILES/NetPrimerMa nual.pdf). In Eq. (2), DH and DS indicate enthalpy and entropy for helix formation, respectively, which are both calculated as the nearest-neighbor model as described [28]; R is molar gas constant (1.987 cal/uC * mol); C is the nucleic acid concentration (default value 250 pM); [K + ] is salt concentration which is equal to total [Na + ] equivalent calculated using the concentration values of the monovalent ion and free [Mg 2+ ] ion (default values 50 and 1.5 mM, respectively) (Eq. (3)). Function Melt_tm(P i ) is used to check whether the melting temperature of a primer pair is between 54 and 65uC, where P iF and P iR denote the forward primer and reverse primer of the i-th particle, and the abs() denotes the absolute value. DMelt tm(P i ) is used to check whether the difference of the melting temperature exceeds 3uC.  which n is sequence selected length.
2) Primer length and GC content. In general, the primer length should be within 18 and 28 nts, and the differential length of a primer pair is restricted to less than 3 nts [29]. |P iF | and |P iR | represent the number of nucleotides of the forward and reverse primers, respectively. In Eq. (8) and (9), the Length(P i ) is used to check whether the length of a primer pair is within 18 to 28 nts, and DLength(P i ) is used to check whether the length difference between the forward and reverse primers exceeds 3 nts or not.
The GC content check function GC%(P i ) is denoted as Eq. (7) GC% 3) PCR product length. MtDNA is a double-stranded circular molecule. PCR amplicons using several primer pairs are surrounded by the entire mtDNA sequences without gap. It is also essential to maintain the coverage for assembly of the individual sequences from different amplicons into the complete mt genome. The length of the PCR product has to be considered. In Eq. (10), P iP j j is the PCR product length. Using forward and reverse primers to perform the two-directional sequencing are helpful for double checking the sequence for 800-1100 nts in reliability. When the PCR product length setting of 800-1100 nts is not available, it is adjustable to reduce or extend the PCR length to a suitable range until it is reached.
4) Secondary structure. In primer design, primers that bind to any sites on the sequence indiscriminately have to be avoided. Furthermore, primers that are self-complementary (self-dimer) or complement each other (cross-dimer) must also be avoided where the dimer was defined to possess over 5 base parings and the hairpin length had to longer than 4 bps. These constraints are defined in Eq. (11) and Eq. (12): , if P iF and P iR are not self À complementary or do notcomplement each other 1, otherwise Hairpin(P i )~0 , if primer a doesn 0 t complement itself in the U form of P iF and P iR 1, otherwise 5) Specificity. The specificity constraint is used to judge whether the sequence repeatedly occurs in primer or not in order to ensure the specificity of the primer. The PCR experiment might fail if the primer is not site-specific or appears more than once in the sequence.
, if P iF and P iR appear in the template sequence of the window once 1, otherwise 6) Terminal limitation. GC_clamp(P i ) is used to judge whether the nucleotides located in the 39 end of primer are G or C: , if 3 0 end of P iF and P iR is G or C 7) End match. Endmatch(P i ) is used to judge whether the three nucleotides located in the 39 end of primer are base parings: , if 3 0 end of three nucleotides are base parings 1, otherwise Based on the several primer design constraints described above, the fitness of each particle is evaluated by the fitness function. A low fitness value means that the particle fits more constraints. The default fitness function can be written as: All the constraints of multiplex PCR primer design are combined into an objective function with weights. Every constraint weight setting is based on our previous researches [30,31].

Particle update
One of the characteristics of PSO is that each particle has a memory of its own best experience. At every iteration the particle's trajectory is updated by two ''best'' values, called pbest and gbest. Each particle keeps track of its coordinates in the problem space, which are associated with the best solution (fitness) the particle has achieved so far. This fitness value is stored, and represents the position called pbest. When a particle takes the whole population as its topological neighborhood, the best value is a global optimum value called gbest.
In this study, the adaptive functional values were based on the particle features representing the feature dimension; this data was evaluated by several constraints to obtain a fitness value. Once the adaptive values pbest and gbest are obtained, the features of the pbest and gbest particles can be tracked with regard to their position and velocity. Each particle is updated according to the following equations.
In these equations, w is the inertia weight, c 1 and c 2 are acceleration (learning) factors, and rand 1 and rand 2 are random numbers between 0 and 1. Velocities v new id and v old id are those of the updated particle and the particle before being updated, respectively, x old id is the original particle position (solution), and x new id is the updated particle position (solution).
In Eq. (19), particle velocities of each dimension are tried to a maximum velocity V max . If the sum of the accelerations causes the velocity of that dimension to exceed V max , the velocity of that dimension is limited to V max . V max and V min are user-specified parameters.

Primer design based on the sliding window technique
The sliding window method is simple and quick technique and has been widely applied in several computation fields, such as 16S ribosomal DNA amplicons in metagenomic studies [32], primer set selection in multiple PCR experiments [33], real-time primer design for DNA chips [34], and accurate location of eukaryotic protein coding regions [35]. The basic concept defines a window which is sliding across the template sequence as a specified shift value. In multiple primer design problems, the window is slid using each previous primer pair at a time along the template sequence. Subsequently, a set of primer pairs can be designed for the coverage of the entire mt genome. A detailed diagram for the sliding window method is provided (Figure 4). First, the window starts from the first nucleotide of the conserved sequences and the PSO algorithm applies it to the designed primer pair within. Then the window shifts based on the previous (first) reverse primer and the next (second) forward primer is designed from the upstream region of the first reverse primer. Each shift provides enough length for fitting PCR product coverage constraints (ranging from 90 to 200 nts of the upstream of the reverse primer). When primers can not be designed within this constrain, the overlapping length for each adjacent PCR product is extended automatically until the primers are suitable. This procedure continues until the entire template sequence is covered by all primer pair sets. The remaining primers are designed in the same way. The pseudocode of the sliding window method based PSO primer design is provided ( Figure 5). The brief JAVA 6.0-based software solution that implements our proposed algorithm was provided in the Supporting Information (Documentation S1 and Software S1).

Results
In this research project, we applied a high-performance PSO algorithm to design fit primer pairs, and used the sliding window technique to cover the entire mt genomic sequences. The test data sets and the detailed experiment result are described below.

Parameter settings
The termination condition of the PSO in this study is reached at a pre-specified number of iterations (in our case the number of iterations was 50). Parameters used here were a population size of 20, rand 1 and rand 2 were random numbers between (0, 1), and c 1 and c 2 were set to c 1~c2~2 . The inertia weight w was 0.8. The velocity constraints were set to V max = 6 and V min = 26.

Test data
Entire mitochondrial genomic DNA sequences of two closelyrelated fish species, i.e., Scarus forsteni (FJ619271.1) and Scarus rubroviolaceus (FJ227899.1) are used as an example for designing the conserved PCR primer sets for closely-related species using a PSO algorithm. These mt genomic sequences were downloaded from NCBI GenBank.

Dry experiment
After multiple sequences were aligned, we implemented the PSO algorithm and the sliding window technique to examine the performance on biological test data, as described above. A set of primers for amplifying entire the mt genome which was approximately 16 kb long, was designed using the proposed algorithm.
As shown in Table 1, these primers had a similar length, GC content and annealing temperature and the constraints fit the general primer design restrictions. The forward 59-ATTTAGC-CAATGACACCTAGCC-39 and the reverse 59-CGATTTG-CACGAGTATTTTCTC-39 of the primer pair No. 1, the length (nts), GC% and Tm (uC) were 22 vs. 22, 45.5 vs. 40.9 and 58.2 vs. 57.3, respectively. All primer sets listed in Table 1  In order to amplify the entire mt genome, it is essential to design a group of primer pairs that are partly overlapping each other. Based on the sliding window technique, the proposed algorithm can easily design primers that to allow PCR amplification for 22 adjacent PCR products that overlap each other. As shown in Figure 6, the proposed algorithm yields the entire sets of forward and reverse primers. The expected length for these PCR products ranged from 800 to 1100 nts, and these primer sets are arranged neatly and overlap with the adjacent PCR products. All designed forward and reverse primers of each PCR product are designed within the conserved regions of the multiple sequence alignment.

Wet experiment
The first ten primer sets for mitochondrial genome of Scarus forsteni provided in Table 1 were examined by PCR amplification. DNA samples (200 ng) were added to the PCR reaction mixture (10 ml) containing 1 ml of 106 PCR buffer, 0.3 ml of 50 mM MgCl 2 , 0.2 ml of 10 mM dNTP each, 0.6 ml of DMSO, 0.14 ml of 5 U Platinum Taq enzyme (Invitogen corp.), 0.12 ml of 350 mg/ml primer mix (1:1), and 7.64 ml of DNA in water. The touch-down PCR program [36] was applied to all the test primer sets (listed in Table 1, sets 1 to 10) and it was slightly modified as follows: 94uC To avoid the very high risk of sample mixup and the resulting artificial recombination when using e.g. many different primer pairs (see Table 1), however, parallel amplification of all targets per specimen would be highly recommended. As shown in Figure 7, the designed primer sets were successfully amplified by PCR.
Recently, GeneFisher2 [48] was developed to design a number of primer pairs for consensus sequence after multi-sequence alignment. However, like most primer design tools, GeneFisher2 only focuses on designing a primer pair or a number of primer pairs on a limited region and does not design a set of primer pairs for amplifying an entire sequence at once. Primer design for  amplification and sequencing of an entire mt genome is currently under development. PSO has been reported to progress faster than genetic algorithm (GA) with crossover, mutation, or both [49]. In our previous work for primer design [31], we found that the average accuracy of PSO was better than memetic algorithm (MA) [30] and genetic algorithm (GA) based on the Wallace formula [50] (100% vs. 98.33% and 74.93%) and the Bolton and McCarthy formula [51] (94.93% vs. 88.93% and 32.40%) for five hundred runs with PCR product length between 150,300 nts, 500,800 nts and 800,1000 nts. Therefore, we applied the PSO rather than the MA and GA to design the whole set of primers for the whole mt genomic sequencing.
In this study, we propose a novel strategy that introduces the sliding window technique coupled with a PSO algorithm. This algorithm provides a sequence alignment function for the entire circular genome and designs a set of primers that generates overlapping PCR amplification for the entire circular genome, e.g. the mt genome. Several primer design constraints for a successful PCR were considered in the proposed algorithm. For example, the melting temperature, the primer pair length, the GC content, the PCR product length, the secondary structure, and the specificity were all considered. The primer design results demonstrated that a set of primers that obeys these design constrain with suitable length and could be found for the PCR amplicons. Other circular genome sequences such as chloroplasts and bacterial chromosomes could be used with this strategy as well.
Some entire mt genome sequences are available for some species in GenBank, and hence they requirement for sequencing is reduced. However, the known mtDNA sequences could be applied to the entire mt genomic sequences of closely-related species based on our experience. For the purpose of sequencing species with an unknown mt genomic sequence, we suggest to collect the entire mt genomic sequences of the same family or order and perform our proposed algorithm. Subsequently, the designed primer sets derived from the conserved regions stand a high possibility of being used to perform successful PCR reactions for the entire mt genomic sequencing to the species with unknown mtDNA sequence.
Utilizing the heuristic algorithm and sliding window technique, we used several primer constraints to appraise the fitness values and, based on their respective significance, each constraint was given a corresponding weight. Through the design of a fitness function, our algorithm was able to design a complete set of primers. The strategy of this algorithm was designed to carefully  . Demonstration of the PCR performance for ten sets of designed conserved primers. Lanes 1 to 10 are the PCR amplification using primer sets 1 to 10 listed in the Table 1. doi:10.1371/journal.pone.0017729.g007 select a set of common PCR primer pairs from the conserved regions by multiple nucleotide sequence alignment. The common PCR primer pairs designed by our algorithm could be applied to amplify conserved sequences of genes from the same or different organisms. It can be used to amplify conserved sequences of genes from other gene families. For example, the whole primer sets for amplifying the entire circular mtDNA for Chimpanzees (Pan troglodytes; accession no. NC 001643) [52] and Bonobo (Pan paniscus; accession no. GU189657) [53] are successfully mined using our proposed methodology (data not shown). In our study, two fish mitochondrial genome data sets were used to test the proposed method and ten of the whole primer sets were successfully amplified to prove the performance of this algorithm. Taken together, our proposed primer design algorithm is an effective method to design primer pair sets for PCR amplification and sequencing for the multiple alignments of circular consensus sequences. The entire mt genome sequencing was easy to complete and could help biologists in identifying the phylogenetic relationship for closely-related species.