A Novel Constraint for Thermodynamically Designing DNA Sequences

Biotechnological and biomolecular advances have introduced novel uses for DNA such as DNA computing, storage, and encryption. For these applications, DNA sequence design requires maximal desired (and minimal undesired) hybridizations, which are the product of a single new DNA strand from 2 single DNA strands. Here, we propose a novel constraint to design DNA sequences based on thermodynamic properties. Existing constraints for DNA design are based on the Hamming distance, a constraint that does not address the thermodynamic properties of the DNA sequence. Using a unique, improved genetic algorithm, we designed DNA sequence sets which satisfy different distance constraints and employ a free energy gap based on a minimum free energy (MFE) to gauge DNA sequences based on set thermodynamic properties. When compared to the best constraints of the Hamming distance, our method yielded better thermodynamic qualities. We then used our improved genetic algorithm to obtain lower-bound DNA sequence sets. Here, we discuss the effects of novel constraint parameters on the free energy gap.


Introduction
More than half a century has passed since the double helix configuration of DNA was identified [1]. Presently, such knowledge about DNA contributes to virtually every area of science, including the use of DNA as a computational tool [2]. The hybridization reaction between 2 DNA sequences is important for advanced DNA applications because its efficiency and accuracy directly influence application reliability; however, false hybridization is an unavoidable artifact of combining DNA strands due to biotechnical limitations. False hybridizations occur as false positives and false negatives [3,4]. A false positive hybridization is a new duplex formed by mismatched single DNA sequences, due to a lack of single-strand similarities. A false negative hybridization involves matching DNA sequences that do not hybridize at all due to biochemical errors [5,6]. DNA sequence design is critical to many biotechnological applications. DNA microarrays rely on accurate DNA design of probes that are immobilized on a surface and bind specifically to complementary targets in a complex mixture [7,8]. Designing DNA sequences which satisfy some constraints could reduce false positives and improve hybridization uncertainty and inaccuracy between probes and their complementary targets. Designed DNA sequences should satisfy single or combinational constraints to ensure DNA sequence quality and permit the shortest DNA sequence to code for each informational unit required. Accurate DNA production also reduces false hybridizations and improves accuracy. The goal of DNA sequence design is to find the maximal number of designs that satisfy single or combinatorial constraints as well as the smallest design that satisfies these constraints.
We propose a novel distance criterion for designing DNA sequences. Using the novel free energy gap constraint, we designed DNA with better thermodynamic properties. Then, an improved genetic algorithm was used to search the lower bounds of DNA sequence sets that satisfy the novel and combinatorial constraints. Finally, we describe the relationship between the thermodynamic properties of DNA sequence sets and the parameters of novel constraints.

Free Energy Gap Criterion
Biotechnical limitations contribute to DNA hybridization uncertainties and inaccuracies which limit available DNA-based applications. To improve hybridization between two DNA molecules, investigators have explored DNA sequence thermodynamic properties to control the MFE and melting temperature. As a criterion for measuring thermodynamic properties of DNA sequences, the MFE of a sequence or sequences is the minimum value among free energies of all possible conformations of a sequence (s) [9]. Here, we report our efforts using the online freeware PairFold to predict the MFE of two interacting DNA molecules and gauge the quality of DNA sequence sets by using the free energy gap d.
A DNA sequencesis a string composed of alphabet S = {A,C,G,T}. DG(u,v) denotes the value of MFE between DNA sequences u and v, which is calculated by PairFold [9]. In addition, s9 denotes the Watson-Crick reverse-complement sequence of DNA sequence s. S is the set of DNA sequences s, S9 is the set of s9. To calculate the free energy gap d, definitions are stated as follows: (1) Sequence-Sequence Constraint: for all pairs of u i , v j in S, (2) Sequence-Complement Constraint: for all pairs of u i in S, v j in S9, and i ? j, (3) Complement-Complement Constraint: for the pairs of u i , v j in S9, (4) Sequence-Self-Complement Constraint: for all pairs of u i in S, u i in S9, and u = (u9)9, (5) Free energy gap: denoted by d. For two DNA sequences u and v, where u,v[S and u 0 ,v 0 [S 0 . In general, a larger d represents a larger gap between the free energy of desired and undesired hybridizations, and thus a better set (DNA sequence set quality) [7,10]. We used the free energy gap to gauge the quality of DNA sequence sets constrained by the Hamming distance and the novel constraint, namely the longest aligning common substring distance constraint (LACS). By comparing the free energy gap, we measured the influence of different distance constraints on DNA sequence designs.

The Hamming Distance Constraint
The Hamming distance constraint is frequently used to reduce DNA sequence similarity for DNA-based applications and mainly includes the word-word Hamming distance constraint (WWH) and the word-complement Hamming distance (WCH).
Garzon first proposed the definition problem of designing DNA sequences for DNA computing [11] as follows: in the alphabet S = {A,C,G,T}, there exists a set S with length n and size of |S| = 4 n . A subset C(S and let u, v any two codes in the C satisfy: d is a positive integer, t is the constraint criteria (or criterion) for DNA sequences, such as the Hamming distance criterion.
Word-word Hamming distance (WWH). Word-word Hamming distance constraint: for the DNA sequences u,v with given length n (written from the 59 to the 39 end), H(u,v) denotes the Hamming distance between u and v. WWH(u i ) denotes the minimal H(u i ,v j ) in all DNA sequences and should not be less than parameter d, Word-complement Hamming distance (WCH). Wordcomplement Hamming distance: for DNA sequences u,v9 with given length n (written from the 59 to the 39 end), H(u,v9) denotes the Hamming distance between u and v9. WCH(u i ) denotes the minimal H(u i ,v9 j ) in all DNA sequences and should not be less than parameter d, i.e., GC content constraint. The GC content constraint approximates the thermodynamic properties of DNA sequences and is combined with the distance constraint. A fixed percentage of nucleotides within each DNA sequence is G or C. Using this constraint, we assume this percentage is ( t n = 2 s n )%.

The Novel Distance Constraint
We propose the novel LACS distance constraint for the design of DNA sequences. DNA sequences that satisfy the novel constraint show reduced similarity and exhibit better thermodynamic properties than sequences constrained by Hamming distance.
Similarity. The LACS distance denotes [l,k] = LACS(A,B); l is the length of the longest consecutive common substring between A and B, k is the number of positions (excluding the longest consecutive common substring) at which the corresponding symbols are the same while aligning the location of the longest common substring between A and B.  Fig. 1, in which we find that the longest consecutive common substring is 0111, so l = 4. At the same, other aligned sequences are not equal, then k = 0. For A 2 and B 2 , they are aligned as the location of the longest common substring as in Fig. 2, in which we find that the longest consecutive common substring is 011, so l = 3. After alignment using the longest consecutive common substring, the third subsequence of A 2 is equal to the first subsequence of B 2 , the k = 1.
In this paper, n denotes the length of the DNA sequence. According to the definition of Hamming distance denoted by d H for designing DNA sequences, we define the LACS distance d L , d L = nlk. Generally speaking, the smaller the values of l and k and the larger the value of d L , the smaller the similarity between 2 strings.
Thermodynamic property. MFE is the minimum free energy of all possible structures and the most effective approach to control for unexpected secondary DNA sequence structures. The algorithm of PairFold and the standard thermodynamic parameters for DNA molecule are based on the nearest-neighbor thermodynamic model [9]; therefore, we employed the LACS distance constraint to approximate the MFE between 2 DNA sequences (not necessarily matching each other) that could form secondary structures.
Word-word LACS distance (WWL). Word-word LACS distance constraint: for the DNA sequences u,v with given length n (written from the 59 to the 39 end), WWL(u i ) denotes the maximal LACS(u i ,v j ) in all DNA sequences, where the values of l and k should not be more than parameters t l and t k , respectively.
Word-complement LACS distance (WCL). Word-complement LACS distance: for the DNA sequences u,v9 with given length n (written from the 59 to the 39 end), WCL(u i ) denotes the maximal LACS(u i ,v9 j ) in all DNA sequences, where the values of l and k should not be more than parameters t l and t k , respectively.
Here, we employed the improved genetic algorithm to design DNA sequence sets which satisfy the combinatorial constraints based on the different distance, and gauged the quality of the sets using the free energy gap calculated by PairFold [19]. Comparing free energy gaps can verify which distance constraint is better for DNA design. To improve the quality of DNA sequence design, the number of the LACS is equal to 1 in each pair of DNA sequences.

Algorithm Design
Genetic algorithms (GAs) are adaptive heuristic search algorithms based on evolutionary concepts of natural selection and genetics. An improved genetic algorithm to design DNA sequence sets based on the LACS distance constraints could enhance global search capabilities of a traditional genetic algorithm based on DNA sequence set characteristics. Improvements include initializing algorithm populations with the evenly distributed method. This enhances multiformity of populations based on a global field. According to the number of populations, the populations are evenly distributed in the value scope by the evenly distributed method. Randomly re-initializing the populations when they satisfy certain conditions would overcome premature convergence. Population re-initialization occurs once because increased time decreases the convergence of the algorithm. In the mutation process, we adjusted the probability of a mutation operator with a dynamic method. The traditional genetic algorithm adopts unique values to process the mutation operation, which could reduce convergence. The optimization problem is defined by the problem of maximum value, and we employ an average weight to manage the evaluation function. We denote fitness function f(i): where v j = 1 is the weight of each constraint, m is the number of constraints, and f j (i) are the selected constraints. The algorithm initializes DNA sequences with an evenly distributed method, selecting sequences which satisfy the constraint (or constraints), generating new DNA sequences by selection, crossover, and the mutation operator, and finally yielding the desired DNA sequence sets. Figure 3 illustrates the process flow.
The steps for designing DNA sequence sets with the improved genetic algorithm are as follows: Step 1: Set parameters and initialize the population with an evenly distributed method.
Step 2: Calculate the value of fitness function. We employed the MeanF to denote the mean of the fitness function. If

MeanFv
P m i~1 f (i)=m, then randomly re-initializing the populations.
Step 3: Generate the next generation population by selection, crossover, and mutation. The algorithm uses random tournament selection and the three-point crossover strategy. The size of the tournament is 2 and the number of repetitions is equal to 10% of the total population in the random tournament selection. In the mutation process, if fitness is larger than MeanF, its probability of mutation is 0.01, and if it is equal to MeanF, its probability of mutation is 0.03. Otherwise, its probability of mutation is 0.3. This process yields dynamic adjustment of the probability. If the generation is less than 200, the algorithm proceeds to step 2; if not, the algorithm moves to step 4.
Step 4: End and output results.
Our algorithms were successful with many different combinatorial constraints. Our results were better than those described in previous reports [12,13]. Thus, our algorithm is sufficient to design DNA sequence sets which satisfy the LACS distance constraint.

Results
The parameters of the improved genetic algorithm in our example are as follows: population size 1000, crossover 0.45, initial probability of a mutation is 0.01. To control the run time of the algorithm, the number of generations is 200. We used the PairFold package [9] to calculate the MFE of 2 DNA sequences. According to recent research, no statistically significant differences exist among free energy approximations in 4 publicly available and widely used programs [8]. The temperature in the algorithm is 37uC. To increase the reliability of our experimental results, we performed 50 experiments for every value and reported the mean of these experiments. In the tables, d is the distance based on the Hamming, LACS, or both constraints, and n is the length of the DNA sequence. Blank cells contain the '-' symbol.

Comparing the Free Energy Gaps
In Tables 1 and 2, d is the distance based on the Hamming and LACS constraints, for which d = d H = d L . Data in Table 1 are the free energy gaps of DNA sequence sets which satisfy the WWL and WCL combinatorial constraints. Parenthetical data are the free energy gaps that satisfy the WWH and WCH combinatorial constraints in Table 1. Table 2 data are the free energy gaps that satisfy the WWL, WCL, and GC content combinatorial constraints. Data in parentheses are the free energy gaps that satisfy the WWH, WCH, and GC content combinatorial constraints in Table 2.
Tables 1 and 2 depict data based on the Hamming distance constraint [13], for which we used the same experimental parameters and algorithms, including the experiment run times. In order to verify the influence of the LACS constraint in the  design of DNA sequences, we compared the free energy gaps based on the Hamming distance and LACS distance constraints, while having the same length of DNA sequences and same distance constraint. A comparison of the free energy gaps based on different distance constraints (Tables 1, 2) suggest the LACS distance constraint is better than the Hamming distance constraint for designing DNA sequences. The data in the tables suggest the quality of DNA sequence sets constrained by LACS distance constraint is significantly better after adding the GC content constraint. Thus, the novel constraints express thermodynamic properties more relevant to the MFE. Comparisons with the Hamming distance constraint do not account for values of l and k.  Table 3. Table 3 also depicts the lower bounds of DNA sequence sets which satisfy different combinatorial constraints based on the LACS criterion. The data suggest that DNA sequence set sizes would be reduced by adding the GC content constraint. Free energy gap values increased with increases in LACS distance, whereas DNA sequence set sizes decreased, similar to the Hamming distance [13].

The Relations between the Parameters of LACS
When the DNA sequence sets are the same length and have the same distance constraints, their free energy gaps often differ. To investigate the influence of different values of l and k on the free energy gap value in same-length DNA sequences with the same distance constraints, we used DNA sequence sets with n = 8 as the analytic example and free energy gaps as the criterion. Tables 4 and 5 depict data for the free energy gaps constrained by the combinatorial constraints. Parenthetical data are values of k. Tables 4 and 5 also demonstrate that free energy gaps increase with increasing values of LACS distance (See Tables 1 and 2 for  similar characteristics). Also, free energy gaps decrease with increasing values of k, keeping l constant. Also, free energy gaps decrease with increasing the values of l, keeping k constant. Finally, the maximum free energy gap was best estimated using a maximum value of l, the LACS distance.

Discussions
The distance constraint (or the similarity constraint) is the chief method for designing DNA sequences. Constraints such as the Hamming distance are used to reduce the similarity of DNA sequences used in hybridization reactions by describing the minimum number of substitutions required to change one DNA stand into the other. Simple mathematical formulae are used to confirm the similarity of a pair of DNA sequences, helping to reduce the likelihood of false positives; however, this technique does not accurately address the thermodynamic properties of DNA sequences, even when accounting for GC content. Addressing thermodynamic constraints for DNA design would increase sequence accuracy more than present design strategies based on distance constraints. Baum proposed the existence of DNA sequence similarity [5] and suggested constraints that could be used in DNA sequence design. He also described maximum DNA sequence sets that would satisfy these constraints. Deaton proposed that DNA sequence design should be combined with biochemical techniques and reported coding reliability problems when information theory was used alone [6]. Deaton suggested an evolutionary genetic algorithm to design DNA sequences. In contrast, Hartemink proposed DNA design based on distance constraints (such as the Hamming distance) and the free-energy criterion [14]. Improving on these discoveries, Zhang used an improved genetic algorithm to design DNA sequences that satisfied combinatorial constraints, including the Hamming distance constraint and accounting for GC content [12]. Shin used the Multi-objective evolutionary algorithm to design DNA sequences and developed a system (NACST) using a genetic algorithm [15,16].
These studies describe a combination of distance constraint considerations and GC content to design DNA sequences with better thermodynamic properties; however, these methods only roughly constrain the thermodynamic attributes of DNA sequences. To address this gap, minimum free energy criteria are widely used to measure the thermodynamic properties of DNA sequences. Garzon and Rose proposed a method for measuring the quality Table 3. Size values to satisfy different combinatorial constraints.  of DNA sequence sets based on their thermodynamic properties by using statistical mechanic principles [17,18], and Penchovsky and Ackermann employed combinatorial criteria to design sets of sequences for molecule-based computing [19]. To maximize the desired hybridization and minimize undesired hybridizations, they limited the range of the sequence set melting temperature. They proposed an important new 'free energy gap' measure of a set quality, and designed their sets based on this new constraint. Tulpan researched DNA sequence sets based on MFE with a PairFold package, which is freeware available online [9]. They described a new algorithm for designing DNA sequence sets in which sets would satisfy several thermodynamic and combinatorial distance constraints [7,20]. This new technique aimed to maximize desired hybridizations between strands and their complements, while minimizing undesired false hybridizations. Garzon's paper presents exhaustive research to produce DNA sequence sets of sizes comparable to maximal sets while guaranteeing the highest quality, as measured by the MFE between any pair of DNA sequences [21]. A comparison of their experimental results with previous work revealed improved lower bounds of DNA sequence sets based on MFE. Subsequently, Kawashimo [22] employed dynamic neighborhood searches to design DNA sequence sets and further improve Garzon's methods.
Kawashimo introduced a technique to reduce such time-consuming evaluations of MFE, rendering the dynamic neighborhood search strategy applicable to practical thermodynamic constraints. They increased the speed of local-search type algorithms for designing DNA sequence sets based on MFE [23] and their algorithm generated better DNA sequence sets than existing methods. Tulpan presented a quantitative comparison of four published DNA/DNA duplex free energy calculation methods and concluded that no statistically significant differences exist among free energy approximations in these publicly available and widely used programs. In another report, improved genetic algorithms were used to design DNA sequence sets which satisfied different combinational constraints and enabled the creation of the highest quality DNA sequences sets thus far [13]. Recently, Bystrykh proposed a method of generalized DNA barcode design based on Hamming codes [3]. In their work, Hamming barcodes could be employed for DNA tag designs in many different ways while preserving minimal distance and error-correcting properties. In the Xiao's paper [24], a multi-swarm particle swarm optimization was proposed to deal with DNA encodings problem. The method proposed used the local PSO with the time-varying acceleration coefficients (TVAC) as the search engine for each sub-swarms, and incorporated the differential evolution to improve the swarm search space.

Conclusions
Here we propose a novel distance constraint: the LACS distance for designing DNA sequences. This constraint decreases the similarity of DNA sequences and better models the thermodynamic properties of DNA in comparison to current Hamming distance constraints. The thermodynamic properties of different distances are accounted for using an improved genetic algorithm to design DNA sequence sets which satisfy the Hamming and LACS distances. Free energy gaps are used to gauge DNA sequence set quality. According to DNA sequence set sizes obtained using this improved genetic algorithm, we identified the lower bounds of novel constraints which satisfy different combinatorial constraints. Finally, we discussed the effect of different values of l and k on the free energy gaps of DNA sequence sets with identical DNA sequence lengths and distance constraints. We hypothesize that the maximal length of the LACS is even more important for designing DNA sequence sets based on thermodynamic properties.
Our work represents a valuable contribution to DNA sequence design. Future studies will improve our algorithm and the lower bounds based on the novel constraint. According to the proof based on Hamming distance, we could theoretically prove the exact lower and upper bounds of novel constraint and offer proofof-concept for the theoretical relationship of l and k.