Levy Equilibrium Optimizer algorithm for the DNA storage code set

Jianxia Zhang

doi:10.1371/journal.pone.0277139

Abstract

The generation of massive data puts forward higher requirements for storage technology. DNA storage is a new storage technology which uses biological macromolecule DNA as information carrier. Compared with traditional silicon-based storage, DNA storage has the advantages of large capacity, high density, low energy consumption and high durability. DNA coding is to store data information with as few base sequences as possible without errors. Coding is a key technology in DNA storage, and its results directly affect the performance of storage and the integrity of data reading and writing. In this paper, a Levy Equilibrium Optimizer (LEO) algorithm is proposed to construct a DNA storage code set that satisfies combinatorial constraints. The performance of the proposed algorithm is tested on 13 benchmark functions, and 4 new global optima are obtained. Under the same constraints, the DNA storage code set is constructed. Compared with previous work, the lower bound of DNA storage code set is improved by 4–13%.

Citation: Zhang J (2022) Levy Equilibrium Optimizer algorithm for the DNA storage code set. PLoS ONE 17(11): e0277139. https://doi.org/10.1371/journal.pone.0277139

Editor: Ziqiang Zeng, Sichuan University, CHINA

Received: June 10, 2022; Accepted: October 10, 2022; Published: November 17, 2022

Copyright: © 2022 Jianxia Zhang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant code and data available on Github (https://github.com/queenbio/LEO-DNAstroagecoding).

Funding: This research is supported by the National Nature Science Foundation of China, grant number 61772100,62272079; This research is supported by Henan Institute of Technology Doctoral Research Fund Project, grant number KQ1812. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

With the rapid progress of science and technology and the increasing popularity of high-speed network, network data, mobile data, social data and other digital information data are increasing exponentially. According to IDC, the total amount of global data will reach 175ZB by 2025. Storage devices based on physical media can not cope with the explosive growth of data. Therefore, how to store these massive data has become a key problem for the long-term sustainable development of information technology. As a new storage method, DNA data storage technology plays an important role in saving storage energy and promoting the development of big data storage. The idea of using DNA molecules to store and store information appeared as early as the 1960s. Since it was difficult to read and write DNA information, it was not until 1988 that Davis [1] began to use DNA to store a small amount of information, but the storage information was very small. In recent years, with the rapid reduction of the cost of base synthesis and the development of DNA sequencing technology, some more practical work has appeared. In 2012, Church et al. [2] adopted A binary model to encode and store digital information by using bases A and C to represent 0 in binary and bases G and T to represent 1 in binary. In order to reduce the error rate, it is required to avoid 4 or more consecutive same bases in the coding information, and ensure the stability of GC content. The compiled DNA sequence is synthesized into several short DNA fragments. Then, through second-generation sequencing, the content of the synthesized DNA fragment is read out, and finally converted into a short fragment, and the position of the fragment in the whole file is found according to the bar code to obtain the original file. Goldman et al. [3] adopted ternary coding model based on Church’s work, that is, each bit of information is represented by 0, 1 and 2 states. They used homopolymer-free DNA sequences to encode ternary digital information. The scheme proposed by Goldman has certain error correction ability, so it can effectively reduce the error rate compared with Church’s work when reading the information stored in DNA. Yazdi et al. [4] described the first DNA-based storage architecture that allowed random access to blocks of data and the rewriting of information stored anywhere within the blocks. The system is based on new constraint coding techniques and corresponding DNA editing methods to ensure data reliability, specificity and access sensitivity. Ceze et al. [5] proposed an architecture for a DNA-based archival storage system. The team managed to encode data from four image files into the nucleotide sequences of synthetic DNA fragments, and they were also able to retrieve the correct nucleotide sequences from a larger pool of DNA and reconstruct the image without losing a single byte of information. Shortly afterward, Microsoft announced that it had saved about 200 megabytes of data using DNA storage technology, including "War and Peace" and 99 classic literary works. In 2017, Erlich and Zielinski [6] proposed the Coding method of DNA fountain algorithm to achieve error-free storage and achieve the coding rate of 1.6bit/nt. In the same year, Shipman et al. [7] introduced images and videos encoded as DNA sequences into the genome of EScherichia coli and read corresponding images and videos from the genome of living bacterial cells. In 2018, the Grass team [8] encoded and stored 35 different files in over 13 million DNA oligonucleotides and could recover each file lossless using random methods. A large primer library has been designed and validated, which can independently recover all files stored in DNA. DNA as a long-term storage medium has been preliminarily demonstrated for storage potential [9–12].

DNA coding is a key technology in DNA storage, which aims to store data information with as few base sequences as possible without error. The results of DNA coding directly affect the performance of storage and the integrity of data read and write. Reasonable and efficient coding is very important for the whole DNA storage system. In 2012, Church et al. [2] used DNA synthesis technology and second-generation sequencing technology to encode 0.65MB of abiotic information into DNA sequence, which was the first application of binary model and achieved an information storage density of 0.83 bit/nt. With the in-depth research on DNA coding, Ross et al. [13] reported that replacement and deletion errors would increase significantly when the running length of homomer exceeded 6. On the other hand, DNA chains with too high or too low GC content were more prone to synthesis and sequencing errors. Due to the above reasons, Bornholt et al. [14] adopted XOR coding principle to improve Church’s coding scheme, which not only realized random access but also achieved 0.85 bit/nt storage density. In terms of error correction coding, Blawat et al. [15] introduced forward error correction to achieve a storage density of 1.08bit/nt. Not long after, Yazdi et al. [4] overcame the need for full sequencing when reading data, designed a coding method to achieve random access through address bit addressing, and a platform for efficient sequencing through iterative alignment and deletion of error check codes, thus achieving high storage density. In 2017, Erlich et al. [6] creatively designed a "fountain code" for information storage, which is highly robust and efficient because it can avoid GC content with high deviation and the generation of homomeric. This is the first time that fountain code introduced in communication coding makes net information density as close to Shannon limit as possible, and shows that error detection/correction algorithm is not necessary for error correction, and the same effect can be achieved by screening sequences. Jeong et al. [16] further improved the fountain code and obtained better decoding results by clustering. Wang et al. [17] designed an encoding method consisting of repeated additive codes (RA) and an efficient hybrid mapping scheme to achieve a storage density of 1.67bit/nt. Zhang et al. used combinatorial constraints to screen DNA storage codes [18–20], and used heuristic algorithms such as CLGBO [21] and NOL-HHO [20] to construct DNA storage code sets, constructing DNA storage code sets of higher quality. Yehezkeli et al. [22] consider noise introduced entirely by uniformly repeated sequences and exploit the relation with equal weight integers in the Manhattan metric. The existence of full-rate reconstruction codes is proved using hyperplane restricted multifaceted wall intersections [23], and a method for the construction of a class of reconstruction codes is given. Lenz et al. present a storage model for disordered sequence representations, deriving a Gilbert-Varshamov lower bound on the reachable bases of error-correcting codes and an upper bound on spherical wrappers [24]. In 2021, Zan et al. [25] proposed a hierarchical error-correction strategy for text DNA storage based on divide-and-conquer algorithm to complete lossless storage of text.

Heuristic algorithms can provide a feasible solution for each instance of the combinatorial optimization problem to be solved at an acceptable cost. The deviation degree of the feasible solution from the optimal solution cannot be predicted in general. Classical heuristic algorithms include genetic algorithm [26], particle swarm optimization [27] algorithm, etc. New ones include the Monarch Butterfly Optimization (MBO) [28], Slime Mould Algorithm (SMA) [29], Moth Search Algorithm (MSA) [30], hunger games search (HGS) [31], RUNge Kutta method (RUN) [32], colony predation algorithm (CPA) [33], Weighted mean of vectors (INFO) [34] and Harris Hawks Optimization (HHO) [20]. They are widely used in engineering to optimize traditional complex engineering problems such as distribution system [35], power flow [36], and power grid [37]. It is also often applied to solve optimization problems in the biological field [38, 39].

The DNA storage coding problem can be equivalent to the DNA coding screening problem satisfying the combinatorial constraints. However, because of the high complexity of the computational process of constraints, the efficiency of using traditional algorithms is too low. Due to the problems of low base utilization and low coding quality in existing DNA storage coding methods, this work constructed a DNA storage coding set that met the combination constraints through the improved LEO (Levy Equilibrium Optimization) to ensure both coding efficiency and coding quality. The LEO algorithm improved by Levy optimizer reduces the possibility of the original algorithm falling into local optimum, and improves the convergence speed of the algorithm. It is possible to construct more sets of DNA storage codes that satisfy the constraints. The constructed encoding set satisfies the Hamming distance constraint, the GC content constraint and the No-runlength constraint, and has some error correction capability. It also offers many coding advantages such as high robustness, low coding complexity and shorter coding time.

2. Encoding constraints

2.1 Hamming distance constraint

The Hamming distance can be used in other research areas, such as in coding theory, to measure the similarity of two codewords. In DNA storage, a smaller Hamming distance in coding [40] can indicate that there are many identical bases between two different DNA codewords, i.e., an increased possibility of non-specific hybridization. For two different DNA codes j, k, HD(j, k) denotes the number of different bases at position i of sequence j, k. The Hamming distance constraint expression can usually be expressed by the following mathematical formula with HD(j, k) ≥ d. The Hamming distance is calculated as follows: (1)

2.2 GC content constraint

A, T, C and G are the four bases that constitute the structure of DNA, among which A and T can form A double-stranded structure when they are complementary, as can G and C. In actual biological operations, sequences with extreme GC content are unstable, so sequences are generally designed according to 40%-60% GC content, which is the GC content constraint condition [41].

(2)

2.3 No-runlength constraint

Continuous bases lead to the instability of the molecular structure of the whole sequence, and the hybridization reaction is difficult to control. Errors are especially prone when reading long homopolymers. Therefore, in the coding process, we use No-runlength constraints [42] to try to avoid similar errors. Running the same nucleotides over long periods of time can cause errors in the DNA code. For example, TCCCCAC, C is repetitive, so it is easy to read long C into short C in synthesis and sequencing, resulting in an increase in the error rate of DNA storage information and a decrease in read and write coverage. For code words L (l₁, l₂, l₃… l_n) is the length of n, and for any I: (3)

3. Algorithm description

3.1 Equilibrium optimizer

Equilibrium optimization algorithms are inspired by a variety of phenomena in physics, such as mixed dynamic mass balances. The mass balance equation in the mixed dynamic mass balance weight is used to describe the dynamic equilibrium process that limits the concentration of non-reactive substances in the volume. The mass balance equation has the role of providing a fundamental physical explanation for the conservation of mass entering, leaving and arising in the control volume. More detailed information related to the mixed dynamic mass balance process can be found further in the original paper [43].

The steps of EO algorithm are as follows:

Step 1: Initialization
Initialization is performed according to the multiple parameters in the search space, and the initial concentration is constructed using the number and dimension of uniformly random initialized particles with the following mathematical equation: (4) Here represents the concentration vector of particle i, c_max, c_min representing the upper and lower bounds of the dimension respectively. r₁ represents a random vector between [0,1] and contains n groups of particles.
Step 2: Balance pool and candidate pool
Population intelligence algorithms such as the EO algorithm and the particle swarm ant colony algorithm are population-based algorithms. These algorithms divide the search process into two phases: exploration and exploitation. Each algorithm has a different approach to exploration and exploitation. For all heuristic algorithms, there is an optimization objective based on their properties. For example, the optimization search process of the ant colony algorithm is carried out by searching for food for ants, in contrast to the EO algorithm, which searches for equilibrium states of the search food. However, in the optimization process of the EO algorithm, there is no specific level of concentration to reach the equilibrium state, so the equilibrium state is artificially defined by the four best particles found and the average particle. These five particles help the EO algorithm to perform better in exploration and exploitation, and they all exist in an equilibrium pool, mathematically formulated as follows:(5)
Step 3: Update method of concentration
EO algorithms need to find a reasonable balance between development ability and exploration ability, and this process is achieved by balancing turnover . In some control volume, the rate of turnover varies with time, assuming is a random vector between 0 and 1. (6) Where t is with the increment in iteration, the formula is as follows (7) Where iter and t_max represent the current iteration number and the maximum iteration number respectively, a₂ represents the constant value of the control development capability. In addition, parameter a₁ is designed to enhance the diversity and exploration capacity of the population, as follows: (8)
Generation rate R is another parameter used to improve the development operator, and its formula is as follows: (9) (10) (11) Where is a random vector between [0,1], r₁ and r₂ are random numbers between 0,1, and is the control parameter for the generation rate and also has the update process to determine whether the generation rate will be applied to the EO algorithm.
Finally, the update equation of EO is as follows: (12) Here V is assigned 1. For more detailed introduction of EO algorithm, please refer to Faramarzi [43].

3.2 Levy Equilibrium Optimizer

Although the EO algorithm uses parameters such as a₁ to enhance the exploration ability of the population, the population richness of the EO algorithm still decreases in later iterations, a situation that is likely to increase the probability of falling into a local optimum, which may be exacerbated in the actual solution process due to more complex conditions. And the individual update mainly depends on the size of the turnover, and then update randomly according to the current optimal global and equilibrium pool. Since the early optimal global value of the algorithm is often too far from the true value, this strategy will increase the probability of the algorithm falling into local optimal, and may lead to a decrease in the convergence speed of the algorithm. A study by Reynolds et al. [44] showed that Drosophila flies explore their environment and search for food during foraging through a series of straight-line flight paths that are often interspersed with abrupt right-angle turns. An intermittent scale-free search model, called Levy, was proposed based on the scale-free flight of Drosophila. And the model was applied to the optimization process and optimal search by the researchers, and it was shown to have good search performance by preliminary results [45]. In LEO algorithm, Levy Flights update strategy is used to replace random update based on current global optimization, which reduces the influence of minimax pool individuals on update mechanism. Therefore, levy flight algorithm was added in the later iteration of the algorithm in this paper to accelerate the convergence of EO algorithm and jump out of local optimum through Levy flight operation. In this paper, levy flight algorithm is used to carry out Levy flight operation on the pool in the late iteration of EO algorithm and process the output of EO algorithm, which can expand the search scope of EO algorithm and obtain a larger code set. By initializing set S, determine whether all codes in set S and S_EO meet the combination constraint one by one. The flow chart of LEO algorithm is shown in Fig 1.

Download:

Fig 1. Flow chart of LEO algorithm.

https://doi.org/10.1371/journal.pone.0277139.g001

4. Result and analysis

4.1 Benchmark function

In order to verify the performance of the LEO algorithm more clearly, the test function approach is used in this paper. Benchmarking was carried out by using the 13 dominant benchmark functions [46] in Tables 1 and 2. On the one hand, different algorithms target different types of real-world problems, but on the other hand, it is uncertain whether each algorithm achieves the best results for each problem. Since the test functions are simulations of real problems, different algorithms may be suitable for different test functions. Thirteen benchmark functions were chosen, including seven high-dimensional single-peaked functions and six high-dimensional multi-peaked functions. These 13 functions have the ability to reflect most real-world problems, and testing them provides a useful indication of the performance of the algorithm. For the sake of fairness and to improve the reliability of the results and the rigour of the experiments, it is necessary to limit the domain of definition and the number of iterations of the test functions. In order to better illustrate the convergence process of LEO, it can be clearly seen in Fig 2 that in the initial stage, LEO and EO maintain the same iteration efficiency, but in the later stage, LEO converges faster and is closer to the global optimum. This is because Levy flight is LEO jumping out of the local optimum and improving the iteration speed.

Download:

Fig 2. Comparison of convergence curves of LEO and EO algorithms on F7.

https://doi.org/10.1371/journal.pone.0277139.g002

Download:

Table 1. Unimodal benchmark functions.

https://doi.org/10.1371/journal.pone.0277139.t001

Download:

Table 2. Multi-modal benchmark functions.

https://doi.org/10.1371/journal.pone.0277139.t002

After running the 13 test functions for 30 times, the mean and variance of the results were compared with the original algorithm and other representative algorithms. We selected EO, PSO, GWO, GA, GSA and SSA algorithms for comparison, among which EO is the latest work from Mirjalili et al. [29], GA is the earliest and well-performing evolutionary algorithm, PSO is a heuristic algorithm that mimics group behaviors and has group validity, and GSA is a generalization based on physical significance. The maximum number of iterations for these algorithms is set at 500. EO, PSO, GWO, GA, GSA and SSA results are derived from Faramarzi’s work [43]. Tables 3 and 4 list the test functions used.

Download:

Table 3. Average result of benchmark functions.

https://doi.org/10.1371/journal.pone.0277139.t003

Download:

Table 4. Standard deviation of benchmark functions.

https://doi.org/10.1371/journal.pone.0277139.t004

F1-F7 is a high-dimensional single-peak function with global optimality, so it is usually used for general testing of algorithms. F8-F13 has a global optimal and several local optimal, and the number of local optimal solutions increases with the increase of dimension. This increases the difficulty of heuristic algorithm, and can better reflect the optimization speed and jump out of local optimal performance of an algorithm. Tables 3 and 4 show LEO’s performance on the 13 test functions, and for the most part, LEO achieved the best results in the table. However, in the face of complex functions such as F12 and F13, LEO performance is unsatisfactory, which may be that in the face of multi-peak functions, the performance of Levy algorithm is limited, so the optimal solution is not obtained. However, on multi-dimensional unimodal functions, such as F1-5, LEO algorithms find the global optimal solution 0. In order to further illustrate the statistical significance of LEO algorithm, we conducted Wilcoxon test on LEO algorithm, and in most cases, LEO algorithm passed statistical verification. The results are shown in Table 5.

Download:

Table 5. P values of Wilcoxon rank sum test over 30 runs.

https://doi.org/10.1371/journal.pone.0277139.t005

4.2 lower bound of the DNA storage code set

The DNA coding set with length n, hamming distance d and meeting hamming distance constraint, GC content constraint and no repeated base constraint is defined as A^GC,NL(n, d, w). In Table 5, the results in the table are 4≤n≤ 10, 3 ≤d≤n satisfy the lower bound of the constraint. Any algorithm seeking optimization requires a fitness function, so the LEO algorithm uses the sum of the Hamming distances of one of the constraints as a fitness function for the DNA constraint encoding process.

(13)

As shown in Table 6, we list the results based on the LEO algorithm and compare them with the best results from Li and Limbachiya [47]. The part in bold represents the optimal solution under the same constraints, A represents the best result in Limbachiya and Li, and LEO represents the result in this paper. When n = 9 and d = 4, the size of the DNA storage coding set constructed by the LEO algorithm was 13.2% higher than the results in previous representative work. This is because LEO algorithm uses Generation probability and Equilibrium pool mechanism to balance the process of exploration and development well, and levy flight strategy is used in the late iteration to jump out of local optimum and approach the optimal solution more closely. The results of LEO algorithm provide good initialization, and the balanced pool strategy further extends the results of EO algorithm. More DNA storage codes can reduce the cost consumption of DNA storage system and can perform the same function with the same length. Better quality DNA storage coding can reduce the error rate in the reading and writing process, ensure the overall operation of the DNA storage system, and DNA as a storage medium is also a low-carbon storage method.

Download:

Table 6. Coding lower bound of A^GC,NL(n, d, w).

https://doi.org/10.1371/journal.pone.0277139.t006

By comparing the results with those of Limbachiya and Li [47], it is clear that the LEO algorithm yields a significant advantage over the best of them in terms of coding. The LEO algorithm is an intelligent algorithm based on a greedy algorithm that removes the "worst" candidates in each iteration and iteratively removes potential code words to obtain a set of codes that satisfy the requirements. As the algorithm repeats, the altruistic algorithm greedily removes the maximum number of coding words d-1 in the radial range until the distance d of the coding set is minimal. However, altruistic algorithms based on greedy algorithms do not consider the global optimality, but only construct a local optimal solution in a specific sense. Similarly, EORS algorithm also has a random search phase, which is expected to search more valid DNA codes through greedy search strategy, but the time complexity is increased. Therefore, in this work, we use the heuristic algorithm LEO. LEO algorithm is an improvement of EO algorithm based on Levy algorithm and has the advantages of fast convergence speed and high population richness, which can help EO algorithm to converge faster and find the approximate optimal solution.

5. Conclusion

This paper proposes a LEO algorithm for DNA storage coding through combinatorial constraints. By approximating the DNA storage coding problem satisfying the constraints to a multi-objective optimization problem, the heuristic algorithm LEO is used to solve the approximate optimal solution of DNA storage coding. Not only can the native advantages of heuristic algorithms for non-linear multi-objective optimizations problems be fully exploited, but the low complexity of constraint encoding is also applied to the field of DNA storage encoding. Encoding that satisfies the constraints reduces the error rate in DNA synthesis and sequencing, as well as the probability of specific hybridization of DNA sequences during PCR. In order to illustrate the superiority of the LEO algorithm proposed in this paper, compared with many convincing algorithms under the benchmark function, the results show that the LEO algorithm has significant advantages in AVE and SD, indicating the effectiveness of the improved algorithm. A larger DNA coding set was constructed under the same combinatorial constraints, and the coding results achieved satisfactory results compared to previous work. The experiments show that in the majority of cases, the coding scheme proposed in this paper achieves satisfactory results compared to the optimal results of Li and Limbachiya, and the lower bound of the coding set is significantly improved, which also illustrates the excellent performance of the LEO algorithm proposed in this paper from the perspective of practical applications. Under the same constraints, the size of the LEO algorithm constructed DNA storage code set is increased by 4–13%. A larger set of stored codes can store more valid information in the same DNA length, reducing costs and improving read and write efficiency. This means that the same performance can be achieved in smaller code lengths, allowing for more efficient and competitive storage of DNA storage systems at a lower cost.

In future work, we will continue to focus on DNA storage coding and continue to study the existing problems of low coding efficiency, low coding quality and insufficient coding set. The intention is to achieve truly fully automated DNA storage as a powerful alternative to traditional silicon-based storage. In addition, the encryption and decryption of image information and text information can be considered for the security of DNA storage, and finally realize the encryption of carbon-based storage and computing integrated equipment similar to silicon-based computer.

References

1. Davis J. Microvenus. Art Journal 1996, 55(1):70–74.
- View Article
- Google Scholar
2. Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science 2012, 337(6102):1628–1628. pmid:22903519
- View Article
- PubMed/NCBI
- Google Scholar
3. Goldman NM, Bertone P, Chen S, Dessimoz C, Leproust EM, Sipos B, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013, 494(7435):77–80. pmid:23354052
- View Article
- PubMed/NCBI
- Google Scholar
4. Yazdi S, Yuan YB, Ma J, Zhao HM, Milenkovic O. A Rewritable, Random-Access DNA-Based Storage System. Scientific Reports 2015, 5. pmid:26382652
- View Article
- PubMed/NCBI
- Google Scholar
5. Bornhol J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K. A DNA-Based Archival Storage System. Acm Sigplan Notices 2016, 51(4):637–649.
- View Article
- Google Scholar
6. Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017, 355(6328):950–953. pmid:28254941
- View Article
- PubMed/NCBI
- Google Scholar
7. Shipman SL, Nivala J, Macklis JD, Church GM. CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 2017, 547(7663):345–349. pmid:28700573
- View Article
- PubMed/NCBI
- Google Scholar
8. Organick L, Ang SD, Chen Y, Lopez R, Yekhanin S, Makarychev K, et al. Random access in large-scale DNA data storage. Nat Biotechnol 2018, 36(3):242–248. pmid:29457795
- View Article
- PubMed/NCBI
- Google Scholar
9. Anavy L, Vaknin I, Atar O, Amit R, Yakhini Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 2019, 37(10):1229–1236. pmid:31501560
- View Article
- PubMed/NCBI
- Google Scholar
10. Banal JL, Shepherd TR, Berleant J, Huang H, Reyes M, Ackerman CM, et al. Random access DNA memory using Boolean search in an archival file storage system. Nat Mater 2021, 20(9):1272–1280. pmid:34112975
- View Article
- PubMed/NCBI
- Google Scholar
11. Bee C, Chen YJ, Queen M, Ward D, Liu X, Organick L, et al. Molecular-level similarity search brings computing to DNA data storage. Nat Commun 2021, 12(1):4764. pmid:34362913
- View Article
- PubMed/NCBI
- Google Scholar
12. El-Shaikh A, Welzel M, Heider D, Seeger B. High-scale random access on DNA storage systems. NAR Genomics and Bioinformatics 2022, 4(1). pmid:35156022
- View Article
- PubMed/NCBI
- Google Scholar
13. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol 2013, 14(5):R51. pmid:23718773
- View Article
- PubMed/NCBI
- Google Scholar
14. Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K. TOWARD A DNA-BASED ARCHIVAL STORAGE SYSTEM. Ieee Micro 2017, 37(3):98–104.
- View Article
- Google Scholar
15. Blawat M, Gaedke K, Hutter I, Chen X, Turczyk BM, Inverso SA, et al. Forward Error Correction for DNA Data Storage. international conference on conceptual structures 2016, 80(80):1011–1022.
- View Article
- Google Scholar
16. Jeong J, Park S-J, Kim J-W, No J-S, Jeon HH, Lee JW, et al. Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes. Bioinformatics (Oxford, England) 2021. pmid:33904574
- View Article
- PubMed/NCBI
- Google Scholar
17. Wang YX, Noor-A-Rahim M, Gunawan E, Guan YL, Poh CL. Construction of Bio-Constrained Code for DNA Data Storage. Ieee Communications Letters 2019, 23(6):963–966.
- View Article
- Google Scholar
18. Wang SC, Lu ZY, Cao Q, Jiang H, Yao J, Dong YY, et al. Exploration and Exploitation for Buffer-Controlled HDD-Writes for SSD-HDD Hybrid Storage Server. Acm Transactions on Storage 2022, 18(1).
- View Article
- Google Scholar
19. Cao B, Zhang X, Wu J, Wang B, Zhang Q. Minimum free energy coding for DNA storage. IEEE Trans NanoBiosci 2021, 2:212–222. pmid:33534710
- View Article
- PubMed/NCBI
- Google Scholar
20. Yin Q, Cao B, Li X, Wang B, Zhang Q, Wei X. An Intelligent Optimization Algorithm for Constructing a DNA Storage Code: NOL-HHO. International journal of molecular sciences 2020, 21(6). pmid:32235762
- View Article
- PubMed/NCBI
- Google Scholar
21. Zheng Y, Wu J, Wang B. CLGBO: An Algorithm for Constructing Highly Robust Coding Sets for DNA Storage. Frontiers in Genetics 2021, 12(673). pmid:34017354
- View Article
- PubMed/NCBI
- Google Scholar
22. Yehezkeally Y, Schwartz M. Reconstruction Codes for DNA Sequences With Uniform Tandem-Duplication Errors. IEEE Trans Inf Theory 2020, 66(5):2658–2668.
- View Article
- Google Scholar
23. Wang P, Mu Z, Sun L, Si S, Wang B. Hidden Addressing Encoding for DNA Storage. Frontiers in Bioengineering and Biotechnology 2022, 10. pmid:35928958
- View Article
- PubMed/NCBI
- Google Scholar
24. Lenz A, Siegel PH, Wachter-Zeh A, Yaakobi E: Coding Over Sets for DNA Storage. IEEE Trans Inf Theory 2020, 66(4):2331–2351.
- View Article
- Google Scholar
25. Zan X, Yao X, Xu P, Chen Z, Xie L, Li S, et al. A hierarchical error correction strategy for text DNA storage. Interdisciplinary Sciences: Computational Life Sciences 2022, 14(1):141–150. pmid:34463928
- View Article
- PubMed/NCBI
- Google Scholar
26. Li X, Wang B, Lv H, Yin Q, Zhang Q, Wei X. Constraining DNA sequences with a triplet-bases unpaired. IEEE Trans NanoBiosci 2020, 19(2):299–307. pmid:32031945
- View Article
- PubMed/NCBI
- Google Scholar
27. Wang B, Zheng X, Zhou S, Zhou C, Wei X, Zhang Q, et al. Constructing DNA Barcode Sets Based on Particle Swarm Optimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2018, 15(3):999–1002. pmid:28287980
- View Article
- PubMed/NCBI
- Google Scholar
28. Wang G-G, Deb S, Cui Z. Monarch butterfly optimization. Neural computing and applications 2019, 31(7):1995–2014.
- View Article
- Google Scholar
29. Li S, Chen H, Wang M, Heidari AA, Mirjalili S. Slime mould algorithm: A new method for stochastic optimization. Future Generation Computer Systems 2020, 111:300–323.
- View Article
- Google Scholar
30. Wang G-G. Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memetic Computing 2018, 10(2):151–164.
- View Article
- Google Scholar
31. Yang Y, Chen H, Heidari AA, Gandomi AH. Hunger games search: Visions, conception, implementation, deep analysis, perspectives, and towards performance shifts. Expert Systems with Applications 2021, 177:114864.
- View Article
- Google Scholar
32. Ahmadianfar I, Heidari AA, Gandomi AH, Chu X, Chen H. RUN beyond the metaphor: An efficient optimization algorithm based on Runge Kutta method. Expert Systems with Applications 2021, 181:115079.
- View Article
- Google Scholar
33. Tu J, Chen H, Wang M, Gandomi AH. The colony predation algorithm. Journal of Bionic Engineering 2021, 18(3):674–710.
- View Article
- Google Scholar
34. Ahmadianfar I, Heidari AA, Noshadian S, Chen H, Gandomi AH. INFO: An efficient optimization algorithm based on weighted mean of vectors. Expert Systems with Applications 2022, 195:116516.
- View Article
- Google Scholar
35. Hashem M, Abdel-Salam M, El-Mohandes MT, Nayel M, Ebeed M. Optimal Placement and Sizing of Wind Turbine Generators and Superconducting Magnetic Energy Storages in a Distribution System. Journal of Energy Storage 2021, 38:102497.
- View Article
- Google Scholar
36. Ahmed D, Ebeed M, Ali A, Alghamdi A, Kamel S. Multi-Objective Energy Management of a Micro-Grid Considering Stochastic Nature of Load and Renewable Energy Resources. Electronics 2021, 10:403.
- View Article
- Google Scholar
37. Mostafa A, Ebeed M, Kamel S, Abdel-Moamen MA. Optimal Power Flow Solution Using Levy Spiral Flight Equilibrium Optimizer With Incorporating CUPFC. IEEE Access 2021, 9:69985–69998.
- View Article
- Google Scholar
38. Li X, Han P, Wang G, Chen W, Wang S, Song T. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 2022, 23(1):474. pmid:35761175
- View Article
- PubMed/NCBI
- Google Scholar
39. Li X, Wei Z, Wang B, Song T. Stable DNA Sequence Over Close-Ending and Pairing Sequences Constraint. Frontiers in Genetics 2021, 12(697). pmid:34079580
- View Article
- PubMed/NCBI
- Google Scholar
40. Cao B, Li X, Zhang X, Wang B, Zhang Q, Wei X. Designing Uncorrelated Address Constrain for DNA Storage by DMVO Algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2022, 19(2):866–877. pmid:32750895
- View Article
- PubMed/NCBI
- Google Scholar
41. Cao B, Zhao S, Li X, Wang B. K-Means Multi-Verse Optimizer (KMVO) Algorithm to Construct DNA Storage Codes. IEEE Access 2020, 8:29547–29556.
- View Article
- Google Scholar
42. Cao B, Zhang X, Cui S, Zhang Q. Adaptive coding for DNA storage with high storage density and low coverage. npj Systems Biology and Applications 2022, 8(1):23. pmid:35788589
- View Article
- PubMed/NCBI
- Google Scholar
43. Faramarzi A, Heidarinejad M, Stephens B, Mirjalili S. Equilibrium optimizer: A novel optimization algorithm. Knowledge Based Systems 2020, 191:105190.
- View Article
- Google Scholar
44. Reynolds AM, Frye MA. Free-flight odor tracking in Drosophila is consistent with an optimal intermittent scale-free search. PLoS ONE 2007, 2(4):e354. pmid:17406678
- View Article
- PubMed/NCBI
- Google Scholar
45. Viswanathan GM, Buldyrev SV, Havlin S, Da Luz M, Raposo E, Stanley HE. Optimizing the success of random searches. Nature 1999, 401(6756):911–914. pmid:10553906
- View Article
- PubMed/NCBI
- Google Scholar
46. Digalakis JG, Margaritis KG. ON BENCHMARKING FUNCTIONS FOR GENETIC ALGORITHMS. International Journal of Computer Mathematics 2001, 77(4):481–506.
- View Article
- Google Scholar
47. Limbachiya D, Gupta MK, Aggarwal V. Family of Constrained Codes for Archival DNA Data Storage. IEEE Communications Letters 2018, 22(10):1972–1975.
- View Article
- Google Scholar

[ref1] 1. Davis J. Microvenus. Art Journal 1996, 55(1):70–74.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science 2012, 337(6102):1628–1628. pmid:22903519
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Goldman NM, Bertone P, Chen S, Dessimoz C, Leproust EM, Sipos B, et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 2013, 494(7435):77–80. pmid:23354052
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Yazdi S, Yuan YB, Ma J, Zhao HM, Milenkovic O. A Rewritable, Random-Access DNA-Based Storage System. Scientific Reports 2015, 5. pmid:26382652
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Bornhol J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K. A DNA-Based Archival Storage System. Acm Sigplan Notices 2016, 51(4):637–649.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Erlich Y, Zielinski D. DNA Fountain enables a robust and efficient storage architecture. Science 2017, 355(6328):950–953. pmid:28254941
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Shipman SL, Nivala J, Macklis JD, Church GM. CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 2017, 547(7663):345–349. pmid:28700573
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Organick L, Ang SD, Chen Y, Lopez R, Yekhanin S, Makarychev K, et al. Random access in large-scale DNA data storage. Nat Biotechnol 2018, 36(3):242–248. pmid:29457795
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Anavy L, Vaknin I, Atar O, Amit R, Yakhini Z. Data storage in DNA with fewer synthesis cycles using composite DNA letters. Nat Biotechnol 2019, 37(10):1229–1236. pmid:31501560
View Article
PubMed/NCBI
Google Scholar

[32] View Article

[33] PubMed/NCBI

[34] Google Scholar

[ref10] 10. Banal JL, Shepherd TR, Berleant J, Huang H, Reyes M, Ackerman CM, et al. Random access DNA memory using Boolean search in an archival file storage system. Nat Mater 2021, 20(9):1272–1280. pmid:34112975
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref11] 11. Bee C, Chen YJ, Queen M, Ward D, Liu X, Organick L, et al. Molecular-level similarity search brings computing to DNA data storage. Nat Commun 2021, 12(1):4764. pmid:34362913
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref12] 12. El-Shaikh A, Welzel M, Heider D, Seeger B. High-scale random access on DNA storage systems. NAR Genomics and Bioinformatics 2022, 4(1). pmid:35156022
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref13] 13. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol 2013, 14(5):R51. pmid:23718773
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref14] 14. Bornholt J, Lopez R, Carmean DM, Ceze L, Seelig G, Strauss K. TOWARD A DNA-BASED ARCHIVAL STORAGE SYSTEM. Ieee Micro 2017, 37(3):98–104.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref15] 15. Blawat M, Gaedke K, Hutter I, Chen X, Turczyk BM, Inverso SA, et al. Forward Error Correction for DNA Data Storage. international conference on conceptual structures 2016, 80(80):1011–1022.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref16] 16. Jeong J, Park S-J, Kim J-W, No J-S, Jeon HH, Lee JW, et al. Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes. Bioinformatics (Oxford, England) 2021. pmid:33904574
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref17] 17. Wang YX, Noor-A-Rahim M, Gunawan E, Guan YL, Poh CL. Construction of Bio-Constrained Code for DNA Data Storage. Ieee Communications Letters 2019, 23(6):963–966.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref18] 18. Wang SC, Lu ZY, Cao Q, Jiang H, Yao J, Dong YY, et al. Exploration and Exploitation for Buffer-Controlled HDD-Writes for SSD-HDD Hybrid Storage Server. Acm Transactions on Storage 2022, 18(1).
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref19] 19. Cao B, Zhang X, Wu J, Wang B, Zhang Q. Minimum free energy coding for DNA storage. IEEE Trans NanoBiosci 2021, 2:212–222. pmid:33534710
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref20] 20. Yin Q, Cao B, Li X, Wang B, Zhang Q, Wei X. An Intelligent Optimization Algorithm for Constructing a DNA Storage Code: NOL-HHO. International journal of molecular sciences 2020, 21(6). pmid:32235762
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref21] 21. Zheng Y, Wu J, Wang B. CLGBO: An Algorithm for Constructing Highly Robust Coding Sets for DNA Storage. Frontiers in Genetics 2021, 12(673). pmid:34017354
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref22] 22. Yehezkeally Y, Schwartz M. Reconstruction Codes for DNA Sequences With Uniform Tandem-Duplication Errors. IEEE Trans Inf Theory 2020, 66(5):2658–2668.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref23] 23. Wang P, Mu Z, Sun L, Si S, Wang B. Hidden Addressing Encoding for DNA Storage. Frontiers in Bioengineering and Biotechnology 2022, 10. pmid:35928958
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref24] 24. Lenz A, Siegel PH, Wachter-Zeh A, Yaakobi E: Coding Over Sets for DNA Storage. IEEE Trans Inf Theory 2020, 66(4):2331–2351.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref25] 25. Zan X, Yao X, Xu P, Chen Z, Xie L, Li S, et al. A hierarchical error correction strategy for text DNA storage. Interdisciplinary Sciences: Computational Life Sciences 2022, 14(1):141–150. pmid:34463928
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref26] 26. Li X, Wang B, Lv H, Yin Q, Zhang Q, Wei X. Constraining DNA sequences with a triplet-bases unpaired. IEEE Trans NanoBiosci 2020, 19(2):299–307. pmid:32031945
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref27] 27. Wang B, Zheng X, Zhou S, Zhou C, Wei X, Zhang Q, et al. Constructing DNA Barcode Sets Based on Particle Swarm Optimization. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2018, 15(3):999–1002. pmid:28287980
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref28] 28. Wang G-G, Deb S, Cui Z. Monarch butterfly optimization. Neural computing and applications 2019, 31(7):1995–2014.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref29] 29. Li S, Chen H, Wang M, Heidari AA, Mirjalili S. Slime mould algorithm: A new method for stochastic optimization. Future Generation Computer Systems 2020, 111:300–323.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref30] 30. Wang G-G. Moth search algorithm: a bio-inspired metaheuristic algorithm for global optimization problems. Memetic Computing 2018, 10(2):151–164.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref31] 31. Yang Y, Chen H, Heidari AA, Gandomi AH. Hunger games search: Visions, conception, implementation, deep analysis, perspectives, and towards performance shifts. Expert Systems with Applications 2021, 177:114864.
View Article
Google Scholar

[111] View Article

[112] Google Scholar

[ref32] 32. Ahmadianfar I, Heidari AA, Gandomi AH, Chu X, Chen H. RUN beyond the metaphor: An efficient optimization algorithm based on Runge Kutta method. Expert Systems with Applications 2021, 181:115079.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref33] 33. Tu J, Chen H, Wang M, Gandomi AH. The colony predation algorithm. Journal of Bionic Engineering 2021, 18(3):674–710.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref34] 34. Ahmadianfar I, Heidari AA, Noshadian S, Chen H, Gandomi AH. INFO: An efficient optimization algorithm based on weighted mean of vectors. Expert Systems with Applications 2022, 195:116516.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref35] 35. Hashem M, Abdel-Salam M, El-Mohandes MT, Nayel M, Ebeed M. Optimal Placement and Sizing of Wind Turbine Generators and Superconducting Magnetic Energy Storages in a Distribution System. Journal of Energy Storage 2021, 38:102497.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref36] 36. Ahmed D, Ebeed M, Ali A, Alghamdi A, Kamel S. Multi-Objective Energy Management of a Micro-Grid Considering Stochastic Nature of Load and Renewable Energy Resources. Electronics 2021, 10:403.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref37] 37. Mostafa A, Ebeed M, Kamel S, Abdel-Moamen MA. Optimal Power Flow Solution Using Levy Spiral Flight Equilibrium Optimizer With Incorporating CUPFC. IEEE Access 2021, 9:69985–69998.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

[ref38] 38. Li X, Han P, Wang G, Chen W, Wang S, Song T. SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genomics 2022, 23(1):474. pmid:35761175
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref39] 39. Li X, Wei Z, Wang B, Song T. Stable DNA Sequence Over Close-Ending and Pairing Sequences Constraint. Frontiers in Genetics 2021, 12(697). pmid:34079580
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref40] 40. Cao B, Li X, Zhang X, Wang B, Zhang Q, Wei X. Designing Uncorrelated Address Constrain for DNA Storage by DMVO Algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2022, 19(2):866–877. pmid:32750895
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref41] 41. Cao B, Zhao S, Li X, Wang B. K-Means Multi-Verse Optimizer (KMVO) Algorithm to Construct DNA Storage Codes. IEEE Access 2020, 8:29547–29556.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

[ref42] 42. Cao B, Zhang X, Cui S, Zhang Q. Adaptive coding for DNA storage with high storage density and low coverage. npj Systems Biology and Applications 2022, 8(1):23. pmid:35788589
View Article
PubMed/NCBI
Google Scholar

[147] View Article

[148] PubMed/NCBI

[149] Google Scholar

[ref43] 43. Faramarzi A, Heidarinejad M, Stephens B, Mirjalili S. Equilibrium optimizer: A novel optimization algorithm. Knowledge Based Systems 2020, 191:105190.
View Article
Google Scholar

[151] View Article

[152] Google Scholar

[ref44] 44. Reynolds AM, Frye MA. Free-flight odor tracking in Drosophila is consistent with an optimal intermittent scale-free search. PLoS ONE 2007, 2(4):e354. pmid:17406678
View Article
PubMed/NCBI
Google Scholar

[154] View Article

[155] PubMed/NCBI

[156] Google Scholar

[ref45] 45. Viswanathan GM, Buldyrev SV, Havlin S, Da Luz M, Raposo E, Stanley HE. Optimizing the success of random searches. Nature 1999, 401(6756):911–914. pmid:10553906
View Article
PubMed/NCBI
Google Scholar

[158] View Article

[159] PubMed/NCBI

[160] Google Scholar

[ref46] 46. Digalakis JG, Margaritis KG. ON BENCHMARKING FUNCTIONS FOR GENETIC ALGORITHMS. International Journal of Computer Mathematics 2001, 77(4):481–506.
View Article
Google Scholar

[162] View Article

[163] Google Scholar

[ref47] 47. Limbachiya D, Gupta MK, Aggarwal V. Family of Constrained Codes for Archival DNA Data Storage. IEEE Communications Letters 2018, 22(10):1972–1975.
View Article
Google Scholar

[165] View Article

[166] Google Scholar

Figures

Abstract

1. Introduction

2. Encoding constraints

2.1 Hamming distance constraint

2.2 GC content constraint

2.3 No-runlength constraint

3. Algorithm description

3.1 Equilibrium optimizer

3.2 Levy Equilibrium Optimizer

4. Result and analysis

4.1 Benchmark function

4.2 lower bound of the DNA storage code set

5. Conclusion

References