DNA Tetrominoes: The Construction of DNA Nanostructures Using Self-Organised Heterogeneous Deoxyribonucleic Acids Shapes

The unique programmability of nucleic acids offers alternative in constructing excitable and functional nanostructures. This work introduces an autonomous protocol to construct DNA Tetris shapes (L-Shape, B-Shape, T-Shape and I-Shape) using modular DNA blocks. The protocol exploits the rich number of sequence combinations available from the nucleic acid alphabets, thus allowing for diversity to be applied in designing various DNA nanostructures. Instead of a deterministic set of sequences corresponding to a particular design, the protocol promotes a large pool of DNA shapes that can assemble to conform to any desired structures. By utilising evolutionary programming in the design stage, DNA blocks are subjected to processes such as sequence insertion, deletion and base shifting in order to enrich the diversity of the resulting shapes based on a set of cascading filters. The optimisation algorithm allows mutation to be exerted indefinitely on the candidate sequences until these sequences complied with all the four fitness criteria. Generated candidates from the protocol are in agreement with the filter cascades and thermodynamic simulation. Further validation using gel electrophoresis indicated the formation of the designed shapes. Thus, supporting the plausibility of constructing DNA nanostructures in a more hierarchical, modular, and interchangeable manner.

Most of the current DNA nanostructures reported were constructed using various approaches, including the DNA origami technique [13,24,25] and the use of DNA sticky ends to connect different molecular tiles [26]. Although successful in providing proof-of-concepts for designing programmable DNA blocks, these conventional approaches were restricted because they required carefully designed and well-defined structures to ensure preordained error-less organisation. These structures can only be generated through a set of distinct and restrictive sequences. Therefore, any base mutations or mispairings would result in incompatible binding and thus, non-conformity of the desired structures.
In resolving this issue, we investigated the plausibility of utilising the self-organising property of DNA molecules to oversee the formation of DNA nanostructures. This minimises the complexity of designing DNA strands and the self-assembly (folding) errors usually encountered during the formation of such structures. This is achieved by allowing competition between various heterogeneous shapes to occur without any pre-specified binding instructions (i.e., orchestration of blocks [27] instead of total programmability). While homogenous blocks (identical and symmetrical DNA shapes) as fundamental units ensure conformity in the structure formation [28,29], our heterogeneous blocks are unique in being able to promote self-organisation and thus better optimisation of structure formation. The assembly using heterogeneous blocks allows the formation of structures to rely purely on natural processes without any interference from predefined sets of instructions. This increases the flexibility of constructing DNA nanostructures as the formation of the structures are achieved through any combination of competing DNA shapes. These DNA shapes are designed with stable free energy to keep the rigidity of the later formed structures intact. The minimum or maximum number of DNA shapes involved in each formation can be optimised during the design stage. The comparison between the conventional approaches (DNA Origami and SST) with the proposed method is presented in Table 1.
The fabrication of DNA nanostructures starts with the sequence design phase. This is commonly conducted using computational tools [33][34][35][36] to generate sequences with minimum free energy (MFE) and minimisation of sequence symmetry [33,37]. Programs such as Tiamat [33] and SEQUIN [38] were developed to prohibit sequence symmetry from occurring throughout the DNA construct (i.e., to avoid undesired pairing within the design sequences). On the contrary, our autonomous protocol allows the structure to have sequence symmetry at tolerable degrees. It allows the structures to form, subjected to the occurrences of some unwanted aggregates. This is necessary to handle the formation of structures under undesirable and uncontrollable physicochemical conditions. This would be applicable for a specific scenario where structures are built in-situ inside living cells, as compared to the conventional method of building the structures externally (thus requiring a complicated delivery mechanism afterwards) [39]. One such scenarios could be where we have DNA nanorobots with acceptable levels of stability inside a cellular environment, however with the drawback of time limitation to resist enzymatic degradation [40,41].
In this study, we propose a new hierarchical schema of assembling supra-molecular structures. As a basis, we use two single stranded DNAs to form our elementary blocks. Then, using these elementary blocks, we constructed four distinct DNA shapes (called DNA Tetrominoes; as the shapes resemble some basic shapes available in the game Tetris). These shapes then would further assemble into the intended supra molecular structures (as illustrated in Fig 1).
Compared to conventional DNA origami approaches, during the final assembly phase, in our proposed schema, the formation of the supra structures would entirely be dependable on the DNA shapes themselves. This is possible because in fabricating a particular DNA structure, our autonomous protocol will generates an N amount of DNA shapes with M amount of sequences for each shape (N and M are dynamic variables that can be customised according to the user). Through in-silico optimisation, the most preferred shape and sequence combinations will take precedence, however, since the shape and sequence combinations are modular, each shape and sequence are interchangeable without affecting the desired structures. The schema promotes a complete outlook of the shapes and sequences landscape necessary in designing any DNA structures. The proposed schema also relief the constraints of identifying and specifying base pairing dependencies mandatory in constructing any DNA structures, therefore allowing for a more flexible DNA structures construction.

Materials and Methods
An autonomous protocol was developed that comprised of (i) a sequence design pipeline to generate sequences for each DNA Tetris shape and (ii) an optimisation algorithm to mutate sequences that violated the fitness criteria. The DNA Sequence Generator program [42] was embedded inside the sequence design pipeline to facilitate the computational speed of generating the desired sequences. The optimisation algorithm was built entirely using Tool Command Language (TCL) [43][44][45] and Perl version 5.12.4 and the complete protocol was tested under the Unix environment in Mac OS X, version 10.7.5.

Protocol (i): Single Stranded Deoxyribonucleic Acid (ssDNA) Sequence Design
The DNA Sequence Generator (DSG) available from the CANADA package [42,46] was used to generate the initial sequences of single stranded deoxyribonucleic acids (ssDNA). This program uses a fully automatic graph based approach to create uniqueness within a pool of sequences. The default parameters were applied (as suggested in [42,46]), with the exception of sequence length, which was set to 25 nucleotides.

Degree of Difficulty in Sequence Design
Restrictive: Meticulous sequences design (each staple strand is targeted to a specific location in the scaffold).
Moderate: Design of sequences for uniform motif is straightforward, however individualised sequences design is required to enable intermolecular binding between domains in every SST motif [28,29,31].
Easy: Each shape is independent of the overall structure design (and only conform to its individual shape), then the sequence generation is context free.
Matching dangling ends are randomised to each shape afterwards.

Programmability
Hard: DNA Origami uses a long scaffold as construction material. Therefore, in order to build a gigadalton nanostructure, it would requires a scaffold (over 1 megabase) which is mechanically fragile and difficult to synthesize [32] Moderate: One standardized length with four domains is used throughout structure. (e.g. 42 nucleotides single stranded DNA was used to build a molecular tube [28], 32 nucleotides single stranded DNA were used to build 3-D shapes [31] Loose: Any Tetris shape can combine together with different combinations to generate larger structures. The users only need to program the sticky ends of each Tetris shape to be compatible. Modification of ssDNA to Block Structures. The ssDNAs generated from the DSG were then modified into a block structure of double stranded DNA (dsDNA). The main block and sticky ends are 15 and 10 nucleotides respectively. There are two basic types of blocks (type-1 and type-2) as depicted. Every two ssDNAs were treated as a block. For each pair, sequences in the main block were modified to complement with each other to form block structures as shown in Fig 2. Stacking and Merging of Blocks. A single crossover between blocks was implemented into the design by stacking one DNA block upon another block. DNA strands were subjected for modification such as the position of block stacking, nucleotide shifting, sequence insertion and deletion were incorporated into the optimisation algorithm to ensure a greater versatility in nucleotide combinations for the resulting structures. Each basic Tetris shape (with an exception of the I-Shape) was built from six ssDNAs or eight ssDNAs, which were then merged to form four long continuous ssDNAs. Meanwhile, the I-Shape was formed using two ssDNAs (Fig 3).

Protocol (ii): Optimisation of DNA sequences
Sequences generated from the pipeline were optimised further to increase the feasibility of the desired Tetris structures formation in the laboratory. The optimisation algorithm, which incorporated four fitness criteria (Table 2), was used to calculate the penalty scores for all the generated sequences in each population. The penalty score increases (i.e., increment by a point) whenever the sequence does not pass any of the evaluation criteria (otherwise, the penalty score will be nil). If the total penalty score of the four fitness criteria exceeds 0, the sequence will undergo a mutation process. The algorithm will randomly select new nucleotides to replace the existing nucleotides (at any random position) in the mutation permissible region. Only a single nucleotide will be mutated at a time; the penalty score will be recalculated and mutations will be conducted repeatedly until the penalty score becomes nil.
Base Pairing at False Binding Sites (FBS). As a general rule during DNA assembly, it is crucial for DNA sequences to form base pairings exactly at the pre-defined positions. At the same time avoiding pairings at unwanted positions (mispairing). This is also known as "binding specificity". Unfortunately, such false-binding sites (FBS) could not be completely avoided; otherwise the sequence diversity would be extremely low. As a consequence, base pairings at false-binding sites were limited to shorter lengths so that the thermodynamic stability [48][49][50] of the false-binding sites is predicted to have low energy and accordingly, low probability of hybridisation. Therefore, this criterion was included as a crucial filter and was intended to detect the longest complementary region that existed between two sequences. In this work, base pairing at a false-binding site is defined as the occurrence of two sequences that form base pairings at unwanted positions.
The detection program was written using Perl version 5.12.4 and was processed using the following three scripts: (i) FindStartPosition.pl, (ii) CleanEmptyPosition.pl and (iii) GetLongest-Complement.pl. The calculations were conducted by aligning a query sequence against the remaining corresponding target sequences. The query was shifted a nucleotide at a time towards the 3' terminal to search for any complementary nucleotides in the target. During each shift, if a nucleotide from a target strand is complementary with the nucleotide from the query strand; the False Binding Sites (FBS) score increases by one, (if and only if the longest complementarity at unwanted position is more than six, otherwise the FBS-score remained unchanged). The final FBS-score represents the longest consecutive stretch of complementary bases that was detected between the two strands.
The first script (FindStartPosition.pl) was employed to find all positions that have a minimum of seven consecutive complementary (Q min ) nucleotides between the query and target sequence. It listed out every start position that matches to the minimum complementary bases.  (3a) Schematic illustration for L-shape formation using 3 blocks or 6 DNA strands (L1-L6). These strands were then subjected for modification (insertion of 10 and 15 nucleotides to L1 and L5, insertion of 5 nucleotides to L6) and block stacking (Block L5-L6 was stacked on position -5 (to the left) relative to Block L1-L2). After modification, these 3 blocks then merged to form 4 long strands (CL1-CL4). (3b) Schematic illustration for T-shape formation using 4 blocks or 8 DNA strands (T1-T8). These strands were then subjected for nucleotides modification (insertion of 10 and 20 nucleotides to form blunt end on T6 and sticky ends on T8. After modification, it was then merged to form 4 newly combined strands (CT1-CT4). (3c) Schematic illustration for B-shape formation using 4 blocks or 8 DNA strands (B1-B8). These strands were then subjected for modification such as deletion (Deletion of 10 nucleotides on strand B2-B4, B6 and B7) and fragment shifting (fragment "TCTAA" shift from strand B7 to B8). Thereafter, blocks were merged to form into 4 long strands (CB1-CB4). (3d) Schematic illustration for I-shape formation using a single block or 2 DNA strands (I1, I2). These strands were subjected for modification, deletion of 10 nucleotides occurred on I2 sticky ends. Following modification, CI1 and CI2 are the new strands. The output of FindStartPosition.pl listed every start position in the query ($QStart) and target ($TStart). The function of CleanEmptyPosition.pl is to remove the query, which does not meet the threshold value of at least seven consecutive matching nucleotides. The GetLongestComplement.pl script was then executed to obtain the longest matching complementary sequence using the start position output from the first script. Parameters are dynamic can be customised to any design specification (e.g. minimum consecutive complementary might be different depending on design of the DNA shapes).
Thermodynamic Energy of inter-molecular and intra-molecular DNA pairings. The thermodynamic energy for a DNA sequence to form self-folding (intra-molecular) and double stranded folding (intermolecular) were calculated using the "AllSub" and "DuplexFold" programs available in the RNAstructure package [47]. The program "AllSub" is selected to generate all possible low free energy structures of a given DNA sequence. The program "DuplexFold" is used to predict the lowest free energy structure for two interacting sequences with a constraint of not allowing any intra-molecular base pairs to occur. Default parameters were selected with the exception of the RNA/DNA option, which was set to only DNA.
This fitness evaluation required the free energy of "AllSub" to be higher (less negative) than the energy of "DuplexFold" (more negative). This was to ensure a relatively more stable structure when bindings occurred between two ssDNAs as compared to the stability of ssDNA selffolding. This is to ensure that correct base-pair formation for inter-molecular assembly occurs.
G4 Pattern. The sequence design was prevented from having a G4 sub-sequence pattern because such sequences are favourable to form an unintended four-stranded G4 DNA structure [51].
Percentage of GC Content. The number of Gs and/or Cs of oligonucleotides is between 40% and 70% inclusive. The GC content was calculated by obtaining the number of GC versus the total nucleotide content.
Mutation. Mutations were exerted on DNA strands if the total score from all four fitness criteria are more than zero. The regions for the mutations to be exercised were based on 2 conditions depending whether a forbidden region exists (Condition 1 if the region exists and condition 2 otherwise). Variable $MutateRegion is a list of nucleotide positions that allow mutations to occur, while variable $ForbidPosition is a list of nucleotide positions that does not allow mutations to occur mainly because these nucleotides are hybridised with the previous strands. The formula for determining the mutation regions is $MutateRegion = $AllPosition-$ForbidPosition. For instance, the calculation of the $MutateRegion if there is forbidden region is depicted in Table 3. In this instance, sequence CB2 has 30 nucleotides, and the nucleotides numbered 16-30 from CB2 are complementary with nucleotides numbered 1-15 from strand CB1. Table 4 depicts an example of the mutating region ($MutateRegion) where the forbidden region is non-existence. In this instance, sequences in CL1 do not have complementary binding with any sequences. The length of the molecule is 35 nucleotides. Table 2. Fitness evaluation criteria for sequence optimisation implemented in Protocol (ii).

Fitness criterion Description
Base pairing at false binding sites The detection program exhaustively looks for maximum consecutive base pairing at unwanted positions (also known as False Binding Site or FBS), FBS max = 6.
Thermodynamics free energy The calculation of thermodynamics incorporated the use of program AllSub and DuplexFold [47] Thermodynamics free energy of intramolecular pairing must be higher compared to inter-molecular pairing.

Percentage of Guanine-Cytosine (GC) content
Percentage of GC must be in the range of 40% to 70% inclusive.
No existence of G4 pattern No existence of GGGG pattern was allowed in the sequence.
Therefore, in order for a mutation to occur, a position will be randomly selected, identified as X in $MutateRegion and X will be replaced with a randomly selected nucleotide, N New .
Protocol for Laboratory Validation of the Constructed DNA Tetris Blocks DNA Annealing. Oligonucleotides were purchased from Integrated DNA Technologies Pte Ltd. Complexes of shapes were formed by mixing stoichiometric quantities of each strand at 0.5 μM concentration in a buffer consisting of 40 mM Tris base, 2.5 mM EDTA, and 13 mM MgCl 2 . Then, the complexes were formed by annealing the reaction mixture for three hours from 90°C to 4°C in an Eppendorf Mastercycler Pro S thermocycler (Eppendorf, Hamburg, Germany). To form individual shapes, 4 oligonucleotides were mixed stoichiometrically in a buffer containing 40 mM Tris base, 2.5 mM EDTA, and 13 mM MgCl 2 . The final concentration of oligonucleotides was set to 0.5 μM. The solution containing DNA sequences were not treated with any DNA polymerases, to ensure that they were held together only by non-covalent interactions (e.g. hydrogen bonds and base stacking).

Results and Discussion
Four Tetris shapes; L-Shape, T-Shape, B-Shape and I-Shape were successfully generated. The B-Shape, and T-Shape were built from four blocks; the L-Shape from three blocks while the I-Shape used only a single block. Two types of blocks were introduced, the Type-1 was used to build the L-Shape, while the Type-2 was used to build the T-Shape, B-Shape and I-Shape. To ensure that molecular optimisation can be approximated accordingly, merged blocks were further subjected to insertion, deletion and the shifting of nucleotides between strands. Two neighbouring blocks were linked using the existing sticky ends while a single crossover was utilised to ensure the linkage between two blocks formed when the two blocks are stacked on top of each other. The merging of short sequences from 3 blocks (L-Shape) and 4 blocks (T-Shape, B-Shape) resulted in four long stretches of DNA sequences (Fig 4).

Analysis of Laboratory Validation
In this work, our autonomous pipeline generates 500 populations for each individual shape. A random sample from each shape was taken for gel electrophoresis study to detect the assembly Table 3. Condition 1 (the existence of the forbidden region).  Table 4. Condition 2 (the non-existence of the forbidden region).  of the ssDNA components into the Tetris structures. Previous study reported that five nucleotides [11] are sufficient to create the possibility of binding, although six [52] or more are more commonly used; anything less than five is regarded as insufficient to form stable binding. Using the sequences from the random sample, in Fig 5 we highlighted mispairings of bases that might influence the result of our laboratory validation.
To fully implement the proposed hierarchical schematic, a less stringent approach was adopted during the sequence design. We allowed mispairing of bases (i.e., complementary binding at incorrect position) to occur in the designed sequences (with subtle limitations) to ensure that the correct bindings would still occur. By referring to the Fig 5 and the resulting gel electrophoresis experiment in Fig 6, we could observed that there are two extra bands appeared below the major bands which are the unwanted aggregates proceeding from the mispairing between CB1-CB4 and CB2-CB3. As for the T-shape, there is an extra band with the same band size observed in both Lane 12 and Lane 13. Similarly, these are the unwanted aggregates derived from the mispairing of CT2-CT3. The complementary binding at the correct position is set at least 10 nucleotides to provide sufficient strength in the structure formation. Supported by the gel electrophoresis results, the formation of the designed DNA Tetris shape is satisfactory except for some minor unwanted aggregates (which is expected due to the allowance of the protocol).

Analysis of the autonomous protocol
The autonomous protocol optimises the following four parameters (i) FBS max = 6 [52], (ii) thermodynamic free energy = Δ DuplexFold < Δ AllSub , (iii) G4 pattern = 0 and (iv) percentage of GC between 40% to 70% inclusive. For each generation, if a sequence does not comply with all four fitness criteria, it will mutate to produce a new sequence and will be re-evaluated. The whole cycle was repeated until the four criteria were satisfied. There are two parts in this algorithm (Fig 7) of fitness evaluation, i.e., the sequence evaluation based on the four parameters and the sequence mutation. The program required two inputs files: (i) Sequence.txt, (A file containing sequences produced from the pipeline to further undergo optimisation) and (ii) DefineSeq.txt (A file that lists all positions of nucleotides that form complementary binding between different DNA strands, listed in Table 5).
Thermodynamics Distribution for the Populations. The thermodynamics free energy for the interaction pairs, Δ DuplexFold was plotted (Fig 8). The distribution of the median (thick horizontal line) showed a relatively uniform distribution between the first and third quartile except for CI1-CI2. This implied that the majority of the populations have relatively similar thermodynamic energy approximations. DNA strands used for gel electrophoresis study  (except for pair CT1-CT2) have a higher energy (less negative) than the median of the populations. The red asterisks show the thermodynamics energy for sequences that were selected thorough random sampling for gel electrophoresis study.
Number of Iterations. The average number of iterations for B-Shape is 9.9±0.46 cycle, L-Shape 8.5±0.53 cycle, T-Shape 22.4±1.13 cycle and I-Shape 3.1±0.15 cycle. The number of iterations increased linearly as the number of nucleotides in mutated regions increases. This is linear with the number of positions that are permitted to mutate. Furthermore, the number of iterations is also dependent on the complexity of the fitness criteria. However, the approach is still effective and does not require complicated heuristics in order to generate candidate sequences for each DNA Tetris shape. The number of iterations required for each shape is relatively small and the computational process is relatively fast.
Each sequence is defined to be dependent or partially dependent on the nucleotide pattern from the previous sequence using a top-down method (e.g. L1!L2!L3!L4). The optimisation process will only proceed when sequence L1 has satisfied all the four criteria, and then continues with the following sequence (L2) until the designs for all sequences are completed. The lack of positions for sequence mutations such as for the I-shape (made up of two strands) caused the resulting structure to be less susceptible to changes. This is because the sequence arrangement in CI2 depends entirely on CI1 (CI2 not having sticky ends that can be mutated).

Conclusions
The problem of constructing any DNA nanostructures has always been associated with strict structural and sequence restrictions to ensure that conformity between sequence and its structural formation. This requires extensive knowledge of the molecule. In this work, we propose a simpler hierarchical schema of conducting the design phase. We design DNA shapes (in the form similar to DNA Tetris) that can be assembled into various DNA nanostructures. Our autonomous protocol is constructed in a manner where the parameters employed are flexible for any alterations and has the potential to be extended for complex DNA shape designs. This from-the-ground-up approach allows users with any level of knowledge on DNA molecule to design DNA shapes for the assembly of larger nanostructures due to its modularity. The proposed schema has the potential to become a platform of constructing a more autonomous, selforganised molecular constructs for advanced molecular information processing tasks.