Eugene – A Domain Specific Language for Specifying and Constraining Synthetic Biological Parts, Devices, and Systems

Background Synthetic biological systems are currently created by an ad-hoc, iterative process of specification, design, and assembly. These systems would greatly benefit from a more formalized and rigorous specification of the desired system components as well as constraints on their composition. Therefore, the creation of robust and efficient design flows and tools is imperative. We present a human readable language (Eugene) that allows for the specification of synthetic biological designs based on biological parts, as well as provides a very expressive constraint system to drive the automatic creation of composite Parts (Devices) from a collection of individual Parts. Results We illustrate Eugene's capabilities in three different areas: Device specification, design space exploration, and assembly and simulation integration. These results highlight Eugene's ability to create combinatorial design spaces and prune these spaces for simulation or physical assembly. Eugene creates functional designs quickly and cost-effectively. Conclusions Eugene is intended for forward engineering of DNA-based devices, and through its data types and execution semantics, reflects the desired abstraction hierarchy in synthetic biology. Eugene provides a powerful constraint system which can be used to drive the creation of new devices at runtime. It accomplishes all of this while being part of a larger tool chain which includes support for design, simulation, and physical device assembly.


Introduction
In its development as an engineering field, synthetic biology is at a stage where encapsulation has been identified as a fundamental challenge [1], [2], [3], [4]. Encapsulation will enable design re-use, sharing, and software tool development, all of which greatly increase synthetic biology's ability to grow both in complexity and in community size. Encapsulation has been shown to be very important in other engineering disciplines [5], [6], [7]. We present a domain specific programming language called Eugene meant to encapsulate biological Parts, Devices, and Rules paving the way for design space exploration, simulation, and automated assembly.
One popular encapsulation view in synthetic biology is that DNA sequence information can be encapsulated as a Part. Parts are well defined regarding the way in which they can be physically composed to create Devices [8], [9]. Parts and Devices then can be re-used in various designs, thus encouraging the development of new larger constructs for the community (see Figure 1) [10]. The process of developing standardized and well-characterized Parts is a key challenge, and community efforts in this direction have been undertaken through the BioBricks Foundation TM (http://bbf. openwetware.org/), the OpenWetWare initiative (http://openwetware.org/), and the International Genetically Engineered Machine (iGEM) competition (http://www.igem.org) [11].
Eugene (a play on the Greek prefix ''eu'' meaning ''good'' and the word ''gene'') is a human readable, executable specification [12], which reflects the creation of systems by defining, specifying, and combining collections of Parts. Eugene is inspired by the languages of the Electronic Design Automation (EDA) [13], [14] industry (e.g. Verilog [15] and VHDL [16]) in terms of its ability to provide a biological design netlist (a collection of components and their connections). This can be synthesized (automatically trans-formed) into collections of physical implementations in a design library [17].
Eugene development has focused on: 1. Flexible Part and Device specification and composition (see Methods and Supplemental Information). 2. Combinatorial design space exploration of Devices using an expressive system of Rules [18] (see Results). 3. Interaction with other tools for simulation and automated assembly (see Results and Figure 2). This paper is organized around these three areas as shown in Figure 3.

Device Specification
Eugene is composed of primitives, constructs, rules, and functions. These elements are outlined in Table 1 along with a brief explanation. For the sake of brevity, we cannot cover this material in depth. For more details, see the Supplemental Information, http://www.eugenecad.org, and [19], which are devoted to covering Eugene's inner workings.
To provide the reader with the required understanding of Eugene, we will step through the creation of a ''T4 Lysis Device with Pbad as the inducible Promoter''. This is a standardized biological part and can be retrieved as BBa_K112809 in the MIT Registry of Standard Biological Parts (http://partsregistry.org). Shown are the various layers of abstraction at which Eugene operates. DNA information forms the most basic unit on which everything else is built (e.g. the genetic code, as specified by bases G, A, T, and C). This is followed by Parts. Parts are non-reducible elements of genetic composition (e.g. promoters, ribosome binding sites, open reading frames, etc). Devices, which can contain one or more Parts, are the next level in the hierarchy. Finally, Devices are followed by a System view that contains collections of Devices. The traversal upward in the hierarchy represents an abstraction process while a downward traversal represents the refinement process. Eugene currently operates at the Part and Device levels via explicit Part and Device data types while encapsulating the DNA level as Eugene Properties. doi:10.1371/journal.pone.0018882.g001 Before beginning, it should be pointed out that there are two approaches to design in Eugene: 1. Bottom-Up Design (BUD) -BUD begins with low-level Properties, creates individual Parts, and then creates Devices. BUD is how libraries of Parts in Eugene will be created but requires a very detailed understanding of the system being created a priori. 2. Top-Down Design (TDD) -TDD begins by specifying the Devices of interest and then instantiating Parts, and finally specifying the Properties that make up the Parts. TDD is a very natural way to design systems, but in the absence of the lowerlevel elements the design is incomplete.
Our example follows the TDD paradigm in the interest of clarity.
Step 1: Specify the Header Files. These files encapsulate information on libraries of Eugene design elements at your disposal. Eugene comes with pre-created sample Header Files. Users can create their own Header Files manually or automatically (more in the Supplemental Information). Here the Header Files are divided into categories detailing what they contain. This separation is not a requirement.
include PropertyDefinition.h, PartDefinition.h, PartDeclaration.h; Step 2: Specify the Device(s). Devices are collections of 1) Parts or 2) other Devices. These must be specified in the body of the Eugene code or in a Header File. Here the Device is composed of eight Parts (ordered from 59 to 39). This syntax includes the Device type along with the name (for readability) but the type is optional (see Supplemental Information for alternate syntax).
Device ); Step 3: Instantiate the Part(s). This entails specifying the Property values of the Part(s). This can be done in the main body of the code or in the Header File. In this case, it will be inside of PartDeclaration.h. For brevity, we only show the sixth of the eight parts in the Device. All eight Parts will have to be specified. Alternate syntax without explicitly assigning values to Properties exists as well (see Supplemental Information).
Promoter BBa_J23116(.ID("BBa_J23116"), .Sequence("GATCTttgacagctagctcagtcctagggac-tatgctagcG"), .Orientation("Forward")); Step 4: Declare the Part(s). Parts are collections of Properties. This again can be captured in Header Files (e.g. in PartDefinition.h) or in the main body. Here we show all four Part types in the design. It is the job of the designer to decide which Properties make up the individual parts. Notice the ''Promoter'' Part has a Property ''Inducible'' which can remain unspecified in Step 3.
Part Promoter(ID, Sequence, Orientation, Inducible); Part ORF(ID, Sequence, Orientation, CDS); Part RBS(ID, Sequence, Orientation); Part Terminator(ID, Sequence, Orientation, Strength); Step 5: Declare the Properties. Properties are text, number, or Boolean values (either arrays or single values). These represent biological characteristics associated with the design. They can be manually specified or pulled from repositories (more in the Supplemental Information).
Property ID(txt); Property Sequence(txt);  . Eugene based synthetic biology design flow. Shown here is the role that Specification, Design, Assembly, and Data can play in synthetic biology. In particular, we illustrate that Eugene is concerned with the activities at the specification level explicitly but at the same time it is designed in such a way that it develops designs that are amenable to other pieces of this design flow. Opportunities for the flow to provide feedback to earlier stages and perform iterative refinement are outlined in red. doi:10.1371/journal.pone.0018882.g002 The final design for BBa_K112809 is shown in Figure 4. Ten experimentally created Devices representative of MIT's Registry of Standard Biological Parts were created to explore the process of specifying Devices using Eugene. Table 1 in file Appendix S1 captures this exploration. Specific information on these Devices and the Eugene code for their designs can be found in the Supplemental Information.
The purpose of this exercise was to display the significance in the separation of Part and lower level Property information, which is hidden in the Header Files, from the Device level construction in the main Eugene file. As a result of this separation, an average of 85% less code is utilized in the main file. At the same time, the ratio of DNA base pairs to total lines of code (an average of 139:1) implies the portability of very complex designs to other tools or systems. Sharing designs becomes much easier, since the creation of an underlying data structure and programming interface is achieved automatically when Eugene designs are interpreted. The design interpretation times are very reasonable (average of 95.2 ms). We have confidence that as designs move to encompass tens or hundreds of devices, the interpretation time will remain very reasonable.

Design Space Exploration
The Methods section illustrates how to specify Devices with Eugene. This is only one very limited aspect of Eugene. Design Space Exploration (DSE) is Eugene's primary task. DSE in this context consists of two phases: A cell surface display system built by the UC Berkeley Wetlab 2009 iGEM team went through the DSE process. This cell surface display system exposes various peptides or proteins to the extracellular environment by anchoring them to the outer membrane of E.Coli. The genetic Device for such a system is composed of three categories of protein domains: passenger domains, displayer domains, and structural spacer elements. An example of such a Device is shown in Figure 5. The individual Parts for this Device are explained briefly in Table 2 in the file Appendix S1.

Design Expansion
There are two types of cell surface display Devices (more details in the Supplemental Information): //A passenger/spacer/displayer/terminator Device Device DeviceType1 (PassNeedle, SpacerINP, Disp_upaG, T01); //Permute this device to switch out each Part instance permute(DeviceType1); //A passenger/displayer/terminator Device Device DeviceType2 (PassNeedle, Disp_upaG, T01); permute(DeviceType2); These four lines of code generate 540 Devices created from the basic Parts specified initially in Eugene. Figure 6 illustrates both how our initial design space consisted of these two devices created with two lines of code, as well as the increase to 540 Devices with the addition of two permute functions (four lines total). Figure 7 is a heat map showing the results of assaying cell surface display Devices for functionality depending on the type of passenger used in the Device. The quantitative data sets from these assays were normalized to an appropriate control and can be used to analyze the functionality of each combination of passenger, displayer, and spacer element.

Design Pruning
In order to reduce the design space from the original 540 Devices to the 135 Devices in Figure 7, we added an additional 13 lines of code ( Figure 6)   Here are shown the three Part types (passengers, spaces, and displayers) which when combined into a Device made up the systems that we explored. As shown the displayer interacts with the outer membrane of the bacterial cell to display the passenger protein extracellularly. Assert(NoAg4 AND NoLeu AND NoCell AND Needle-Spacers1-4 AND StrepSpacers1-4); We next reduced these Devices to six sets of fifteen Devices (90 total). These sets were combinations of three types of passengers,  Rule Rule4(NOTCONTAINS PassMgfp); Finally, to reduce the design space to only the 3 most active Devices, we add 3 lines of code.
//Removes all but the last 3 Devices Rule Rule5((PassStrep WITH Disp_CPG6) OR (PassStrep WITH Disp_AIDA)); Rule Rule6(PassNeedle WITH Disp_upaG); Assert (Rule5 AND Rule6); With more data on these Parts, such as molecular weight, shape, efficiency, and data relevant to surface displayers, we could create more informative Properties. This would lead to more detailed, powerful rules in the future. These rules would allow more specific pruning of the combinatorial space, and the ease and specificity of the reduction would be greater still.

Assembly and Simulation Integration
As shown, Eugene ultimately produces collections of Devices which both adhere to specific constraints and encapsulate Parts and Properties. There are two natural next steps in the design process: 1. Automated Assembly -1) Determine an optimal global assembly strategy for all Devices [20]. 2) Create assembly files for a liquid handling robotic platform [21]. Figure 8 illustrates this design flow. This was carried out with the help of Clotho [22], [23], (http://www.clothocad.org). 2. Simulation -Convert the underlying Eugene data structures to an exchange format for external simulation programs. We illustrate this process with the Synthetic Biology Software Suite (SynBioSS) [24].

Automated Assembly
We created a ''protein tagging'' (PT) system which uses combinatorial tagging of ORFs to optimize protein expression and purification, and test protein-protein interactions, by quickly creating iterations of functional designs. Our PT systems consisted of the components types in Table 3 in file Appendix S1.
Devices were created so that each Device would encode two different ORFs where each was tagged with a different tag, either on the N-or C-terminus of the ORF. Tags were always separated from ORFs by a protease cleavage site (such that tags and ORFs can be physically separated from each other). Thus, each ORF-tag combo is made of three basic parts (one ORF, one tag, and one cleavage site between them). Therefore, a two ORF-tag architecture contains six basic parts. Since proper protein expression of a Device also requires a promoter and a terminator, each Device consists of eight basic parts in total (the six above, plus a promoter, plus a terminator). In all cases, the first Part is always a promoter, and the last Part is always a terminator. The order of the six middle Parts varies according to the desired topology of the ORF-tag combos. These four Device types result in 2304 Devices using Eugene's permute function. We next use Rules to prevent the same antibody type of nTag or cTag from appearing in a Device. These Rules take three forms (where X is the specific tag antibody from the 12 different Part choices): //These rules prevent specific tag combinations Rule r1a(ctagX NOTWITH ntagX); //for CN and NC type Devices Rule r1b(ctagX NOTMORETHAN once); //for CC type Devices Rule r1c(ntagX NOTMORETHAN once); //for NN type Devices This reduces the number of Devices to 2112 Devices. We were only interested in Devices with distinct protein-tag set combinations. This is a total of 528 Devices. See the Supplemental Information for the complete Eugene code.
Automated assembly for Eugene based Devices occurs as follows ( Figure 8): 1. Create Device specifications in Eugene using Header Files created by a Clotho compatible database. 2. Use a Clotho App (e.g. Spectacles [25] or Eugene Scripter) to read in the Eugene code. 3. Clotho assembly algorithms [20] produce files for liquid handling robot based on information provided by the Clotho connection to the database (e.g. well location, sample volume, etc).
The assembly was carried out in 3 separate rounds (or stages) of assembly. In stage 1, we used 31 basic Parts to assemble 56 composite Parts (made of 2 basic Parts each). In stage 2, we used the Parts made in stage 1 to assemble 48 composite Parts (made of 4 basic Parts each). In the final stage, we used the Parts made in stage 2 to assemble 528 composite Parts (made of 8 basic Parts each). All 528 bi-cistronic operons contained a total of 3696 junctions between parts, out of which 632 were unique. Assuming $3 a Part junction and an amortized time of 10 minutes per part junction, we estimate that this saved around $9000 ($11,088-$1,896 = $9,192) and 500 hrs (36,960 min-6,320 min = 510 hrs). This is considering that the 528 constructs made contained a total of 3696 junctions between Parts, but of those only 632 were made since unique junctions only need to be made once.

Simulation
For simulation, we chose to look at a classic genetic regulatory network, namely a ''repressilator'' [26]. The example repressilator used here is based on a lac-tet-ara oscillatory network examined by Tuttle et al [27]. The overall behavior is that LacI represses the expression of TetR, which represses the expression of AraC, which in turn represses expression of LacI. See Figure 9 for an illustration of a repressilator. We decided to examine a repressilator because its behavior is well understood and it can be composed of primitive parts. It also provides a point of comparison with other tools in the literature (e.g. GEC [28]).
SynBioSS is a software suite for the generation, storage, and quantitative simulation of synthetic biological networks. One component of this software suite, called SynBioSS Designer, uses biological rules to create a reaction network given a series of biological parts, such as promoters and ribosome binding sites, and the spatial and temporal connectivity of these parts [29]. This reaction network represents the transcription, translation, and regulation occurring in the system. SynBioSS Designer outputs this reaction network as either a NetCDF or SBML file to be used in simulation software of the user's choice. We use SynBioSS for this investigation but Eugene could be used with a variety of simulation tools (e.g. Tinkercell [30]).
The Eugene code for this design is provided in the Supplemental Information. We provide a small sample here to give the reader a feel for some key elements of the repressilator design.
The following Property definitions form the pool of parameters to be associated with Parts in the repressilator: Property Sequence(txt); //The DNA sequence for the part Property Neg35StartEnd(txt); //Promoter information Property Neg10StartEnd(txt); //Promoter information Property OperatorSites(txt[]); //An array of promoter information Property Corresponding Protein(txt); //Which protein the part produces Property ProteinBindingInfo(txt); //Protein interaction information The following Part definitions form the set of Part types in the repressilator and the Properties associated with them: Part Promoter(Sequence, Neg35StartEnd, Neg10-StartEnd, OperatorSites, OperatorSiteLocations); Part RBS(Sequence); Part CodingDNA(Sequence, CorrespondingProtein, ProteinBindingInfo); Part Terminator(Sequence); The following example Part declarations specify the actual physical Parts in the repressilator: Promoter araP(); //lacI and tetR promoters created as well RBS rbs1(); //two other RBS created as well CodingDNA DNAlac(); //tetR and araC ORFs created as well Terminator term1(); The following rules constrain Devices to use Parts in such a way to give rise to the repressilator behavior: Rule promoterToCoding1(araP BEFORE DNAlac); Figure 8. Illustration of an automated assembly flow beginning with a Eugene file for a protein tagging (PT) Device with nTag and cTag Parts. This shows the eight Parts that make up the Device and the order in which the Parts must be assembled to have a functional Device. In the Eugene import process, the Devices of interest are captured with Eugene and processed by a Clotho App (e.g. Spectacles). Later the Device construction is planned for a specific assembly protocol with the creation of an assembly graph. In the final phase, the files for a liquid handling robot are created and fed to the platform doing the assembly. doi:10.1371/journal.pone.0018882.g008 Rule promoterToCoding2(lacP BEFORE DNAtet); Rule promoterToCoding3(tetP BEFORE DNAara); Assert(promoterToCoding1 AND promoterToCod-ing2 AND promoterToCoding3); Finally, the repressilator Device is declared with the specific ordering of these Parts: Device Repressilator(araP, rbs1, DNAlac, term1, lacP, rbs2, DNAtet, term2, tetP, rbs3, DNAara, term3); SynBioSS Designer loads this Eugene code for simulation. Specifically, Designer uses SimpleXML to load the XML produced as an artifact of Eugene interpretation. SimpleXML is a PHP extension which converts XML to an array with the same structure as the original XML. This array is then manipulated to have a structure compatible with all of Designer's algorithms. A diagram of this design flow is shown in Figure 9.

Discussion
Eugene is a language in development. We have illustrated a very brief snapshot of its capabilities. Here are future directions for the language: Control Flow Extensions -It will be important to incorporate other control statements into Eugene. The language will require the ability to systematically iterate through lists, which can be achieved through loops. This will be useful when different combinations of Parts or Devices need to be traversed and some operations on them performed.
Functional Extensibility -The user should have the ability to create custom functions as well. This mechanism could resemble other imperative programming languages. This process would introduce the importance of scope in variables and instances, since functions should only apply to specific scoped instances of variables. Currently, all variable instances in a file can be accessed globally. Figure 9. High-level diagram of a repressilator as well as its Eugene implementation. Here the relationship between LacI, TetR, and AraC and the promoters in the system is shown. This design was chosen since its behavior is well understood and can be easily decomposed into the individual Parts that make up the Device. The SynBioSS design flow with Eugene is also shown. Beginning with the Eugene XML produced by the Eugene interpreter, SimpleXML creates an array which holds the data from Eugene. After a reorganization process the data can now be transformed by SynBioSS into a reaction network in SBML or NetCDF which can then be simulated. Sample of the reaction network generated by SynBioSS Designer is also provided. These reactions describe the unregulated expression of TetR, as well as its dimerization and degradation. All rate laws are elementary and all kinetic data is in SI units unless otherwise noted. Asterisks indicate gamma-distributed reactions. doi:10.1371/journal.pone.0018882.g009 Explicit Database Support -Another potential strength in a language like Eugene is the direct access to a database of Parts. By providing an explicit function to connect to a specified database, we would certainly give more expressional power to the language. Currently, database access is performed outside of Eugene by translating XML information from the database to Eugene code.
Abstraction Level -Currently, the highest level in the design hierarchy is the ''Device Level''. Ideally, we would like to extend Eugene to contain Systems and the ability to operate on such a level by providing built-in functions, which will depend on new assembly standards.
Constraint Scope -Currently, rules are based on Part instances but not Part definitions. For example, a rule will be based on Promoter P1 but not across all Promoters. In many cases, it would be much more appropriate to apply rules to Part definitions to not only save on programming effort but also increase the expressiveness of the constraint system.
Constraint Application -Currently, rules are applied to Device composition. However, if one wanted to make a rule regarding two Devices, this is currently not possible. The introduction of a ''System'' level of abstraction with System level wide rules could address this.
We also are aware that there are a number of existing languages and tools in this domain. In particular, we consider comparisons to Systems Biology Markup Language (SBML) [31], Antimony [32], GenoCAD [33], Genetic Engineering of living Cells (GEC) [28], Proto [34], Tinkercell [30],and CellML [35] particularly relevant. In the Supplemental Information we address these comparisons directly. Broadly speaking, we feel Eugene offers certain advantages in the areas of flexibility, ease of use, interoperability with other tools, reflection of synthetic biology design flows, and extensibility.

Summary
We have introduced the Eugene programming language for synthetic biology. In particular, we have illustrated flexible Part and Device specification and composition, combinatorial design space exploration of Devices using an expressive system of Rules, and interaction with other tools for simulation and automated assembly. We have also provided ample Supplemental Materials with comparisons to other approaches, additional information regarding our results, a complete set Eugene designs, and more information regarding how to write Eugene programs.

Availability
Eugene is available at http://www.eugenecad.org . This is an open source project covered broadly under a BSD general license. The download includes all the examples provided here along with documentation regarding how to use the tool. In addition the grammar file used to create Eugene is available as well. It requires Java 6 (http://java.sun.com/javase/6) to run. We encourage questions and comments.
Eugene is most effectively used with other tools as illustrated in this paper. Clotho is available at http://www.clothocad.org . It too is an open source project under BSD. We highly recommend Notepad++ for the creation of Eugene files and we provide a Notepad++ syntax highlighter with the Eugene download. You can get Notepad++ at http://sourceforge.net/projects/notepadplus/. SynBioSS is available at http://synbioss.sourceforge.net .