Conceived and designed the experiments: JP. Performed the experiments: YC MWL LA. Analyzed the data: YC MWL JP. Wrote the paper: YC MWL JP.
The authors have declared that no competing interests exist.
Recognizing that certain biological functions can be associated with specific DNA sequences has led various fields of biology to adopt the notion of the genetic part. This concept provides a finer level of granularity than the traditional notion of the gene. However, a method of formally relating how a set of parts relates to a function has not yet emerged. Synthetic biology both demands such a formalism and provides an ideal setting for testing hypotheses about relationships between DNA sequences and phenotypes beyond the gene-centric methods used in genetics. Attribute grammars are used in computer science to translate the text of a program source code into the computational operations it represents. By associating attributes with parts, modifying the value of these attributes using rules that describe the structure of DNA sequences, and using a multi-pass compilation process, it is possible to translate DNA sequences into molecular interaction network models. These capabilities are illustrated by simple example grammars expressing how gene expression rates are dependent upon single or multiple parts. The translation process is validated by systematically generating, translating, and simulating the phenotype of all the sequences in the design space generated by a small library of genetic parts. Attribute grammars represent a flexible framework connecting parts with models of biological function. They will be instrumental for building mathematical models of libraries of genetic constructs synthesized to characterize the function of genetic parts. This formalism is also expected to provide a solid foundation for the development of computer assisted design applications for synthetic biology.
Deciphering the genetic code has been one of the major milestones in our understanding of how genetic information is stored in DNA sequences. However, only part of the genetic information is captured by the simple rules describing the correspondence between gene and proteins. The molecular mechanisms of gene expression are now understood well enough to recognize that DNA sequences are rich in functional blocks that do not code for proteins. It has proved difficult to express the function of these genetic parts in a computer readable format that could be used to predict the emerging behavior of DNA sequences combining multiple interacting parts. We are showing that methods used by computer scientists to develop programming languages can be applied to DNA sequences. They provide a framework to: 1) express the biological functions of genetic parts, 2) how these functions depend on the context in which the parts are placed, and 3) translate DNA sequences composed of multiple parts into a model predicting how the DNA sequence will behave in vivo. Our approach provides a formal representation of how the biological function of genetic parts can be used to assist in the engineering of synthetic DNA sequences by automatically generating models of the design for analysis.
“How much can a bear bear?” This riddle uses two homonyms of the word “bear”. The first instance of the word is a noun referring to an animal, and the second is a verb meaning “endure”. Although the word “bear” has over 50 different meanings in English, its meaning in any given sentence is rarely ambiguous. In a simple case like this riddle, the meaning of each word can be deciphered by looking at other words in the same sentence. In other cases, it is necessary to take into account a broader context to properly interpret the word. For instance, it may be necessary to read several sentences to decide if “bear claw” refers to a body part or a pastry. A reader will progressively derive the meaning of a text by recognizing structures consistent with the language grammar. It is often difficult to understand the meaning of a text by relying exclusively on a dictionary.
It is interesting to compare this bottom-up emergence of meaning with the top-down approach that made genetics so successful. The discipline was built upon a quest to define hereditary units that could be associated with observable traits well before the physical support of heredity was discovered
Yet, despite its success, the notion of gene appears insufficient to express the complexity of the relation between an organism genome and its phenotype
It is becoming apparent that the genetic code captures only a small fraction of the information content of DNA molecules
Synthetic biology is likely to be instrumental in refining our understanding of the design of natural biological systems
One possible approach to this problem is to extend the linguistic metaphor used to formulate the central dogma. The notions of genetic code, transcription, and translation are derived from a linguistic representation of biological sequences. Several authors have modeled the structure of various types of biological sequences using syntactic models
We recently described a fairly simple syntactic model of synthetic DNA sequences
An attribute grammar is a context free grammar augmented with attributes, semantic rules, and conditions. Attribute grammars were developed as a means of formalizing the semantics of a context free grammar. | |
A context free grammar is a quadruple (V, Σ, P, S) where V is a finite set of non-terminal symbols, Σ (the alphabet) is a finite set of terminal symbols, P is a finite set of rules, and S is a distinguished element of V called the start symbol. A rule P is of the following form A→ω where A is a single non-terminal symbol and ω is a string of terminals and/or non-terminals (possibly empty). The term “context-free” expresses the fact that non-terminals are rewritten without regard to the context in which they occur. | |
A codimension 2 bifurcation formed by the tangential meeting of two loci of saddle-node bifurcations. In other words, a cusp bifurcation traces the path of the points bounding a bistable region as they change with changes in two parameters. Bistability is implied within the cusp bounds. | |
A direct left recursion in context free grammar refers to rules of the form A→Aω. Parsing left recursion can possibly lead the parser down an infinite branch of the search tree in the corresponding logic program. | |
The measurement of polymerase per second transcribing past a defined point of DNA. | |
The Systems Biology Markup Language (SBML) is a machine-readable language, based on XML, for representing models of biochemical reaction networks. | |
Semantics reveals the meaning of syntactically valid strings in a language. For natural languages, this means correlating sentences and phrases with the objects, thoughts, and feelings of our experiences. For programming languages, semantics describes the behavior that a computer follows when executing a program in the language. | |
Syntax refers to the ways symbols may be combined to create well-formed sentences (or programs) in a language. Syntax defines the formal relations between the constituents of a language, thereby providing a structural description of the various expressions that make up legal strings in the language. Syntax deals solely with the form and structure of symbols in a language without any consideration given to their meaning. |
The translation of a gene network model from a genetic sequence is very similar to the compilation of the source code of a computer program into an object code that can be executed by a microprocessor (
The input for this process is a DNA sequence that is first broken down into parts by the scanner. The combination of the parts is validated by the parser according to a syntactic model. After validation by the parser, the sequence is translated by applying semantic actions attached to the rules to transform the series of parts into a set of chemical equations. The resulting equations can then be solved using existing simulation engines. Each step takes the output of the previous step as input, so the workflow can start from any step if the appropriate input is provided.
In the derivation tree, terms in <> corresponds to the non-terminals in the grammar, while terms in [ ] are terminals, and the dashed lines indicate the transformation to terminals. The subscripts are used to distinguish different instances of the same category.
We have developed a simple grammar compact enough to be presented extensively, yet sufficiently complex to represent basic epistatic interactions. The grammar generates constructs composed of one or more gene expression cassettes. The gene expression cassettes are themselves composed of a promoter, cistron, and transcription terminator. Finally, a cistron is composed of a Ribosome Binding Site (RBS) and a coding sequence (gene). The syntax is composed of 12 production rules (P1 to P12) displayed in bold characters in
The attributes of a part include the kinetic rates related to this part and the interaction information. For example, the attributes of a promoter include a transcription rate along with a list of proteins repressing it and the kinetic parameters of the protein-DNA interactions. For non-terminal variables corresponding to combinations of parts such as cistrons, the attributes include a list of proteins, a list of promoters, and a list of chemical equations. The equation list is used to store the model of the system behavior, while the lists of promoters and proteins are recorded for computing the molecular interactions resulting from the DNA sequence. The complete set of attributes used in this simple grammar is listed in
Non-terminals | Inherited Attribute | Synthesized Attributes |
constructs | protein_list | promoter_list, equation_list |
cassette | protein_list | promoter_list, equation_list |
restConstructs | protein_list | promoter_list, equation_list |
cistron | protein_list | transcript, equation_list |
promoter | - | name, transcription_rate, leakiness_rate, repressor_list |
RBS | - | name, translation_rate |
gene | - | name, mRNA_degradation_rate, protein_degradation_rate |
terminator | - | name |
If many attributes can be computed locally by only considering a small fragment of the DNA sequence, other attributes are global properties of the system. For instance, the computation of protein-DNA interactions requires access to a global list of proteins expressed by the constructs. However, this list is not available until all of the different cassettes have been parsed. The problem is overcome by using a multiple-pass compilation method. In the first pass, the compiler does not do any structural validation but builds the list of proteins in the system and passes the list as an inherited attribute to the second pass. In the second pass, the promoter-protein interactions can be calculated locally at the level of each cassette. Rules P1 to P5 define the structure of a design, while rules P6 to P12 cover the selection of a specific part for each category. In the semantic action, the relation between an attribute and its variable is indicated by a dot and constants are enclosed by brackets. For instance,
The translation of the DNA sequence into a mathematical model is available as the
The semantic model presented in the previous section is completely modular since the parameters of the model describing the construct behavior are attributes of individual parts, not of higher order structures. For instance, in the previous model (
P5. cistron → rbs, gene
{
cistron.translation_rate = get_translation_rate (rbs, gene)
cistron.transcript = rbs.name+gene.name
cistron.equation_list = translation(rbs, gene, cistron.translation_rate)
}
The get_translation_rate function checks for specific cases of interactions between an RBS and coding sequence first. If none is found, then the default RBS translation rate is used.
If exists translation_rate(rbs, gene)
translation_rate = translation_rate(rbs, gene)
else
translation_rate = translation_rate(rbs)
endif
This approach is illustrated in
Mutant | RBS | ORF | Expression | Translation rate function |
1 | RBS WT | ORF WT | 100 | translation_rate(RBS WT) |
6 | RBS WT | ORF2 | 100 | translation_rate(RBS WT) |
7 | RBS WT | ORF3 | 100 | translation_rate(RBS WT) |
17 | RBS WT | ORF4 | 3 | translation_rate(RBS WT, ORF4) |
20 | RBS WT | ORF5 | 6 | translation_rate(RBS WT, ORF5) |
23 | RBS WT | ORF6 | 0.3 | translation_rate(RBS WT, ORF6) |
4 | RBS1 | ORF WT | 100 | translation_rate(RBS1) |
2 | RBS1 | ORF1 | 100 | translation_rate(RBS1) |
3 | RBS1 | ORF2 | 100 | translation_rate(RBS1) |
5 | RBS1 | ORF3 | 4 | translation_rate(RBS1, ORF3) |
14 | RBS1 | ORF4 | <0.003 | translation_rate(RBS1, ORF4) |
9 | RBS2 | ORF WT | 100 | translation_rate(RBS2) |
8 | RBS2 | ORF1 | 100 | translation_rate(RBS2) |
10 | RBS2 | ORF3 | 100 | translation_rate(RBS2) |
12 | RBS3 | ORF WT | 100 | translation_rate(RBS3) |
11 | RBS3 | ORF1 | 20 | translation_rate(RBS3, ORF1) |
13 | RBS3 | ORF3 | 100 | translation_rate(RBS3) |
15 | RBS4 | ORF4 | 0.1 | translation_rate(RBS4) |
16 | RBS5 | ORF4 | 0.05 | translation_rate(RBS5) |
22 | RBS6 | ORF WT | 0.2 | translation_rate(RBS6, ORF WT) |
18 | RBS6 | ORF4 | 80 | translation_rate(RBS6) |
21 | RBS7 | ORF WT | 100 | translation_rate(RBS7) |
19 | RBS7 | ORF4 | 100 | translation_rate(RBS7) |
The semantic model in
Each section A to F indicates a different selection of repressors within a toggle switch: (A)
To demonstrate the potential use of a semantic model to search for a desirable behavior in a large genetic design space, we have generated the DNA sequences of all 41,472 possible sequences (722×8 RBS for the reporter gene) having the same structure as previously described switches. All sequences were translated into separate model files and a script was developed to perform a bistability analysis of each model. Parameters of the semantic model were obtained by qualitatively matching the experimental results of the six previously published switches
Bistability was tested numerically by integrating the differential equations until they converged to a steady state starting from two different initial conditions. The two initial conditions started with one protein level very high and the other very low and vice versa. We characterized the bistability by computing the ratio of reporter concentration for the two steady state values. In order to globally verify the behavior of this large population of models, we focused on the 3,072 constructs potentially capable of bistability, 1,408 of which were found to be bistable. We further reduced the number of constructs used to verify the translation process from 3,072 to 384 by assuming that two constructs differing only in the RBS in 5′ of the reporter gene would produce the same ratio of steady state values.
This example demonstrates the benefit of building a semantic model of synthetic DNA sequences. Even a small library of genetic parts can generate large numbers of artificial gene networks having no more than a few interacting genes. A syntactic model describing how parts can be combined into constructs is a compact representation of the genetic design space generated from the parts library. While it is possible to manually build mathematical models capturing the dynamics of some of these artificial gene networks individually, it becomes desirable to automate the process to ensure the model consistency when building large families of related models derived from the same parts library. By considering genetic parts as the terminal symbols of an attribute grammar, it becomes possible to automatically generate models of numerous artificial gene networks derived from this parts library and quickly identify the optimal designs
The parameter values used in the previous example were selected to match an extremely small set of six experimental data points. Although the under-determination of the model does not make it possible to precisely estimate the value of these parameters, the example illustrates how the framework could provide valuable guidance in selecting specific parts for a design. Considering that the exact value of parameters for parts is still a far off perspective, the automatic exploration of the design space presented here will provide useful guidance in construct design. For example, robust constructs from the cusp interior of the
The approach presented in this report will be implemented into GenoCAD
A function description language called Genetic Engineering of living Cells (GEC) was recently introduced to specify the properties of a design
Still, the scripts developed to generate our results are of lesser importance than the application of the theory of semantics-based translation using attribute grammars to the translation of DNA sequences into dynamical models representing the molecular interactions they encode. Since this approach is used to develop the compilers of many computer languages
Ultimately, tools capable of automatically generating models of the behavior of synthetic DNA sequences will be important for the advancement of synthetic biology
Before it will be used to build synthetic genetic systems meeting user-defined specifications, the semantic model of DNA sequences presented in this report will be instrumental in the quantitative characterization of structure-function relationships in synthetic DNA sequences. The vision of applying quantitative engineering methods to biological problems has been recognized as a promising avenue to biological discovery
Ongoing efforts aim to carefully define how parts should fit together syntactically and what attributes are needed to characterize their function. For example, the sequence between the RBS and the start codon has been shown to play an important role in translation rate
Computation dependence corresponding to the derivation tree in
(0.01 MB PDF)
List of parts used in the “exploration of genetic space” section and values of associated attributes
(0.01 MB PDF)
Zip file containing the scripts and data used in this report.
(0.03 MB ZIP)
The authors would like to thank Drs. Jacques Cohen and Mark Cooper for their critical reading of an early version of this manuscript, Stephan Hoops for helping us use COPASI in batch mode, and Emily Alberts for her editorial skills.