Formal reasoning about systems biology using theorem proving

System biology provides the basis to understand the behavioral properties of complex biological organisms at different levels of abstraction. Traditionally, analysing systems biology based models of various diseases have been carried out by paper-and-pencil based proofs and simulations. However, these methods cannot provide an accurate analysis, which is a serious drawback for the safety-critical domain of human medicine. In order to overcome these limitations, we propose a framework to formally analyze biological networks and pathways. In particular, we formalize the notion of reaction kinetics in higher-order logic and formally verify some of the commonly used reaction based models of biological networks using the HOL Light theorem prover. Furthermore, we have ported our earlier formalization of Zsyntax, i.e., a deductive language for reasoning about biological networks and pathways, from HOL4 to the HOL Light theorem prover to make it compatible with the above-mentioned formalization of reaction kinetics. To illustrate the usefulness of the proposed framework, we present the formal analysis of three case studies, i.e., the pathway leading to TP53 Phosphorylation, the pathway leading to the death of cancer stem cells and the tumor growth based on cancer stem cells, which is used for the prognosis and future drug designs to treat cancer patients.


Introduction
The discovery and design of effective drugs for infectious and chronic biological diseases, like cancer and cerebral malaria, require a deep understanding of behaviorial and structural characteristics of underlying biological entities (e.g., cells, molecules and enzymes). Traditional approaches, which rely on verbal and personal intuitions without concrete logical explanations of biological phenomena, often fail to provide a complete understanding of the behavior of such diseases, mainly due to the complex interactions of molecules connected through a chain of reactions. Systems biology [1] overcomes these limitations by integrating mathematical modeling and high-speed computing machines in the understanding of biological processes and thus provides the ability to predict the effect of potential drugs for the treatment of chronic diseases. System biology is widely used to model the biological processes as pathways or networks. Some of the examples are signaling pathways and protein-protein interaction networks [2]. These biological networks such as gene regulatory networks (GRNs) or biological regulatory networks (BRNs) [3], are analysed using the principles of molecular biology. This analysis, in turn, plays an important role for the investigation of the treatment of various human infectious diseases as well as future drug design targets. For example, the BRNs analysis has been recently used for the prediction of treatment decisions for sepsis patients [4].
Traditionally, biologists analyze biological organisms (or different diseases) using wet-lab experiments [5,6]. These experiments cannot provide reliable analysis due to their inability to accurately characterize the complex biological processes in an experimental setting. Moreover, the experiments take a long execution time and often require an expensive experimental setup. One of the other techniques used for the deduction of molecular reactions is the paper-andpencil proof method (e.g. Boolean modeling [7] or kinetic logic [8]). But the manual proofs in paper-and-pencil proof methods, become quite tedious for large systems, where several hundred proof steps are required in order to calculate the unknown parameters, thus prone to human error. Other alternatives for analyzing system biology problems include computerbased techniques (e.g. Petri nets [9] and model checking [10]). Petri net is a graph based technique [11] for analyzing system properties. In model checking, a system is modeled in the form of state-space or automata and the intended properties of the system are verified in a model checker by a rigorous state exploration of the system model. Theorem proving [12] is another formal methods technique that is widely used for the verification of the physical systems but has been rarely used for analyzing system biology related problems. In theorem proving, a computer-based mathematical model of the given system is constructed and then deductive reasoning is used for the verification of its intended properties. A prerequisite for conducting the formal analysis of a system is to formalize the mathematical or logical foundations that are required to model the system in an appropriate logic.
Zsyntax [13] is a recently proposed formal language that supports the modeling of any biological process and presents an analogy between a biological process and the logical deduction. It has some pre-defined operators and inference rules that are used for the logical deductions about a biological process. These operators and inference rules have been designed in such a way that they are easily understandable by the biologists, making Zsyntax a biologist-centered formalism, which is the main strength of this language. However, Zsyntax does not support specifying the temporal information associated with biological processes. Reaction kinetics [14], on the other hand, caters for this limitation by providing the basis to understand the time evolution of molecular populations involved in a biological network. This approach is based on the set of first-order ordinary differential equations (ODEs) also called reaction rate equations (RREs). Most of these equations are non-linear in nature and difficult to analyze but provide very useful insights for prognosis and drug predictions. Traditionally, the manual paperand-pencil technique is used to reason logically about biological processes, which are expressed in Zsyntax. Similarly, the analysis of RREs is performed by either paper-and-pencil based proofs or numerical simulation. However, both methods suffer from the inherent incompleteness of numerical methods and error-proneness of manual proofs. We believe that these issues cannot be ignored considering the critical nature of this analysis due to the involvement of human lives. Moreover, biological experiments based on erroneous parameters, derived by the above-mentioned approaches may also result in the loss of time and money, due to the slow nature of wet-lab experiments and the cost associated with the chemicals and measurement equipment.
In this paper, we propose to develop a formal reasoning support for system biology to analyze complex biological systems within the sound core of a theorem prover and thus provide accurate analysis results in this safety-critical domain. By formal reasoning support, we mean to develop a set of generic mathematical models and definitions, a process that is usually termed as formalization, of commonly used notions of system biology using an appropriate logic and ascertain their properties as formally verified theorems in a theorem prover, which is a verification tool based on deductive reasoning. These formalized definitions and formally verified theorems can then in turn be used to develop formal models of real-world system biology problems and thus verify their corresponding properties accurately within the sound core of a theorem prover. The use of logic in modeling and a theorem prover in the verification leads to the accuracy of the analysis results, which cannot be ascertained by other computational approaches. In our recent work [15], we developed a formal deduction framework for reasoning about molecular reactions by formalizing the Zsyntax language in the HOL4 theorem prover [16]. In particular, we formalized the logical operators and inference rules of Zsyntax in higher-order logic. We then built upon these formal definitions to verify two key behavioral properties of Zsyntax based molecular pathways [17,18]. However, it was not possible to reason about biological models based on reaction kinetics due to the unavailability of the formal notions of reaction rate equations (a set of coupled differential equations) in higherorder logic. In order to broaden the horizons of formal reasoning about system biology, this paper presents a formalization of reaction kinetics along with the development of formal models of generic biological pathways without the restriction on the number of molecules and corresponding interconnections. Furthermore, we formalize the transformation, which is used to convert biological reactions into a set of coupled differential equations. This step requires multivariate calculus (e.g., vector derivative, matrices, etc.) formalization in higher-order logic, which is not available in HOL4 and therefore we have chosen to leverage upon the rich multivariable libraries of the HOL Light theorem prover [19] to formalize the above mentioned notions and verify the reactions kinetics of some generic molecular reactions. To make the formalization of Zsyntax [15] consistent with the formalization of reaction kinetics in HOL Light, as part of our current work, we ported all of the HOL4 formalization of Zsyntax to HOL Light. In order to illustrate the usefulness and effectiveness of our formalization, we present the formal analysis of a molecular reaction representing the TP53 Phosphorylation [13], a molecular reaction of pathway leading to the death of cancer stem cells (CSC) and the analysis of tumor growth based on the CSC [20].

Related work
In the last few decades, various modeling formalisms of computer science have been widely used in system biology. We briefly outline here the applications of computational modeling and analysis approaches in system biology, where the main idea is to transform a biological model into a computer program.
Process algebra (PA) [21] provides an expressive framework to formally specify the communication and interactions of concurrent processes without ambiguities. Biological systems can be considered as concurrent processes and thus process algebra can be used to model biological entities [22]. Some recent work in this area includes the formalizations of molecular biology based on K-Calculus [23] and π-Calculus [24]. The main tools that support PA in biology are sCCP [25], BioShape [26] and Bio-PEPA [27]. Even though PA based biological modeling provides sound foundations, it may be quite difficult and cumbersome for working biologists to understand these notations [28,29].
Rule-based modeling offers a flexible and simple framework to model various biochemical species in a textual or graphical format. This allows biologists to perform the quantitative analysis [30,31] of complex biological systems and predict important underlying behaviors. Some of the main rule-based modeling tools are BioNetGen [30], Kappa [32] and BIOCHAM [33].
These tools are mainly based on rewriting and model transformation rules along with the integration with model checking tools and numerical solvers. However, these integrations are usually not checked for correctness (for example by an independent proof assistant), which may lead to inconsistencies [34].
Boolean networks [35] are used to characterize the dynamics of gene-regulatory networks by limiting the behavior or genes by either a truth state or false state. Some of the major tools that support the Boolean modeling of biological systems are BoolNet [36], BNS [37] and GINsim [38]. The discrete nature of Boolean networks does not allow us to capture continuous biological evolutions, which are usually represented by differential equations.
Model checking has shown very promising results in many applications of molecular biology [39][40][41][42]. Hybrid systems theory [43] extends the state-based discrete representation of traditional model checking with a continuous dynamics (described in terms ODEs) in each state. Some of the recently developed tools that support the hybrid modeling of biological systems are S-TaLiRo [44], Breach toolbox [45] and dReach [46]. Recently, Petri nets have been widely used to model biological networks [47,48] and some of the important associated tools include Snoopy [49] and GreatSPN [50]. However, the graph or state based nature of the models in these methods only allow the description of some specific areas of molecular biology [13,51]. Moreover, the model checking technique has an inherent state-space explosion problem [52], which makes it only applicable to the biological entities that can acquire a small set of possible levels and thus limits its scope by restricting its usage on larger systems.
In a system analysis based on theorem proving, we need to formalize the mathematical or logical foundations required to model and analyze that system in an appropriate logic. Several attempts have been made to formalize the foundations of molecular biology. The first attempt at some basic axiomatization dates back to 1937 [53]. Zanardo et al. [54] and Rizzotti et al. [55] have also done some efforts towards the formalization of biology. But all these formalizations are paper-and-pencil based and have not been utilized to formally reason about molecular biology problems within a theorem prover. In our recent work [15], we developed a formal deduction framework for reasoning about molecular reactions by formalizing the Zsyntax language in the HOL4 theorem prover [16]. However, a major limitation of this work is that it cannot cater for the temporal information associated with biological processes and, hence, does not support modeling the time evolution of molecular populations involved in a biological network, which is of a dire need when studying the dynamics of a biological system. Reaction kinetics [14] provide the basis to understand the time evolution of molecular populations involved in a biological network. To overcome the limitation of the work presented by Sohaib et al. [15], we provide the formalization of reaction kinetics in higher-order logic and in turn extend the formal reasoning about system biology.

Higher-order-logic theorem proving and HOL Light theorem prover
In this section, we provide a brief introduction to the higher-order-logic theorem proving and HOL Light theorem prover.

Higher-order-logic theorem proving
Theorem proving involves the construction of mathematical proofs by a computer program using axioms and hypothesis. Theorem proving systems (theorem provers) are widely used for the verification of hardware and software systems [56,57] and the formalization (or mathematical modeling) of classical mathematics [58][59][60]. For example, hardware designers can prove different properties of a digital circuit by using some predicates to model the circuits model. Similarly, a mathematician can prove the transitivity property for real numbers using the axioms of real number theory. These mathematical theorems are expressed in logic, which can be a propositional, first-order or higher-order logic based on the expressibility requirement.
Based on the decidability or undecidability of the underlying logic, theorem proving can be done automatically or interactively. Propositional logic is decidable and thus the sentences expressed in this logic can be automatically verified using a computer program whereas higher-order logic is undecidable and thus theorems about sentences, expressed in higherorder logic, have to be verified by providing user guidance in an interactive manner.
A theorem prover is a software for deductive reasoning in a sound environment. For example, a theorem prover does not allow us to conclude that " x x ¼ 1" unless it is first proved or assumed that x 6 ¼ 0. This is achieved by defining a precise syntax of the mathematical sentences that can be input in the software. Moreover, every theorem prover comes with a set of axioms and inference rules which are the only ways to prove a sentence correct. This purely deductive aspect provides the guarantee that every sentence proved in the system is actually true.
HOL Light theorem prover. HOL Light [19] is an interactive theorem prover used for the constructions of proofs in higher-order logic. The logic in HOL Light is represented in meta language (ML), which is a strongly-typed functional programming language [61]. A theorem is a formalized statement that may be an axiom or could be deduced from already verified theorems by an inference rule. Soundness is assured as every new theorem must be verified by applying the basic axioms and primitive inference rules or any other previously verified theorems/inference rules. A HOL Light theory is a collection of valid HOL Light types, axioms, constants, definitions and theorems, and is usually stored as an ML file in computers. Users interacting with HOL Light can reload a theory and utilize the corresponding definitions and theorems right away. Various mathematical foundations have been formalized and stored in HOL Light in the form of theories by the HOL Light users. HOL Light theories are organized in a hierarchical fashion and child theories can inherit the types, constants, definitions and theorems of the parent theories. The HOL Light theorem prover provides an extensive support of theorems regarding Boolean variables, arithmetics, real numbers, transcendental functions, lists and multivariate analysis in the form of theories which are extensively used in our formalization. The proofs in HOL Light are based on the concept of tactics which break proof goals into simple subgoals. There are many automatic proof procedures and proof assistants [62] available in HOL Light, which help the user in concluding a proof more efficiently.

Proposed framework
The proposed theorem proving based formal reasoning framework for system biology, depicted in Fig 1, allows the formal deduction of the complete pathway from any given time instance and model and analyze the ordinary differential equations (ODEs) corresponding to a kinetic model for any molecular reaction. For this purpose, the framework builds upon existing higher-order-logic formalizations of Lists, Pairs, Vectors, and Calculus.
The two main rectangles in the higher-order logic block present the foundational formalizations developed to facilitate the formal reasoning about the Zsyntax based pathway deduction and the reaction kinetics. In order to perform the Zsyntax based molecular pathway deduction, we first formalize the functions representing the logical operators and inference rules of Zsyntax in higher-order logic and verify some supporting theorems from this formalization. This formalization can then be used along with a list of molecules and a list of Empirically Valid Formulae (EVFs) to formally deduce the pathway for the given list of molecules and provide the result as a formally verified theorem using HOL Light. Similarly, we have formalized the flux vectors and stoichiometric matrices in higher-order-logic. These foundations can be used along with a given list of species and the rate of the reactions to develop a corresponding ODEs based kinetic reactions model. The solution to this ODE can then be formally verified as a theorem by building upon existing formalizations of Calculus theories.
The distinguishing characteristics of the proposed framework include the usage of deductive reasoning to derive the deduced pathways and solutions of the reaction kinetic models. Thus, all theorems are guaranteed to be correct and explicitly contain all required assumptions.

Formalization of Zsyntax
Zsyntax [13] is a formal language which exploits the analogy between biological processes and logical deduction. Some of its key features are that: 1) it enables us to represent molecular reactions in a mathematical rigorous way; 2) it is of heuristic nature, i.e., if the initialization data and the conclusion of a reaction is known, then it allows us to deduce the missing data based on the initialization data; and 3) it possesses computer implementable semantics. Zsyntax has three operators namely Z-Interaction, Z-Conjunction and Z-Conditional that are used to represents different phenomenon in a biological process. These are the atomic formulas residing in the core of Zsyntax. Z-Interaction ( Ã ) represents the reaction or interaction of two molecules. In biological reactions, the Z-interaction operation is not associative. i.e., in a reaction having three molecules namely A, B and C, the operation ( is used to form the aggregate of the molecules participating in the biological process. These molecules can be same or different. Unlike the Z-Interaction operator, the Z-Conjunction is fully associative. Z-Conditional (!) is used to represent a path from A to B when condition C becomes true, i.e., A ! B if there is a C allowing it. To apply the above-mentioned operators on a biological process, Zsyntax provides four inference rules that are used for the deduction of the outcomes of the biological reactions. These inference rules are given in Table 1.
Zsyntax also utilizes the EVFs which are the empirical formulas validated in the lab and are basically the non-logical axioms of molecular biology. A biological reaction can be mapped and then these above-mentioned Zsyntax operators and inference rules are used to derive the final outcome of the reaction as shown in [13].
We start our formalization of Zsyntax, by formalizing the molecule as a variable of arbitrary data type (α) [18]. Z-Interaction is represented by a list of molecules (α list), which is a molecular reaction among the elements of the list. This (α list) may contain only a single element or it can have multiple elements. We model the Z-Conjunction operator as a list of list of molecules ((α list) list), which represents a collection of non-reacting molecules. Using this data type, we can apply the Z-Conjunction operator between individual molecules (a list with a single element), or between multiple interacting molecules (a list with multiple elements). Thus, based on our datatype, Z-Conjunction is a list of Z-interactions for both of these cases, i.e., individual molecules or multiple interacting molecules. So, overall, Z-conjunction acts as a set of Z-interaction. When a new set of molecules is generated based on the EVFs available for a reaction, the status of the molecules is updated using the Z-Conditional operator. We model each EVF as a pair of data type (α list # α list list) where the first element of the pair is a list of the molecules represented by data type (α list) and are actually the reacting molecules, whereas, the second element is a list of list of molecules ((α list) list), which represents a set of molecules that are obtained as a result of the reaction between the molecules of the first element of the pair Table 1. Zsyntax inference rules.

Inference Rules Definition
and thus act as a set of Z-Interactions. A collection of EVFs is formalized using the data type ((α list # α list list) list), which is a list of EVFs. Next, we formalize the inference rules using higher-order logic. The inference rule named elimination of the Z-Conditional (!E) is equivalent to the Modus Ponens (the elimination of implication rule) law of propositional logic. Similarly, we can infer introduction of Z-Conditional (!I) rule from the existing rules of the propositional logic present in a theorem prover. Thus, both of these rules can be handled by the simplification and rewriting rules of the theorem prover and we do not need to define new rules for handling these inference rules. To check the presence of a particular molecule in an aggregate of some inferred molecules, the elimination of the Z-Conjunction (& E) rule is used. We apply it at the end of the biological reaction to check whether the product of the reaction is the desired molecule or not. We formalized this rule by a function (Table 2: zsyn_conjun_elimin), which accepts a list l and an element x and checks if x is present in this list. If the condition is true, it returns the given element x as a single element of that list l. Otherwise, it returns the list l as is, as shown in Fig 2a. The Z-Interaction and the introduction of Z-Conjunction (& I) rule jointly enable us to perform a reaction between different molecules during the experiment. This rule is basically the append operation of lists, based on the above data types defined in our formalization. The function zsyn_conjun_intro, given in Table 2, represents this particular rule. It takes a list l and two of its elements x and y, and appends the list of these two elements on its head as shown in Fig 2b. According to laws of stoichiometry [13], we have to delete the initial reacting molecules from the main list, for which the Z-Conjunction operator is applied. Our formalization of this behavior is represented by the function zsyn_delet, given in Table 2 and depicted in Fig 2c. The function zsyn_delet accepts a list l and two numbers x and y and deletes the x th and y th elements of the given list l. The function checks if the index x is greater than the index y, i.e., x > y. If the condition is true, then it deletes the x th element first and then the y th element. Similarly, if the condition x > y is false, then it deletes the y th element first and then the x th element. In this deletion process, to make sure that the deletion of first element will not affect the index of the other element that has to be deleted, we delete the element present at the higher index of list before the deletion of the lower indexed element.
We aim to build a framework that takes the initial molecules of a biological experiment along with the possible EVFs and enables us to deduce its corresponding final outcomes. Towards this, we first write a function zsyn_EVF, given in Table 2 and depicted in Fig 2d, that takes a list of initial molecules and compares its particular combination with the corresponding EVFs and if a match is found then it adds the newly resulted molecule to initial list after deleting the instance that have already been consumed. The function zsyn_EVF takes a list of molecules l and a list of EVFs e and compares the head element of the list l to all of the elements of the list e. Upon finding no match, this function returns a pair having first element as false (F), which acts as a flag and indicates that there is no match between any of the EVFs and the corresponding molecule, whereas the second element of the pair is the tail of the corresponding list l of the initial molecules. If a match is found, then the function will return a pair with its first element as a true (T), which indicates the confirmation of the match that have been found, and the second element of the pair is the modified list l, whose head is removed, and the second element of the corresponding EVF pair is added at the end of the list and the matched elements are deleted as these have already been consumed.
Next, we have to call the function zsyn_EVF recursively, for the deduction of the final outcome of the experiment and for each of the recursive case, we place each of the possible combinations of the given molecules (elements at indices x and y of list l) at the head of l one by

Name Formalized Form Description
Elimination of Z-Conjunction Rule l e x y. zsyn_recurs1 l e x (y + 1) = if FST (zsyn_EVF (zsyn_conjun_intro l x (y + 1)) e (LENGTH e -1) x (y + 1)) , T then zsyn_EVF (zsyn_conjun_intro l x (y + 1)) e (LENGTH e -1) x (y + 1) else zsyn_recurs1 l e x y  one. This whole process can be done using functions zsyn_recurs1 and zsyn_recurs2, given in Table 2. In the function zsyn_recurs1, we first place the combination of molecules indexed by variables x and y at the top of the list l using the introduction of Z-Conjunction rule. Then, this modified list l is passed to the function zsyn_EVF, which is recursively called by the function zsyn_recurs1. Moreover, we instantiate the variable p of the function zsyn_EVF with the length of the EVF list (LENGTH e -1) so that every new combination of the list l is compared with all the elements of the list of EVFs e. The function zsyn_re-curs1 terminates upon finding a match in the list of EVFs and returns true (T) as the first element of its output pair, which acts as a flag for the status of this match. The second function zsyn_recurs2 checks, if a match in the list of EVFs e is found (if the flag returns true (T)) then it terminates and returns the output list of the function zsyn_recurs1. Otherwise, it recursively checks for the match with all of the remaining values of the variable x. In the case of a match, these two functions zsyn_recurs1 and zsyn_recurs2 have to be called all over again with the new updated list. This iterative process continues until no match is found in the execution of these functions. This overall behaviour can be expressed in HOL Light by the recursive function zsyn_deduct_recurs, given in Table 2. In order to guarantee the Formal reasoning about systems biology using theorem proving correct operation of deduction, we instantiate the variable of recursion (q) with a value that is greater than the total number of EVFs so that the application of none of the EVF is missed. Similarly, in order to ensure that all the combinations of the list l are checked against the entries of the EVF list e, the value LENGTH l -1 is assigned to both of the variables x and y. Thus, the final deduction function for Zsyntax can be modeled as the function zsyn_deduct, given in Table 2. The function zsyn_deduct accepts the initial list of molecules l and the list of valid EVFs e and returns a list of final outcomes of the experiment under the given conditions. Next, in order to check, if the desired molecule is present in this list (the output of the function zsyn_deduct), we apply the elimination of the Z-Conjunction rule presented as function zsyn_conjun_elimin, given in Table 2. More detail about the behavior of all of these functions can be found in our proof script [63]. These formal definitions enable us to check recursively all of the possible combinations of the molecules, present in the initial list l, against each of the first element of the list of EVFs e. Upon finding a match, the reacting molecules are replaced by their outcome in the initial list of molecules l by applying the corresponding EVF. This process is repeated on the current updated list of molecules until there are no further molecules reacting with each other. The list l at this point contains the post-reaction molecules. Finally, the elimination of the Z-Conjunction rule zsyn_conjun_elimin, given in Table 2, is applied to obtain the desired outcome of the given biological experiment.
In order to prove the correctness of the formal definitions presented above, we verify a couple of key properties of Zsyntax involving operators depicting the vital behaviour of the molecular reactions. The first verified property captures the scenario when there is no reacting molecule present in the initial list of the experiment. As a result of this scenario, the postexperiment molecules are the same as the pre-experiment molecules. The second property deals with the case when there is only one set of reacting molecules in the given initial list of molecules and in this scenario we verify that after the execution of the Zsyntax based experiment, the list of post-experiment molecules contains the products of the reacting molecules minus its reactant along with the remaining non-reacting molecules provided at the beginning of the experiment. We formally specified both of these properties, representing the no reaction and single reaction scenarios in higher-order logic using the formal definitions presented earlier in this section. The formal verification results about these properties are given in Table 3 and more details can be found in the description of their formalization [18,63]. The formalization presented in this section provides an automated reasoning support for the Zsyntax based molecular biological experiments within the sound core of HOL Light theorem prover.

Formalization of reaction kinetics
Reaction kinetics [64] is the study of rates at which biological processes interact with each other and how the corresponding processes are affected by these reactions. The rate of a reaction provides the information about the evolution of the concentration of the species (e.g., molecules) over time. A process is basically a chain of reactions, called pathway, and the investigation about the rate of a process implies the rate of these pathways. Generally, biological reactions can be either irreversible (unidirectional) or reversible (bidirectional). We formally define this fact by an inductive enumerating data-type reaction_type, given in Table 4.
In order to analyze a biological process, we need to know its kinetic reaction based model, which comprises of a set of m species, X = {X 1 , X 2 , X 3 ,. . ., X m } and a set of n reactions, R = {R 1 , R 2 , R 3 ,. . ., R n }. An irreversible reaction R j , {1 j n} can generally be written as: Similarly, a reversible reaction R j , {1 j n} can be described as: . . þ s m;j X m . The coefficients s 1;j ; s 2;j ; . . . ; s m;j ; s 1;j ; s 2;j ; . . . ; s m;j are the non-negative integers and represent the stoichiometries of the species taking part in the reaction. The non-negative integer k j is the kinetic rate constant of the irreversible reaction. The non-negative integers k j f and k j r are the forward and reverse kinetic rate constants of the reversible reaction, respectively [65]. In a biological reaction, we model a biological entity as a pair (N, R), where the first element represents the stoichiometry and the second element is the concentration of the molecule. We formally model a biological reaction as the type abbreviation bio_reaction [63], given in Table 4. The dynamic behavior of the biological systems is described by a set of ordinary differential equations (ODEs) and the evolution of the system is captured by analyzing the change in the concentration of the species (i.e., time derivatives): d½X i dt ¼ P n j¼1 n i;j v j , where n i, j is the stoichiometric coefficient of the molecular species X i in reaction R j (i.e., n i;j ¼ s i;j À s i;j ). The parameter v j represents the flux of the reaction R j , which can be computed by the law of mass action [14],  Formal reasoning about systems biology using theorem proving Table 4. Definitions of reaction kinetics formalization.

Biological Reaction
Reaction Type define_type "reaction_type" = irreversible | reversible" Reaction type (reversible or irreversible) defined by an inductive enumerating data-type Biological Reaction new_type_abbrev "bio_reaction", Biological reaction is a pair with reaction type as the first element and a 3-tuple as the second element with the following components: • (N × R)list: List of reactants, where N is the stiochiometry and R represents the concentration of a reactant • (N × R)list: List of products, where N is the stiochiometry and R represents the concentration of a product • (R × R): The first element R is the forward kinetic rate constant and the second element R is the reverse kinetic rate constant for reversible reaction. A zero here indicates a irreversible reaction.

Product of the Concentrations
It takes a list of reactants in the form of a pair and returns a real number, which is the product of the concentration raised to the power of the stoichiometry of all the reactants in the reaction.
gen_flux_irreversible reactants_list products_list rate = rate Ã flux_irr reactants_list It takes a list of reactants, a list of products and the first element of the kinetic rate constant pair and returns the flux of an irreversible reaction.

Flux of a Reversible Reaction
' 8 rate_1 reactants_list rate_2 products_list. gen_flux_reversible reactants_list products_list rate_1 rate_2 = rate_1 Ã flux_irr reactants_list -rate_2 Ã flux_irr products_list It takes a list of reactants, a list of products and the forward kinetic rate constant, reverse kinetic rate constant and returns the flux of a reversible reaction.

Flux of a Single Reaction
' 8 t R P k1 k2. flux_sing (t,R,P, k1,k2) = if t = irreversible then gen_flux_irreversible R P k1 else gen_flux_reversible R P k1 k2 The defintions gen_flux_irreversible and gen_flux_reversible are combined into a uniform defintion. The function flux_sing takes a biological reaction, which can be a reversible or irreversible reaction, and returns the corresponding flux of that reaction. It accepts a list of the reactants and a list of products and returns a list containing the corresponding column of the stoichiometric matrix.

(Continued)
Formal reasoning about systems biology using theorem proving i.e., the rate (also called flux) of a reaction is proportional to the concentration of the reactant (c) raised to the power of its stoichiometry (s), i.e., c s . We define the function gen_flux_irreversible,given in Table 4, to obtain the flux of an irreversible reaction [63].
A reversible reaction can be divided into two irreversible reactions with the forward kinetic rate constant and the reverse kinetic rate constant, respectively. The rate/flux of a reversible reaction is obtained by taking the differences of the fluxes of the two irreversible reactions. We formally define the flux of a reversible reaction by the function gen_flux_reversible, given in Table 4. Next, we combine the functions gen_flux_irreversibleand gen_flux_reversible into one uniform function flux_single (Table 4)[63] to obtain the flux of a single reaction.
For all reactions from 1 to n of a biological system, the flux becomes a flux vector as v = (v 1 , v 2 ,. . ., v n ) T and the system of ODEs can be written in the vectorial form as: d½X dt ¼ Nv, where [X] = (X 1 X 2 ,. . ., X n ) T is a vector of the concentration of all of the species participating in the reaction and N is the stoichiometric matrix of order m × n. We can obtain the flux vector v for a chain of reactions of a biological system by the function flux [63], given in Table 4.
Next, we formalize the notion of stoichiometric matrix N by the function st_matrix [63] given in Table 4. Finally, in order to formalize the left-hand side of above vector equation, i.e., d½X dt , we define a function entities_deriv_vec which takes a list containing the concentrations of all species and returns a vector with each element represented in the form of a realvalued derivative.
We can utilize this infrastructure to model arbitrary biological networks consisting of any number of reactions. For example, a biological network consisting of a list of E biological species and M biological reactions can be formally represented by the following general kinetic model: ððentities deriv vec E tÞ : real^mÞ ¼ transpððst matrix MÞ : real^m^nÞ Ã Ã flux M We used the formalization of the reaction kinetics to verify some generic properties of biological reactions, such as irreversible consecutive reactions, reversible and irreversible mixed

Name Formalized Form Description
Vector of the Stoichiometric Matrix Column ' 8 t k1 k2 R P. st_matrix_sing (t,R,P,k1,k2) = vector (stioch_mat_column R P) It takes a single biological reaction (bio_reaction) and returns a vector (R m ), which corresponds to the column of the stoichiometric matrix.
It takes a list of biological reactions and returns a stiochiometric matrix (in transposed form) using the MAP function, which applies the function st_matrix_sing on every element of the list M.

Derivative of a List of Functions
It takes a list containing the concentrations of all the species taking part in the reaction and maps a real derivative over each function of the list using the function real_derivative, which represents the real-valued derivative of a function Derivative of a Vector ' 8 L t. entities_deriv_vec L t = vector (map_real_deriv L t) It accepts a list containing the concentrations of species and returns a vector with each element represented in the form of a real-valued derivative, which is lefthand side of vector equation, i.e., d½X dt . https://doi.org/10.1371/journal.pone.0180179.t004 reactions. The main idea is to express the given biological network as a kinetic model and verify that the given solution (mathematical expression) of each biological entity satisfies the resulting set of coupled differential equations. This verification is quite important as such expressions are used to predict the outcomes of various drugs and to understand the time evolution of different molecules in the reactions of the biological systems.
The irreversible consecutive reactions. We consider a general irreversible consecutive reaction scheme as shown in Fig 3a. In the first reaction, A is the reactant and B is the product whereas k 1 represents the kinetic rate constant of the reaction. Similarly, in the second reaction, B is the reactant, C is the product and k 2 is its kinetic rate constant. We formally model this reaction scheme as a HOL Light function rea_sch_01, given in Table 5, and the formalization details are available as a technical report [66]. We moreover verify the solution of its kinetic model in HOL Light. The formal verification results are given in Table 6 [63].
The consecutive reactions with the second step being reversible. The second reaction scheme consists of the consecutive reactions with the second step being reversible as shown in Fig 3b. In the irreversible reaction, A and B are the reactant and product, respectively, whereas k 1 is the kinetic rate constant of the reaction. Since any reversible reaction can be written as two irreversible reactions, so the first irreversible reaction has B, C and the forward kinetic reaction constant k 2 as the reactant, product and the kinetic rate constant, respectively. Similarly, the parameters C, B and k 3 are the reactant, product and kinetic rate constant of the second irreversible reaction, respectively. We formally model this scheme as a HOL Light function rea_sch_02 [66] (Table 5) and then verified the solution for its ODE model given in Table 6 [63].
The consecutive reactions with the first step as a reversible reaction. In this scheme, the first reaction is reversible and the second reaction is irreversible as shown in Fig 3c. The reversible reaction can be equivalently written as two irreversible reactions with k 1 and k 2 as their kinetic rate constants. In the first irreversible reaction, A and B are the reactant and product, respectively, whereas in the second reaction, B and A are the reactant and product, respectively. For the second step, B, C and k 3 are the reactant, product and kinetic rate constant, respectively. The verified solution of the ODE model corresponding to this reaction scheme (rea_sch_03 [66], given in Table 5) is given in Table 6.
The consecutive reactions with a reversible step. In this reaction scheme, we consider the consecutive reactions with one reversible and one irreversible reaction step as shown in Fig  3d. The ODE model and solution corresponding to this reaction scheme (rea_sch_04 [66], given in Table 5) are given in Table 6.
This completes our formal verification of some commonly used reaction schemes. The verification of these solutions requires user interaction but the strength of these theorems lies in Table 5. Formal models of generic reaction schemes.

Name Formalized Form Description
The Irreversible Consecutive Reactions Formal reasoning about systems biology using theorem proving Table 6. Formal verification of reaction kinetics properties.

Case studies
In this section, we use our proposed framework to formally reason about three case studies: In the first, we formally analyse the reaction involving the phosphorylation of TP53 using our formalization of Zsyntax. In the second, we formally derive the time evolution expressions of different tumor cell types, which are used to predict the tumor population and volume at a given time instant, using our formalization of reaction kinetics. In the third, we take another model for the growth of tumor cells and perform both the Zsyntax and reaction kinetic based formal analysis using our proposed formalizations presented in the Result section of the paper. TP53 phosphorylation. TP53 gene encodes p53 protein, which plays a crucial role in regulating the cell cycle of multicellular organisms and works as a tumour suppressor for preventing cancer [13]. The pathway leading to TP53 phosphorylation (p(TP53)) is shown in Fig 4a. The green-colored circle represents the desired product, whereas, the blued-colored circles describe the chemical interactions in the pathway. Similarly, each rectangle in Fig 4a contains the total number of molecules at a given time. It can be clearly seen from the figure that whenever a biological reaction results into a product, the reactants get consumed, which satisfies the stoichiometry of a reaction. Now, we present the formal verification of pathway deduction from TP53 to p(TP53) using our formalization of Zsyntax, presented in the last section.
In classical Zsyntax format, the reaction of the pathway leading from TP53 to p(TP53) [13] can be represented by a theorem as TP53 & ATP & Kinase ' p(TP53). Based on our formalization, it can be defined as follows: In the above theorem, the first argument of the function zsyn_deduct represents the list of initial aggregate (IA) of molecules that are present at the start of the reaction, whereas the second argument is the list of valid EVFs for this reaction specified in the form of pairs and include the molecules (ATP, Kinase, etc.). These are obtained from wet lab experiments, as reported by Boniolo et al. [13]. We use the HOL Light function DISTINCT to ensure that all molecule variables (from IA and EVFs) used in this theorem represent distinct molecules. Thus, the final list of molecules is deduced under these particular conditions using the function zsyn_deduct. Finally, if the molecule pTP53 is present in the post-reaction list of molecules, it will be obtained after the application of the function zsyn_conjun_elimin, as previously described. Additionally, in order to automate the verification process, we developed a simplifier Z_SYNTAX_SIMP [63], which is based on some derived rules and already available HOL Light tactics that simplified the manual reasoning and thus allowed us to formally verify Theorem 1 automatically. It is important to note that formalization of Zsyntax was quite a tedious effort but it took only 6 lines of code for the verification of the theorem of pathway deduction from TP53 to pTP53 in HOL Light, which clearly illustrates the effectiveness of our foundational work.
We have shown that our formalization is capable of modeling molecular reactions using Zsyntax inference rules, i.e., given an IA Formal analysis of tumor growth based on Cancer Stem Cells (CSC). According to the Cancer Stem Cell (CSC) hypothesis [68], malignant tumors (cancers) are originally initiated by different tumor cells, which have similar physiological characterises as of normal stem cells in the human body. This hypothesis explains that the cancer cell exhibits the ability to selfrenew and can also produce different types of differentiated cells. The mathematical and computational modeling of cancers can provide an in-depth understanding and the prediction of required parameters to shape the clinical research and experiments. This can result in efficient planning and therapeutic strategies for accurate patient prognosis. In this paper, we consider a kinetic model of cancer based on the cancer stem cell (CSC) hypothesis, which was  [20]. In this model, four types of events are considered: 1) CSC self-renewal; 2) maturation of CSCs into P cells; 3) differentiation to D cells; and 4) death of all cell subtypes. All of these types of reactions are driven by different rate constants as shown in Fig 4b. In the following, we provide the possible reactions in the considered model of cancer [20]: 1. Expansion of CSCs can be accomplished through symmetric division, where one CSC can produce two CSCs, i.e., CSC ! k 1 2CSC.

2.
A CSC can undergo asymmetric division (whereby one CSC gives rise to another CSC and a more differentiated progenitor (P) cell). This P cell possesses intermediate properties between CSCs and differentiated (D) cells, i.e., CSC ! k 2 CSC þ P.
3. The CSCs can also differentiate to P cells by symmetric division, i.e., CSC ! k 3 2P.
4. The P cells can either self-renew, with a decreased capacity compared to CSCs, or they can differentiate to D cells, i.e., P ! k 4 2P, P ! k 5 2D.
In order to reduce the complexity of the resulting model, only three subtypes of cells are considered: CSCs, transit amplifying progenitor cells (P), and terminally differentiated cells (D) as shown in Fig 4b. This assumption is consistent with several experimental reports [20]. Our main objective is to derive the mathematical expressions, which characterize the time evolution of CSC, P and D. Concretely, the values of these cells should satisfy the set of differential equations that arise in the kinetic model of the proposed tumor growth. Once the expressions of all cell types are known, the total number of tumor cells (N) in the human body can be computed by the formula N(t) = CSC(t) + P(t) + D(t). Furthermore, the tumor volume (V) can be calculated by the formula V(t) = 4.18 × 10 6 N(t), considering that the effective volume contribution of a spherically shaped cell in a spherical tumor (i.e., 4.18 × 10 −6 mm 3 /cell).
The formal verification of the time-evolution of tumor cell types CSC, P and D in Theorem 2 can be easily used to formally derive the total population and volume of tumor cells. The derived time-evolution expression, verified in Theorem 2, can also be used to understand how the overall tumor growth model works. Moreover, potential drugs are usually designed using the variation of the kinetic rate constants, such as k 1 , k 2 Á Á Ák 8 in Theorem 2, to achieve the desired behavior of the overall tumor growth model and thus Theorem 2 can be utilized to study this behavior formally. On similar lines, the variation of these parameters is used to plan efficient therapeutic strategies for cancer patients and thus the formally verified result of Theorem 2 can aid in accurately performing this task.
Combined Zsyntax and Reaction kinetic based formal analysis of the tumor growth model. In this section, we consider another model for the growth of tumor cells and formally analyze it using both of our Zsyntax and Reaction kinetics formalizations, presented in the Results section of the paper.
Pathway Leading to Death of CSC The pathway leading to death of CSC is shown in Fig 5a. The green-colored circle represents the desired product, whereas, the blued-colored circles describe the chemical interactions in the pathway. We use our formalization of Zsyntax to deduce this pathway. In the classical Zsyntax format, the reaction of the pathway leading from CSC to its death can be represented by a theorem as CSC & P ' M. Based on our formalization, it can be defined as follows:  In the above theorem, the first argument of the function zsyn_deduct represents the list of IA of molecules that are present at the start of the reaction, whereas the second argument is the list of valid EVFs for this reaction specified in the form of pairs and include the molecules (CSC, P, etc.). We use the HOL Light function DISTINCT to ensure that all molecule variables (from IA and EVFs) used in this theorem represent distinct molecules. Thus, the final list of molecules is deduced under these particular conditions using the function zsyn_deduct. Finally, if the molecule M is present in the post-reaction list of molecules, it will be obtained after the application of the function zsyn_conjun_elimin. We use the simplifier Z_SYN-TAX_SIMP [63] to formally verify Theorem 3 automatically.

Reaction Kinetic based Formal Analysis of a Tumor Growth based on CSC
We perform the reaction kinetic based formal analysis of a tumor growth model, which is shown in Fig 5b. In this model, two types of events are considered: 1) maturation of CSCs into P cells; 2) death of all cell subtypes. All of these types of reactions are driven by different rate constants as shown in Fig 5b. In the following, we provide the possible reactions in the considered tumor growth model: 1. A CSC can undergo asymmetric division (whereby one CSC gives rise to another CSC and a more differentiated P cell), i.e., CSC ! k 1 CSC þ P.
In order to reduce the complexity of the resulting model, only two subtypes of cells are considered: CSCs and transit amplifying progenitor cells (P) as shown in Fig 5b. Our main objective is to derive the mathematical expressions, which characterize the time evolution of CSC and P. Concretely, the values of these cells should satisfy the set of differential equations that arise in the kinetic model of the proposed tumor growth. Once the expressions of all cell types are known, the total number of tumor cells (N) in the human body can be computed by the formula N(t) = CSC(t) + P(t). We formalize the reaction kinetic based tumor growth model and verify the time evolution expressions for CSC and P that satisfy the general kinetic model. We formally represent this requirement in the following HOL Light theorem: Theorem 4. Time Evolution Verification of a Tumor Growth Model ' 8 k 1 k 2 k 3 CSC P M t. A1: (k 3 −k 2 6 ¼0)Â 2: 8t. CSC(t) = e −k 2 Ã tÂ 3:8t: PðtÞ ¼ ½ðk 3 À k 2 À k 1 Þe À k 3 t þk 1 e À k 2 t ðk 3 À k 2 ÞÂ 4: real_derivative M(t) = k 2 CSC(t) + k 3 P(t) ) entities_deriv_vec [CSC; P; M] t = transp (st_matrix (tumor_growth_rk_model CSC P M k 1 k 2 k 3 t)) ÃÃ flux (tumor_growth_rk_model CSC P M k 1 k 2 k 3 t)) where the first assumption (A1) ensures that the time evolution expression of P does not contain any singularity. The next two assumptions (A2-A3) provide the time evolution expressions for CSC and P, respectively. The last assumption (A4) is provided to discharge the subgoal characterizing the time-evolution of M (dead cells), which is of no interest and does not impact the overall analysis as confirmed by experimental evidences [20]. Finally, the conclusion of Theorem 4 is the equivalent reaction kinetic (ODE) model of the CSC based tumor growth model. To facilitate the verification process of the above theorem, we use the KINETIC_SIMP simplifier, which sufficiently reduces the manual reasoning interaction with the theorem prover. After the application of this simplifier, it only takes some arithmetic reasoning to conclude the proof of Theorem 4. More details about the verification process can be found at [63].

Discussion
Most of the existing research related to the formal analysis of the biological systems has been focussed on using model checking. However, this technique suffers from the inherent statespace explosion problem, which limits the scope of this success to systems where the biological entities can acquire only a small set of possible levels. Moreover, the underlying differential equations describing the reaction kinetics are solved using numerical approaches [69], which compromises the precision of the analysis. To the best of our knowledge, our work is the first one to leverage the distinguishing features of interactive theorem proving to reason about the solutions to system biology problems. We consider the concentration of the species of the biological systems in reaction kinetic based formal analysis as a continuous variable. Besides formalizing Zsyntax and the reaction kinetics of commonly used biological pathways, we also formally verified their classical properties. This verification guarantees the soundness and the correctness of our formal definitions. It also enables us to conduct formal analysis of realworld case studies. In order to illustrate the practical effectiveness of our formalization, we presented the automatic Zsyntax based formal analysis of pathway leading to TP53 Phosphorylation and a pathway leading to the death of CSCs in the tumor growth model, and reaction kinetics based analysis of the tumor growth model. Our source code is available online [63] and can be used by other biologists and computer scientists for further applications and experimentation.
The distinguishing feature of our framework is the ability to deductively reason about biological systems using both Zsyntax and reaction kinetics. The soundness of interactive theorem proving ensures the correct application of EVFs or the simplification process as there is no risk of human error. The involvement of computers in the formal reasoning process of the proposed approach makes it more scalable than the analysis presented in Boniolo et al.'s and Molina-Pena et al.'s paper [13,20], which is based on traditional paper-and-pencil based analysis technique. Another key benefit of the reported work is the fact that the assumptions of these formally verified theorems are guaranteed to be complete, due to the soundness of the underlying analysis methods, and thus enables us to get a deep understanding about the conditions and constraints under which a Zsyntax and reaction kinetics based analysis is performed. Also, we have verified generic theorems with universally quantified variables and thus the analysis covers all possibilities. Similarly, in the case of reaction kinetics based analysis, the theorems have been verified for arbitrary values of parameters, such as k 1 and k 2 , which is not possible in the case of simulation where these expressions are tested for few samples of such parameters. A major limitation of higher-order logic theorem proving is the manual guidance required in the formal reasoning process. But we have tried to facilitate this process by formally verifying frequently used results, such as, simplification of vector summation manipulation and verification of flux vectors and stoichiometric matrices for each of the reaction schemes, and providing automation where possible. For example, we have developed two simplifiers, namely Z_SYNTAX_SIMP and KINETIC_SIMP, that have been found to be very efficient in automatically simplifying most of the Zsyntax or reaction kinetic related proof goals, respectively. In the first case study, the simplifier Z_SYNTAX_SIMP allowed us to automatically verify the theorem representing the reaction of the pathway leading to TP53 Phosphorylation. Similarly, in the second case study, i.e., time evolution verification of the tumor growth model, the simplifier KINETIC_SIMP significantly reduced the manual interaction and the proof concluded using this simplifier and some straightforward arithmetic reasoning. These simplifiers are also used to automate the verification process of the third case study, i.e., the automatic verification of the theorem representing the reaction of the pathway leading to the death of CSC and a significant simplification of the verification of the theorem representing the time evolution for the growth of the tumor cell.
In future, we plan to conduct the sensitivity and steady state analysis [14] of biological networks that is mainly based on reaction kinetics. We also plan to integrate Laplace [70] and Fourier [71] transforms formalization in our framework that can assist in finding analytical solutions of the complicated ODEs.
Supporting information S1 Report. Technical report.