1 Systems Biology Approach to Model the Life Cycle of Trypanosoma cruzi

Due to recent advances in reprogramming cell phenotypes, many efforts have been dedicated to developing reverse engineering procedures for the identification of gene regulatory networks that emulate dynamical properties associated with the cell fates of a given biological system. In this work, we propose a systems biology approach for the reconstruction of the gene regulatory network underlying the dynamics of the Trypanosoma cruzi ’s life cycle. By means of an optimisation procedure, we embedded the steady state maintenance, and the known phenotypic transitions between these steady states in response to environmental cues, into the dynamics of a gene network model. In the resulting network architecture we identified a small subnetwork, formed by seven interconnected nodes, that controls the parasite’s life cycle. The present approach could be useful for better understanding other single cell organisms with multiple developmental stages.


Introduction
One of the main aims in the post-genome era is to elucidate the complex webs of interacting genes and proteins underlying the establishment and maintenance of cell states. Consequently, many researchers have focused on developing quantitative frameworks to identify modules that govern the transitions between different phenotypes [1][2][3]. The gene regulatory network (GRN) approach is one of the most popular frameworks used today [4,5]. This approach has been used to study key reprogramming genes and cell differentiation processes in stem cells from different points of view [6][7][8]. Mathematically, GRN models are dynamical systems whose states determine the gene-expression levels [4]. The structure of the network is defined as a graph whose nodes are associated with genes (or groups of genes), and whose edges represent the interactions between the nodes. The task of uncovering the GRN architecture from the cell states (gene-expression profiles) represents a very complex inverse problem that has become central in functional genomics [9]. The main drawbacks of this reverse engineering task are not only the large number of genes and the limited amount of data available, but also the nonlinear dynamics of regulations, the inherent experimental errors, the noisy readouts of expression levels, and many other unobserved factors that are part of the challenge [10]. Although emerging technologies offer new prospects for monitoring mRNA concentrations, researchers have focused on determining the architecture of simplified theoretical models [11].
In this work, we have implemented a GRN approach to analyse transcriptional data of the steady states of the flagellated protozoan parasite Trypanosoma cruzi (T. cruzi). This trypanosomatid is the causative agent of Chagas disease, that affects about 7-8 million people worldwide causing about 12,000 deaths per year. Usually, the parasites are transmitted to humans and to other mammalian hosts mainly by contact with the faeces of infected blood-sucking triatomine bugs [12]. T. cruzi has several developmental stages both in insect vectors and in mammalian hosts (Fig 1a). Insects become infected by sucking blood from mammals with circulating parasites (trypomastigotes). In the midgut of the insect, trypomastigotes differentiate After a dimensional reduction analysis of the microarray dataset, we have found that the four steady states can be represented by 339 variables. Each of these variables (cells in the 19 × 18 array) corresponds to the intra-cluster average of the log-transformed relative expression level of the genes that belong to the corresponding cluster. Since gene assignment to the clusters is the same for all states, the arrays can be directly compared with one another. into epimastigotes that replicate by binary fission. Then, epimastigotes differentiate into metacyclic trypomastigotes in the hindgut. This parasite form is released in the insects faeces and enters the mammalian host. Once in the vertebrate host, metacyclic trypomastigotes invade local cells and differentiate into amastigotes that replicate by binary fission. They subsequently transform into trypomastigotes inside the cells. By lysing the cells, trypomastigotes are released into the circulation. Thus, they spread via the bloodstream and infect new cells from distant tissues, mainly muscle and ganglion cells. There, trypomastigotes transform back into intracellular amastigotes which, in turn, undergo further cycles of intracellular multiplication. The cycle of transmission is completed when circulating trypomastigotes are taken up in blood meals by triatomine vectors [13,14].
Even though there are potential vaccine candidates against T. cruzi infection, no vaccine is yet available [15]. Thus, the finding of novel therapeutic targets remains a significant challenge in the control of Chagas disease. Taking this into consideration, we have implemented a GRN approach to analyse transcriptional data of T. cruzi's steady states [16]. The first analyses of networks related to T. cruzi were not truly GRN approaches but extracellular matrix (ECM) interactomes [17,18]. These protein-interaction networks were built using the MiMI Cytoscape plugin. In these works some parasite surface molecules, which are known to modulate or interact with these host proteins, were pointed out.
In the present framework, we have uncovered the underlying architecture network that supports the steady states associated with the four phenotypic stages of T. cruzi and the transitions between the parasite's life cycle stages in response to environmental cues. We believe that this gene network model can clarify the signaling pathways, predict the response of cellular systems to multiple perturbations other than the ones used to derive the model, and determine the perturbation pattern for any desired response.

Microarray data normalisation
In this work we have used the microarray experiments of Minning et al. [16]. These data are publicly available in Gene Expression Omnibus (GEO) database (Accession no.: GSE14641). This series is the result of dye-swap experiments, out of which we selected the probe intensity signals of 12 microarrays (three biological replicates, and the four not-mixed parasite stages). These microarrays comprise 12,288 unique 70-mers designed against open reading frames in the annotated CL Brener reference genome sequence. They also contain 500 control oligonucleotides designed from Arabidopsis sequences. All of these oligonucleotides were printed in duplicate. Further details about probe preparation, microarray hybridisation, and data acquisition can be found in [16], while the description of the microarrays is available at http://pfgrc.tigr.org.
The probe intensity signals from the microarrays were subjected to the following normalisation procedure. (i) The signal intensity of each probe was set at the average of the signal intensities associated with a pair of replicate spots. (ii) The signal intensity of a probe i was normalised against the average signal of control Arabidopsis probes in order to obtain a signal relative intensity within the slide. The average signal of control Arabidopsis probes is the arithmetic mean of a set of control probes with valid signals. Of course, this set is the same in all microarray experiments. This normalisation procedure has allowed us to integrate the expression data of all the microarrays. The signal relative intensity of the probe i recorded in one of the biological replicates, j = 1, 2, 3, at one of the stages, α = 1, 2, 3, 4, was represented by y a j i . (iii) After within-slide replicates processing, we averaged the relative intensity y a j i over all replicates belonging to the same stage, i.e., y a i ¼ 1=3 P j y a j i . Probes without a valid relative signal in all three biological replicates were not considered in the subsequent analyses. As a result of this processing, we obtained the relative intensity of 8904 probes at parasite's four different stages. We then considered the variable that describes the expression level of the probe i at stage α as the quantity x a i ¼ Log e ½ y a i =h y a i i a . S1 Table lists all normalised expression levels, x a i , used in the following analyses, with their corresponding oligo IDs.

Clustering procedure
Instead of using each gene's profile, many researchers have analysed the cell at a higher level of abstraction. One way to do this is by grouping redundant genes, i.e. by clustering co-expressed genes [19,20], and using the average within each cluster as a variable. In order to group the genes by similar expression profiles, we have applied an agglomerative hierarchical clustering method; the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), though a more sophisticated clustering method like self-organizing map could be used for this task. The agglomerative process is stopped at a given number of clusters considered suitable for our dataset. Since the suitable number of clusters, N c , is not known, it has to be computed beforehand. In order to do this, we repeated the clustering procedure for several N c values, and computed the Davies-Bouldin index (DBI) as a measure of the clustering merit [20,21]. The DBI is defined as: where d k ¼ N À1 k P i jjx i À c k jj denotes the centroid intra-cluster distances of cluster k (N k being the number of genes belonging to cluster k); and ||c k − c j || is the distance between the cluster centroids. A low DBI value indicates a good cluster structure. It should be noted that increasing N c without penalty will always reduce the resulting index. Then, the choice of N c will intuitively strike a balance between the data compression and the accuracy of the dimensionality reduction. S1 Fig displays the DBI as a function of N c for the gene-expression profile under study. It can be seen that the DBI does not suffer a significant reduction beyond N c = 339. Thus, N c = 339 was selected as the optimal number of clusters. The profiles of the 8,904 genes were grouped in 339 clusters, and the intra-cluster average of the expression level (i.e. x a j ¼ hx a i i i2clusterj ) was used in all of the subsequent analyses. Fig 1b displays a 2D array of the resulting average levels after the dimension reduction process described above for the four stages of T. cruzi's life cycle. The sets of genes belonging to each cluster are listed in S2 Table, and the intra-cluster averages of the expression levels, x a j , for each cluster (rows) at each of the parasite's stages (columns) are listed in S3 Table. Reverse engineering methods Gene network dynamics. In this work, we have implemented a discrete-time linear model [22][23][24] which has two advantages: it can take into account fluctuations, and its parameter estimation does not involve intensive computational steps [11]. In this model, the system's state at time t is represented by an N-dimensional vector x(t), which represents the activity of the N nodes of the network. The temporal evolution of the gene network is governed by: where w i,j are the elements of the weighted connectivity matrix W, θ i is a constant bias term of gene i, and k m i determines the influence of the environmental cue μ on gene i. We have considered four different cues corresponding to unknown external differentiation signals. Thus, μ = 1, 2, 3, and 4 represent the external signals responsible for the transitions to the amastigote, epimastigote, metacyclic tryp., and trypomastigote stages, respectively. Finally, i (t) is a noise term assumed to be Gaussian with mean equal to 0. In order to simplify the notation for the parameter estimation procedure, we noticed that the bias term and the environmental cues can be included in an extended version of matrix W and of state vector x. Thus, the state of gene i is given by: where μ corresponds to the acting environmental cue. This said, the same parameter estimation method can be applied whether the environmental cues are present or not. Singular value decomposition (SVD). Linear models serve as the basis of all continuous gene-network approaches currently available to model typical time-course gene-expression data sets (see [11] for a review). These data sets consist of M pairs of input-output states, represented by D = {X,Y}. Matrix X is the N × M gene-expression matrix at time t. The columns of matrix X labeled by index ν, x ν , correspond to the experiments, while the rows indicate individual genes. The same is valid for the gene-expression matrix at time t+Δt, Y. For a given D, the linear model must map each gene-expression state to the consecutive state, i.e.: Therefore, in order to find the connectivity matrix, the predicted states from a given input state x ν of the training set must be as close as possible to the output state y ν . An alternative would be to minimise the cost function ∑ ν kWx ν − y ν k. A particular solution with the smallest L 2 norm is given in terms of the SVD of matrix X T (where superscript T denotes the transpose matrix), i.e. X T = U Á S Á V T , where U is a unitary M × N matrix of left eigenvectors, S is a diagonal N × N matrix containing the eigenvalues {s 1 , . . .,s N }, and V is a unitary N × N matrix of right eigenvectors [23,25]. Thus, the solution with the smallest L 2 norm represented by W L 2 is given by: Without loss of generality, all s j elements whose value is different from 0 were listed at the end of diagonal matrix S, and the s À1 j values in Eq (5) were considered to be 0 if s j = 0. SVD is mathematically related to the eigen decomposition (ED) and the principal component analysis (PCA). SVD can be understood as a generalisation of ED, but ED only applies on diagonalizable matrices, and fails if the matrix is not square or is singular as in our case. SVD and ED are related, in particular the column vectors of U are eigenvectors of X Á X T , while the column vectors of V are eigenvectors of X T Á X. The diagonal entries of S are the square roots of the eigenvalues of both X Á X T and X T Á X. The eigenvectors of X T Á X also allow to compute the principal components in the standard PCA. PCA is a common tool for exploratory data analysis. It can provide a lower-dimensional picture by projecting the data onto the subspace with the higher variance [26]. We have exploited PCA analysis for a better visualisation of our results. However, since the focus of this manuscript is finding the connectivity matrix rather than obtaining a better representation of data sets, we have used SVD instead of ED or PCA because it is more suitable for that purpose.
The smallest L 2 norm solution cannot be unique. Assuming that x ν are linearly independent, finding the unique solution requires that M ! N. Unfortunately, the inverse problem in GRN involves M < <N. Thus, the problem tends to be severely underdetermined, and many solutions can then be consistent with data D. Therefore, all the possible connectivity matrices that are consistent with Eq (4) can be written in a closed form as: where C is an N × N matrix whose elements c ij are 0 as long as s j 6 ¼ 0. Otherwise, they are arbitrary scalar coefficients. As it will be seen later, the degrees of freedom due to this arbitrariness can be exploited to our benefit [23]. The solution offered by Eq (5) is implemented to embed the four steady states of T. cruzi into the dynamics of the model, without considering the transitions between the states. Eq (6) is used to uncover the environmental cues by means of using the information provided by the transitions between the different stages of the parasite's life cycle, and the connectivity matrix associated with the steady states inferred in the previous step.
Embedding the steady states. In order to infer the parameter values of Eq (4) that allow the model to display the same set of steady states as the ones seen in the parasite, we have constructed a training set of size M, represented by D ss . Different noise realisations associated with the four stages were added to the steady states. Thus, the columns of matrix X are given by: where α = 1, 2, 3, 4, and the superscript i denotes the noise realisations. In this work, we have used 40 noise realisations for each steady state. Thus, M = 4 × 40 = 160. j is a Gaussian noise with mean equal to 0 and a small standard deviation (set at 1% of the expression data). The columns of matrix Y (y n ¼ f x a j g þ f i 0 j g) are defined in the same way. We have expanded the size of the training set, thereby making the solutions more robust against the fluctuations. This simple concept is similar to adding a Tikhonov regularisation term in the optimisation process, which has been studied in several neural network problems [27,28]. The constructed training set implies that if at a given time the system is very close to one steady state, it will remain close to that steady state in the next time-step as well (Fig 2c).
Overfitting occurs when the model has a good performance on the training set, but it has a poor generalization performance, i.e. a poor ability to correctly predict data beyond the training set. There are several reasons for a model to adjust very specific random features of the training set, with a consequently poor generalisation power [29]. Among them, it can be mentioned models with many parameters and small training set, or when trying to learn training examples that have no causal relationship to the target function. The latter situation is more common in regression problems such as the one considered here. For example, overfitting was reported using SVD when recovering a gene network from a highly noisy training set (20% of the expression level), but it was not present when recovering the network from a clean training set [23]. The noise level used to generate training set D in this work is very low (<1%), and does not cause overfitting.
In order to discriminate if an estimated matrix element should be 0 or another reliable value different from 0, we have constructed not only a training set, D ss , but an ensemble of training sets by means of using different noise realisations. For each training set we have computed the minimal L 2 -norm solution. Different noise realisations give slightly different solutions. Thus, the ensemble of solutions defines a probability distribution for each weight, P i,j (w). We then performed a location test for each distribution P i,j (w), as illustrated in Fig 2d. This step consists of testing the hypothesis stating that the true mean value of P i,j (w) differs from 0 at some magnitude (set at 0.0075). If the p-value associated with this test is greater than 0.01, the hypothesis is rejected, and w i,j is assigned 0 value. Otherwise, w i,j is assigned the mean value of P i,j (w). This procedure allows us to obtain a sparse connectivity matrix, W ss , that is compatible with the steady states.
Embedding the transitions between the steady states. In order to extend our analysis by including the environmental cues, we have used the extended versions of W and x described by Eq (3). To embed the transitions between the steady states, we have considered that these transitions occur gradually and through the shortest possible path between the steady states. Thus, if the system is in the steady state x α and is driven to the steady state x β due to an external cue μ = β, then the system performs a series of small transitions between intermediate states represented by x α, β (t). These intermediate states were constructed by means of a linear combination of the initial and final steady states, i.e. x a;b ðtÞ ¼ ððn i À tÞ x a þ t x b Þ=n i with t = 0, 1, 2, . . ., n i . As it can be seen, x α, β (0) and x α, β (n i ) coincide with the steady states x α and x β , respectively. Thus, using these intermediate states, we constructed a new training set D t , where the columns of matrices X and Y are defined as follows: where t = 0, 1, 2, . . ., n i − 1. We have used n i = 10, which implies 10 small transitions. The pairs (α,β) correspond to the allowed transitions between the steady states; five transitions in the case of T. cruzi. Again, j is a Gaussian noise with mean equal to 0 and a small standard Intermediate states (small circles) between the stages are assumed to exist. It is also considered that an unknown external cue (black arrow) is responsible for the transitions. (f) By means of using SVD, the L 2 -norm solution, W L 2 , is determined. This solution is in turn used to find another solution, W t , which includes information concerning the steady states. This procedure is used to infer the weighted links between genes, w i,j , and to answer two questions: which genes are affected by the external cues, and how they are regulated (up or down) by the environment. doi:10.1371/journal.pone.0146947.g002 deviation (set at 1% of the expression data). In all, four different noise realisations were used, and the size of our training set, D t , was in turn M = 200. We then computed its smallest L 2 norm solution, W L 2 . However, since M < N, this solution was not unique. In order to find a particular solution as close as possible to connectivity matrix W ss , we used Eq (6) and computed the elements of matrix C ss that obey the following equation: where matrix W ss was padded with 0 values because W L 2 includes four additional rows and columns corresponding to the environmental cues that are not present in W ss . Eq (7) is an overdetermined problem that can be solved by applying the interior point method for L 1 regression [23]. The resulting c-values were then used to compute a new connectivity matrix (Fig 2f). This matrix, represented by W t , is not only consistent with the information of the environmental cues and transitions included in D t , but it is also close to W ss .

GRN modeling
Key decisions in modeling a gene network system include the choice of variables and the mathematical framework for representing the system dynamics. In this sense, several regulatory network approaches such as Bayesian networks [30], Boolean networks [31], and linear models [22,24,32] have been suggested. The model must be chosen based on the available data and the ability to infer accurate-enough parameters. The more detailed the model, the more experimental data required to make it work. For instance, when choosing a linear model, in which the expression levels of N genes at time t determine the changes of such expression levels at time t + Δt, the transition matrix must be computed from N pairs of input-output data.
In this work, we have assumed that the system's state is represented by x(t) -the N-dimensional vector corresponding to the expression levels of N gene clusters measured at time t. The GRN dynamics is modeled by a first order Markov model, where the future state depends linearly on the present state and on external perturbations. Mathematically, it is defined by the following equation: where w i,j are the elements of the weighted connectivity matrix W, and indicate the type and strength of the influence of gene j on gene i (w ij > 0 indicates activation, w ij < 0 indicates repression, and 0 indicates no influence). θ i is a constant bias term to capture the activity level of gene i in the absence of regulatory inputs. We have also added a term indicating the influence of unknown external perturbations, or environmental cues; k m i , which is the influence of the environmental cue μ on gene i. Finally, i (t) is a noise term assumed to be Gaussian with mean 0. The next task in our work was to determine which nodes were affected by external cues -even if those cues were unknown-, and how they were affected. To this end, we considered not only the expression-profile data set information ( x a j ), but also some a priori information associated with the following biological facts: (i) the parasite's life cycle has four stages, each of them associated with a measured steady state; (ii) each steady state exhibits some level of noise or fluctuations; and (iii) there are five possible transitions between these four stages. We have assumed that these transitions are the result of different environmental cues acting on certain nodes of the network. Following these facts, we implemented a two-step reverse engineering protocol sketched in Fig 2. First, we focused on embedding the four steady states into the dynamics of the model, regardless of the transitions between these states. Second, we concentrated on uncovering the environmental-cue effectors considering the transitions between the parasite's life cycle stages, while using the same connectivity matrix derived in the previous step.
Modeling the steady states of T. cruzi In order to infer connectivity matrix W, we have considered the linear model (Eq (8)) without external perturbations, and have applied the SVD procedure over a training set, D ss , constructed as indicated in Methods. As a result, the dynamical system, together with the derived matrix, has four basins of attraction which correspond to each of the parasite's stages. This means that whenever the system is in a given basin, it will remain inside that basin as long as there are no external perturbations.
One way to quantify how close the system is of a particular state is by computing the overlap between the vectors that represent the actual and the target stage, where the overlap between vectors x and y is mathematically defined as xÁy/|x||y|. Fig 3a depicts the trajectories (black lines) that illustrate the dynamics of our model in the space spanned by the three principal components. This plot shows that the trajectories fluctuate around corresponding stages represented by colored circles (amastigote in blue, epimastigote in red, metacyclic tryp. in green, and trypomastigote in yellow). For each trajectory we have computed the overlap between the vectors corresponding to the state of the system at time t, and the ones corresponding to each target stage. Fig 3b depicts the time course of such overlaps illustrating, in a more quantitative manner, how the trajectories fluctuate around the the corresponding stages. Fig 3c shows a 2D schematic illustration of the pseudo-potential landscape with the four basins of attraction.
The elements of matrix W are continuous variables and, consequently, they are associated with not-null values. However, the statistical analysis of known regulatory networks has  revealed that such networks have a sparse nature, i.e. the number of actual edges in a network is very small compared to the number of possible edges [33,34]. Such sparsity is difficult to obtain when dealing with continuous weights. Thus, the inferred matrix elements at the end of the reverse engineering process should be either 0 or another reliable value different from 0.
In the spirit of inferring a sparse weight matrix that allows the system to display the four steady states, we have consider a bootstrap method. In this sense, an ensemble of 300 training sets was constructed by means of adding different noise realisations to the steady states, as described in Methods. Using SVD we computed a solution for each training set, obtaining a probability distribution for each weight, P i,j (w). The next step was to assign a value to each element of the connectivity matrix, while carefully assessing the significance of the weight values. We performed a location test to prune the non-significant weights, as illustrated in Fig 2d, and constructed a sparse connectivity matrix, W ss , which supports the data set. At the significance level of 0.01 there are 11,470 links between genes, i.e. around 10% of the elements of W ss are not null. Even with this average node degree, the visualisation of the resulting network poses a challenge. In order to overcome this difficulty, we have displayed only a small fraction of the nodes (around 470 links with p-value less than or equal to 10 −200 ).  Reconstructing the Life Cycle of Trypanosoma cruzi two weakly connected subnetworks seen in the graph reveal a modular organisation of the network at the significance level used. As it will be seen later, one of these subnetworks is linked to the parasite's life cycle. The 11,470 links (all not-null elements of the matrix) between genes are listed in S4 Table. Extracting valuable information from a network made of 10,000 links is a complex task. One way to overcome this problem is by considering only the more important regulators of each steady state. Since the whole regulatory output of a gene depends on the gene's activity level, some genes can be important regulators in one state, while their activity level in the other three states is low (i.e. x i * 0). With this in mind, we have constructed network plots that emphasise the most important links in each steady state; that is to say, those links with |w i,j x j |! 5% of |x i |. The plots in S2 Fig. depict Table. After analysing the data obtained for each of T. cruzi's four stages, we have found 47 clusters with important regulatory activity. These clusters include a total of 68 genes: 25 encoding uncharacterised proteins, and 43 coding for proteins with known functions. Among the latter, the most abundant proteins are trans-sialidase (TS) (encoded by nine different genes), amastin (encoded by five different genes), and mucin TcMUCII (encoded by four different genes).
Besides the main four basins of attraction linked to the known steady states displayed in Fig  3, the system dynamics might include other basins of attraction not-linked to known phenotypes. An exhaustive search for these spurious attractors was performed, and another 20 small attractors, where the system can be trapped were found. These attractors have small basins associated, that disappeared when the phenotypic transitions due to the external cues are included in the model.

Modeling the phenotypic transitions of T. cruzi
After embedding the steady states of the parasite into the GRN dynamics, our analysis was extended to include the transitions that take place between those states as a result of environmental cues. The fact that these transitions in the presence of a given external perturbation occur gradually was taken into account. Since no data about the intermediate states between the steady states are available, we have constructed a training set, represented by D t , considering that the system performs transitions between the initial and final steady states through the shortest possible path. For details about the construction of the training set, see Methods. As this training set has M < N, there exist infinite solutions compatible with D t . We have chosen the closest solution to the connectivity matrix that uses nothing but the steady states information, i.e. the closest to W ss . Thus, our connectivity matrix is represented by: where W L 2 is the corresponding minimal L 2 -norm solution obtained by SVD for D t . Matrix C ss was computed by the interior point method as described in Methods. The new connectivity matrix, W t , is consistent with the information of the environmental cues and transitions included in training set D t . As W t is also very close to W ss , it consequently inherits the ability to support the multi-stability of the parasite's life cycle.
In order to test the ability of the model (Eq (8)) to emulate the observed dynamical behavior, simulations under different external cues were performed. Each of these simulations was performed considering that the system is initially in one of the parasite's steady states, and that an external cue μ is acting. The simulations were performed by running 12 iterations of the model (Eq (8)), and recording the system's state at each of these 12 steps. The temporal evolution of the 339 variables of the system was compiled in movies available as S1-S5 Movies. S1 Movie shows the simulated phenotypic transition from the amastigote stage to the trypomastigote stage when external cue μ = 4 is acting. S2-S5 Movies, on their part, illustrate the modeling results of the remaining phenotypic transitions. In all cases, the final state of the system is in agreement with the expected state regarding the acting external cue. This agreement can be better appreciated when using a principal component analysis procedure for dimension reduction.  Table. A total of 166 externally regulated genes belonging to 86 different clusters were found. While 73 of these genes encode uncharacterised proteins, the other 93 genes code for proteins with known functions. Just as in the steady states, the most abundant proteins are TS (encoded by 21 different genes), amastin (encoded by six different genes), and mucin TcMUCII (encoded by five different genes). The difference, however, is that when considering the transitions, these proteins act no longer as regulators, but they are up-or down-regulated instead. Uncharacterised proteins without GO annotations have been analysed using the InterproScan software [35], and the results are summarised in S7 Table. According to this analysis, 39% of these proteins are membrane proteins. Furthermore, we have found that genes affected by the external cues leading to the parasite's two mammalian stages are inversely regulated, i.e. if they are up-regulated in one of these two stages, so they are down-regulated in the other, and vice versa. Regarding genes affected by the external cues leading to the parasite's two insect stages, we have found that they are equally regulated, i.e. they are up-regulated in both stages or down-regulated in both stages.
The next step in our analysis was to identify the module that controls the parasite's life cycle. This is a difficult task because it involves the isolation of a small subset of nodes and regulatory connections out of a network of 10,000 links. In principle, the number of possible subnetworks within a network of such a size is very large. Consequently, the evaluation of the subnetworks' dynamics is not possible. In order to solve this problem, we reduced the search space. With this in mind, we considered only those circuits that involve nodes with important regulatory roles. To this end, we have used the list of regulatory clusters shown in S5 Table, and have written a script to search for cyclic graphs, i.e. closed loops, containing such nodes in matrix W t . With this set of modules, we then searched for those subnetworks with the ability to emulate the parasite's dynamics. At this point, our model had to be simplified. In order to evaluate the dynamics of the system, we have considered that variable x i is a Boolean variable, and that the system's evolution is given by: where index j 0 in the sum runs only over the nodes belonging to the module under evaluation.
The parameter values are taken from W t , and listed in S8 Table. As a result of the searching process, we were able to identify a seven-node module, containing a total of nine genes. Fig 6a illustrates the architecture of this subnetwork. The seven clusters forming the subnetwork are: 195, 214, 257, 259, 260, 332, and 334. Relevant information about these clusters and their composing genes is shown in Table 1. Three of the nine genes code for uncharacterised proteins (Q4DTV8, Q4DVU8 and Q4E589). According to their GO annotations, Q4DTV8 has hydrolase activity (acting on carbon-nitrogen -but not peptidebonds, in linear amidines), Q4DVU8 has transporter activity, and Q4E589 has catalytic activity. The other six genes code for: an hexokinase (Q4D3P5), a δ-1-pyrroline-5-carboxylate dehydrogenase (Q4DRT8), a quinone oxidoreductase (Q4DHH8), a glutamate dehydrogenase (Q4DWV8), a peptidyl-prolyl cis-trans isomerase (Q4E4L9), and a metaciclina II (Q4E2M3).
The identified subnetwork reproduces many important dynamical features observed in the life cycle of T. cruzi. On the one hand, the phenotypic transitions from epimastigote to metacyclic tryp., from amastigote to trypomastigote, and from trypomastigote to epimastigote are reproduced under the influence of the corresponding external cue. And on the other, the phenotypic stages epimastigote, metacyclic tryp., and trypomastigote correspond to steady states of the subnetwork's dynamics. As an example of the subnetwork's dynamics, Fig 6b illustrates the basin of attraction of the module under the action of environmental cue μ = 4. This external signal leads the network to the trypomastigote state. The figure shows that regardless of the initial state (there are 128 Boolean states), the final stop of the trajectories in the Boolean space is always the trypomastigote stage. Similarly, when the environmental cue is μ = 2 or μ = 3, the obtained basin of attraction is the epimastigote or the metacyclic trypomastigote stage, respectively.
Finally, we performed two perturbation experiments in our modelling, in order to do an in silico validation. In this sense, we first considered the case when genes belonging to cluster 326 were overexpressed and genes belonging to clusters 337 were knocked down. These genes were predicted to have major roles for the trypomastigote stage maintenance (see S2 Fig). S3 Fig illustrates the dynamics of the system (yellow trajectory) under this perturbation. It can be seen that the trypomastigote stage is not associated to a stable attractor anymore. Instead, the system  moves to an alternative stable state. In the second in silico experiment, we simulated the effect of overexpressing genes belonging to clusters 259, 260, and 332; which were predicted to be down-regulated by the external cue leading to the trypomastigote stage (see Fig 6a). The blue trajectory in S3 Fig does not ends at the trypomastigote stage, but at the same alternative state as in the previous experiment. This indicates that these network perturbations could prevent the normal development of the parasite's life cycle.

Discussion
One fundamental open question in systems biology is how cells that share the same genome exhibit notably different gene-expression patterns or distinct phenotypes. This question is closely related to the process of establishing cell fates during development. A widely used picture to describe these phenomena is Waddington's epigenetic landscape, a phenomenological metaphor which corresponds to an energy landscape with many local minimums where the system moves regardless of whether environmental cues are present [36][37][38]. Despite the simplicity and elegance of Waddington's concept, it lacks quantitative mechanistic details. Given the significance of a quantitative understanding of cell phenotypic transitions, many efforts have been made to develop predictive mathematical frameworks [38][39][40][41][42]. Although some advances have been made for low dimensional systems, the application of these mathematical frameworks to higher dimensional models remains a theoretical challenge.
In this work, we have developed a reverse engineering approach to identify the gene network structures responsible for the observed dynamical properties of a high dimensional biological system. These dynamical properties include the steady states associated with the stable phenotypes, and the phenotypic transitions observed in T. cruzi's life cycle. We have assumed that each of the five phenotypic transitions occurs in response to the external cue corresponding to the final state of the transition. For this reason, we have modeled the life cycle of T. cruzi as if it were an open dynamical system. Our methodology for embedding the observed expression patterns into the GRN dynamics adds several new ingredients such as the use of an ensemble of noise-perturbed training sets, and a pruning procedure to identify the significant network links. The information of the transitions between the stable phenotypes was used to develop an optimisation procedure. This reverse engineering procedure has been successfully used to identify one key network module that explains three of the five phenotypic transitions. To incorporate the information about phenotypic transitions we have assumed that all transitions occur at the same rate, and through the shortest path. Until now, there are not experimental facts supporting or not these assumptions. Thus, the present results must be interpreted taking into account the limitations of the available data. However, we believe that using our coarse-grain hypothesis about phenotypic transitions has more predictive power than only using the steady state transcriptome data. We are persuaded that having the transcriptome data between transitions could be valuable to improve our results.
In a previous work an interactome network of the early T. cruzi infection process was presented [18]. It was built using components of the ECM (THBS1, LAMC1, LGALS3, and ERK1/ 2) as initial seed nodes. T. cruzi gp83 ligand, a TS expressed only in invasive trypomastigotes, triggers gp83 receptors in the host cells via activation of ERK1/2. This results in the up-regulation of LAMC1 which, in turn, cross-talks with LGALS3 and THBS1. All these interactions enhance cellular infection using specific parasite surface molecules such as calreticulin (TcCRT), Tc45 mucin, and Tc85 [18]. In particular, infective trypomastigotes use Tc85, a member of the gp85/TS gene family, to interact with laminin [43], Tc45 mucin to interact with LAMC1 through LGALS3 [44], and TcCRT to interact with THBS1 [45]. Taken together, these are clear evidences that T. cruzi regulates and uses the ECM to invade host cells and cause disease. In a certain way, our results are consistent with that, since some of the regulatory genes we have found code for virulence factors such as TSs or mucins. In addition, and regarding transitions, we have found 3 TS-coding genes (Q4CQC9, Q4DWU9 and Q4DLE3) and 1 mucin TcMUCII-coding gene (Q4D2K9) up-regulated in transitions leading to trypomastigote stage. This is another result that is in agreement with the previously mentioned works, in which the importance of certain parasite's molecules in parasite-host interactions and host-cell invasion was demonstrated.
Besides the development of a model with the ability to emulate the parasite's dynamics, the information presented in this work (S5 and S6 Tables) could be useful to assign previously unknown putative functions to some genes. In this sense, our results suggest that amastin genes could act as key regulators. This finding is consistent with a previous study in which it is shown that amastin may increase T. cruzi's differentiation rates both in the insect and in the mammalian hosts [46]. On the other hand, our finding of TS-coding genes acting as regulators in the amastigote stage adds relevant information to the resulting parasite state. Furthermore, we have found that these same TS-coding genes are inhibited in the transitions leading to the amastigote stage, and activated in the transitions leading to the trypomastigote stage. Considering that TS plays a key role in T. cruzi's infectivity and that this enzyme is not present in mammals, TS constitutes a potential target for the development of novel drugs to treat or prevent Chagas disease [47][48][49]. Finally, we have found four mucin genes belonging to the TcMUC family that act as regulators both in the mammalian and in the insect parasite's stages. It is known that this family of mucins is expressed only in the mammalian stages [50].
The present approach could be adapted to and useful for better understanding other single cell parasites with multiple developmental stages such as T. brucei, P. falciparum and Leishmania. Uncovering the core circuit that underlies the dynamics of these parasites' life cycles could open the door to new possibilities: the development of applications to reprogramme the parasites' life cycles, and the finding of new therapeutic targets against the parasites. Yellow trajectory shows the dynamics of the system, in the space spanned by the three principal components, when steady state corresponding to the trypomastigote stage is perturbed by mean of over-expressing and knockingdown genes belonging to cluster 326 and 337, respectively. Blue trajectory shows the dynamics of the system when clusters 259, 260, and 332, predicted to be down-regulated by the external cue leading to the trypomastigote stage, are overexpressed. For comparison, trajectories corresponding to the unperturbed transition from amastigote to trypomastigote stages are also plotted (colored circles). (TIF) S1 Movie. Transition from the amastigote stage to the trypomastigote stage. The animated matrix plots show the evolution of the system from the amastigote stage, under the action of μ = 4, to the trypomastigote stage. The movie is composed of 12 frames; one for each step in the simulation. The simulation shows a clear similarity between the states associated with the last two frames and the corresponding target stage indicated by μ. (AVI) S2 Movie. Transition from the trypomastigote stage to the epimastigote stage. The animated matrix plots show the evolution of the system from the trypomastigote stage, under the action of μ = 2, to the epimastigote stage. The movie is composed of 12 frames; one for each step in the simulation. The simulation shows a clear similarity between the states associated with the last two frames and the corresponding target stage indicated by μ. (AVI) S3 Movie. Transition from the epimastigote stage to the metacyclic tryp. stage. The animated matrix plots show the evolution of the system from the epimastigote stage, under the action of μ = 3, to the metacyclic tryp. stage. The movie is composed of 12 frames; one for each step in the simulation. The simulation shows a clear similarity between the states associated with the last two frames and the corresponding target stage indicated by μ. (AVI) S4 Movie. Transition from the metacyclic tryp. stage to the amastigote stage. The animated matrix plots show the evolution of the system from the metacyclic tryp. stage, under the action of μ = 1, to the amastigote stage. The movie is composed of 12 frames; one for each step in the simulation. The simulation shows a clear similarity between the states associated with the last two frames and the corresponding target stage indicated by μ. (AVI) S5 Movie. Transition from the trypomastigote stage to the amastigote stage. The animated matrix plots show the evolution of the system from the trypomastigote stage, under the action of μ = 1, to the amastigote stage. The movie is composed of 12 frames; one for each step in the simulation. The simulation shows a clear similarity between the states associated with the last two frames and the corresponding target stage indicated by μ. (AVI) S1 Table. Gene-expression profile of T. cruzi's life cycle. Log-norm expression levels corresponding to 8,904 T. cruzi genes, obtained from microarray experiments as indicated in Methods. The gene IDs listed in the first column correspond to our own gene numbering. The second column lists the microarray oligo IDs. The third and the fourth columns list the gene and protein names, respectively. The last four columns correspond to the gene-expression levels in each stage of the parasite's life cycle: amastigote, epimastigote, metacyclic tryp. and trypomastigote, respectively. (XLSX) S2 Table. Cluster composition. Clusters (cluster ID) are listed in the first column, with their corresponding genes (gene ID) listed in the second column. The third and the fourth columns list the gene and protein names, respectively.