Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Predicting species emergence in simulated complex pre-biotic networks

  • Omer Markovitch,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Current Address: Center for Systems Chemistry, Stratingh Institute, University of Groningen, Groningen, The Netherlands; Blue Marble Space Institute of Science, Seattle, Washington, United States of America

    Affiliation Interdisciplinary Computing and Complex Bio-Systems research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, United-Kingdom

  • Natalio Krasnogor

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Interdisciplinary Computing and Complex Bio-Systems research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, United-Kingdom

Predicting species emergence in simulated complex pre-biotic networks

  • Omer Markovitch, 
  • Natalio Krasnogor


An intriguing question in evolution is what would happen if one could “replay” life’s tape. Here, we explore the following hypothesis: when replaying the tape, the details (“decorations”) of the outcomes would vary but certain “invariants” might emerge across different life-tapes sharing similar initial conditions. We use large-scale simulations of an in silico model of pre-biotic evolution called GARD (Graded Autocatalysis Replication Domain) to test this hypothesis. GARD models the temporal evolution of molecular assemblies, governed by a rates matrix (i.e. network) that biases different molecules’ likelihood of joining or leaving a dynamically growing and splitting assembly. Previous studies have shown the emergence of so called compotypes, i.e., species capable of replication and selection response. Here, we apply networks’ science to ascertain the degree to which invariants emerge across different life-tapes under GARD dynamics and whether one can predict these invariant from the chemistry specification alone (i.e. GARD’s rates network representing initial conditions). We analysed the (complex) rates’ network communities and asked whether communities are related (and how) to the emerging species under GARD’s dynamic, and found that the communities correspond to the species emerging from the simulations. Importantly, we show how to use the set of communities detected to predict species emergence without performing any simulations. The analysis developed here may impact complex systems simulations in general.


The Origins of Life (OOL) field attempts to understand the transition from a mixture of life-less molecules to life-full entities, with protocells [14] as intermediate (potentially viable) milestones along the non-living to living spectrum [5]. A widely accepted definition of minimal life is: a self-sustaining system capable of undergoing Darwinian evolution [6], while other definitions are often similar (e.g. [7]). A minimally living entity needs not be a cell as we know it but could be a much simpler protocell [2, 815], i.e. container with some necessary molecular content. Two major schools tackle the problem of transition from non-life to life: the genetic, or replicator-first approach, and the metabolism-first approach. The replicator-first approach focuses on a single self-perpetuating informational biopolymer, e.g., RNA, as the first step, and it is thus often referred to as the “RNA world” [1620]. In contrast, the metabolism-first approach [2, 9, 11, 2123] focuses on a network of chemical reactions among simpler chemical components that became endowed with some reproductive characteristics [2, 8, 9, 1113].

The RNA world, a widely accepted replicator-first scenario, assumes that a molecule similar or analogous to present day RNA played the role of the self-perpetuating biopolymer [1719, 24]. The mixture of such molecules is assumed to have later evolved both a metabolic network and an encompassing container. The RNA-world draws from RNA’s capability to store (sequential) information and certain catalytic activities typical of metabolism [2528].

The metabolism-first scenario, on the other hand, suggests that the very first life precursors are likely to have been relatively elaborate molecular networks of much simpler organic molecules, thus trading the complexity of the building blocks (e.g. RNA) for the complexity at the ensemble level. One of the first suggested possible chemical pathway for the emergence of life was made by Oparin, who proposed that it could be manifested by the molecular reactions of relatively simple organic molecules in the primordial soup, interacting with each other to spontaneously form colloidal molecular assemblies (coacervates) [8, 29, 30].

The lipid world scenario for OOL is a variant of the metabolism first scenario, which considers a complex chemical system consisting of mixture of mutually interacting simple molecule types which spontaneously form noncovalent assemblies [22, 31]. Importantly, these assemblies store information in the form of non-random molecular compositions–compositional information (i.e. the specific ratio of different molecule types that make up the assembly)–and pass it to progeny via homeostatic growth accompanied by fission. This information transmission is a function similar to what can be done with sequence-based biopolymers such as RNA/DNA/PNA, except that in this case it is compositional information that is preserved and propagated rather than sequential information. Specifically, compositional replication is the transfer (at least partially) of compositional information from parent to progeny, where the process of information transfer is itself a function of the compositional information in the parent entity [32]. The composition encoded in several chemical systems has been shown to affect their physical properties (i.e. phenotypes), supporting the realism of the lipid world. For example, vesicles’ lipid-composition has been shown to affect dye encapsulation efficiency [33] or vesicle’s structure [34], and genetic programming (“evolutionary algorithms”) has been applied to evolve vesicles’ formulation [35, 36]. More recently it has been suggested that vesicles can “osmotically” couple otherwise decoupled chemical reactions [37].

The GARD kinetic model is a physio-chemical simulator within the lipid world scenario [31, 3840]. The model is based on a matrix (named β) that determines the interactions between different molecular types while the system is kept away from thermodynamic equilibrium by assembly fission (Fig 1). GARD dynamics exhibit quasi-stationary states, which appear in the simulation as faithfully replicating molecular assemblies, termed composomes (for compositional genomes) [38]. Clusters of compositionally-similar composomes are called compotypes (for composome types) [41]. These compotypes have been shown to respond to selection [40], exhibit ecology-like population dynamics [42] and exhibit quasispecies behavior including error-catastrophe-like transition [32] and hence have been interpreted as (emergent) species.

Fig 1. Schematics of GARD’s dynamics.

Different molecules types (represented by different colored circles) aggregate to form assemblies. Aggregation is biased by a matrix of chemical rates (β, Eq 1)). Once an assembly reaches a size-threshold (Nmax) it splits, and the progeny then continues the growth-split cycles (generations). A composome is an assembly that has high average compositional similarity (see section: The GARD model) to its parent and to one of its children.

The next paragraph presents a more elaborate discussion of selection in GARD, which can be summarised as the following: GARD simulations show compotypes (but note that not every composition is a compotype), these compotypes can respond to external selection (but not always) by changing their frequencies within a population. Under very small alphabet size and very small assembly size this change in frequency seems negligible.

As typical GARD simulations take a constant number of alphabet molecule types and a predefined assembly size, the total number of possible compositions is fixed [32] and the system is not permitted to show true open-ended evolution [43]. In 2010, perhaps the first rigor attempt at studying evolution in GARD was reported, in the sense of population responding to an external selection pressure [44]. Unfortunately, the study was based on a single instantiation of a random lognormal matrix, which hinders on the ability to draw conclusions from it. Moreover, the study employed parameters values very different than those typical used in GARD (i.e. small alphabet size and small assembly size), and the study did not designate compotype species as targets for selection. A later study considered a similar methodology for selection as the 2010 paper, and explored a large number of matrix instances and focused on compotypes as selection targets [40], asking whether compotypes change their frequency within a population as an outcome of external selection. The later study found that GARD systems can respond to selection (but not always), and that this selection response is more favourable when the matrix instance is highly mutualistic (i.e. when off-diagonal values are higher than diagonal values). A recent attempt to extend the 2010 paper by attempting to map GARD into the quasispecies formalism [45] presents an argument on GARD’s putative limited evolvability. The paper failed however to designate compotypes as selection targets, even thought it was previously shown that only compotypes can be mapped into quasispecies [32], and used atypical GARD parameters.

Regardless of selection behaviour, the present paper asks whether the biological diversity that surrounds us would be different if the tape of life was to run again from the start [4649] under similar initial conditions, and whether adaptations that lead to similar phenotypes follow a quantifiably repeatable route [50]. Some evidence for the convergent nature of evolution can be seen when two separated populations of E. coli evolved separately for many generations in identical environments achieved similar fitness [51], or when different populations of lizards from different nearby islands developed into similar ecomorphs independently [52]. Computer models have also been used to study this question [5358].

In this paper we postulate that if evolutionary diversity is dominated by “invariants” rather than “decorations” then it should be possible to predict the outcome of the evolutionary process without actually waiting for it to happen. That is, it should be possible to predict which species will emerge. In the present paper, this translated to investigating the degree to which the emergence of GARD species, i.e. compotypes, can be analysed in terms of β’s inner organization only (i.e. independent of the dynamics in GARD) (Fig 2). In order to do this, we analysed the community structure of β. Typically in a network representation, nodes symbolize entities (molecules, web pages, people, etc`) and edges are relations between the entities (catalysis, hyperlinks, friendship, etc`). Communities are organizational features in many networks, and are generally defined as sets of nodes more densely interconnected between themselves than to other nodes in the network [5961]. Communities detection algorithms allow revealing of essential internal network organization and typically detection algorithms try to optimise the ratio between the number of internal community to cross-communities edges across all communities simultaneously.

Fig 2. Overview of the algorithm developed in the present work.

(A) A network (β) is employed in GARD simulations and the emerging compotype species are collected. (B) In parallel, the communities of β are analysed and collected. Finally, (A) and (B) are compared by using the ensemble of detected communities to predict compotypes.

Network science is often fruitfully applied to decipher and understand complex systems, including food-webs [62], metabolic networks [63], genes networks [64], protein networks [65] and different social networks [66, 67]. Such applications of network science, together with previous linear algebra analysis of β [39] and of other networks [68, 69], motivate us to apply such analyses to our system, focusing on how the inner organization of a β affects the nature of observed compotypes species. Even though differences exist between replicating polymers and replicating catalytic networks [32], in both cases the model can be represented as a network [13] and encourages understanding how network’s inner organisation affects the nature of observed species. We showed in [70] that one can predict the best simulation algorithms for systems and synthetic biology models by analysing their network structure. Further, different β’s result in different GARD simulations giving rise to different compotype species provides additional motivation for our current study.

In this paper we use large scale simulations and data analysis of GARD simulations to demonstrate that communities’ analysis allows us to “shortcut” expensive dynamical simulations of a (proto) evolutionary process and predict its invariants, namely, the set of species that can be expected to emerge from such a dynamical system.


The GARD model

GARD describes the growth and fission of a molecular assembly, typically assumed to consist of a large repertoire of amphiphilic molecules drawn from a repertoire of NG molecular types [38, 40] (Fig 1). Molecules from the environment join an assembly and molecules within the assembly it can leave. Once the number of molecules in an assembly reaches a pre-defined size threshold (Nmax), a random fission event takes place and produces two daughter assemblies of the same size (Nmax/2) which can then repeat the growth-fission cycle (Fig 2 show a scheme of the model, adapted from [32]). This dynamic is described by a set of ordinary differential equations: Eq 1 Where ni is the current count of molecule type i in an assembly (i = 1..NG), kf and kb are the basal forward and backward rate constants (assembly joining and leaving, respectively). ρi is the buffered environmental concentration and N is current assembly size (N = ∑ni). βij is the rate-enhancement exerted by an assembly molecule of type j on incoming or outgoing molecule of type i.

β can be represented as an NG×NG adjacency matrix for a weighted-directed-asymmetric-network with NG nodes and NG2 edges. Typically, βij values are drawn from a lognormal distribution [39, 71] (that is, the values ln(βij) are normally distributed with mean = -4 and standard deviation = 4) where different β instances represent different potential environmental prebiotic chemistries [40]. Introducing negative βij values, i.e. inhibition, is expected to result in catalysis aswell via inhibition of inhibitor [40].

As mentioned previously, composomes are faithfully replicating assemblies, that is a composome is an assembly with high similarity to its predecessor and successor (typically compared when both assemblies are at size Nmax). It is important to distinguish composome assemblies from non composomes (i.e. drifting assemblies), because the latter may appear spontaneously yet are incapable of transmitting compositional information (i.e. the specific ratio of different molecule types): that is, once a non composome assembly reaches the critical size triggering the fission event (Nmax), its compositional information is not preserved in the daughter assemblies and hence is lost. Composomes are grouped into compotypes using k-means clustering algorithm based on compositional similarity as a distance measure (see section: Compotype-community assignment) by picking the k which give the highest silhouette [41]. A compotype is thus represented by a vector constituting the center of mass of all its member assemblies and is interpreted as a GARD species.

GARD simulations

The GARD model was run using a stochastic kinetic Monte Carlo simulation based on Gillespieʼs algorithm [72] using parameter values identical to those employed in previous studies [32, 40, 42]: kf = 10−2, kb = 10−4i = 10−2, Nmax = 102 and NG = 102, for 5,000 growth-split cycles (generations). Calculations were executed using MATLAB version R2015a. A large set of 10,000 GARD simulations was generated, all with the above parameters, and each with a different β, created by MATLAB’s pseudorandom number generator with seeds 1–10,000. Each of these β‘s represents different chemistries that might lead to the emergence of one or more compotypes.

In the basic form employed for this paper, GARD was run in a single-lineage mode, where at each split event only one progeny (picked at random) is followed and the other one is discarded (Fig 1). For each simulation under a given β, composomes where identified and clustered into compotypes.

Simulations give rise to the emergence of various compotype species as a result of the different chemistries represented by different β‘s. The number of compotypes observed in each simulation typically ranged from 1–6, with a total of 20,235 compotypes observed in 10,000 simulations performed (3 simulations out of those failed and were therefore discarded).

We provide the MATLAB code and datasets used in this work (see S1 File (Supporting Information) and reference [73]).

Community detection algorithms

A community detection algorithm was run on each β, and the list of nodes (molecule types) belonging to each community was recorded per each β. The three different algorithms used are: Louvain (MATLAB version) [74], Infomap (version 0.18.2) [75] and OSLOM (Order Statistics Local Optimization Method) (version 2.5) [76], with their default parameters.

Louvain is a heuristic method to find communities [74]. This method starts by assigning each node to its own community. Then, a node m is added to the community of node n only if this results in increased modularity value. m and n pairs are picked to give the highest increase. This is continued until no increase in modularity is gained by joining nodes. Next, a new network is created, whose nodes correspond to the previously found communities and whose edges are the respective sum of the previous edges between communities. This entire process is repeated until no further increase in modularity is possible.

Infomap is based on flow and encoding [75]. This method first simulates a random walk along the network, biased by the edges' weights. These random walks are then encoded into binary string in a way that would reflect how frequency adjacent nodes are visited, rather than create a maximally compressible binary string. This is done in a two-level description whereby a community of nodes where the walker has spent long periods of time receives unique code, but the nodes within a community receive non-unique codes that can be repeated in other communities’ nodes. In other words, the random walk is efficiently encoded in a way in which important structures (communities) indeed retain unique codes.

OSLOM is finding clusters which are statistically significant with respect to a random network with similar characteristics as the actual network [76]. This method begins by randomly picking a node as the first community and additional nodes are added to this community if they are considered significance in the statistical sense. This is then repeated with other nodes until all communities are found.

Results and discussion

GARD tapes

In order to understand if convergent evolution is occurring under GARD dynamics, simulation-runs were repeated 10 times under a given β, with different random seeds (and hence initial assembly) each time. Each repeated run is regarded as a GARD “tape” (analogue to replaying the tape of life under the same chemistry (i.e. β)). The history of each tape was recorded (i.e. the content of each assembly) and compotypes were identified for each tape (i.e. k-means clustering). Fig 3 (panels A1-C1) show individual examples of GARD tapes (more examples are available at These panels show the content of assemblies from the different tapes, where different assemblies are plotted along the X axis and the NG molecule types are of each assembly are given along the Y axis, with color representing the count of a molecule type in an assembly. While the detailed histories of various tapes under a given β are different, they generally show similar trends (invariants) represented by the horizontal lines. Further, different tapes from the same β exhibit the same number of compotypes (in 85% of cases studied for this part, see Fig A in S1 File), and, importantly, those compotypes are extremely similar between different tapes (Fig B in S1 File), signifying that GARD dynamics display convergent evolution. In other words, even if different GARD tapes portray different histories (decorations) under the same β, they give rise to very similar compotype species (invariants) and thus it becomes relevant to ascertain to what degree it is possible to predict the emergence of these species from the underlying chemistry alone, i.e., ignoring the dynamical process that generates the species. Because GARD exhibits convergent evolution, in the next sections only a single tape will be simulated per each β, but in return a large number of different β’s will be employed.

Fig 3. Examples of GARD simulations under different β’s.

(A1-C1) Histories of different tapes; For each tape, assemblies from different generations are plotted along the X axis, and color represents the counts of each of the NG molecule types in each assembly (recorded at assembly size Nmax (Fig 1). Tapes are separated by a vertical black line. For each tape, the first 1,000 assemblies are shown. Red color represents counts ≥ 50, and for brevity counts < 5 are colored white. (A2-C2) Density plots; For each assembly shown in panels (A1-C1), its Euclidean distance and angle vs. the eigenvector of the full-β was calculated (normalized for the maximum value between two assemblies, for distance and 90 degrees for angle). Color is normalized probability (log10 scale) of an assembly having a certain angle and distance. See section: Compotype-community assignment. (A3-C3) Same as (A2-C2), except for each assembly the distance and angle are calculated against the one eigenvector of β* which has the lowest angle to this assembly. Number of Infomap communities detected is: 9 (A3), 7 (B3) and 6 (C3). Further examples are available at and [73].

Communities detection

This section presents how community detection algorithms were applied to the β network, and the next section presents how the detected communities were related to the emergence of species in the prebiotic evolution model (GARD) (Fig 2). In order to adequately compare a detected community to an observed compotype one needs to convert a community–which is a set of molecule types (nodes and their links)–to a composition, that is–the ratios between those molecule types. This composition can then be directly compared with the composition of a compotype. To detect communities within different β matrices, each of the three different algorithms used (Louvain [74], Infomap [75] and OSLOM [76]), were run on each β, and the list of nodes (molecule types) belonging to each community was recorded per each β.

Each of the three algorithms always detected several communities (>1) in each of the 10,000 different β‘s studied here (Fig 4 A and 4B). Louvain algorithm detected on average fewer communities than Infomap or OSLOM. Interestingly, both OSLOM and Infomap detected similar numbers of communities, even though OSLOM allows for overlaps (i.e. molecules belonging to more than 1 community). The latter suggests that a detection algorithm may sometime consider two overlapping communities as one, if overlaps are allowed. In GARD these overlaps are suggested to be the facilitators of species interconverting into each other–a phenomenon best seen in GARD populations [42].

Fig 4. Communities in β’s.

(A) Histogram of total number of communities detected in each network, for the 3 algorithms. Frequency is given out of the 104 β’s studied here. (B) Average occurrence of community-sizes. An occurrence of 10−4 means that this community-size appeared only in 1 β out of the 10,000 studied here and an occurrence of 1 means that on average each β has one community with this size. Insert show the occurrence of sizes > 30. (C) Average number of communities detected vs. number of observed compotype species shows no correlation. Vertical bars mark standard deviation. (D) Histogram of the size of assigned communities, when a community is assigned to a compotype based on eigenvector similarity (see section: Compotype-community assignment). Mean and standard deviation are given in Table 1.

Different simulations under different β’s give rise to different compotypes, which calls for the search for a link between the inner structures of a β to the emerging compotypes in a simulation under this β. However, as the average number of communities detected in a β is higher than the average number of compotypes observed in a simulation under this β and no correlation between number of communities to number of compotype exists (Fig 4C), finding such link is not trivial. The next section will discuss a methodology for community–to–compotype assignment and prediction.

Compotype-community assignment

In order to perform such comparison, compotypes observed in each β-dependent simulation were collected (will sometime be referred to as original compotypes) and on the other hand the communities detected in each β were collected (respectively corresponding to (A) and (B) in Fig 2). Then, for each detected community in each β, a matrix β* is created with elements βij*: Eq 2 Where C is the set of the indices of all nodes (molecule types) that belong to a community, i and j are nodes’ indices and βij are elements of β (Eq 1). β* has the same dimensions as β. That is– β* is a sparser version of β matrix in which only pairs of molecule types that belong to a community can interact (all other rates are set to zero). This particular formulation of β* was picked such that its eigenvectors will have the same dimensionality as the original compotypes. Next, linear algebra is used on β*.

According to the Perron-Frobenius theorem a matrix such as β* or β has a nondegenerate largest real eigenvalue with a corresponding eigenvector with all non-negative elements [77, 78]. Indeed, an eigenvector analysis on all the β*’s and β’s studied here showed that only a single non-negative eigenvector exists for each. It is tempting to consider an eigenvector with all non-negative elements as representing a molecular composition (as sometimes done [39, 77, 79]), homologue to a compotype. A vector with some negative elements, representing negative molecular counts or concentrations, by definition cannot represent molecular composition. Because GARD simulations can exhibit more than 1 compotype (Fig 4C), it is unclear what is the relation between the single eigenvector of β to the observed compotypes and the same can be said about the communities. What follows presents a method to successfully predict the content (i.e. composition) of all compotypes observed in a simulation under a given β, given only the ensemble of communities of that β.

The Perron-Frobenius theorem was applied to all β* and the eigenvectors were recorded. Exploring the role of communities in the actual GARD dynamics, the angle and distance between each assembly during a simulation to the eigenvector of the full-β and to the eigenvector of β* were calculated (Fig 3 panels A2-C2 and A3-C3, respectively). Indeed, the assemblies show a lower angle to β* than to β (see also Fig C in S1 File), symbolizing the significance of communities in analysing GARD’s dynamics.

Each such eigenvector of β* is compared with each compotype, using cosine of compotype vectors as typically applied in GARD studies [38, 41, 44, 80]: Eq 3

H measures how well an eigenvector matches a compotype’s content (i.e. composition), where a value of 1.0 means identical compositions (i.e. one vector is the other vector multiplied by a positive scalar). Each compotype is then assigned with the community that give rise to the highest H.

Fig 5 shows, out of all the H values between the communities’ eigenvectors and the original compotypes, the percentage of particularly high values (H>0.8). Full histograms are given in Fig D in S1 File. When multiple compotypes were observed in a simulation, the eigenvectors of β* showed a high degree of similarity to all compotypes whereas the eigenvector of the full-β showed much lower similarity values (Table 1). Only in the limiting case, when only a single compotype is produced by the simulation, the eigenvector of the full-β showed high similarity to that compotype. Two-sample Kolmogorov-Smirnov tests were performed, with the null hypothesis that the similarities with respect to the full-β are from the same continuous distribution as the similarities with respect to β*, against the alternative hypothesis that they are from different continuous distributions. The Kolmogorov-Smirnov tests were repeated for the cases of single and multiple compotypes, for each of the three community detection algorithms (that is– 6 tests in total). All the tests rejected the null hypothesis with alpha level that is essentially zero. Further, when taking into account the overall dataset (that is, without distinguishing between cases with single or multiple compotypes), the majority of β* showed substantial similarity to their original compotypes, with more than 60% of cases showing H>0.8 (Fig 5). The overall high degree of similarity achieved across all three community detection algorithms indicates that the communities are able to successfully predict the composition of compotypes, while the eigenvector of β may represent something else (see S1 File, section: On the eigenvector of the full-β). Thus, compotype species can be successfully predicted based only on the complex chemistry that is in a β. A test to ascertain whether a better community-to-compotype assignment and prediction could be achieved at random was performed. The test measured (for each community-to-compotype assignment) the probability of achieving higher H values by a random community–a community with the same size as the assigned community but with different molecule types. The test was repeated 103 times for each assigned community. The test showed that it is highly unlikely to achieve better H values by random community assignment (Table 1, and Fig E in S1 File).

Fig 5. Bar plot of the percentage of high compositional similarity (H, Eq 3) when predicting compotypes using the eigenvectors of β* (Eq 2) vs. full-β.

Percentage is given out of the total number of compotypes observed under all β networks. Mean and standard deviation are given in Table 1 and full histograms are given in Fig D in S1 File.

Finally, it is important to verify whether indeed β* represents a meaningful chemistry that can give rise to a compotype species under GARD’s stochastic dynamics (Eq 1). To this end, GARD simulations were repeated with exactly the same parameters (see section: GARD simulations), and with β* for each assigned community rather than with the full β. Compotype identification in the new simulations was performed exactly as before (i.e., k-means clustering) and the compositional similarity to the original compotype was calculated (Fig 6 ‘Original’). A high similarity to the original compotype was always obtained, corroborating the community detection algorithms ability to detect the communities which serve as the ‘invariant content’ of GARD’s compotypes. In [81], the authors analysed stochastic Kauffman-like dynamics via the introduction of a temporal-window in order to determine which part of their reaction network is currently active, however, the novelty of the present paper is in enabling to make such determination a-priori based on the network topology.

Fig 6. Box plots of compositional similarity, for the three community detection algorithms (Louvain, top; Infomap, middle; OSLOM, bottom).

Similarity was measured in three cases: ‘Original’, when comparing the original compotype observed vs. the one in GARD under β* of its assigned community; ‘Assigned’, when comparing the compotype observed in a GARD simulation with β* of its assigned community to the eigenvector of β*; ‘Rest’, analogue to ‘Assigned’, only with communities that were not assigned to original compotypes β*. Mean and standard deviations for ‘Original’ respectively are: 0.849±0.165, 0.856±0.145 and 0.813±0.216.

As presently it is impossible to determine a priori the number of compotype species that will be observed, the algorithm for compotype-community assignment presented above is required to address all compotypes (however, it was previously shown that having an excess of mutual-interactions over self-interactions in β (i.e. βij over βii) is a necessary but insufficient condition for a high number of compotypes [40]).

On the nature of non-assigned communities

Lastly, it is asked why some communities successfully predict compotypes while other communities do not, and are there differences between those communities. It was previously suggested that compotype dynamics are somehow related to the compartments formed by high βij values [44]. The morphology of the communities assigned to compotypes seems to be different than that of those which were not assigned (Fig 7), which may begin to point to the nature of differences between the assigned and non-assigned detected communities. Additionally, the similarity between the eigenvector of β* and the compotype from GARD under β* was calculated, both for the assigned and non-assigned communities. It was found that this similarity is much higher for the assigned communities than for the non-assigned (Fig 6 ‘Assigned’ and ‘Rest’). This last result suggests that the dynamics of the non-assigned communities is fundamentally different than that of the assigned ones, in the sense that the former are less likely to exhibit faithful replication. An ongoing investigation is on its way to further understand those differences, which may prove critical for reverse-engineering, i.e. the design of a β network that give rise to specific and desired compotypes dynamics.

Fig 7. Network topology for assigned communities and rest, for the three community-detection algorithms (Louvain, top; Infomap, middle, OSLOM, bottom).

(left) Node-betweenness-centrality [82], normalized by dividing with (n-1)*(n-2), where n is number of nodes in a community. (right) Clustering-coefficient [83]. Parameters were calculated using [84].


The GARD model performs biased and far from equilibrium random walks on a network that has previously been linked to pre-biotic evolutionary dynamics. Via community analyses, we were able to bypass the dynamic trajectories of the stochastic simulator and use the ensemble of detected communities to predict the emergence of (proto) species of this system as well as their invariant content. Interestingly, the morphology of assigned communities is different than that of non-assigned ones, which deserve further scrutiny in order to understand the nature of this difference, how the various topological characteristics affect dynamics as well as the precise role of those un-assigned communities.

We have used the eigenvector of β* to predict compotypes and corroborated by performing GARD dynamics under β*, to find that GARD-dynamics approach gives rise to a compotype more similar to the original one (the original compotype observed under the full-β). In other words: using β*, GARD-dynamics are ‘closer to the truth’. This is both non-intuitive and interesting, because the eigenvector approach does not employ GARD’s stochastic dynamics, where the latter are expected to introduce some variation in the compotype content. If we treat the observation of species in GARD’s dynamics as the ground truth–analogous to how species are observed in nature–then this points that the theoretical prediction using the eigenvector is imperfect (but still very good!), probably because the eigenvector method takes into account only β and not the full physio-chemical details of the GARD model, such as the reversibility of assembly-joining.

For tractability, the present manuscript kept to the definition and identification of compotype species as they have traditionally been used in GARD and lipid world literature [32, 4042, 80]. We would like to argue in favour of rethinking species identification, as follows. We speculate that the un-assigned communities represent either assemblies that are unable to faithfully replicate or compotype species that are very rare. The latter may require an even larger scale simulation analysis than the one we have done here involving more runs and longer simulation times before these rare species could be observed. Any species identification algorithm developed must, critically, acknowledge faithful replication. As presently it is impossible to determine a-priori the number of compotype species that will be observed in a simulation under a given β network, we are in the process of extending this current paper in order to precisely predict the expected number of compotype species under a given β without running simulations. The community count provides an upper limit for the species count, and the community eigenvectors, even if somewhat numerous, still strongly narrows the search for compotypes.

Our heuristic approach gave very similar results among all three community-detection-algorithms we used, thus providing robustness to our findings. Future extension of this work will apply the species-prediction-algorithm developed herein on multiple dynamical models and their emergent species (or species equivalent), as well as address larger networks which is more realistic, in order to address the generality of the algorithm presented here.

Supporting information

S1 File. Supplementary data and figures associated with this article.



We thank Harold Fellermann and Pawel Widera for discussions, and Jean-Loup Guillaume for making available Louvain’s MATLAB code.


  1. 1. Rasmussen S. Protocells: Mit Press; 2009.
  2. 2. Dyson F. Origins of life: Cambridge University Press; 1999.
  3. 3. Powner MW, Sutherland JD. Prebiotic chemistry: a new modus operandi. Philos T R Soc B. 2011;366(1580):2870–7. WOS:000294993100003. pmid:21930577
  4. 4. Walker SI, Davies PCW. The algorithmic origins of life. J R Soc Interface. 2013;10(79). WOS:000331118500006. pmid:23235265
  5. 5. Cronin L, Krasnogor N, Davis BG, Alexander C, Robertson N, Steinke JH, et al. The imitation game—a computational chemical approach to recognizing life. Nature biotechnology. 2006;24(10):1203–6. pmid:17033651
  6. 6. Benner SA. Defining life. Astrobiology. 2010;10(10):1021–30. pmid:21162682
  7. 7. Trifonov EN. Vocabulary of Definitions of Life Suggests a Definition. J Biomol Struct Dyn. 2011;29(2):259–66. ISI:000294884000004. pmid:21875147
  8. 8. Oparin AI. Origin and evolution of metabolism. Comparative biochemistry and physiology. 1962;4:371–7. Epub 1962/10/01. pmid:13940205
  9. 9. Anet FA. The place of metabolism in the origin of life. Current Opinion in Chemical Biology. 2004;8(6):654–9. ISI:000225782300014. pmid:15556411
  10. 10. Szathmary E, Santos M, Fernando C. Evolutionary potential and requirements for minimal protocells. Top Curr Chem. 2005;259:167–211. WOS:000234567400005.
  11. 11. Shapiro R. Small molecule interactions were central to the origin of life. Quarterly Review of Biology. 2006;81(2):105–25. ISI:000237887600001. pmid:16776061
  12. 12. Cleaves H. Prebiotic Chemistry: Geochemical Context and Reaction Screening. Life. 2013;3(2):331. pmid:25369745
  13. 13. Kauffman SA. The origins of order: Self organization and selection in evolution: Oxford university press; 1993.
  14. 14. Szathmary E, Smith JM. From replicators to reproducers: the first major transitions leading to life. J Theor Biol. 1997;187(4):555–71. WOS:A1997XR80400010. pmid:9299299
  15. 15. Chen IA, Walde P. From Self-Assembled Vesicles to Protocells. Csh Perspect Biol. 2010;2(7). WOS:000279883100009. pmid:20519344
  16. 16. Bernhardt HS. The RNA world hypothesis: the worst theory of the early evolution of life (except for all the others). Biology direct. 2012;7:23. Epub 2012/07/17. PubMed Central PMCID: PMC3495036. pmid:22793875
  17. 17. Gesteland FR, Cech RT, Atkins FJ. The RNA world. Cold Spring: Cold Spring Harbor Laboratory; 1999. 709 p.
  18. 18. Gilbert W. Origin of Life—the RNA World. Nature. 1986;319(6055):618–. ISI:A1986A079600021.
  19. 19. Joyce GF. The antiquity of RNA-based evolution. Nature. 2002;418(6894):214–21. ISI:000176710400049. pmid:12110897
  20. 20. Orgel LE. Prebiotic chemistry and the origin of the RNA world. Critical Reviews in Biochemistry and Molecular Biology. 2004;39(2):99–123. ISI:000222588800002. pmid:15217990
  21. 21. Luisi PL, Walde P, Oberholzer T. Lipid vesicles as possible intermediates in the origin of life. Current Opinion in Colloid & Interface Science. 1999;4(1):33–9. ISI:000080941800005.
  22. 22. Segre D, Ben-Eli D, Deamer DW, Lancet D. The lipid world. Origins Life Evol B. 2001;31(1–2):119–45. ISI:000167737900008.
  23. 23. Aono M, Kitadai N, Oono Y. A Principled Approach to the Origin Problem. Origins of life and evolution of the biosphere: the journal of the International Society for the Study of the Origin of Life. 2015;45(3):327–38. pmid:26177711; PubMed Central PMCID: PMC4510921.
  24. 24. Hanczyc MM, Fujikawa SM, Szostak JW. Experimental models of primitive cellular compartments: encapsulation, growth, and division. Science. 2003;302(5645):618–22. Epub 2003/10/25. pmid:14576428.
  25. 25. Kruger K, Grabowski PJ, Zaug AJ, Sands J, Gottschling DE, Cech TR. Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell. 1982;31(1):147–57. Epub 1982/11/01. pmid:6297745.
  26. 26. Lincoln TA, Joyce GF. Self-Sustained Replication of an RNA Enzyme. Science. 2009;323(5918):1229–32. ISI:000263687600043. pmid:19131595
  27. 27. Hayden EJ, Lehman N. Self-assembly of a group I intron from inactive oligonucleotide fragments. Chem Biol. 2006;13(8):909–18. ISI:000240329800015. pmid:16931340
  28. 28. Vaidya N, Manapat ML, Chen IA, Xulvi-Brunet R, Hayden EJ, Lehman N. Spontaneous network formation among cooperative RNA replicators. Nature. 2012;491(7422):72–7. ISI:000310434500032. pmid:23075853
  29. 29. Lazcano A. Historical Development of Origins Research. Cold Spring Harb Perspect Biol. 2010;2(11). ISI:000283646900001. pmid:20534710
  30. 30. Oparin AI. Evolution of Concepts of Origin of Life, 1924–1974. Origins Life Evol B. 1976;7(1):3–8. WOS:A1976CE36200001.
  31. 31. Segre D, Lancet D. Composing life. Embo Rep. 2000;1(3):217–22. ISI:000165766200009. pmid:11256602
  32. 32. Gross R, Fouxon I, Lancet D, Markovitch O. Quasispecies in population of compositional assemblies. BMC evolutionary biology. 2014;14:265. pmid:25547629; PubMed Central PMCID: PMC4357159.
  33. 33. Maurer SE, Deamer DW, Boncella JM, Monnard PA. Chemical Evolution of Amphiphiles: Glycerol Monoacyl Derivatives Stabilize Plausible Prebiotic Membranes. Astrobiology. 2009;9(10):979–87. WOS:000273181200006. pmid:20041750
  34. 34. Vequi-Suplicy CC, Riske KA, Knorr RL, Dimova R. Vesicles with charged domains. Bba-Biomembranes. 2010;1798(7):1338–47. WOS:000279365000008. pmid:20044978
  35. 35. Theis M, Gazzola G, Forlin M, Poli I, Hanczyc MM, Bedau M. Optimal formulation of complex chemical systems with a genetic algorithm. ECCS06 online Proceedings (P193). Oxford; 2009.
  36. 36. Gutierrez JMP, Hinkley T, Taylor JW, Yanev K, Cronin L. Evolution of oil droplets in a chemorobotic platform. Nat Commun. 2014;5. WOS:000347223500001. pmid:25482304
  37. 37. Shirt-Ediss B, Sole RV, Ruiz-Mirazo K. Emergent chemical behavior in variable-volume protocells. Life (Basel). 2015;5(1):181–211. pmid:25590570; PubMed Central PMCID: PMC4390847.
  38. 38. Segre D, Ben-Eli D, Lancet D. Compositional genomes: prebiotic information transfer in mutually catalytic noncovalent assemblies. Proceedings of the National Academy of Sciences of the United States of America. 2000;97(8):4112–7. pmid:10760281; PubMed Central PMCID: PMC18166.
  39. 39. Segre D, Shenhav B, Kafri R, Lancet D. The molecular roots of compositional inheritance. J Theor Biol. 2001;213(3):481–91. pmid:11735293.
  40. 40. Markovitch O, Lancet D. Excess mutual catalysis is required for effective evolvability. Artificial life. 2012;18(3):243–66. pmid:22662913.
  41. 41. Shenhav B, Oz A, Lancet D. Coevolution of compositional protocells and their environment. Philos T R Soc B. 2007;362(1486):1813–9. ISI:000249516700009. pmid:17510019
  42. 42. Markovitch O, Lancet D. Multispecies population dynamics of prebiotic compositional assemblies. J Theor Biol. 2014;357:26–34. pmid:24831416.
  43. 43. Markovitch O, Sorek D, Lui LT, Lancet D, Krasnogor N. Is there an optimal level of open-endedness in prebiotic evolution? Origins of life and evolution of the biosphere: the journal of the International Society for the Study of the Origin of Life. 2012;42(5):469–73; discussion 74. pmid:23114973.
  44. 44. Vasas V, Szathmáry E, Santos M. Lack of evolvability in self-sustaining autocatalytic networks constraints metabolism-first scenarios for the origin of life. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(4):1470–5. ISI:000273974600045. pmid:20080693
  45. 45. Vasas V, Fernando C, Szilagyi A, Zachar I, Santos M, Szathmary E. Primordial evolvability: Impasses and challenges. J Theor Biol. 2015;381:29–38. pmid:26165453.
  46. 46. Gould . Wonderful Life: The Burgess Shale and the Nature of History (Book). Library Journal. 1989;114(14):214–.
  47. 47. Kauffman S. At home in the universe: The search for the laws of self-organization and complexity: Oxford university press; 1996.
  48. 48. Beatty J. Replaying life's tape. The Journal of philosophy. 2006;103(7):336–62.
  49. 49. Orgogozo V. Replaying the tape of life in the twenty-first century. Interface focus. 2015;5(6):20150057. pmid:26640652; PubMed Central PMCID: PMC4633862.
  50. 50. Lobkovsky AE, Koonin EV. Replaying the tape of life: quantification of the predictability of evolution. Frontiers in genetics. 2012;3:246. pmid:23226153; PubMed Central PMCID: PMC3509945.
  51. 51. Travisano M, Mongold JA, Bennett AF, Lenski RE. Experimental tests of the roles of adaptation, chance, and history in evolution. Science. 1995;267(5194):87–90. pmid:7809610.
  52. 52. Losos JB, Jackman TR, Larson A, Queiroz K, Rodriguez-Schettino L. Contingency and determinism in replicated adaptive radiations of island lizards. Science. 1998;279(5359):2115–8. pmid:9516114.
  53. 53. Fontana W, Buss LW. What would be conserved if "the tape were played twice"? Proceedings of the National Academy of Sciences of the United States of America. 1994;91(2):757–61. pmid:8290596; PubMed Central PMCID: PMC43028.
  54. 54. Szathmary E. A classification of replicators and lambda-calculus models of biological organization. Proceedings Biological sciences / The Royal Society. 1995;260(1359):279–86. pmid:7630896.
  55. 55. Taylor T, Hallam J. Replaying the tape: An investigation into the role of contingency in evolution. From Anim Animat. 1998:256–65. WOS:000075924900029.
  56. 56. Bedau M. The scientific and philosophical scope of artificial life. Leonardo. 2002;35(4):395–400. WOS:000177435800010.
  57. 57. Wagenaar DA, Adami C. Influence of chance, history, and adaptation on digital evolution. Artificial life. 2004;10(2):181–90. WOS:000220379900007. pmid:15107230
  58. 58. Missa O, Dytham C, Morlon H. Understanding how biodiversity unfolds through time under neutral theory. Philosophical transactions of the Royal Society of London Series B, Biological sciences. 2016;371(1691). pmid:26977066; PubMed Central PMCID: PMC4810819.
  59. 59. Fortunato S. Community detection in graphs. Physics Reports. 2010;486(3):75–174.
  60. 60. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical review E. 2008;78(4):046110.
  61. 61. Kim Y, Son SW, Jeong H. Finding communities in directed networks. Physical review E, Statistical, nonlinear, and soft matter physics. 2010;81(1 Pt 2):016103. pmid:20365428.
  62. 62. Sole RV, Montoya JM. Complexity and fragility in ecological networks. Proceedings Biological sciences / The Royal Society. 2001;268(1480):2039–45. pmid:11571051; PubMed Central PMCID: PMC1088846.
  63. 63. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000;407(6804):651–4. pmid:11034217.
  64. 64. Bassel GW, Lan H, Glaab E, Gibbs DJ, Gerjets T, Krasnogor N, et al. Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(23):9709–14. pmid:21593420; PubMed Central PMCID: PMC3111290.
  65. 65. Glaab E, Baudot A, Krasnogor N, Valencia A. Extending pathways and processes using molecular interaction networks to analyse cancer genome data. BMC bioinformatics. 2010;11:597. pmid:21144022; PubMed Central PMCID: PMC3017081.
  66. 66. Ferrer ICR, Sole RV. The small world of human language. Proceedings Biological sciences / The Royal Society. 2001;268(1482):2261–5. pmid:11674874; PubMed Central PMCID: PMC1088874.
  67. 67. Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T. Evolution of the social network of scientific collaborations. Physica A. 2002;311(3–4):590–614. WOS:000177271100023.
  68. 68. Schilling CH, Palsson BO. The underlying pathway structure of biochemical reaction networks. Proceedings of the National Academy of Sciences. 1998;95(8):4193–8.
  69. 69. Jain S, Krishna S. Autocatalytic Sets and the Growth of Complexity in an Evolutionary Model. Physical Review Letters. 1998;81(25):5684–7.
  70. 70. Sanassy D, Widera P, Krasnogor N. Meta-stochastic simulation of biochemical models for systems and synthetic biology. ACS synthetic biology. 2015;4(1):39–47. pmid:25152014.
  71. 71. Limpert E, Stahel WA, Abbt M. Log-normal Distributions across the Sciences: Keys and Clues On the charms of statistics, and how mechanical models resembling gambling machines offer a link to a handy way to characterize log-normal distributions, which can provide deeper insight into variability and probability—normal or log-normal: That is the question. BioScience. 2001;51(5):341–52.
  72. 72. Gillespie DT. General Method for Numerically Simulating Stochastic Time Evolution of Coupled Chemical-Reactions. J Comput Phys. 1976;22(4):403–34. WOS:A1976CQ87900001.
  73. 73. Markovitch O, Krasnogor N. Accompanying dataset for: Predicting Species Emergence in Simulated Complex Pre-Biotic Networks. Zenodo. 2016.
  74. 74. Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 2008;2008(10):P10008.
  75. 75. Rosvall M, Axelsson D, Bergstrom CT. The map equation. The European Physical Journal Special Topics. 2010;178(1):13–23.
  76. 76. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S. Finding statistically significant communities in networks. PloS one. 2011;6(4):e18961. pmid:21559480; PubMed Central PMCID: PMC3084717.
  77. 77. Eigen M, Mccaskill J, Schuster P. Molecular Quasi-Species. J Phys Chem-Us. 1988;92(24):6881–91. ISI:A1988R227300010.
  78. 78. Meyer CD. Matrix analysis and applied linear algebra. Philadelphia: Society for Industrial and Applied Mathematics; 2000. 718 p.
  79. 79. Virgo N, Ikegami T, McGregor S. Complex Autocatalysis in Simple Chemistries. Artificial life. 2016:1–15.
  80. 80. Armstrong DL, Markovitch O, Zidovetzki R, Lancet D. Replication of simulated prebiotic amphiphile vesicles controlled by experimental lipid physicochemical properties. Phys Biol. 2011;8(6). WOS:000298181900003. pmid:21946049
  81. 81. Filisetti A, Graudenzi A, Serra R, Villani M, Füchslin RM, Packard N, et al. A stochastic model of autocatalytic reaction networks. Theory in Biosciences. 2012;131(2):85–93. pmid:21979857
  82. 82. Brandes U. A faster algorithm for betweenness centrality*. Journal of Mathematical Sociology. 2001;25(2):163–77.
  83. 83. Fagiolo G. Clustering in complex directed networks. Physical review E, Statistical, nonlinear, and soft matter physics. 2007;76(2 Pt 2):026107. pmid:17930104.
  84. 84. Rubinov M, Sporns O. Complex network measures of brain connectivity: uses and interpretations. NeuroImage. 2010;52(3):1059–69. pmid:19819337.