Using Likelihood-Free Inference to Compare Evolutionary Dynamics of the Protein Networks of H. pylori and P. falciparum

Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of Helicobacter pylori and Plasmodium falciparum. The 80% credible intervals for the duplication–divergence component are [0.64, 0.98] for H. pylori and [0.87, 0.99] for P. falciparum. The remaining parameter estimates are not inconsistent with sequence data. An extensive sensitivity analysis showed that incompleteness of PIN data does not largely affect the analysis of models of protein network evolution, and that the degree sequence alone barely captures the evolutionary footprints of protein networks relative to other statistics. Our likelihood-free inference approach enables a fully Bayesian analysis of a complex and highly stochastic system that is otherwise intractable at present. Modelling the evolutionary history of PIN data, it transpires that only the simultaneous analysis of several global aspects of protein networks enables credible and consistent inference to be made from available datasets. Our results indicate that gene duplication has played a larger part in the network evolution of the eukaryote than in the prokaryote, and suggests that single gene duplications with immediate divergence alone may explain more than 60% of biological network data in both domains.


Corollary S.3
The transition probability from G t to G t+1 given the set θ of growth parameters is If not stated otherwise, the (standard) mixture model in this article is the evolution graph with transition probabilities (S.11), initial graph G 0 that connects two nodes with one link and stopping time T ∞. For H. pylori and P. falciparum, we take T = 1500 and T = 5300 respectively.
Definition S.4 For a graph G t and a node v in G t , denote G t where node v is removed by G t (−v). v is said to be removable if P G t G t (−v), θ > 0 for some θ. If G t contains removable nodes, then it is called reducible, otherwise irreducible. An evolution graph is reducible, if P (G t |G 1 ) > 0 for some θ and any graph G t , where G 1 is the graph consisting of one node.
Lemma S.5 Consider the transition kernel P of the mixture model. For any nontrivial graph G t and a node v of where we set δ Att and δ Div different from 0, 1 and α < 0.
Proof: Since P does not produce a graph with nodes of degree 0, we may choose a node w = v in G t . If G t contains only v and w, then there must be an edge between them and clearly p G 2 G 1 , v, θ = 1 − δ Att > 0. Let us now assume that G t contains more than two nodes. Consider N := N (v) ∪ N (w) \ {v, w}. Then, we may construct G t (−v) as the graph that results after replacing v,w and N (v),N (w) with its common ancestor u,N .
Applying Lemma S.5 repeatedly, gives the following Corollary.
Corollary S.6 The mixture model is reducible.
Corollary S.6 guarantees that the the mixture model can explain any graph G topologically. In particular, Lemma S.5 is based on properties of DDa, and hence a fortiori DDa can explain any graph topologically. It also shows that every node is removable and hence there are T ! ways to reduce any graph G t with DDa. This is significantly less for the transition kernel presented in [19], making the inference scheme therein computationally feasible. pylori (grown to 1500 nodes and subsampled to 675) are generated with the parameter θ = (0.32, 0.02, 0.15), and the squared errors between each summary and the mean summary are recorded. The frequency of cases such that the squared error is greater than values on the abscissa is plotted for WR, ND, PL, ND and TRIA. In 20% of all cases, the squared error in TRIA is greater than 1000, while in all cases the squared error in ND is not larger than   proportional decay relative to the sampling fraction and is computed as follows. Denote the observed sampling fraction of the P. falciparum PIN data set with ρ D = 0.24, and a mean simulated summary to an ensemble of simulated PINs at sampling fraction ρ with S θ,ρ . Given another sampling fraction ρ , the slope of the green dashed line is S θ,ρD − ρ ρD S θ,ρD / ρ D − ρ . We considered ρ = 0.22; the intercept is computed analogously. Apart from CC, all other summaries were distorted under decreasing ρ. While ND decreased linear to the sampling fraction, PL decreased sublinearily and TRIA superlinearily. This indicates that at low sampling fraction, global aspects of PINs are much less distorted than motif counts in biological network data.

Convergence analysis of likelihood-free inference.
MCMC is guaranteed to converge in distribution to the posterior under some regularity conditions. The MCMC output may be directly used to determine the burn-in period, at which the Markov chain has not yet converged.
After burn-in, MCMC continues and is taken to sample from the posterior. Such analysis is particularly important for likelihood-free inference within MCMC, because of (i) the nontrivial choice of ε in combination with the set of summaries S and (ii) the nature of the ABC algorithm.
Suppose we start many Markov chains at overdispersed initial parameter values. A standard technique to detect that all chains have not yet converged is to compare the variance within one chain against the variance between all chains [41]: if this ratio exceeds 1.2, then convergence is usually rejected. In this case we argue that the independent chains have not explored the state space sufficiently well, and must run longer.

Preliminary Tests of LFI on simulated protein networks
Accuracy PIN test data sets were generated based on model (1) with the parameter θ = (0.4, 0.3, 0.3). To simulate incompleteness, we grew the PIN test data sets to 120 proteins and retained a subgraph of 100 proteins.
For each test, 4 independent PIN test data sets were generated in this way. We believe that the performed tests were demanding because the protein test networks are (1) very small and (2) highly variable so that characteristics genuine to the evolution mechanism and yet common to all networks are highly diluted. Then we applied the LFI scheme detailed in Materials and Methods, with each Markov chain using a different PIN test data set. We In summary, our method correctly predicted the true model parameter with accuracy, converged quickly and clearly distinguished between different evolution mechanisms only when 4 or more different summaries with non-zero gradient and moderate variation were used. Notably, ND alone could not re-estimate the true model parameters confidently, while the set of summaries composed of WR+DIA+CC+ND+FRAG performed very well relative to other combinations. We could not reproduce similar results without either tempering or smoothing.