^{ 1 }

^{*}

^{ 2 }

^{ 3 }

^{ 4 }

^{5}

^{ 6 }

^{ 2 }

^{7}

OR conceived and designed the experiments. OR, OJ, and TH performed the experiments. OR, SR, and CW analyzed the data. OR, MS and CW wrote the paper.

The authors have declared that no competing interests exist.

Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of

Genomic sequence data provides substantial evidence for the abundance of duplicated genes in all organisms surveyed: at least 40% of genes in two prokaryotes [

In theory, the evolutionary fate of gene duplicates can differ: (D1) one copy may become silenced (nonfunctionalization); (D2) both copies are very similar in sequence, and one is functionally redundant to the other [

The structure of protein interaction networks (PINs) derives from multiple stochastic processes over evolutionary time scales, and a number of mechanisms have been proposed to capture aspects of network growth [

The analysis of PINs is notoriously difficult because measurements of PINs are subject to considerable levels of noise [

In this work, we develop an approximate, likelihood-free Monte-Carlo inference technique based on approximate Bayesian computation (ABC) [

The degree sequence [

To study the relative importance of aspects of duplication divergence in network evolution between different domains, we simulated the evolutionary history of PINs with a mixture of duplication divergence with parent–child attachment (DDa) and preferential attachment (PA); see _{t}_{t}_{Div} the divergence probability, _{Att} the parent–child attachment probability, and let _{Div}, _{Att,} _{t}_{+1} conditional on _{t}

To account for incomplete data, random subnetworks of order

The Bayesian paradigm is a powerful probabilistic framework for making inference on complex stochastic systems and allows all sources of uncertainty to be accounted for [

ABC confers computational tractability by circumventing the problem of evaluating the likelihood directly [_{
D℘}, for example ND and DIA (see _{
D℘}. Consequently, we anticipate that generating candidate parameters from the prior will be highly inefficient.

LFI compares summaries of the observed dataset with mean summaries
S℘ of an ensemble of simulated PINs at each iteration of the algorithm during the burn-in phase, i.e., the first 800 iterations in this study. Since mean summaries over larger ensembles have reduced variance (

LFI within MCMC is often prone to get stuck or to sit in the tails of the distribution [_{min} is reached. To avoid the chain sitting in tails, it is in our case sufficient to temper the proposal variance Σ. See

Choosing appropriate summary statistics is central to any method approximating the true likelihood. This choice is governed by the principle that useful summaries should be sensitive to genuine changes in real PINs. Briefly, we characterized genuine changes by comparing the standardized mean derivative (smd) of a summary smd(

The standardized mean gradient smd is plotted as a function of

LFI is sensitive to the particular type of distance function _{∩}.

To compare different distance functions on sets of summaries, we analyzed the two-dimensional posterior support of

(A) _{Div} and (B) _{Att}. Using LFI with the set of summaries WR + DIA + CC + _{k}_{min} (_{∩}, red), and when the sum of these differences did not exceed the sum Σ_{k}ɛ_{k}_{min} of these thresholds (_{Σ}, blue). In both cases, we used an average of shifted histograms to estimate the two-dimensional posterior support. When using _{∩}, the posterior support was more restricted, prompting us to use _{∩} in LFI.

In summary, for inference on protein networks our results suggest that
_{k}_{k,θ} has non-zero smd(θ) and moderate cv(_{k,θ}

Our evolutionary analysis of real PIN datasets centres on a comparison of two representatives from the prokaryotic and eukaryotic domain. We obtained descriptions of the PINs of

We successfully applied LFI on the _{Div} ∈ [0,1], and the estimated posterior _{Div}|
D℘). Similar good convergence was obtained for the attachment probability _{Att} and the mixture parameter _{min}. We could not reproduce our results without averaging over an ensemble of

For the

(A) The four chains for the parameter _{Div} ∈ [0,1] over the first 30,000 iterations. During burn-in, the chains moved quickly from overdispersed starting values and converged toward the same narrow support. Before iteration 800 (vertical line),

(B) Accepted parameters after convergence were pooled over the four chains and used to estimate the posterior density. For _{Div}, the marginal posterior is displayed (black line); in addition, posteriors were calculated for each chain and are overlaid, showing that the four sets of posterior samples overlapped well.

Comparison of the Evolutionary Dynamics Inferred from

We repeated the LFI analysis on the

We found that the lower 80% quantile of 1 − α is larger than 0.6 in both investigated species. Genomic and expression data indicate that repeated single gene duplications with immediate subfunctionalization are a driving force in the evolution of higher organisms [

The role of duplication divergence in evolution of protein networks across domains we promote here must be considered within the limits of our model and the data. However, we note that our analysis is based on several global features of the network data, which are more reliable than local aspects (_{Div} and _{Att} for the

The complexity of PIN data suggests that LFI on biological network data may be highly influenced by the choice of summaries. _{min} small, _{min} ≤ 0.35; but _{Div}, _{Att}, and α pairwise, as in _{min}, but did not lead to a reliable and consistent estimation of

Sensitivity of LFI Based on Different Sets of Summaries

(A–C) The 2D histograms of the posterior parameters to the

(D–F) For comparison, we ran LFI based on ND alone, adjusted to yield a similar empirical acceptance probability. Although _{min} could be chosen stringently, the 2D histograms are diffuse. The regions of highest posterior density of LFI using ND are inconsistent with those of LFI using WR + DIA + CC +

PA alone generates tree-like networks, whereas DDa occasionally produces triangles. Surprisingly, LFI with TRIA included in the set of summary statistics did not aid inference in that convergence took longer and fewer samples were accepted without tightening the credible intervals. Taken together with the fact that other motif counts have a similar high variation over the evolutionary history (unpublished data), this suggests that the extreme variability of motif counts in simulated data reduces their usefulness for inference on biological network data.

Aspects of the complete, unobserved PINs are easily predicted from the observed networks, once MCMC output is available. Here, we discuss the true network size

(Left) posterior modes (5,636 and 43,835, dashed line and dot-dashed line, respectively) were consistent with the estimator presented in [

(Right) for the

The fact that current PINs are largely incomplete hampers inference [

We found large variability associated with predictions of the true network size (see _{min}.

Instead, the credible intervals of all evolution parameters

We further analysed how the degree of incompleteness affects LFI by randomly withholding more network data of the

Sensitivity of LFI for PIN Data of Increasing Incompleteness

For increasingly incomplete PIN datasets of

PINs from different species have attracted much attention in molecular systems biology. Apart from their suspected role in modulating and underpinning the molecular machinery of complex phenotypes, their evolutionary properties are increasingly being investigated using a range of evolutionary and statistical approaches. We showed that it is possible to draw evolutionary inferences from large-scale, incomplete network data when models of randomly growing graphs are conditioned on many, carefully chosen aspects of the networks. Using a likelihood-free approach that relies on comparing summaries of real network data to simulated PINs, we were able to study more complex models of network evolution at increased confidence than had previously been possible [

Our results have important implications for the analysis of protein network topology. Due to its elusive complexity, the topology of a PIN is commonly summarized by the degree sequence [

We used our computational inference scheme to estimate the potential role of aspects of duplication divergence in different domains from large-scale biological network data of

The opportunities arising from LFI to computational statistics on complex systems are large. Our results emphasize that choosing a set of appropriate summaries is central to maintaining the approximate character of LFI. We proposed the standardized mean derivative and measures of scaled variation to compare the power of summaries one by one. Although ABC-MCMC failed on network data, algorithm LFI enabled efficient and consistent inference. LFI might prove useful in other biological contexts when prior information is relatively vague, and when the underlying model is complex and highly stochastic.

For clarity of exposition, we first outline algorithm ABC-MCMC [_{1},…, S_{k}…S_{K}} be the chosen set of summary statistics, and let _{
D℘}_{θ}

_{θ′}

Here, _{
D℘},
S℘_{θ′}

ABC-MCMC is guaranteed to eventually sample from _{
D℘,}
S℘_{θ}_{t}_{1,t}, …, _{k,t},_{K,t}_{k,}_{min} be the final, preset threshold value for the _{t}_{min} be the final, preset variance after cooling.

_{k,t} ≥ ɛ_{k}_{,min}, update _{k,t}_{t}_{min}, update Σ_{t}

_{t}_{t}_{t}_{[0,1]}, appropriately normalized.

_{k,θ′}_{k,θ′}_{k}

_{k}_{,θ} for all _{k}_{k,θ′}_{k}_{,θ} and go to LFI3.

In our case, the prior is uniform, and _{k}

LFI fulfils the detailed balance equations for the same reasons as [

_{t}_{0} and cooling at the next iteration to _{t}_{+1} = _{t}_{min} is reached. In all cases, the minimal temperature is reached in about 750 iterations. Tempering reduces the number of accepted parameters as the number of iterations increases. We employ a similar exponential cooling scheme on Σ_{t}_{min} and

For ND and WR, _{k}_{k,
D℘}_{k,θ′}_{k}_{,θ′}), we compute the common node degrees (or distances), and for these values, sum the absolute differences of the associated frequencies, cutting off the tails of these distributions.

_{
D℘}(^{k}^{k}_{
D℘}(

^{b}^{b}

Here, _{D}

Note that the average cluster coefficient is an observed probability, which is already normalized, and we utilize _{l}_{l}

^{b}^{b}^{b}

These values yield a relative error histogram for fixed
S℘ and

Aspects or quantities of PINs can be predicted within the Bayesian framework. The posterior predictive distribution of such a quantity, e.g., the network size

We have chosen

Out of 1,271 proteins in the

One thousand networks to

(55 KB PDF)

We compared WR and ND for

(A) The interquantile ranges of WR for PINs generated by different parameters were clearly distinct, and the mixture model with

(B) On the same scale, the interquantile ranges of ND largely overlapped, indicating that ND might have significantly less power than WR to distinguish between different parameters.

(C) On the log scale for

(1.4 MB TIF)

Mean summaries over larger ensembles of simulated PIN datasets have reduced variance, as exemplified here with DIA. We computed the mean summary (red points) from

(47 KB PDF)

To compare the variability of the mean posterior summaries of

(76 KB PDF)

(1.1 MB PDF)

We thank Mikael Hvidtfeldt Christensen, René Thomsen, and Thomas Bataillon for stimulating discussions. We also thank David Balding, David Welch, and John Molitor for critical review of the manuscript. Computations were performed at the Imperial College High Performance Computing Centre [

approximate Bayesian computation

coefficient of variation density

duplication divergence with parent–child attachment

likelihood-free inference

Markov Chain Monte Carlo

preferential attachment

protein interaction network

standardized mean derivative

within-reach distribution