GASOLINE: a Greedy And Stochastic algorithm for Optimal Local multiple alignment of Interaction NEtworks

The analysis of structure and dynamics of biological networks plays a central role in understanding the intrinsic complexity of biological systems. Biological networks have been considered a suitable formalism to extend evolutionary and comparative biology. In this paper we present GASOLINE, an algorithm for multiple local network alignment based on statistical iterative sampling in connection to a greedy strategy. GASOLINE overcomes the limits of current approaches by producing biologically significant alignments within a feasible running time, even for very large input instances. The method has been extensively tested on a database of real and synthetic biological networks. A comprehensive comparison with state-of-the art algorithms clearly shows that GASOLINE yields the best results in terms of both reliability of alignments and running time on real biological networks and results comparable in terms of quality of alignments on synthetic networks. GASOLINE has been developed in Java, and is available, along with all the computed alignments, at the following URL: http://ferrolab.dmi.unict.it/gasoline/gasoline.html.


Computational complexity of GASOLINE
In this section we analyze the time complexity of GASOLINE. To simplify the analysis, we suppose that the size of the complexes returned at the end of each execution of GASOLINE is W , we suppose also that we have N input networks with the same number of nodes, n, and edges m. We define also the following variables: • γ s : the number of Gibbs sampling iterations of in the bootstrap phase; • γ e : the number of Gibbs sampling iterations in each extension step of the iterative phase; • k: the average degree of a node (k = m n ); • γ i : the number of iterations of the iterative phase; • γ x : the number of executions of GASOLINE.
Time complexity will be expressed as a function of n, N and W . Through the analysis, we will assume that the generation of random numbers and the computation of the orthology score between two proteins are done in constant time O(1).
First, let's analyze the bootstrap initial phase, whose goal is to find an optimal alignment of protein seeds. The generation of the initial alignment requires time O(N ). The computation of transition probabilities for all the proteins of the selected network at each iteration of Gibbs sampling costs O(nN ), while the computation of the alignment score requires O(N ). Therefore, the time complexity of the initial phase is: Since, in practice, N << n we can write: Now, let's consider the iterative phase, which removes and adds nodes to the current local alignment iteratively.
Suppose, for simplicity, that the alignment grows in the following way: at the beginning the size of aligned complexes increases from 1 up to W, then in the following extension and removal steps it switches from W to W-1 and viceversa.
The last assumption fits quite well the behavior of GASOLINE in the context of real biological networks, since our algorithm yields an alignment of complexes of a certain size and then tries to adjust it by replacing bad parts according to a goodness score.
Let's start with the extension step, which can be divided into three phases: 1. The computation of seeds' adjacent nodes; 2. The execution of Gibbs sampling; 3. The extension of seeds.
Let's suppose that networks are represented through adjacency lists. Under this assumption, the adjacent of a node can be found in O(k) time. As regards Gibbs sampling, the generation of the initial alignment requires O(N ) time.
The computation of the transition probabilities and the alignment score depends on the size of seeds. Let L the current size of the aligned complexes. The transition probability of a protein is computed as the product of two components: orthology similarity score and topology similarity score.
Computing the orthology score for a protein of the selected network at each iteration of Gibbs sampling costs O(N ) as in the bootstrap phase.
In order to compute topology scores efficiently, topology vectors are built before starting Gibbs sampling, for all seed's adjacent nodes of all aligning networks. The construction of topology vector of a single protein can be done in O(L) time, assuming that adjacent lists are implemented by using hash tables with buckets, thus providing constant-time access (in average) to an element of the list. So, the overall cost of building topology vectors is N × O(kL 2 ), supposing that the total number of a seed's adjacent nodes is O(kL). Under this assumptions, the orthology score of a protein can be computed in O(N L) time and the transition probability for all the proteins of the selected network requires O(nN L) time. Finally, the computation of the alignment score requires O(N L) time.
Summing up, the overall cost of the extension step we obtain: Assume N << n and k << n we can rewrite the equation as: Since in the worst case L = O(n) we can deduce that: The removal step simply consists in computing the minimum value of a function (Goodness score) overall the L sets of aligned proteins in the current alignment. For each set, the Goodness score can be evaluated in O(N L), which is the time required to compute the internal degree of all the proteins within the set. So, the cost of removal step is: Assuming that the extension step is performed W − 1 times at the beginning and γ i − 1 times later on and the removal step is executed γ i times, the overall cost of the iterative phase is: We can assume, without loss of generality, that W = O(γ i ) and W << n, so: Finally, all preprocessing steps can be performed in linear time, by considering the degree and the number of orthologous proteins for the proteins of all networks. Post-processing phase consists in filtering highly overlapping complexes and can be done in constant time.
By combining equations (1) and (4) and considering preprocessing operations, the overall cost of γ e executions of GASOLINE is: From the results of the analysis, it follows that the running time of GASO-LINE is polynomial in n. In fact, γ x is at most equal to n since at each execution of the algorithm different protein seeds are considered. Moreover, in all applications, γ s and γ e are in the range 200-400 and can be considered constant. Therefore, the final complexity is O(n 2 W ). We can distinguish three cases: 1. If networks are very similar, then the average size W of complexes found in each execution is high, so W = O(n) and the algorithm requires O(n 3 ) time. This is the worst case; 2. If networks are very distantly related, then W = O( √ n) and the running time is O(n 2.5 ) this is the average case.
3. If W is independent to the size of networks we can suppose its size constant W = O(1). Therefore, the running time will be O(n 2 ) that is, the best case of our algorithm.