What are the Evolutionary Origins of Mitochondria? A Complex Network Approach

Mitochondria originated endosymbiotically from an Alphaproteobacteria-like ancestor. However, it is still uncertain which extant group of Alphaproteobacteria is phylogenetically closer to the mitochondrial ancestor. The proposed groups comprise the order Rickettsiales, the family Rhodospirillaceae, and the genus Rickettsia. In this study, we apply a new complex network approach to investigate the evolutionary origins of mitochondria, analyzing protein sequences modules in a critical network obtained through a critical similarity threshold between the studied sequences. The dataset included three ATP synthase subunits (4, 6, and 9) and its alphaproteobacterial homologs (b, a, and c). In all the subunits, the results gave no support to the hypothesis that Rickettsiales are closely related to the mitochondrial ancestor. Our findings support the hypothesis that mitochondria share a common ancestor with a clade containing all Alphaproteobacteria orders, except Rickettsiales.

Schwartz & Dayhoff [1] suggested that Rhodospirillaceae are the closest extant relatives to the mitochondria. However, with the exponential increase of DNA, RNA, and protein sequences datasets, new studies using diverse methods have been conducted to elucidate the origins of mitochondria, leading to conflicting results. Esser et al. [2] proposed that Rhodospirillum rubrum was as close to mitochondria as any alphaproteobacterium investigated by conducting pairwise amino acid sequence identity comparisons between 6,214 nuclear proteincoding genes from Saccharomyces cerevisiae and 177,117 proteins encoded in sequenced genomes from 45 eubacteria and 15 archaebacteria. Fitzpatrick et al. [3] concluded that the ATP synthase subunits 9 and c This dataset was composed of 174 protein sequences: 47% (n = 82) mitochondrial sequences, 53% (n = 92) alphaproteobacterial sequences. Fig 1 shows the dependence of network distance δ(σ,σ+Δσ) as a function of σ, with Δσ = 1, which illustrates the process of selecting the most appropriate values of the similarity thresholds σ to construct critical networks and analyze their modularity properties. High peaks in this graph indicate that the network, which is built based on the information of the protein similarity indices S ij stored in the similarity matrix S, suffers important changes in its topological structure, which reflects the way the organisms (represented by nodes) are related among themselves in distinct communities. In the current case, the largest peak occurs at σ = σ max = 56% (Fig 1). Nevertheless, the presence of a second large peak at σ = 62% indicates that an important separation event also occurs at this value. Therefore, we discuss results for the network at these two σ values.
For each one of them, we perform a community identification analysis based on the neighborhood matricesM with elementsm ij . The structure of the neighborhood matrix (NM) can be visualized with the help of color code plots, as shown in Fig 2. The different colors indicate how similar two protein sequences are. If they are very similar, the nodes representing them are directly connected in the network, leading tom ij ¼ 1, which is represented by the blue color. If not, they can be indirectly connected, i.e., one can go from one node to the other by visiting a sequence of connected nodes. In this case,m ij equals the number of intermediate nodes plus 1. Larger values ofm ij indicate that the protein sequences represented by the nodes i and j are less similar. The color used to representm ij approaches red when its value increases.  (1) in Materials and Methods section) between two networks, constructed from the same similarity matrix S at nearby values of σ, as a function of σ for ATP synthase subunits 9 and c. σ is the threshold of sequence similarity used to construct a network, ranging from 0% to 100% with interval Δσ = 1%. On the vertical axis, δ is the distance value between two networks constructed at subsequent values σ and σ+Δσ. Large peaks (higher distance values), at σ = σ max = 56% and σ = 62%, indicate values of σ at which the networks undergo large structural changes. Modular components can be better revealed when modularity analysis is carried out for networks at these values of σ.
Finally, the grey color (corresponding to the value 0) represents pairs of very different protein sequences: the corresponding nodes are isolated from each other in the network constructed at that value of σ. The community identification procedure changes the labels of the nodes in the network, in such a way that the nodes in a given community are preferentially numbered in sequence. This allows to identify the sequences by the presence of square patches, mostly in blue or other colors, corresponding to low values ofm ij , which are placed along the matrix diagonal. Each such patch, which corresponds to a different community, is labeled by Cx.
At σ = 56%, it was possible to identify two community clusters and an additional community disconnected from those two clusters. In the first cluster, we find communities C1-mitochondrial sequences, Rhodospirillales, and clusters SAR11 and SAR116; and C2-Rhodobacterales and Rhizobiales. In the second cluster, we find C3-Rickettsiales-Rickettsiaceae; and C4-Rickettsiales-Anaplasmataceae. C5 appears as an isolated community, containing Rhodobacterales and Rhizobiales sequences disconnected from the rest of the network (see Fig 2a).
The first cluster (communities C1 and C2) is composed of 96% (n = 79) of the mitochondrial sequences and 63% (n = 57) of the alphaproteobacterial sequences in the dataset, including those from Rhizobiales, Rhodobacterales, Rhodospirillales, and clusters SAR11 and SAR116. The information in Fig 2a can also be depicted by drawing the network through a direct node to node connection diagram, as shown in Fig 3. The second cluster is composed of two Rickettsiales families, namely Rickettsiaceae (C3) and Anaplasmataceae (C4). This cluster is disconnected from the other alphaproteobacterial as well as from the mitochondrial sequences. Among the alphaproteobacterial sequences, 13% are disconnected from the entire dataset (C5 and the disconnected nodes in Fig 3).
The used framework provides further quantitative useful information about interrelation between the different organisms through the Newman-Girvan algorithm that leads to network community identification (for more details, see last section, Materials and Methods). This procedure, based on the successive elimination of connections with largest values of centrality betweenness, can be illustrated in the form of the dendrogram exhibited in Fig 4, for the σ = σ max = 56% network. Since the horizontal axis corresponds to the number of eliminated links from the original network, it indicates the order at which branching events (equivalent to community separations) occur. For instance, we learn that C1 and C2 remained connected by, approximately, 25 links. This quantitative information reveals, in a precise way, that mitochondrial sequences have strong connections to the alphaproteobacterial sequences from Rhodospirillales, SAR11 and SAR116 cluster, Rhodobacterales and Rhizobiales. According to the method, the organisms gathered in the same group show more similar protein sequences as compared to organisms in the detached groups. The same picture shows that the community cluster composed of Rickettsiales (C3 and C4) is disconnected from the other communities, and that these two communities, which share a common node (in white), split after the elimination of a few more connections after C1 separates from C2. C5, which includes other Alphaproteobacteria (Rhodospirillales and Rhodobacterales), is also disconnected from the other sequences.
As previously anticipated, we also analyzed the network at σ = 62% (Fig 1), where a second relevant change in network structure occurs. This network shows three isolated communities and one cluster including two communities. As shown by Figs 2b and 5 (which correspond to Figs 2a and 3 of the first network) the presence of several communities is clearly seen. The identification of the organisms in each one of them leads to the following results: C1 -mitochondrial sequences, Rhodospirillales, and the SAR116 cluster; C2 -sequences from Rhodobacterales and Rhizobiales. C3 (Rickettsiales-Rickettsiaceae) and C4 (Rickettsiales-Anaplasmataceae) form a community cluster. C5 contains Rhodobacterales and Rhizobiales sequences disconnected from the rest of the network. The first community (C1) is composed of Two network nodes can be: i) directly connected (shown in blue) when they are linked by an edge; ii) indirectly connected, when it is possible to go from one node to the other by a path formed by a finite number of edges (steps) between directly connected intermediate nodes; in the current case, the maximal number of steps is four, as indicated by the color code bar; iii) disconnected nodes (shown in color gray), when they cannot be linked by any path going through directly connected nodes. Squares placed along the 95% (n = 78) of all mitochondrial sequences and five sequences from Rhodospirillales, corresponding to 50% of all Rhodospirillales sequences in the entire dataset. The five orange nodes within C1 correspond to sequences from Rhodospirillales. The second community (C2) is composed of 57% of all alphaproteobacterial sequences (n = 52). The cluster composed of C3 and C4 is formed by two families of Rickettsiales, Rickettsiaceae and Anaplasmataceae, respectively. The last community (C5) is composed of 8% of the alphaproteobacterial sequences (n = 7), from the orders Rhodobacterales and Rhizobiales, disconnected from the rest of the diagonal, labeled as Ci, i = 1-5, reveal communities formed by subsets of proteins (organisms). Top panel: NM at σ = σ max = 56%. Bottom panel: NM at σ = 62%. Communities C1 and C2, which are still connected at σ = 56%, become disconnected at σ = 62%. In the bottom panel it is possible to recognize the emergence of a small sub-community (small blue square at the lower right corner) of community C1, which was not yet revealed in the top panel. C1 (composed of mitochondrial sequences) and C2 (composed of alphaproteobacterial sequences) are connected. C3 and C4 (sequences from Rickettsiales) are disconnected from the other communities, including C1.
doi:10.1371/journal.pone.0134988.g002 Finally, the dendrogram in Fig 6 shows that, at this larger value of σ, communities C1 and C2 have already split at the start of the edge elimination process. The sub-community formed by five Rhodospirillales sequences within the C1 community becomes disconnected from the rest of C1 after only the elimination of 10 edges, approximately. At that σ value, the other communities are already disconnected from each other, except for C3 (Rickettsiaceae) and C4 (Anaplasmataceae), linked to each other by one single edge (see Fig 5).

ATP synthase subunits 4 and b
In a similar way to the discussion of the results for ATP synthase subunits 9 and c, this discussion is supported by a sequence of graphs in Figs 7-11. The dataset for the subunits 4 and b includes 122 protein sequences: 27% (n = 33) mitochondrial, 73% (n = 89) alphaproteobacterial. The critical network occurs at σ = σ max = 36% (Fig 7). The network at this value displays six communities, which are divided into two larger clusters and one additional disconnected community (Fig 8a). In the first cluster there is one community containing mitochondrial sequences, while the remaining mitochondrial sequences are found in the second cluster. In the first cluster, we find the following communities: C1-Rhodobacterales, Rhizobiales and clusters SAR11 and SAR116; C2 -mitochondrial sequences from plants; and C3- and c at σ = σ max = 56%. The betweeness centrality of a link is proportional to the number of paths linking any pair of nodes that pass through it. In Fig 3, the edges that connect node 83 with nodes between communities C3 and C4, as well as the edges that directly connect nodes in C1 to nodes in C2, have large betweenness centrality. They are the first edges eliminated in the NGA. The dendrogram is in agreement with results in Figs 2 and 3, indicating that, at σ = σ max , the network is constituted by the three following clusters: C1+C2, C3+C4, C5. Two branching events, in which C1 detaches from C2 and C3 detaches from C4, require the elimination of relatively small number of edges. Rhodobacterales, Rhodospirillales, Sphingomonadales, and Caulobacterales. In the second cluster, we find two further communities: C4-mitochondrial sequences from fungi; C5mitochondrial sequences from metazoans. Finally, C6 is an isolated community containing sequences from Rickettsia.   The first cluster (containing communities C1-C3) is composed of 92% (n = 82) of all alphaproteobacterial sequences, belonging to the orders Rhizobiales, Rhodobacterales, Rhodospirillales, Caulobacterales, and Sphingomonadales, and to the clusters SAR11 and SAR116. In community C2, eight mitochondrial sequences from plants are also found. The gray nodes are disconnected from the network, not belonging to any community (n = 8) (Fig 9).
The last similarity threshold in which all mitochondrial sequences are connected occurs at σ = 35%. At that σ value, the network in Fig 10 shows 87% (n = 29) of the mitochondrial sequences gathered in three communities (C2, C4 and C5) (Fig 8b). In the cluster, we find the following communities: C1-Rhodobacterales, Rhizobiales, and clusters SAR11 and SAR116; C2-mitochondrial sequences from plants; C3-Rhodobacterales, Rhodospirillales, Sphingomonadales, and Caulobacterales; C4-mitochondrial sequences from fungi; and C5-mitochondrial sequences from metazoans. The isolated community C6 contains sequences from Rickettsia. Only the most representative sequences from each community are mentioned above. In the large cluster containing communities C1-C5, all the sequences in the dataset corresponding to Rhizobiales, Rhodobacterales, Rhodospirillales, Caulobacterales, Sphingomonadales, and clusters SAR11 and SAR116 are included together with the mitochondrial   [34]). The six communities in Figs 8 and 9 can also be identified. Note that the connection between the two community groups that become separated at σ = 35% is also provided between nodes in communities C2 and C4. sequences. The disconnected community C6 contains all the Rickettsiales sequences present in the dataset (n = 3). Only 3% (n = 3) of the alphaproteobacterial sequences and 15% (n = 5) of the mitochondrial sequences are completely isolated, not belonging to any community. The n = 8 gray nodes are disconnected from the network, not belonging to any community.
The dendrogram in Fig 11 shows that the number of links necessary to separate plant mitochondrial sequences (C2) from alphaproteobacterial sequences (C1 and C3) is higher than the number of links that connect fungal (C4) and metazoan (C5) mitochondrial sequences to alphaproteobacterial sequences. It also shows that the Rickettsia sequences (C6) are already disconnected from all the other sequences when the link elimination process starts.
At this σ value, it was possible to identify 12 communities containing, at least, 14 sequences, as shown by the neighborhood matrix in Fig 13. We do not consider here small communities containing less than 14 sequences because they do not provide relevant information to our purposes. Four communities (C3-C6) are gathered in a large cluster, while the remaining eight communities are disconnected from each other. The large cluster includes 10 sequences from Sphingomonadales together with metazoan mitochondrial sequences: In communities C3, we find mitochondrial sequences from birds; in C4, mitochondrial sequences from fishes; in C5, mitochondrial sequences from mammals; in C6, mitochondrial sequences from reptiles. Sequences from Sphingomonadales are found in a small community placed between C5 and C6. The isolated communities include C1, containing mitochondrial sequences from Basidiomycota (Fungi); C2, with mitochondrial sequences from Ascomycota (Fungi); C7, with sequences from Rhizobiales; C8, with mitochondrial sequences from plants; and C9, with sequences from Rhodobacterales. The sequences from other Alphaproteobacteria, including Rickettsiales, are disconnected from the mitochondrial sequences. In C10, we find sequences from family Anaplasmataceae, order Rickettsiales; in C11, from family Rickettsiaceae, order Rickettsiales; and in C12, mitochondrial sequences from insects.
Since not all mitochondrial sequences are grouped together at σ = 57%, we analyzed the network obtained at σ = 56%. Here another relevant change in network structure could be discerned, so it becomes important to check if all mitochondrial sequences would be connected. We found one large community cluster with six communities (C1-C6) and six communities disconnected from each other (Fig 14b). The cluster was composed of 89% (n = 1607) of the mitochondrial sequences and by almost all the sequences from Sphingomonadales in the dataset (10 out of 11 sequences). In the cluster we find the following communities: C1-containing mitochondrial sequences from Basidiomycota (Fungi); C2-with mitochondrial sequences from Ascomycota (Fungi); C3-with mitochondrial sequences from birds; C4-with mitochondrial sequences from fishes; C5-with mitochondrial sequences from mammals; and C6with mitochondrial sequences from reptiles. Sequences from Sphingomonadales are found in a small community placed between C5 and C6. Isolated communities include C7-with sequences from Rhizobiales; C8-with mitochondrial sequences from plants; and C9-with sequences from Rhodobacterales. Three smaller communities comprise C10-with sequences from Anaplasmataceae, Rickettsiales; C11-with sequences from Rickettsiaceae, Rickettsiales; and C12-with mitochondrial sequences from insects. Due to the size of this dataset (1964   sequences) and its complexity, it was not possible to generate either dendrograms or networks visualizations based on drawings of connected nodes.

Discussion
Our study shows that three F 0 subunits from the mitochondrial ATP synthase complex are far less related to the bacterial homologs in the Rickettsiales as compared to the homologs from the orders Rhizobiales, Rhodobacterales, Rhodospirillales, Sphingomonadales, and clusters SAR11 and SAR116. These results do not support the hypothesis that the order Rickettsiales is the closest living relative to the mitochondria [3,5,6,8,9]. They show more agreement with the initial proposal that Rhodospirillaceae includes the closest extant relatives to the mitochondria [1] and also with the previous results pointing to Alphaproteobacteria other than those in the Ricketsialles as showing closer relationships to mitochondria [2,4,7].
ATP synthase subunit 4 is no longer encoded by the mitochondrial genome and is less conserved than the other two subunits [22]. Its genes can accumulate mutations if the role of connecting F 0 and F 1 in ATP synthase is not affected. The network analyzed at the critical σ = 36% for subunit 4 shows that plant mitochondrial sequences are closer to those from Alphaproteobacteria than metazoan and fungal mitochondrial sequences. This can result from the fact that in plants nuclear interferences, for instance, caused by transposable elements, seem to have less effect than in metazoans and fungi, because plant nuclear genomes have higher average silentsite nucleotide diversity than metazoan genomes [23]. Despite the lower protein conservation, the σ = 35% network shows also in this case that the mitochondrial sequences are more closely related to sequences from Rhizobiales, Rhodobacterales, Rhodospirillales, Caulobacterales, Sphingomonadales, and clusters SAR11 and SAR116 than to Rickettsiales sequences. Although there are only three Rickettsiales sequences in this dataset, the result is consistent with those obtained in the other networks. Subunits 6 and 9 are still encoded by the mitochondrial genome, play a key role in the functioning of the ATP synthase complex, and are, thus, more conserved than subunit 4 [22]. This explains why the critical networks for them were obtained at higher σ values (57% and 56%, respectively). In the network analyses, the mitochondrial sequences were grouped with sequences from the orders Sphingomonadales (subunits 6 and a), and Rhizobiales, Rhodobacterales, Rhodospirillales, and clusters SAR11 and SAR116 (subunits 9 and c), while the Rickettsiales sequences were found in a diversity of communities. Given the high peak found at σ = 62% for ATP 9 and c, we analyzed the network constructed at this distance value, in which the mitochondrial sequences were grouped with sequences from Rhodospirillales, pointed by Schwartz & Dayhoff [1] as the closest relative to the mitochondria. Another recent study [2] also concluded that a Rhodospirillales member is as close to the mitochondria as any other Alphaproteobacteria studied to date.
Luo et al. [24] showed that the trees obtained for the alphaproteobacterial orders differ from one another if different methods or models are used to infer their phylogenetic placement. Nevertheless, Rickettsiales is shown to be an order that diverged earlier than the other orders in most of the phylogenies for the Alphaproteobacteria [24,25,26,27,28,29,30]. If we combine the results provided by our network analyses with the results obtained by these phylogenetic analyses, our findings support the hypothesis that the mitochondria share a common ancestor with a clade containing all the alphaproteobacterial orders other than Rickettsiales. Furthermore, in many studies Rickettsiales is recurrently retrieved as a distinct group, separate from all other Alphaproteobacteria [9,24,25], a result corroborated by the present study. It seems, thus, that the mitochondria may have evolved after the Rickettsiales order separated from the Alphaproteobacteria lineage, sharing a common ancestor with a clade containing all other extant alphaproteobacterial orders. However, considering that the order Rickettsiales is the most divergent within Alphaproteobacteria, and the genus Rickettsia suffered reduction in its genome [28], it will be important to conduct further network analyses with other mitochondrial proteins, as additional tests of this hypothesis.

Dataset and comparative analysis
In order to build the dataset used in this study, we searched for peptide sequences in NCBI (http://www.ncbi.nlm.nih.gov/). Our targets were ATP synthase subunits 4, 6, and 9 and their homologs in prokaryotes, respectively, subunits b, a, and c. The data were downloaded in August 4 th 2011. The sequences were filtered by size and based on whether they were from mitochondria or Alphaproteobacteria. Sequences containing less than 70% of the size reported by Devenish et al. [22] were considered incomplete. The datasets include, thus, sequences larger than 167 amino acids for subunits 4 and b, larger than 199 amino acids for subunits 6 and a, and larger than 60 amino acids for subunits 9 and c. Sequences with wrong annotations were also removed. These sequences had more than 255 amino acids for ATPs 4 and b, 281 for ATPs 6 and a, and 201 for ATPs 9 and c. After filtering the datasets, we used BLAST [31] version 2.2.21 StandAlone, with E-value < 1.0, in order to perform pairwise alignments to make quantitative comparisons among the protein sequences pertaining to each set. For each dataset, we constructed a similarity matrix S of elements S ij , where S ij indicates one of the available BLAST outputs, namely the similarity index between two sequences i and j [19]. The possibly asymmetric matrices were submitted to a symmetrization operation ((S ij +S ji )/2 !S ij ) before being used to construct undirected networks.

Network Construction
After generating the symmetrized similarity matrix, we constructed a set of networks, depending on a parameter σ called similarity threshold, where each node represents a sequence in the dataset while the edges are based on the similarity index between all sequences pairs. For a given value of σ, two nodes in the network (say i and j) were connected by an edge only when the corresponding sequences have a similarity index S ij higher than σ. These connections were represented by an adjacency matrix M(σ) such that the matrix elements M ij are 1 or 0 according to whether the nodes i and j are connected by an edge or not. The networks were built for σ values varying from 0 to 100% similarity. Then, neighborhood matrices (NM), indicated by the symbolMðsÞ, were built for each M(σ) [32,33]. The elementsm ij indicate the number of steps in the shortest path connecting nodes i and j. When i and j belong to distinct isolated clusters, thenm ij ¼ 0. The matrices M(σ) were used to calculate the network distance δ(σ,σ+Δσ) between the 101 pairs of successive networks (i.e.,Δσ = 1). Here, the function δ is defined, for any two networks α and β with the same number N of nodes, respectively represented by NM'ŝ MðaÞ andMðbÞ, by the expression where D indicates the network diameter, the largest minimum distance between any two nodes in the network. This procedure reveals, by the presence of peaks, the values of σ at which the network undergoes a relevant structural change. At such values, a putative modular structure can be identified with the largest amount of retrievable information concerning the evolutionary relationships between the organisms represented in the dataset. For this study, the results were primarily based on the network at the value σ max , corresponding to the value of σ for which δ(σ,σ +Δσ) attains its largest value [18,19]. However, in order to get a better understanding of the modular structure, it is also relevant to investigate the networks generated for σ values just before σ max, as well as networks associated with other significant peaks of σ. The networks were visualized with the help of the GePhi package [34].

Network Analysis
We applied the Newman-Girvan Algorithm (NGA) [35] to reveal the modular network structure at selected values of σ. NGA is based on the successive elimination of edges with the highest betweenness centrality score. Betweenness centrality is a measure that calculates the relevance of an edge based on its weight in the network and is proportional to the number of paths linking any pair of edges that pass through that edge. It also reveals how strong the connections between the nodes in the network are, and allows for the construction of a dendrogram indicating the formation of modular structure as a function of the number of eliminated edges.
The modular structure was analyzed by the color representation of the resulting NM [18,19]. This was possible because the NGA was coupled to a node re-enumeration, placing the most connected nodes together. The identification of the positions occupied by the network nodes, which have been renumbered during the NGA used to find the community structure, reveals the composition of the different groups. After applying this method to each dataset, the modular structure of the networks was analyzed, making it possible to identify the organisms belonging to each community. We also analyzed other relevant networks, either due to their significant modularity change or to find the last similarity threshold that connects mitochondrial sequences to one another and with alphaproteobacterial sequences. The visualization of the modular structure of the network was done by the representation of nodes and links using the GePhi package [34], by the dendrogram, and by the color representation of the NM. In the latter case, the nodes no longer show their original positions, since their positions are changed in order to obtain a dendrogram without any crossing lines. The color code reveals the modular structures: blue indicates pairs of nodes that are directly connected, while red represents pairs of nodes that are connected by a large number of steps. Disconnected pairs are shown in gray. A color code bar is provided for each figure. We highlight that the NM representation makes it clear how far a given community Cx is from another community Cy. Moreover, the dendrogram representation demonstrates which edges, how many of them, and in which order the edges have to be removed to separate groups of communities or a given community Cx from Cy.
Supporting Information S1 File. Accession numbers from all data obtained from NCBI. (XLS)