Support Measures to Estimate the Reliability of Evolutionary Events Predicted by Reconciliation Methods

The genome content of extant species is derived from that of ancestral genomes, distorted by evolutionary events such as gene duplications, transfers and losses. Reconciliation methods aim at recovering such events and at localizing them in the species history, by comparing gene family trees to species trees. These methods play an important role in studying genome evolution as well as in inferring orthology relationships. A major issue with reconciliation methods is that the reliability of predicted evolutionary events may be questioned for various reasons: Firstly, there may be multiple equally optimal reconciliations for a given species tree–gene tree pair. Secondly, reconciliation methods can be misled by inaccurate gene or species trees. Thirdly, predicted events may fluctuate with method parameters such as the cost or rate of elementary events. For all of these reasons, confidence values for predicted evolutionary events are sorely needed. It was recently suggested that the frequency of each event in the set of all optimal reconciliations could be used as a support measure. We put this proposition to the test here and also consider a variant where the support measure is obtained by additionally accounting for suboptimal reconciliations. Experiments on simulated data show the relevance of event supports computed by both methods, while resorting to suboptimal sampling was shown to be more effective. Unfortunately, we also show that, unlike the majority-rule consensus tree for phylogenies, there is no guarantee that a single reconciliation can contain all events having above 50% support. In this paper, we detail how to rely on the reconciliation graph to efficiently identify the median reconciliation. Such median reconciliation can be found in polynomial time within the potentially exponential set of most parsimonious reconciliations.


Appendix S1
Formal definition of a reconciliation [5] Definition 1. Consider a gene tree G, a dated species tree S such that S(G) ⊆ L(S), and its subdivision S . Let α be a function that maps each node u of G onto an ordered sequence of nodes of S , denoted α(u) = (α 1 (u), α 2 (u), . . . , α (u)). Function α is said to be a reconciliation between G and S if and only if exactly one of the following events occurs for each pair of nodes u of G and α i (u) of S (denoting α i (u) by x below): a) if x is the last node of α(u), one of the cases below is true: 4. α 1 (u l ) = x , and α 1 (u r ) is any node other than x having height h(x ) or α 1 (u r ) = x , and α 1 (u l ) is any node other than x having height h(x ); (T event) b) otherwise, one of the cases below is true: 5.

Proof of Lemma 1
Given a reconciliation R and an event e, let ind(R, e) be the indicator function for e in R, i.e. ind(R, e) = 1 if e ∈ E(R) and ind(R, e) = 0 otherwise. Let R A be the reconciliation of R minimizing (1) where |R| and |R|, respectively denote the number of reconciliations in R and the number of events in a reconciliation R. The claim for the asymmetric case then follows from the fact that the first sum and the |R| factor in (1) are independent of the choice of R A . Now for the symmetric distance, suppose R S is a candidate reconciliation for being the symmetric median of R, then for every event e ∈ E(R) each R ∈ R containing the event contributes by adding one to d S (R S , R) if e / ∈ E(R S ), and each R ∈ R not containing the event contributes by adding one if e ∈ E(R S ). More precisely, we have This holds because R S is in R. The first summation term and the 2|R| factor do not depend on the choice of R S , hence the reconciliation minimizing d S (R S , R) is that maximizing e∈E(R S ) f (e) − 0.5 .

Proof of Theorem 1
Proof: For each node v of G, we introduce the notion of best local reconciliation support for v, denoted BLS (v), which corresponds to the maximum support achievable for event nodes of a subtree rooted at v and belonging to a reconciliation tree: We will now show that SCORE (v) = BLS (v), for each node v ∈ V (G), which will prove the theorem as i) each root of G corresponds to the root of a reconciliation tree; ii) there is a bijection between E(R) and V e (T R ); i.e. line 11 will then be shown to return a suitable reconciliation tree.
The proof that SCORE (v) = BLS (v) for each node v ∈ V (G) proceeds by induction on the height of v. If h(v) = 0, by construction of G, v is an event node such that e(v) = C [18] and, by line 8 of Algorithm 1, , as v has no child here. Let us now suppose that SCORE (u) = BLS (u), for each node u ∈ V (G) with h(u) < h i and let v be a node in G such that h(v) = h i . Note that, if v is an event node, from Condition C 4 of Definition 5 of [18], each reconciliation tree in T containing v also contains all child nodes of v (that have a height strictly smaller than h i ). Thus: where these equalities hold by definition of BLS (v), by induction and by line 8 of Algorithm 1. On the contrary, if v is a mapping node, from Condition C 5 of Definition 5 in [18], each reconciliation tree from T containing v also contains exactly one child node of v. Hence, BLS (v) = max = SCORE (v), which holds by definition of BLS (v), by induction and by line 10 of Algorithm 1. This concludes the proof that SCORE (v) = BLS (v) for each node v ∈ V (G) and thus ensures that node r selected on line 11 of Algorithm 1 maximizes BLS (·) among all roots of G.
Algorithm 2 simply traverses G starting from the root node r(T A ) of an optimal reconciliation tree T A and identifies all other nodes of T A . Indeed, the subset of nodes selected by Algorithm 2 satisfies all conditions of Definition 5 of [18], and can thus be proved to be a valid reconciliation tree T A using a proof similar to that of Theorem 1 of [18]. Moreover, it is straightforward to see that BLS r(T A ) = w∈Ve(T A ) f G (w) and, since all reconciliation trees in T are rooted at roots of G [18], this concludes the proof.