Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A polynomial invariant for a new class of phylogenetic networks

  • Joan Carles Pons ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

    joancarles.pons@uib.es

    Affiliation Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, Spain

  • Tomás M. Coronado,

    Roles Formal analysis, Investigation, Methodology, Supervision, Validation, Writing – original draft

    Affiliation Department of Mathematics and Computer Science, University of the Balearic Islands, Palma, Spain

  • Michael Hendriksen,

    Roles Formal analysis, Investigation, Methodology, Validation, Writing – original draft

    Affiliation School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia

  • Andrew Francis

    Roles Conceptualization, Investigation, Methodology, Supervision, Validation, Writing – original draft, Writing – review & editing

    Affiliation Centre for Research in Mathematics and Data Science, Western Sydney University, Parramatta, Australia

Abstract

Invariants for complicated objects such as those arising in phylogenetics, whether they are invariants as matrices, polynomials, or other mathematical structures, are important tools for distinguishing and working with such objects. In this paper, we generalize a complete polynomial invariant on trees to a class of phylogenetic networks called separable networks, which will include orchard networks. Networks are becoming increasingly important for their ability to represent reticulation events, such as hybridization, in evolutionary history. We provide a function from the space of internally multi-labelled phylogenetic networks, a more generic graph structure than phylogenetic networks where the reticulations are also labelled, to a polynomial ring. We prove that the separability condition allows us to characterize, via the polynomial, the phylogenetic networks with the same number of leaves and same number of reticulations by considering their internally labelled versions. While the invariant for trees is a polynomial in where n is the number of leaves, the invariant for internally multi-labelled phylogenetic networks is an element of , where r is the number of reticulations in the network. When the networks are considered without leaf labels the number of variables reduces to r + 2.

Introduction

A complete polynomial invariant able to uniquely distinguish between rooted trees has been recently introduced in [1]. Motivated to analyze and compare tree shapes in a phylogenetic context, this polynomial (to which we will refer as the Liu polynomial) has been used both to define a similarity measure on rooted tree shapes and to estimate parameters and models via its coefficients [2]. Moreover, its generalization from trees to networks (by analyzing the set of embedded spanning trees in the network) has also been used to study the properties of randomly generated networks [3].

We note that the word “invariant” is used here in its traditional sense, and not the one used in algebraic geometry approaches to phylogenetics, in which phylogenetic invariants for an evolutionary model along a tree are the polynomials which vanish on the expected frequencies of base patterns at the leaves [4]. Throughout this article, a (complete) invariant of a set A is a function f: AB with the property that xA y if and only if f(x) ∼B f(y), where B is some other set (such as the set of polynomials), and ∼A and ∼B are equivalence relations in the respective sets.

A multitude of (non-polynomial) invariants have been defined for specific subclasses of phylogenetic networks. To name just a few, the μ-vectors which store the number of paths from nodes to leaves characterize (among others) tree-child networks [5] and orchard networks (without stacks) [6]; the set of displayed trees that characterizes regular networks [7]; and the induced trinets (minimal subnetworks induced by triples of leaves) that characterize (among others) level-2 networks [8] and orchard networks [9].

In this paper we show how a polynomial invariant can be defined for rooted phylogenetic networks, generalizing the Liu polynomial invariant for trees. In order to do so, we consider phylogenetic networks and a labelled version of them, called internally labelled phylogenetic networks, where we keep the labels on leaves and also (bijectively) label the reticulations. In fact, internally labelled phylogenetic networks are a subset of a more general set of networks, which we call internally multi-labelled phylogenetic networks, or IMLN’s. On these networks the presence of elementary nodes is allowed, and leaves, reticulation and elementary nodes are all labelled. Then, if we denote by PN the set of all phylogenetic networks (up to isomorphism) and by ILPN the set of all internally labelled phylogenetic networks (up to isomorphism), the map Φ: ILPN → PN that sends each internally labelled phylogenetic network to the phylogenetic network obtained by “forgetting” all the internal labels (on reticulations) is obviously well defined; therefore for each N ∈ PN, Φ−1(N) is the set of all the internally labelled phylogenetic networks that have its same topology; its fiber, in mathematical terms.

The aim of this paper is to define a polynomial p that uniquely characterizes these fibers and, in so doing, also characterizes the phylogenetic networks beneath them. See the diagram below. Since Φ is not injective, the dashed arrows denote maps that are not unique. We will see that, in general, p is not injective, but that it will be so under a suitable topological condition.

This paper is organized as follows. In the Methods section we include the three main graph structures of study: phylogenetic networks, internally labelled phylogenetic networks and internally multi-labelled phylogenetic networks (or IMLN’s). We also define the concept of isomorphism on these structures. The Results section is divided into two main subsections. The first one studies a process that unfolds an IMLN into a tree (an IMLT) and its reverse, folding, that recovers the initial IMLN. The key result of this section is the characterization of an IMLN by an IMLT (Corollary 10). The second subsection is dedicated to the definition and study of an extension of the Liu polynomial on IMLN’s. If N is an IMLN on a set of leaves labelled by X, the assigned polynomial p(N) has |X| + r + 1 variables, where r is the number of reticulations in the network. This subsection is further divided into multiple parts. The first part studies a special type of path (composed only of reticulations or elementary nodes) in IMLN’s, called strong paths. Roughly speaking, these allow us to define an equivalence relation between IMLN’s, and we prove that two IMLN’s share the polynomial if, and only if, they are equivalent (Theorem 15). The second part gives a sufficient condition on the space of phylogenetic networks (which we call separability) for the derived internally labelled phylogenetic networks to be completely characterized by the polynomial. The multiple lemmas proved in this part allow us to prove the main result (Theorem 22) in the third part; that is, the polynomial is a complete invariant in the set of internally labelled separable phylogenetic networks up to isomorphism. The fourth part of this subsection proves that orchard networks are separable, and so are characterized by the polynomial introduced in this paper (Theorem 23). Finally, in the last part, we present how the obtained results can be applied for an unlabelled version of networks, in the sense that we forget the labelling of the leaves, reducing the polynomial to r + 2 variables (Proposition 24). This paper finishes with a section of Discussion and Conclusion.

Methods

In this section we introduce the mathematical notation that will be used in the rest of the paper.

Throughout this paper, X will denote a non-empty finite set (of taxa). Commonly, we will use X = {x1, …, xn}, and we will allow ourselves to see each member of X as an irreducible polynomial in ; i.e., we will consider the labels of the leaves in our networks to be polynomials of the form xi for i ∈ {1, …, n}.

Definition 1. A rooted binary phylogenetic network N = (V, E) on X, or simply a phylogenetic network on X, is a rooted directed acyclic graph with no parallel arcs satisfying the following conditions:

  1. any node with out-degree zero (a leaf) has in-degree one, and the set of nodes with out-degree zero, denoted by L(N), is identified with X via a bijection φ: L(N) → X;
  2. the root is the only node with in-degree zero, and has out-degree two;
  3. any other node has either in-degree one and out-degree two (a tree node), or in-degree two and out-degree one (called a reticulation node).

We shall consider the leaves and root to be tree nodes.

Definition 2. A rooted binary internally multi-labelled phylogenetic network N = (V, E) on X, or simply an IMLN on X, is a rooted directed acyclic graph with no parallel arcs satisfying the following conditions:

  1. any node with out-degree zero (a leaf) has in-degree one, and the set of nodes with out-degree zero, denoted by L(N), is identified with X via a surjection φ: L(N) → X;
  2. the root is the only node with in-degree zero, and it can have out-degree one (in which case we shall say it is an elementary node) or two (a tree node);
  3. any other node has either in-degree one and out-degree two (again, a tree node), or in-degree two and out-degree one (called a reticulation node), or in-degree one and out-degree one (again, an elementary node);
  4. if R(N) denotes the set of reticulation nodes and E(N) the set of elementary nodes of N, then there exists : R(N) ∪ E(N) → {λ1, …, λr} a labelling function such that its restriction to R(N) is injective and if uR(N) and vE(N), (u) ≠ (v).

Definition 3. A rooted binary internally multi-labelled phylogenetic tree T = (V, E) on X, or simply IMLT on X, is an IMLN without reticulation nodes.

We will consider the labels λ1, …, λr to be irreducible polynomials in . Notice that Definition 2 implies that IMLN’s are a recursive structure in the following sense: given any IMLN N, for any uV(N), the subgraph rooted at u is still an IMLN. This is not the case in general for phylogenetic networks.

In the case that an IMLN (with the root of out-degree two) does not have elementary nodes and the labelling on the leaves is a bijection, by definition, it becomes a phylogenetic network if the labelling on reticulations is suppressed. Also, if we consider a phylogenetic network and we add a labelling bijection : R(N) → {λ1, …, λr}, it becomes an IMLN. In order to reflect this possibility, we introduce the following definition.

Definition 4. An internally labelled phylogenetic network N on X is an IMLN on X without elementary nodes and where the maps φ: L(N) → X and : R(N) → {λ1, …, λr} are bijections.

In order to formally define the concept of isomorphism between a pair of phylogenetic networks or between a pair of IMLN’s, we consider the alternative notation, (V, E, φ) and (V, E, φ, ), to reflect the labelling functions, respectively.

Definition 5. Two phylogenetic networks N1 = (V1, E1, φ1) and N2 = (V2, E2, φ2) on X are isomorphic if there exists a bijection f: V1V2 such that φ1(x) = φ2(f(x)) for all xL(N1), and (u, v) ∈ E1 if and only if (f(u), f(v)) ∈ E2.

Definition 6. Two IMLN’s N1 = (V1, E1, φ1, 1) and N2 = (V2, E2, φ2, 2) on X are isomorphic if there exists a bijection f: V1V2 such that φ1(x) = φ2(f(x)) for all xL(N1), 1(x) = 2(f(x)) for all xR(N1) ∪ E(N1), and (u, v) ∈ E1 if and only if (f(u), f(v)) ∈ E2.

That is, a graph isomorphism that preserves the labels of both the reticulation and elementary nodes.

Results

Folding and unfolding

Following [10], a phylogenetic network can be “unfolded” in a specific manner to obtain a multi-labelled tree, that is a particular IMLT without elementary nodes in terms of the previous definitions. Moreover, in some cases, this process can be reverted, and the multi-labelled tree can be “folded” recovering the initial network. A phylogenetic network cannot in general be characterized by a multi-labelled tree, and this correspondence is valid only for the subclass of FU-stable phylogenetic networks [10].

In this subsection, however, we prove that an internally labelled phylogenetic network can be uniquely characterized by an IMLT obtained by a sequence of “unfoldings” on its reticulation nodes. Roughly speaking, considering the reticulations of an IMLN in a specific order, it is possible to sequentially duplicate the subnetwork descending from these nodes until an IMLT is obtained.

Let N be a (generic) IMLN, and R(N) the set of its reticulation nodes. The relation of being a descendant of another node induces a partial order over R(N), which we will denote by ≤R. That is, for any two nodes u, vR(N), uR v if, and only if, there exists a directed path from v to u. Let Rmin(N) be the set of the minimal elements of R(N) under this order, i.e. reticulation nodes such that none of their descendants are also reticulation nodes.

Lemma 1. Let N be an IMLN and uRmin(N). Then the graph rooted at u is an IMLT.

Proof. If uRmin(N), then there is no path in N from u to another reticulation. This means that there are no reticulations in the graph rooted at u; and therefore it is an IMLT.

Let N be an IMLN, and consider uRmin(N) (so that u is labelled by an element in {λ1, …, λr}). Let v1, v2 be its parents, noting that v1v2 due to the fact that parallel arcs are excluded. Define U(N, u) to be the unfolded IMLN of N at u, obtained by the following algorithm:

  1. delete edges (v1, u) and (v2, u);
  2. duplicate N(u), the IMLT rooted at u, including all its labels;
  3. add an edge from v1 to one of the resulting copies of u, and an edge from v2 to the remaining copy of u.

Remark 1. Notice that the process of unfolding preserves paths in the following sense: if N′ is obtained from N by unfolding N at some node u, then any path between two nodes in N′ comes from an existing path in N; and vice versa, any path between two nodes in N corresponds to a path in N′. Notice, however, that a path in N might very well correspond to two different paths in N′, and so this assignation is not injective.

Corollary 2. Let N be an IMLN, and uRmin(N). Then U(N, u) is an IMLN.

Let N be an IMLN. We say that a sequence (u1, …, uk) of nodes in R(N) is compatible if the associated sequence of IMLN’s is such that and u1Rmin(N), where and . Then, if (u1, …, uk) is compatible, for each i ∈ {1, …, k − 1} there is no path from ui to uj when j > i; i.e., it is non decreasing under the partial order ≤R induced by the network over R(N).

Lemma 3. Let N be an IMLN and u1, u2Rmin(N). Then,

Proof. It is straightforward by Lemma 1 and the steps of the unfolding algorithm. If u1Rmin(N), then u2Rmin(U(N, u1)); otherwise there would be a reticulation node u′ in R(U(N, u1)) and a path from u2 to u′ in U(N, u1), and so in N, which is a contradiction. Then, by Lemma 1, the graph rooted at u2 in U(N, u1) is an IMLT. Since u2 is not a node in any of the copies of the IMLT rooted at u1 in the construction of U(N, u1), there is no intersection between the copies from u1 and the copies from u2. Since the same argument holds if we start by u2, the result is achieved.

Lemma 3 can be extended following the same arguments for any set of reticulations {u1, …, uk} if all of them are in Rmin(N), since there will be no intersection between the created copies of IMLT’s.

Let N be an IMLN. We define an equivalence relation ≡ in the set of compatible sequences of elements of R(N) as follows:

That is, we say that two compatible sequences are equivalent if they are composed by the same set of nodes.

An ≤R-chain in an IMLN N is a chain under the ≤R order defined on R(N) (or a subset of it). That is, a subset of reticulations such that u1R ⋯ ≤R us. And, an ≤R-antichain in an IMLN N is an antichain under the ≤R order; i.e., a subset of reticulations of N which are pairwise incompatible (uiR uj and ujR ui if uiuj) under the ≤R order.

In the next lemma we prove that if we consider an ≤R-chain in an IMLN N then there is a single way to traverse these nodes in a compatible sequence, from bottom to top. On the other hand, if we consider an ≤R-antichain, then every way to traverse these nodes is valid to form a compatible sequence.

Lemma 4. Let N be an IMLN and S = {v1, …, vr} ⊆ R(N). Then

  1. (a). If v1R v2R ⋯ ≤R vr is anR-chain, then vi must precede vj in every compatible sequence containing S if i < j.
  2. (b). If S is anR-antichain, then every possible ordering of its nodes produces a compatible sequence composed by S.

Proof. We first prove (a). If v1R v2R ⋯ ≤R vr is an ≤R-chain, then there is a path from vj to vi if i < j. Therefore if there exists a path from vi to vj, it produces a cycle in N; but this is not possible because N is an IMLN, and so in particular it is acyclic. This means that there is no path from vi to vj when i < j. Consequently, if i < j, vi must precede vj in every compatible sequence containing S.

Now we prove (b). Let v and v′ be two nodes in S. If v precedes v′ in a sequence there cannot be a path from v to v′; otherwise v′ ≤R v. If v′ precedes v in a sequence there cannot be a path from v′ to v; otherwise vR v′. Since S is an ≤R-antichain, then both cases derive compatible sequences.

Corollary 5. Let N be an IMLN and (u1, u2, …, uk) ≡ (v1, v2, …, vk) a pair of equivalent compatible sequences of elements of R(N). Let and be the associated sequences of IMLN’s to their corresponding compatible sequences. Then and are isomorphic.

Proof. For k = 1 there is nothing to prove, since u1 = v1. For k = 2. If u1, u2Rmin(N), there is nothing to prove, because (u1, u2) and (u2, u1) are compatible sequences and Lemma 3 applies. If (u1, u2) is a compatible sequence and u1R u2, then must be (v1, v2) = (u1, u2) (and not (v1, v2) = (u2, u1)), since .

The general situation for k ≥ 3 demands a different approach. Let s1 = (u1, u2, …, uk) and s2 = (v1, v2, …, vk). Since s1s2, we have {u1, u2, …, uk} = {v1, v2, …, vk} ⊆ R(N). Let A = {u1, u2, …, uk}. Then we could iteratively apply the following process to prove the result. Let A′ = {uA: uRmin(N)}. Note that A′ is not empty due to u1 and v1 (which could be equal) are in Rmin(N). Then, let be the sequence obtained from s1 by moving all the nodes in A′ to the first positions (in such a way that if ui, ujA′ with i < j, then the node ui appears before uj in ) and remain invariant the rest of nodes. Note that is compatible by construction and . A similar process can be repeated to obtain . Note that the set of nodes of A′ occupying the first |A′| positions in both and are exactly the same, and it is an ≤R-antichain; but these nodes may not appear in the same order in both sequences.

Let u* be the last node (the rightmost) in such that u* ∈ A′. Now let be the compatible equivalent sequence to obtained by remaining invariant all positions except for the node u*, which comes to be the last node in with u* ∈ A′. This ensures that the last node of the first |A′| positions in both and is the same, u*. Note that, could be u* = uk = vk (when A = A′). By Lemma 4(b) and Lemma 3, the IMLN Nu* obtained by sequentially unfold at the nodes in until u* is achieved, is isomorphic to the IMLN obtained by sequentially unfold at the (same) nodes in until u* is achieved. Then, the same process can be repeated by considering new equivalent compatible sequences obtained from and by suppressing the first |A′| positions and starting with the IMLN Nu*.

Therefore, given a compatible sequence (u1, u2, …, ur) of all the elements of R(N), and its associated sequence , we define the unfolding of an IMLN N, denoted by U(N), by means of the equation . We may refer to such a sequence as a sequence of unfoldings. See Fig 1 for an example of a sequence of unfoldings for an IMLN; in fact for an internally labelled phylogenetic network.

thumbnail
Fig 1. The unfolding of an IMLN.

Top two figures: A phylogenetic network N on {x1, x2, x3, x4}, and the IMLN obtained by considering the labelling function over R(N) given by (ui) = λi for i ∈ {1, 2, 3}. Notice that N is an internally labelled phylogenetic network. The three figures below are the sequence of unfoldings associated to the compatible sequence of reticulations (u2, u3, u1). Following the introduced terminology, , and . Note that u2, u3Rmin(N) and u1Rmin(N), there is a path from u1 to u2 in N.

https://doi.org/10.1371/journal.pone.0268181.g001

Now, we are interested in the “reverse” process to unfolding. Roughly speaking, we are interested in formally defining a way to “fold” an IMLT to recover the IMLN from which it comes. We can, given an IMLN N, also define a partial order over the set of elementary nodes E(N) by saying that for any two u, vE(N), uE v if and only if there exist u′, v′ ∈ E(N) with (u) = (u′) and (v) = (v′) and a directed path from v to u. We call the set of elementary nodes that are maximal under this order Emax(N).

Lemma 6. Let be a sequence of unfoldings of an internally labelled phylogenetic network N. For any in it and for every , there exists exactly another such that (u) = (v) and the IMLT’s and are isomorphic.

Proof. Let be one of the IMLN’s in the sequence of unfoldings. Let , with N′ = N when i = 1. By construction, uiRmin(N′).

Since , the IMLT N′(ui) is duplicated; say u and v the two resulting copies of ui in , we have (u) = (v) and . Moreover, ; otherwise, if u (or v) is not maximal under the order ≤E in , it means that there are with (w) = (w′) such that there is a path from w to u. By Remark 1 this path is preserved in every with j < i. Since the labelling function is injective over reticulation nodes and N has not elementary nodes, this means that the pair w, w′ corresponds to a reticulation node in some with j < i; equivalently, this is a reticulation node equal to some uj with j < i. This leads to a contradiction with the fact that the sequence (u1, u2, …, ur) is compatible. If we consider a maximal element in different to the two coming from the duplication of ui in N′, the previous argument can be reproduced similarly. These pair of maximal elements are preserved as maximal in every with j < i right up until the unfolding on this reticulation is produced. This proves that the IMLT’s rooted on the corresponding copies of it are also preserved until is reached.

In particular, in the proof of Lemma 6, and following the same notation, we show that the node ui is maximal under the ≤E order in . Notice also that this could be false if elementary nodes are allowed in the initial IMLN N.

Proposition 7. Let be a sequence of unfoldings of an internally labelled phylogenetic network N. For any in it, let . Then, is such that vE w if and only if vR w in R(N).

Proof. We begin by the “if” direction. If v, w are such that vR w when seen as reticulation nodes in N, there exists at least a path from w to v. Now, since , by Lemma 6, there exists such that (w) = (w′) and , via an isomorphism f. Then, since by hypothesis and, by Remark 1, the path from w to v in N is preserved in , there exist paths from w to v and from w′ to f(v) in , such that (v) = (f(v)) and therefore vE w in .

On the opposite direction, suppose that v, w are such that vE w. Again by Lemma 6, in there exists w′ such that (w) = (w′) and via an isomorphism f. Since vE w, there exists a path from w to v and a path from w′ to f(v) and (v) = (f(v)). Now, since there are no elementary nodes in N, there must exist j < i such that in (it could be that ), the nodes v and w are reticulations. By Remark 1, this implies that there would exist a path from w to v in , and therefore vR w in , and so in N. Thus concludes the proof.

Given N an IMLN, uRmin(N) and U(N, u), we would like to consider N to be the result of a folding operation over U(N, u): N = F(U(N, u), u), for some suitable F. For any unfolding sequence , we say that each of its members is a (phylogenetic) pseudo-network —in particular, they are IMLN’s. Equivalently, we can define a pseudo-network recursively as follows: let N be an IMLN; it is a pseudo-network if it satisfies the following three conditions:

  1. (i). no reticulation node descends from an elementary node;
  2. (ii). for any uEmax(N) there exists vEmax(N) such that (u) = (v) and N(u) = N(v) as IMLT’s;
  3. (iii). for any uEmax(N), the IMLN obtained by the process of
    1. considering the node vEmax(N) such that (v) = (u) and N(u) = N(v), and the parent of v, say v(1);
    2. deleting N(v), as well as the edge (v(1), v);
    3. adding the arc (v(1), u),
    is also a pseudo-network.

The IMLN obtained by the process described in (iii) is denoted by F(N, u), and called the folded IMLN of N at u. Notice that if u, vEmax(N) are such that (u) = (v), then F(N, u) = F(N, v).

Lemma 8. Let N be a pseudo-network and uRmin(N). Then,

Proof. Let N′ = U(N, u). Since uRmin(N), then N(u) (the tree rooted at u) is an IMLT. Let v1, v2 be the parents of u in N. When N(u) is duplicated in the unfolding process, u and a new copy of it, say v, are elementary nodes and the roots of N′(u) and N′(v) respectively, such that N′(u) = N′(v). Moreover, (v1, u), (v2, v) are arcs in N′. Since uEmax(N′) (because uRmin(N)), by Lemma 6, v is the other node in Emax(N′), such that (u) = (v) and N′(u) = N′(v). By definition of the folding process of N′ at u, the IMLT N′(v) and also the arc (v2, v) are deleted and a new arc (v2, u) is created. This results in a reticulation node u with parents v1 and v2 which is the root of N′(u). Since N(u) = N′(u), then F(N′, u) = N.

Given N an IMLN and a sequence of unfoldings, by Lemma 8 we have that and that . Therefore, we derive the following result.

Corollary 9. Let N be an internally labelled phylogenetic network and any sequence of unfoldings. Then

Note that, similarly as we have done by the equivalent compatible sequences, there is not a unique way to recover the IMLN N by applying a set of foldings.

If N is a pseudo-network we know that it is the product of a sequence of unfoldings performed over an IMLN, N′. We can then rewrite Corollary 9, by defining a function F from the set of pseudo-networks to the set of IMLN’s by F(N) ≔ N′. Hence,

Corollary 10. Let N be an internally labelled phylogenetic network. Then

This result is the analogue of the concept of stable networks in Section 4 of [10]. The key difference here is that we allow elementary nodes.

A polynomial for internally multi-labelled phylogenetic networks

Given a phylogenetic network N on X, one can obtain a rooted tree by removing one incident arc to each reticulation node. These (sub)trees could contain elementary nodes, and its leaves might be labelled in X (the leaves from N) and other sets different from it (for instance when the single outgoing arc to a reticulation is removed). Those trees become unrooted if the direction of the arcs is suppressed (particularly, the root becomes a degree two node) and are called embedded spanning trees if its set of leaves is exactly X. Tree-child phylogenetic networks are characterized by their set of embedded spanning trees [11], but not general phylogenetic networks.

In [3], the Liu polynomial is generalized to phylogenetic networks by their sets of embedded spanning trees. Roughly speaking, the polynomial of the network is the product of the polynomials of the embedded spanning trees (considering trees with multiplicity). Consequently, this extension is a complete invariant for tree-child networks.

There are some natural extensions of the Liu polynomial to IMLN’s that come to mind. The first one, for internally labelled phylogenetic networks, is to completely unfold such a network and, from any elementary node u labelled λi, for some i ∈ {1, …, r} and labels λi distinguishable from labels xi, grow an arc to a new node v, label v as λi, and finally forget the labelling of u. Thus, the unfolded IMLT becomes a multi-labelled tree over leaves {x1, …, xn, λ1, …, λr}. See an example of that decomposition in Fig 2 from the internally labelled phylogenetic network N depicted in Fig 1. By means of Corollary 3.5 in [1], this extension of the polynomial is immediately seen to uniquely characterize an internally labelled phylogenetic network.

thumbnail
Fig 2. A multi-labelled tree derived from an internally labelled phylogenetic network.

Let N be the network depicted in Fig 1. This figure depicts a decomposition of N resulting in a multi-labelled tree.

https://doi.org/10.1371/journal.pone.0268181.g002

We will here deal with a natural extension that reflects the reticulation process in the sheer morphology of the polynomial, rather than in the name of the variables.

Let N be an IMLN. Then, consider

to be defined recursively as follows. Let uV(N), then:

  • if u is a leaf, p(u) = φ(u);
  • if u is an internal tree node whose two children are v1, v2, p(u) = y + p(v1)p(v2);
  • otherwise, i.e. if u has only one child v and its associated label is λi = (u), then p(u) = λi p(v).

Then, let ρN be the root of N; we define p(N) to be p(ρN). Notice that this definition of the polynomial p is given over generic IMLN’s.

For example, the polynomial associated to the IMLN represented in Fig 1 is

Proposition 11. Let N be an IMLN. Then, for any uV(N), is an irreducible polynomial if and only if u is a tree node.

Proof. If u is not a tree node the polynomial will not be irreducible, since then there would exist vV(N) as the only descendant of u, and p(u) = (u)p(v).

It then remains only to see that if u is a tree node, p(u) is irreducible. In this case, either u is a leaf and then p(u) = φ(u) = xi for some i ∈ {1, …, n} and so irreducible, or u has two children and p(u) = y + Λp(w1)p(w2), where Λ is a product of λi from λ1, …, λr, and w1, w2 are the first descendants from u at each side that are tree nodes (they are possibly equal). Now consider the polynomial p′(u) obtained from p(u) by changing every variable x1, …, xn, λ1, …, λr for, say, x1. Then, it can be seen that p′(u) satisfies Eisenstein’s irreducibility criterion in (which is an unique factorization domain, UFD) applied to the ideal 〈y〉, and so p(u) is irreducible when seen as a polynomial in . But, since y does not divide p(u), then p(u) is also irreducible in .

The next proposition will show that the polynomial is conserved throughout a sequence of unfoldings, and therefore will allow us to compute it over any of its members without distinction. In particular, it can be computed on the unfolding of the network.

Proposition 12. Let N be an IMLN, and be a sequence of unfoldings. Then, and, for any i ∈ {1, …, r − 1}, .

Proof. Let N′ be an IMLN, and uRmin(N′). If we are able to show that p(N′) = p(U(N′, u)), then the proposition will hold. Let v(1), v(2) be the parents of u, in U(N′, u) each of them will be the parent of at least one elementary node ux, x ∈ {1, 2}, which will be the root of a copy of the IMLT N′(u), and by construction p(u1) = p(u2) = p(u) = p(N′(u)). Now, by the definition of the polynomial, p(v(x)) will be the same in N′ and in U(N′, u). Therefore, p(N′) = p(U(N′, u)).

We now introduce two remarks, the first concerning the interpretation of the coefficients and, the second, about the reconstruction of the unfolding of an IMLN from the polynomial if it characterizes the IMLN.

Remark 2. The interpretation of the coefficients of the polynomial p(N) can be extended from Lemma 2.4 in [1] by slightly modifying the definition of primary subtrees to the IMLT T = U(N). Let a primary subtree S of T be a rooted subtree of T such that S shares the same root node with T and any leaf node in T is either a leaf node in S or a descendant of a leaf node in S which does not come from an elementary node.

Then, if we represent p(N) as each one of its coefficients counts the number of primary subtrees of U(N) satisfying that:

  • γi (for i ∈ {1, …, r}) is the number of nodes labelled by λi of these subtrees;
  • αi (for i ∈ {1, …, n}) is the number of leaf nodes labelled by xi of these subtrees which are also leaves in U(N);
  • β is the number of leaf nodes of these subtrees which are internal nodes in U(N).

See Fig 3 for the interpretation of some of the terms of the polynomial p(N) of the IMLN N depicted in Fig 1. Notice that these primary subtrees can then be folded into a sort of “sub-primary networks”.

thumbnail
Fig 3. Two primary subtrees of U(N).

Let N be the IMLN depicted in Fig 1. The figure depicts two primary subtrees of U(N) corresponding to the terms λ1λ2 x2 y3 (left), and (right), of the polynomial p(N).

https://doi.org/10.1371/journal.pone.0268181.g003

Remark 3. In this remark we shall give a first approximation to the problem of reconstructing the Newick string of an IMLT U(N) from p(N), in the case where the polynomial characterizes N. Roughly speaking, we proceed as follows: start by substracting y from p(N) and then factor p(N) − y = q1q2. Then the Newick string to consider is (q1, q2). From now on, whenever it is possible to substract y from a polynomial q, do so. If the factorization involves only two members, q = q1q2, then proceed as before and replace q by (q1, q2). Otherwise, there could be conflicts in terms of deciding how to group members in a factorization of type where qk are polynomials. But there will always be in the queue of factorizations pending to be grouped, a pair of them where a “minimum” monomial of type λiqs is common in both; this allows one to determine that there is an arc from an elementary node labelled by λi to the subtree determined by the polynomial qs. In terms of the Newick string, it could be replaced by (λi(qs)).

We are now specially interested in determining under which conditions the polynomial associated to an IMLN uniquely characterizes it. Note that this is not always the case, indeed for IMLT’s. See for instance the three representations of IMLT’s in Fig 4. The polynomial fails to correctly distinguish between them. Roughly speaking, looking at the polynomials of the elementary vertices we could readily distinguish between the three possibilities, but we cannot do so by only looking at p(u), since p(u) = y + λ1λ2 p(w1)p(w2).

thumbnail
Fig 4. Non-isomorphic IMLT’s.

Three non-isomorphic IMLT’s presenting the same polynomial at u.

https://doi.org/10.1371/journal.pone.0268181.g004

Strong paths.

We shall now present a series of definitions. Let N be an IMLN, and u, vV(N). If there exists a path from u to v consisting only of elementary or reticulation nodes, we say that u is a strong ancestor of v, and that v is a strong descendant of u. Such a path is called a strong path. For example, by considering the situation in Fig 4, we can see that in all three cases w1, w2 strongly descend from u.

Lemma 13. Let N be an internally labelled phylogenetic network, and v1, v2 two reticulation nodes. If p(v1) = p(v2), then v1 = v2.

Proof. Let w1 be the child of v1; by the definition of the polynomial, p(v1)/p(w1) = λi for some λi ∈ {λ1, …, λr}. Since p(v1) = p(v2), it also means that p(v2)/p(w1) = λi, but since N is an internally labelled phylogenetic network this implies that v2 is a parent of w1 and that (v2) = λi. Thus, they are the same node.

Lemma 14. Let N be an internally labelled phylogenetic network, and v a reticulation node in it. A node u is a strong ancestor of v if, and only if, one of the two following conditions happens:

  • p(v) | p(u), that is p(v) divides p(u), and then u is a reticulation node, or
  • p(v) | (p(u) − y), and then u is a tree node.

Proof. By the definition of the polynomial and Lemma 13.

Now, if we want to compare two IMLN’s on the same sets of labels {x1, …, xn} and {λ1, …, λr}, we should take into account the possibility that two of them are isomorphic up to a permutation of the labels. In order to express this possibility, let σ: {x1, …, xn, λ1, …, λr} → {x1, …, xn, λ1, …, λr} be a permutation such that σ(X) = X (i.e., that fixes the sets of labels of the leaves and of the elementary or reticulation nodes). Given an IMLN N, we denote by σN the network isomorphic to N that has all its labels permuted according to σ, and by σp(N) we mean p(σN) or, equivalently, the polynomial that has all its variables changed according to σ.

Definition 7. Let N1, N2 be two IMLN’s, and σ a permutation of their labels such that σ(X) = X. We say that N1 and N2 are equivalent modulo strong paths if the following three conditions are satisfied:

  1. p(N1) = σp(N2);
  2. there exists a bijection f between the sets of tree nodes of N1 and N2 such that, if u, v are tree nodes and v is a strong descendant of u, then f(v) is a strong descendant of f(u);
  3. for any tree node u in N1, p(u) = σp(f(u)).

Being equivalent modulo strong paths is an equivalence relation.

Remark 4. The above definition can also be easily stated exclusively in terms of strong paths, which are intrinsic to the IMLN. However, the definition in terms of the polynomial is more tractable and concise.

Notice that all the IMLT’s in Fig 4 are equivalent modulo strong paths. Indeed, we present the following theorem:

Theorem 15. Let N1, N2 be two IMLN’s, and σ a permutation of their labels such that σ(X) = X. Then, p(N1) = σp(N2) if, and only if, N1 and N2 are equivalent modulo strong paths.

Proof. The “if” part of the implication is direct by the first condition of the definition of equivalence modulo strong paths.

Suppose now that p(N1) = σp(N2), and let us show that N1 and N2 must be equivalent. We first see that there exists a bijection f between the sets of tree nodes of N1 and N2 such that for any tree node u in N1, p(u) = σp(f(u)). We will use the following inductive schema: we shall prove that, if u is a tree node in N1 and f(u1) is a tree node in N2 such that p(u) = σp(f(u)), then if w1, w2 in N1 are the two tree nodes that strongly descend from u1, then the two tree nodes that strongly descend from f(u) in N2 are such that and . Then, we will provide tree nodes u1, u2 in N1 and N2, respectively, from which all other tree nodes will descend and such that p(u1) = σp(u2).

Let u be a tree node in N1, and w1, w2 be the two tree nodes that strongly descend from it. Then, p(u) = y + μ1 ⋅ … ⋅ μr p(w1)p(w2), for μ1, …, μr ∈ {λ1, …, λr}. Then, if p(u) = σp(f(u)), , where are the tree nodes that strongly descend from f(u) in N2; but since p(w1), p(w2) are both irreducible and different from any λi, then it must happen that (without loss of generality) and . Thus, set and .

We will now show that there is a tree node in both N1 and N2 such that any other tree node descends from it. Suppose that the root of N1, say ρ1, is a tree node; if so, since p(N1) = σp(N2) and by Proposition 11, the root of N2, say ρ2, must also be a tree node. Therefore, any other tree node in their respective IMLN’s must descend from them, and furthermore p(ρ1) = σp(ρ2). Set f(ρ1) = ρ2.

Finally, suppose that ρ1 is not a tree node; then, p(ρ1) is not an irreducible polynomial, and therefore neither will σp(ρ2). Let w1 be the only tree node strongly descending from ρ1 in N1. It is straightforward to see that, if is the only tree node strongly descending from ρ2 in N2, then . In both cases, any other tree node in the network will descend from them. Therefore, set .

Now, the question arises: under which conditions can we say that two internally labelled phylogenetic networks that are equivalent modulo strong paths are actually isomorphic?

Separability: A sufficient condition.

In this part we shall give a sufficient condition for two internally labelled phylogenetic networks to be completely characterized by the polynomial. In order to do so, we will work with the immediate neighbourhood of any tree node.

Let N be a phylogenetic network, and let u be a tree node in N. Let w1, w2 be the two (possibly equal) tree nodes that strongly descend from it. Let be the reticulation nodes in the strong paths from u to w1 and w2, and suppose that there are r1 such nodes in the path from u to w1 and r2 in the other. See Fig 5. Let U(u) = {u1, …, uk} be the set of all the tree nodes that are strong ancestors of w1 or w2 different from u. Note that the node ui in Fig 5 (left) is a node in U(u). In what follows, we will allow ourselves to write U if the context is sufficiently clear. We will present now the following lemma.

thumbnail
Fig 5. Strong paths from a tree node.

A tree node u and its strong descendants w1 and w2 (left) or w1 (right). The curly paths represent strong paths. The nodes v and ui are used in the proof of Lemma 16.

https://doi.org/10.1371/journal.pone.0268181.g005

Lemma 16. Consider the situation above. Let v be a reticulation node from the collection . Then, there are two possibilities:

  • both its parents are nodes from , or
  • there exists at least one tree node uiU such that there is a strong path from ui to v not containing any other reticulation node .

Furthermore, the first possibility can only happen for one reticulation node in , and it will hold if, and only if, w1 = w2.

Proof. Suppose that v is the first reticulation node (counting by proximity to u) that satisfies the first condition (this makes sense, since our networks are binary). In this situation, from it emerges only one path up to the next tree node. But since N is binary, the two paths that emerged from u are now confounded in the only path from v to the next tree node, w1 = w2. See Fig 5, right. Therefore, since there is now only one path of reticulation nodes, no other node in it can satisfy the first condition.

If v does not satisfy the first condition, one of its parents must not be from . Let ui be a tree node strong ancestor of such a parent of v. The pair v, ui satisfies the second condition. See Fig 5, left.

We say that a tree node uiU(u) enters the neighbourhood of u at v if the pair v, ui satisfies the second condition of Lemma 16. If the context is sufficiently clear, we shall only say that it enters at v. Likewise, we say that v is the entry of ui to the neighbourhood of u (or that it is just its entry).

We can then divide the set U into five sets: let v(x), x ∈ {1, 2}, be the two children of u, then we define

Notice that, if w1w2, then

The above division is a partition of U. In Fig 6 three tree nodes u1, u2 and u3 from the set U = U(u) are represented. Note that , and u3U3.

thumbnail
Fig 6. Division of U(u).

Three trees nodes evidencing the type of sets in the division of U(u). In this case, , and u3U3.

https://doi.org/10.1371/journal.pone.0268181.g006

In general, given all the polynomials evaluated at each tree node of U, we cannot deduce the exact configuration of the vi’s. Remember, for instance, for the case where r1 + r2 = 2, the three situations presented in Fig 4. That is, we had no a priori information on which vi were strong ancestors of w1 and which of w2. This fact motivates the following definition.

Definition 8. Let N be a phylogenetic network and u a tree node in it. Let v(x), x ∈ {1, 2}, be the two children of u. We say that u is separable if either v(1) and v(2) are tree nodes, or if there exists a tree node u1 different from u such that it satisfies one of the following conditions:

  • is a strong ancestor of v(1) (or v(2)) but not of any other strong descendant of u, or
  • is a strong ancestor of v(1) (or v(2)) and of one of its strong descendants.

Remark 5. In this case, the negative definition might be more intuitive. Let u be a tree node with w1 and w2 the tree nodes strongly descended from u. Then u is not separable if none of its two children v(1) and v(2) are tree nodes, and

  • if w1w2, all the strong ancestors of v(1), v(2) that are not u are in U3(u), or
  • if w1 = w2 and v is the first reticulation node that is a strong descendant of both v(1) and v(2), then any strong ancestor of v(1) that is not u will be a strong ancestor of a reticulation node in the strong path from v(2) to v, and vice versa.

A phylogenetic network is called separable if all its tree nodes are so.

Remark 6 Notice that separability is a completely topological condition. Thus, we will use it indistinguishably for phylogenetic networks and internally labelled phylogenetic networks.

The key point in separability is that given u a separable tree node and all the polynomials of the tree nodes that are strong ancestors of w1 and w2, we can actually identify the polynomial p(u1) of the tree node that satisfies the conditions of the definition, and thus we can identify which reticulation nodes descend from v(1) and which from v(2). Indeed: if w1w2, p(u1) will be such that p(w1) divides p(u1) − y but p(w2) does not, and contains the largest number of λ1, …, λr dividing p(u) − y. If w1 = w2, the argument is analogous using . As a result, we are able to deduce that , x ∈ {1, 2}, for dividing p(u) − y. Thus, we are able to “separate” p(u) into the contributions from p(v(1)) and p(v(2)).

Fig 7 depicts two sub-networks which can be part of internally labelled phylogenetic networks (and then part of the underlying phylogenetic networks) that are not separable. Notice that they are not separable at any of the nodes u1, u2, u3. The filled triangle and non-filled triangle pendant at w1 and w2 represent non-isomorphic sub-networks (for example a leaf and a cherry). Note that in both cases we have the same polynomials at ui, namely p(u1) = y + λ1λ2λ3 p(w1)p(w2), p(u2) = y + λ1λ2λ3λ4 p(w1)p(w2) and p(u3) = y + λ1λ2λ4 p(w1)p(w2). Thus, we can not distinguish between the sub-networks when looking at p(u1), p(u2), p(u3).

thumbnail
Fig 7. Non separable internally labelled phylogenetic networks.

None of the nodes u1, u2, u3 are separable. The filled and non-filled triangles pending from w1 and w2 represent non-isomorphic sub-networks.

https://doi.org/10.1371/journal.pone.0268181.g007

Lemma 17. Let N be an internally labelled phylogenetic network, and u1 a tree node in it such that it is one of the deepest tree node (i.e., one for which exists path of maximal length from the root to it) satisfying the following condition: there exists another tree node u2 such that p(u1) = p(u2). Then, u1 and u2 must have the same set of children.

Proof. If u1 is a leaf, there is nothing to prove, because all the leaves have a different label. Then if p(u1) = p(u2), and p(u1) = φ(u1), we must have u2 = u1. In the other case, let v(1), v(2) be the two children of u1; since p(v(1)) and p(v(2)) both divide p(u2) − y and are unique (because u1 is one of the deepest node satisfying the condition in the statement of the lemma), u2 is a strong ancestor to both of them. Therefore, v(1), v(2) must be reticulation nodes.

We write where w1, w2 are the tree nodes that strongly descend from u1, for x ∈ {1, 2}, and . From v(x) to wx there is only one strong path of length rx, and since u2 is a strong ancestor of both v(1) and v(2) there are r1 + r2 polynomials λ1, …, λr that divide p(u2) − y. But these are exactly the number of polynomials in λ1, …, λr that must divide p(u2) − y, since p(u1) = p(u2).

Lemma 18. Let N be an internally labelled separable phylogenetic network, and u1, u2 two internal nodes in it. Then, p(u1) = p(u2) if, and only if, u1 = u2.

Proof. The “if” part is trivial by the definition of the polynomial. By Lemma 13, if either u1, or u2 is a reticulation node, the result is proven. Therefore, assume that u1, u2 are both tree nodes, and suppose, for the sake of contradiction, that u1u2. Furthermore, assume that u1 is one of the deepest nodes satisfying that p(u1) = p(u2).

By Lemma 17, their sets of children are the same. Let v1, v2 be the two children of u1 and u2. Then u1 and u2 are the only strong ancestors of both v1 and v2. Moreover, u2 is in U3(u1). This means that u1 is not separable and, therefore, neither is N.

Corollary 19. If N is a separable phylogenetic network, then there is no pair of tree vertices with the same set of children.

Note that the other direction of the implication in the above Corollary is false. See for instance the (internally labelled) phylogenetic subnetworks depicted in Fig 7. These are non separable and they have different set of children for every pair of tree nodes.

Isomorphism of internally labelled phylogenetic networks.

In this part we prove the main theorem of this paper. It roughly says that the polynomial is a complete invariant for the class of internally labelled separable phylogenetic networks up to equivalence modulo strong paths.

Lemma 20. Let N1, N2 be two internally labelled phylogenetic networks such that, for any u1, u2Nx, x ∈ {1, 2}, p(u1) = p(u2) implies that u1 = u2. Suppose that, for any u, vV(N2), p(u) ≠ p(v) if uv, and let f: V(N1) → V(N2) be a bijection. If there exists a permutation σ of their labels with σ(X) = X such that p(u) = σp(f(u)) for any uV(N1), then f is an isomorphism of internally labelled phylogenetic networks.

Proof. In order to ease the notation, and without loss of generality, let us assume that σ is the identity. The fact that f is a bijection is already required in the statement of the Lemma. Then, we must prove that if (u, v) ∈ E(N1), then (f(u), f(v)) ∈ E(N2) and that f preserves the labels.

Suppose that u is a reticulation node; if (u, v) ∈ E(N1), then p(u) = λi p(v) for some λi ∈ {λ1, …, λr}. Therefore, p(f(u)) = λi p(f(v)) which, since p(f(v)) is unique for f(v), implies that f(v) is the only child of f(u) (which is a reticulation node since p(f(u)) is not irreducible).

Suppose now that u is a tree node, and let v1, v2 be its two children. Then, we know that p(vx) = p(f(vx)) for x ∈ {1, 2}, and that p(f(u)) = y + p(f(v1))p(f(v2)). Since each node is uniquely characterized by its polynomial, it means that both f(v1) and f(v2) are strong descendants of f(u). By an argument analogous to that in the proof of Lemma 17, we can deduce that f(v1) and f(v2) are actually the children of f(u).

Now, we prove that f preserves the labels on the leaves and on the reticulations. If uL(N1), then f(u)∈L(N2). Since uL(N1), by definition, p(u) = φ1(u). Moreover, p(u) = p(f(u)) because leaves are tree nodes. Since f(u) ∈ L(N2), p(f(u)) = φ2(f(u)). Then, φ1(u) = φ2(f(u)). Now, let uR(N1) (a reticulation on N1). By definition, p(u) = 1(u)p(v), where v is the single child of u. We have seen above that p(f(u)) = 1(u)p(f(v)); but, since f(u) is a reticulation in N2 and f(v) is its single child, by definition, p(f(u)) = 2(f(u))p(f(v)). Then, 1(u) = 2(f(u)).

Theorem 21. Let N1, N2 be two internally labelled separable phylogenetic networks. If they are equivalent modulo strong paths, then they are isomorphic.

Proof. By Lemma 18, if N1 and N2 are separable, then p(u1) = p(u2) implies u1 = u2 for any internal node in either N1 or N2. Then, if we are able to find a bijection f between the sets of nodes satisfying the premises of Lemma 20, we will be able to apply it and show the result.

Now, N1 and N2 are equivalent modulo strong paths, and that means that there exists a bijection f between the sets of tree nodes such that, for a fixed permutation σ between the sets of labels with σ(X) = X, p(u) = σp(f(u)) for any tree node u, and if u, v are tree nodes and v is a strong descendant of u, then f(v) is a strong descendant of f(u). We shall show that this f induces our bijection if we generalize it to any internal node (i.e., if we define it correctly over the reticulation nodes in N1). In order to ease the notation, and without loss of generality, let σ be the identity.

Let v be a reticulation node in N1, and u a tree node that is a strong ancestor of it. Let v(1), v(2) be the children of u, and suppose that v strongly descends from v(1). Let w1, w2 be the two (possibly equal) tree nodes that strongly descend from u.

Since N1 is separable, in particular u is separable, and we know that we can write and . Now, by Lemma 16, either (1) there exists a tree node u′ that enters the neighbourhood of u at v, or (2) it does not and both parents of v are strong descendants of u.

Thus, we distinguish the following cases:

  1. (1). There exists a tree node u′ that enters the neighbourhood of u at v, and
    • if v is the only reticulation node at which u′ enters the neighbourhood of u (that is ), then , where are the only polynomials in λ1, …, λr that divide both p(u) − y and p(u′) − y.
    • if u′ also enters the neighbourhood of u at some v′ and there is no strong path between v and v′ (that is u′ ∈ U3(u)), then , where are the only polynomials in λ1, …, λr that divide both p(u) − y and p(v(1)).
    • if u′ also enters the neighbourhood of u at some v′ that is a strong ancestor of v (that is a case where ), then , where are the only polynomials in λ1, …, λr such that they divide p(u) − y and, for every j ∈ {i1, …, r1}, .
    • if u′ also enters the neighbourhood of u at some v′ that is a strong descendant of v (that is a case where ), then , where are the only polynomials in λ1, …, λr that divide both p(u) − y and p(u′) − y.
    Notice that the above arguments are independent of whether w1 = w2 or not.
  2. (2). Both parents of v are strong descendants of u (and so w1 = w2). Let the label of the reticulation v and let the labels of reticulations in the strong path from v to w1. Then , where μj for j ∈ {i1, …, r3} are the only polynomials in λ1, …, λr such that (μj)2p(u) − y.

Since N2 is also separable, in particular f(u) is separable, and since p(f(u)) = p(u) (because N1 and N2 are equivalent modulo strong paths), some of its children cannot be a tree node. Therefore, if are its children, there must exist a tree node u1 that is either a strong ancestor of but not of any other strong descendant of f(u) or a strong ancestor of, say, and of one of its strong descendants. This node will allow us to characterize . But since N1 and N2 are equivalent modulo strong paths, there exists f−1(u1) in N1 that satisfies the same condition with regard to the pair u, v(1) in N1, and so and . Thus, we set and .

Now, for any v* reticulation node strongly descending from either or , any of its strong ancestors that are tree nodes are such that there exists a tree node in N1 with its same polynomial (and thus, is a strong ancestor of some v strongly descending from u). Therefore, we will have that p(v) = p(v*), and we can then set f(v) = v*.

Theorem 15 and Theorem 21 together imply the following main result.

Theorem 22. Let N1, N2 be two internally labelled separable phylogenetic networks, and σ a permutation of their labels such that σ(X) = X. If p(N1) = σp(N2), then N1 and N2 are isomorphic.

Orchard networks.

In this subsection we prove that the phylogenetic networks in the class of orchard networks [12] are separable. These (strictly) include tree-child networks.

Before we recall the definition of orchard networks, we need to introduce some definitions. Let N be a phylogenetic network on X. Let {a, b} ⊆ X. The set {a, b} is a cherry of N if a and b share a parent. Let pa and pb the parents of a and b, respectively. If pb is a reticulation and (pa, pb) is an arc in N, then {a, b} is a reticulated cherry of N.

Let N be a phylogenetic network and let {a, b} be a cherry of N. Then “reduce b” is the operation of deleting b and suppressing the resulting elementary node. If pa = pb is the root of N, then delete b and the root. If {a, b} is a reticulated cherry of N in which pb is the reticulation, “cut {a, b}” is the operation of deleting (pa, pb), and suppressing the two resulting elementary nodes. For both operations, we say that a cherry-reduction is performed on N.

Let N be a phylogenetic network. The sequence N = N0, N1, …, Nk of phylogenetic networks is a cherry-reduction sequence of N if, for all i ∈ {1, …, k}, the phylogenetic network Ni is obtained from Ni−1 by a (single) cherry-reduction. Then, a phylogenetic network N is orchard if there exists a cherry-reduction sequence N = N0, N1, …, Nk of N such that Nk consists of a single vertex.

Theorem 23. Orchard networks are separable.

Proof. Let N be an orchard network and let N = N1, …, Nk be a sequence of cherry-reductions of N. We prove that, for any i ∈ {1, …, k − 1}, if Ni is not separable, then Ni+1 is not either. This means that if N is not separable, the last network in every cherry-reduction sequence cannot be a single vertex, reaching a contradiction due to N being orchard.

If a reduction of a leaf in a cherry is produced there is nothing to prove because it does not involve reticulation nodes. Then suppose that a cut of a reticulated cherry {a, b} is produced in Ni. Let pa and pb the parents of a and b, respectively, and let pb the reticulation node. Then pa is a tree node. Moreover pa is a separable node in Ni because the single strong descendant that is a reticulation node of pa is pb. Then, Ni is not separable due to some other tree node.

Notice that the cut of the reticulated cherry {a, b} does not change the relation of strong descendance in the remaining nodes; i.e., u, v were such that v strongly descended from u in Ni if, and only if, the correspondent nodes in Ni+1 satisfy this condition too. More precisely, let u be a non separable tree node, v(1), v(2) its children and w1, w2 the tree nodes that strongly descend from it. By Remark 5 this means that, to begin with, neither v(1) nor v(2) are tree nodes and, if w1w2, all the strong ancestors of v(1), v(2) that are not u are in U3(u). Now, pa can never be in U3(u) because one of its children is a leaf, a. Therefore, the cut of the reticulated cherry {a, b} would not affect the non separability of u. Suppose now that w1 = w2. By Remark 5, if v is the first reticulation node that is strong descendant of both v(1), v(2), the reticulation node pb cannot be in the strong paths from v(1) to v and from v(2) to v (note also that must be pbv). Then, both strong paths remain untouched to the cut of the reticulated cherry and also the set of strong ancestors of v(1) and v(2) that cause the non separability of u. Therefore, any non separable tree node in Ni continues to be so in Ni+1.

Unlabelled version.

Throughout this paper we have not made any use of the different labels of the leaves of an IMLN, and so the arguments could be translated, mutatis mutandis, to IMLN’s whose leaves are not labelled (although internal labels would still be necessary), modelled by labelling all leaves using a single variable x, to give a polynomial in . Again, for the case of phylogenetic networks, this would require that given two unlabelled phylogenetic networks we consider internally labelled phylogenetic networks with the same topology. This leads to the following proposition:

Proposition 24. Let N1, N2 be two internally labelled separable phylogenetic networks whose leaves are all labelled by x. Then, p(N1) = p(N2) implies that N1 and N2 are isomorphic.

Discussion and conclusion

In this paper a new complete polynomial invariant for a class of (binary) phylogenetic networks, that of separable networks, is introduced. It generalizes results in both [2] for phylogenetic trees and in [3] for phylogenetic networks where their set of embedded spanning trees (like tree-child) characterizes it. The introduced polynomial p is a generalization of the Liu polynomial and it is defined in a more generic structure of networks, called IMLN’s, where the reticulations are also labelled with labels other than those on the leaves. In contrast to [3], we compute the polynomial directly over the IMLN, and we avoid to previously compute its set of spanning trees. We prove that for the case of separable phylogenetic networks, the internally labelled structure derived from those is completely characterized by the polynomial. This induces a complete polynomial invariant for separable phylogenetic networks. That is, given two separable phylogenetic networks N1 and N2 on X, we could fix an internally labelled phylogenetic network from it, say , by bijectively labelling the reticulations. Then, if we consider all possible internally labelled phylogenetic networks obtained from N2 by the permutation of all its variables, X and the reticulations, we can compare with the polynomial of all the networks obtained from N2. Note that, due to Proposition 24, we could avoid the permutation of the labels on X, reducing the cost of this computation.

Establishing a complete polynomial invariant for phylogenetic networks opens the door to several interesting opportunities for exploration, such as new ways to define metrics on networks, fast methods to distinguish networks, and possibly ways to extract important features of a network by examining this polynomial. To this end, it may be helpful to understand whether a particular polynomial is derived from a network or not (for clearly not all irreducible polynomials give networks).

Furthermore, the computation of p(N) here may be performed reticulation-by-reticulation for some network classes, eg orchard networks [12]. That is, suppose that N is an internally labelled phylogenetic network derived from an orchard network and N = N0, N1, …, Nk is a complete cherry reduction sequence of N (that is Nk is a single node). We can perform an assignment of polynomials to all leaves in every intermediate IMLN Nj. Finally, p(N) is the polynomial assigned to the single node in Nk. Start by assigning p(u) = φ(u), for every leaf u in N0. Then, let {v1, v2} be the two leaves involved in the cherry-reduction to move from Nj to Nj+1 and let p(vi) be the polynomial assigned to vi in Nj for i ∈ {1, 2}. Then,

  • if {v1, v2} is a cherry, assign to the resulting leaf in Nj+1 the polynomial y + p(v1)p(v2).
  • if {v1, v2} is a reticulated cherry (being v2 the child of the reticulation labelled by λi), assign to the resulting leaf in Nj+1 coming from the parent of v1 the polynomial y + λi p(v1)p(v2), and to the resulting leaf in Nj+1 coming from the parent of v2, the polynomial λi p(v2).

It would be interesting to investigate more optimisations for general or for specific subclasses of phylogenetic networks.

It would also be interesting to think about ways to reduce the complexity of the polynomial assigned to a network; even at the expense of a loss of the uniqueness of this assignment. One possibility would be, for instance, to define a polynomial for a phylogenetic network over the IMLT into which is transformed the network following a similar approach that allow the computation of its extended Newick format [13]. Consider, for example, this: for every reticulation, split it (also copying its label) in two copies, the first such copy with one of its parent and its child, and the other copy with the other parent and no children. See two examples of this decomposition in Fig 8 from the internally labelled phylogenetic network N depicted in Fig 1. Clearly, this transformation process is not unique, and different IMLT’s can be obtained from the same network; but different networks result in disjoint sets of IMLT’s. Notice that this process can be understood as a way to prune irrelevant subtrees of the IMLT U(N) defined in the Subsection Folding and unfolding, with the goal to keep enough information to code the network. Roughly speaking, to recover the network from these IMLT’s one should only merge every pair of nodes labelled by the same λi. Applying the definition of the polynomial p to these IMLT’s, we obtain, for the example depicted in Fig 8(a), the polynomial where (some of) the terms are notably simpler than in the original.

thumbnail
Fig 8. Subtrees of U(N).

Let N be the internally labelled phylogenetic network depicted in Fig 1. The figure depicts two (IMLT) subtrees of U(N).

https://doi.org/10.1371/journal.pone.0268181.g008

There are potentially many further questions arising that relate to phylogenetic networks more broadly. For instance, do embedded spanning trees characterize general internally labelled phylogenetic networks? That is, if we keep the labels on elementary nodes (which come from reticulation nodes) of the embedded spanning trees, can we extend the results in [11] from tree-child networks to more general networks? Which classes of phylogenetic networks are separable? Do FU-stable networks require all the labels of the polynomials λ1, …, λr or can these be replaced by a single variable λ? And, over all, is there a complete characterization in topological terms of the phylogenetic networks that are characterized by the polynomial introduced in this article?

With all this, we hope that the results here will stimulate these and many other investigations.

Acknowledgments

The authors thank Francesc Rosselló for his helpful comments and suggestions. All the authors thank anonymous reviewers for detailed comments on an earlier version of this manuscript.

References

  1. 1. Liu P. A tree distinguishing polynomial. Discrete Applied Mathematics. 2021;288:1–8.
  2. 2. Liu P., Biller P., Gould M., Colijn C. Analyzing Phylogenetic Trees with a Tree Lattice Coordinate System and a Graph Polynomial. Systematic Biology, syac008, 2022.
  3. 3. Janssen R., Liu P. Comparing the topology of phylogenetic network generators. journal of Bioinformatics and Computational Biology 2021;19(6): 2140012 pmid:34895114
  4. 4. Cavender J. and Felsenstein J. Invariants of phylogenies in a simple case with discrete states. journal of Classification. 1987;4(1):57–71.
  5. 5. Cardona G, Rosselló F, Valiente G. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;6(4):552–569. pmid:19875855
  6. 6. Bai A., Erdős P., Semple C., Steel M. Defining phylogenetic networks using ancestral profiles. Mathematical Biosciences. 2021;332: 108537 pmid:33453221
  7. 7. Willson S. Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;8(3):785–796
  8. 8. Van Iersel L., Moulton V. Trinets encode tree-child and level-2 phylogenetic networks. journal of Mathematical Biology. 2014;68(7):1707–1729 pmid:23680992
  9. 9. Semple C., Toft G. Trinets encode orchard phylogenetic networks. journal of Mathematical Biology. 2021;83(3):1–20 pmid:34420100
  10. 10. Huber K., Moulton V., Steel M., Wu T. Folding and unfolding phylogenetic trees and networks. journal of Mathematical Biology. 2016;73(6-7):1761–1780 pmid:27107869
  11. 11. Francis A., Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination-mutation model of evolution. journal of Theoretical Biology. 2018;446:160–167. pmid:29548737
  12. 12. Erdős P., Semple C., Steel M. A class of phylogenetic networks reconstructable from ancestral profiles. Mathematical Biosciences. 2019;313:33–40 pmid:31077680
  13. 13. Cardona G., Rosselló F., Valiente G. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics. 2008;9(1):1–8 pmid:19077301