A polynomial invariant for a new class of phylogenetic networks

Joan Carles Pons; Tomás M. Coronado; Michael Hendriksen; Andrew Francis

doi:10.1371/journal.pone.0268181

Abstract

Invariants for complicated objects such as those arising in phylogenetics, whether they are invariants as matrices, polynomials, or other mathematical structures, are important tools for distinguishing and working with such objects. In this paper, we generalize a complete polynomial invariant on trees to a class of phylogenetic networks called separable networks, which will include orchard networks. Networks are becoming increasingly important for their ability to represent reticulation events, such as hybridization, in evolutionary history. We provide a function from the space of internally multi-labelled phylogenetic networks, a more generic graph structure than phylogenetic networks where the reticulations are also labelled, to a polynomial ring. We prove that the separability condition allows us to characterize, via the polynomial, the phylogenetic networks with the same number of leaves and same number of reticulations by considering their internally labelled versions. While the invariant for trees is a polynomial in where n is the number of leaves, the invariant for internally multi-labelled phylogenetic networks is an element of , where r is the number of reticulations in the network. When the networks are considered without leaf labels the number of variables reduces to r + 2.

Citation: Pons JC, Coronado TM, Hendriksen M, Francis A (2022) A polynomial invariant for a new class of phylogenetic networks. PLoS ONE 17(5): e0268181. https://doi.org/10.1371/journal.pone.0268181

Editor: Akbar Ali, University of Hail, SAUDI ARABIA

Received: January 1, 2022; Accepted: April 24, 2022; Published: May 20, 2022

Copyright: © 2022 Pons et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: JCP and TMC were supported by the Ministerio de Ciencia e Innovaci´on (MCI), the Agencia Estatal de Investigaci´on (AEI) and the European Regional Development Funds (ERDF); through project PGC2018-096956-B-C43 (FEDER/MICINN/AEI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A complete polynomial invariant able to uniquely distinguish between rooted trees has been recently introduced in [1]. Motivated to analyze and compare tree shapes in a phylogenetic context, this polynomial (to which we will refer as the Liu polynomial) has been used both to define a similarity measure on rooted tree shapes and to estimate parameters and models via its coefficients [2]. Moreover, its generalization from trees to networks (by analyzing the set of embedded spanning trees in the network) has also been used to study the properties of randomly generated networks [3].

We note that the word “invariant” is used here in its traditional sense, and not the one used in algebraic geometry approaches to phylogenetics, in which phylogenetic invariants for an evolutionary model along a tree are the polynomials which vanish on the expected frequencies of base patterns at the leaves [4]. Throughout this article, a (complete) invariant of a set A is a function f: A → B with the property that x ∼_A y if and only if f(x) ∼_B f(y), where B is some other set (such as the set of polynomials), and ∼_A and ∼_B are equivalence relations in the respective sets.

A multitude of (non-polynomial) invariants have been defined for specific subclasses of phylogenetic networks. To name just a few, the μ-vectors which store the number of paths from nodes to leaves characterize (among others) tree-child networks [5] and orchard networks (without stacks) [6]; the set of displayed trees that characterizes regular networks [7]; and the induced trinets (minimal subnetworks induced by triples of leaves) that characterize (among others) level-2 networks [8] and orchard networks [9].

In this paper we show how a polynomial invariant can be defined for rooted phylogenetic networks, generalizing the Liu polynomial invariant for trees. In order to do so, we consider phylogenetic networks and a labelled version of them, called internally labelled phylogenetic networks, where we keep the labels on leaves and also (bijectively) label the reticulations. In fact, internally labelled phylogenetic networks are a subset of a more general set of networks, which we call internally multi-labelled phylogenetic networks, or IMLN’s. On these networks the presence of elementary nodes is allowed, and leaves, reticulation and elementary nodes are all labelled. Then, if we denote by PN the set of all phylogenetic networks (up to isomorphism) and by ILPN the set of all internally labelled phylogenetic networks (up to isomorphism), the map Φ: ILPN → PN that sends each internally labelled phylogenetic network to the phylogenetic network obtained by “forgetting” all the internal labels (on reticulations) is obviously well defined; therefore for each N ∈ PN, Φ⁻¹(N) is the set of all the internally labelled phylogenetic networks that have its same topology; its fiber, in mathematical terms.

The aim of this paper is to define a polynomial p that uniquely characterizes these fibers and, in so doing, also characterizes the phylogenetic networks beneath them. See the diagram below. Since Φ is not injective, the dashed arrows denote maps that are not unique. We will see that, in general, p is not injective, but that it will be so under a suitable topological condition.

This paper is organized as follows. In the Methods section we include the three main graph structures of study: phylogenetic networks, internally labelled phylogenetic networks and internally multi-labelled phylogenetic networks (or IMLN’s). We also define the concept of isomorphism on these structures. The Results section is divided into two main subsections. The first one studies a process that unfolds an IMLN into a tree (an IMLT) and its reverse, folding, that recovers the initial IMLN. The key result of this section is the characterization of an IMLN by an IMLT (Corollary 10). The second subsection is dedicated to the definition and study of an extension of the Liu polynomial on IMLN’s. If N is an IMLN on a set of leaves labelled by X, the assigned polynomial p(N) has |X| + r + 1 variables, where r is the number of reticulations in the network. This subsection is further divided into multiple parts. The first part studies a special type of path (composed only of reticulations or elementary nodes) in IMLN’s, called strong paths. Roughly speaking, these allow us to define an equivalence relation between IMLN’s, and we prove that two IMLN’s share the polynomial if, and only if, they are equivalent (Theorem 15). The second part gives a sufficient condition on the space of phylogenetic networks (which we call separability) for the derived internally labelled phylogenetic networks to be completely characterized by the polynomial. The multiple lemmas proved in this part allow us to prove the main result (Theorem 22) in the third part; that is, the polynomial is a complete invariant in the set of internally labelled separable phylogenetic networks up to isomorphism. The fourth part of this subsection proves that orchard networks are separable, and so are characterized by the polynomial introduced in this paper (Theorem 23). Finally, in the last part, we present how the obtained results can be applied for an unlabelled version of networks, in the sense that we forget the labelling of the leaves, reducing the polynomial to r + 2 variables (Proposition 24). This paper finishes with a section of Discussion and Conclusion.

Methods

In this section we introduce the mathematical notation that will be used in the rest of the paper.

Throughout this paper, X will denote a non-empty finite set (of taxa). Commonly, we will use X = {x₁, …, x_n}, and we will allow ourselves to see each member of X as an irreducible polynomial in ; i.e., we will consider the labels of the leaves in our networks to be polynomials of the form x_i for i ∈ {1, …, n}.

Definition 1. A rooted binary phylogenetic network N = (V, E) on X, or simply a phylogenetic network on X, is a rooted directed acyclic graph with no parallel arcs satisfying the following conditions:

any node with out-degree zero (a leaf) has in-degree one, and the set of nodes with out-degree zero, denoted by L(N), is identified with X via a bijection φ: L(N) → X;
the root is the only node with in-degree zero, and has out-degree two;
any other node has either in-degree one and out-degree two (a tree node), or in-degree two and out-degree one (called a reticulation node).

We shall consider the leaves and root to be tree nodes.

Definition 2. A rooted binary internally multi-labelled phylogenetic network N = (V, E) on X, or simply an IMLN on X, is a rooted directed acyclic graph with no parallel arcs satisfying the following conditions:

any node with out-degree zero (a leaf) has in-degree one, and the set of nodes with out-degree zero, denoted by L(N), is identified with X via a surjection φ: L(N) → X;
the root is the only node with in-degree zero, and it can have out-degree one (in which case we shall say it is an elementary node) or two (a tree node);
any other node has either in-degree one and out-degree two (again, a tree node), or in-degree two and out-degree one (called a reticulation node), or in-degree one and out-degree one (again, an elementary node);
if R(N) denotes the set of reticulation nodes and E(N) the set of elementary nodes of N, then there exists ℓ: R(N) ∪ E(N) → {λ₁, …, λ_r} a labelling function such that its restriction to R(N) is injective and if u ∈ R(N) and v ∈ E(N), ℓ(u) ≠ ℓ(v).

Definition 3. A rooted binary internally multi-labelled phylogenetic tree T = (V, E) on X, or simply IMLT on X, is an IMLN without reticulation nodes.

We will consider the labels λ₁, …, λ_r to be irreducible polynomials in . Notice that Definition 2 implies that IMLN’s are a recursive structure in the following sense: given any IMLN N, for any u ∈ V(N), the subgraph rooted at u is still an IMLN. This is not the case in general for phylogenetic networks.

In the case that an IMLN (with the root of out-degree two) does not have elementary nodes and the labelling on the leaves is a bijection, by definition, it becomes a phylogenetic network if the labelling ℓ on reticulations is suppressed. Also, if we consider a phylogenetic network and we add a labelling bijection ℓ: R(N) → {λ₁, …, λ_r}, it becomes an IMLN. In order to reflect this possibility, we introduce the following definition.

Definition 4. An internally labelled phylogenetic network N on X is an IMLN on X without elementary nodes and where the maps φ: L(N) → X and ℓ: R(N) → {λ₁, …, λ_r} are bijections.

In order to formally define the concept of isomorphism between a pair of phylogenetic networks or between a pair of IMLN’s, we consider the alternative notation, (V, E, φ) and (V, E, φ, ℓ), to reflect the labelling functions, respectively.

Definition 5. Two phylogenetic networks N₁ = (V₁, E₁, φ₁) and N₂ = (V₂, E₂, φ₂) on X are isomorphic if there exists a bijection f: V₁ → V₂ such that φ₁(x) = φ₂(f(x)) for all x ∈ L(N₁), and (u, v) ∈ E₁ if and only if (f(u), f(v)) ∈ E₂.

Definition 6. Two IMLN’s N₁ = (V₁, E₁, φ₁, ℓ₁) and N₂ = (V₂, E₂, φ₂, ℓ₂) on X are isomorphic if there exists a bijection f: V₁ → V₂ such that φ₁(x) = φ₂(f(x)) for all x ∈ L(N₁), ℓ₁(x) = ℓ₂(f(x)) for all x ∈ R(N₁) ∪ E(N₁), and (u, v) ∈ E₁ if and only if (f(u), f(v)) ∈ E₂.

That is, a graph isomorphism that preserves the labels of both the reticulation and elementary nodes.

Results

Folding and unfolding

Following [10], a phylogenetic network can be “unfolded” in a specific manner to obtain a multi-labelled tree, that is a particular IMLT without elementary nodes in terms of the previous definitions. Moreover, in some cases, this process can be reverted, and the multi-labelled tree can be “folded” recovering the initial network. A phylogenetic network cannot in general be characterized by a multi-labelled tree, and this correspondence is valid only for the subclass of FU-stable phylogenetic networks [10].

In this subsection, however, we prove that an internally labelled phylogenetic network can be uniquely characterized by an IMLT obtained by a sequence of “unfoldings” on its reticulation nodes. Roughly speaking, considering the reticulations of an IMLN in a specific order, it is possible to sequentially duplicate the subnetwork descending from these nodes until an IMLT is obtained.

Let N be a (generic) IMLN, and R(N) the set of its reticulation nodes. The relation of being a descendant of another node induces a partial order over R(N), which we will denote by ≤_R. That is, for any two nodes u, v ∈ R(N), u ≤_R v if, and only if, there exists a directed path from v to u. Let R_min(N) be the set of the minimal elements of R(N) under this order, i.e. reticulation nodes such that none of their descendants are also reticulation nodes.

Lemma 1. Let N be an IMLN and u ∈ R_min(N). Then the graph rooted at u is an IMLT.

Proof. If u ∈ R_min(N), then there is no path in N from u to another reticulation. This means that there are no reticulations in the graph rooted at u; and therefore it is an IMLT.

Let N be an IMLN, and consider u ∈ R_min(N) (so that u is labelled by an element in {λ₁, …, λ_r}). Let v₁, v₂ be its parents, noting that v₁ ≠ v₂ due to the fact that parallel arcs are excluded. Define U(N, u) to be the unfolded IMLN of N at u, obtained by the following algorithm:

delete edges (v₁, u) and (v₂, u);
duplicate N(u), the IMLT rooted at u, including all its labels;
add an edge from v₁ to one of the resulting copies of u, and an edge from v₂ to the remaining copy of u.

Remark 1. Notice that the process of unfolding preserves paths in the following sense: if N′ is obtained from N by unfolding N at some node u, then any path between two nodes in N′ comes from an existing path in N; and vice versa, any path between two nodes in N corresponds to a path in N′. Notice, however, that a path in N might very well correspond to two different paths in N′, and so this assignation is not injective.

Corollary 2. Let N be an IMLN, and u ∈ R_min(N). Then U(N, u) is an IMLN.

Let N be an IMLN. We say that a sequence (u₁, …, u_k) of nodes in R(N) is compatible if the associated sequence of IMLN’s is such that and u₁ ∈ R_min(N), where and . Then, if (u₁, …, u_k) is compatible, for each i ∈ {1, …, k − 1} there is no path from u_i to u_j when j > i; i.e., it is non decreasing under the partial order ≤_R induced by the network over R(N).

Lemma 3. Let N be an IMLN and u₁, u₂ ∈ R_min(N). Then,

Proof. It is straightforward by Lemma 1 and the steps of the unfolding algorithm. If u₁ ∈ R_min(N), then u₂ ∈ R_min(U(N, u₁)); otherwise there would be a reticulation node u′ in R(U(N, u₁)) and a path from u₂ to u′ in U(N, u₁), and so in N, which is a contradiction. Then, by Lemma 1, the graph rooted at u₂ in U(N, u₁) is an IMLT. Since u₂ is not a node in any of the copies of the IMLT rooted at u₁ in the construction of U(N, u₁), there is no intersection between the copies from u₁ and the copies from u₂. Since the same argument holds if we start by u₂, the result is achieved.

Lemma 3 can be extended following the same arguments for any set of reticulations {u₁, …, u_k} if all of them are in R_min(N), since there will be no intersection between the created copies of IMLT’s.

Let N be an IMLN. We define an equivalence relation ≡ in the set of compatible sequences of elements of R(N) as follows:

That is, we say that two compatible sequences are equivalent if they are composed by the same set of nodes.

An ≤_R-chain in an IMLN N is a chain under the ≤_R order defined on R(N) (or a subset of it). That is, a subset of reticulations such that u₁ ≤_R ⋯ ≤_R u_s. And, an ≤_R-antichain in an IMLN N is an antichain under the ≤_R order; i.e., a subset of reticulations of N which are pairwise incompatible (u_i ≰_R u_j and u_j ≰_R u_i if u_i ≠ u_j) under the ≤_R order.

In the next lemma we prove that if we consider an ≤_R-chain in an IMLN N then there is a single way to traverse these nodes in a compatible sequence, from bottom to top. On the other hand, if we consider an ≤_R-antichain, then every way to traverse these nodes is valid to form a compatible sequence.

Lemma 4. Let N be an IMLN and S = {v₁, …, v_r} ⊆ R(N). Then

(a). If v₁ ≤_R v₂ ≤_R ⋯ ≤_R v_r is an ≤_R-chain, then v_i must precede v_j in every compatible sequence containing S if i < j.
(b). If S is an ≤_R-antichain, then every possible ordering of its nodes produces a compatible sequence composed by S.

Proof. We first prove (a). If v₁ ≤_R v₂ ≤_R ⋯ ≤_R v_r is an ≤_R-chain, then there is a path from v_j to v_i if i < j. Therefore if there exists a path from v_i to v_j, it produces a cycle in N; but this is not possible because N is an IMLN, and so in particular it is acyclic. This means that there is no path from v_i to v_j when i < j. Consequently, if i < j, v_i must precede v_j in every compatible sequence containing S.

Now we prove (b). Let v and v′ be two nodes in S. If v precedes v′ in a sequence there cannot be a path from v to v′; otherwise v′ ≤_R v. If v′ precedes v in a sequence there cannot be a path from v′ to v; otherwise v ≤_R v′. Since S is an ≤_R-antichain, then both cases derive compatible sequences.

Corollary 5. Let N be an IMLN and (u₁, u₂, …, u_k) ≡ (v₁, v₂, …, v_k) a pair of equivalent compatible sequences of elements of R(N). Let and be the associated sequences of IMLN’s to their corresponding compatible sequences. Then and are isomorphic.

Proof. For k = 1 there is nothing to prove, since u₁ = v₁. For k = 2. If u₁, u₂ ∈ R_min(N), there is nothing to prove, because (u₁, u₂) and (u₂, u₁) are compatible sequences and Lemma 3 applies. If (u₁, u₂) is a compatible sequence and u₁ ≤_R u₂, then must be (v₁, v₂) = (u₁, u₂) (and not (v₁, v₂) = (u₂, u₁)), since .

The general situation for k ≥ 3 demands a different approach. Let s₁ = (u₁, u₂, …, u_k) and s₂ = (v₁, v₂, …, v_k). Since s₁ ≡ s₂, we have {u₁, u₂, …, u_k} = {v₁, v₂, …, v_k′} ⊆ R(N). Let A = {u₁, u₂, …, u_k}. Then we could iteratively apply the following process to prove the result. Let A′ = {u ∈ A: u ∈ R_min(N)}. Note that A′ is not empty due to u₁ and v₁ (which could be equal) are in R_min(N). Then, let be the sequence obtained from s₁ by moving all the nodes in A′ to the first positions (in such a way that if u_i, u_j ∈ A′ with i < j, then the node u_i appears before u_j in ) and remain invariant the rest of nodes. Note that is compatible by construction and . A similar process can be repeated to obtain . Note that the set of nodes of A′ occupying the first |A′| positions in both and are exactly the same, and it is an ≤_R-antichain; but these nodes may not appear in the same order in both sequences.

Let u* be the last node (the rightmost) in such that u* ∈ A′. Now let be the compatible equivalent sequence to obtained by remaining invariant all positions except for the node u*, which comes to be the last node in with u* ∈ A′. This ensures that the last node of the first |A′| positions in both and is the same, u*. Note that, could be u* = u_k = v_k (when A = A′). By Lemma 4(b) and Lemma 3, the IMLN N_u* obtained by sequentially unfold at the nodes in until u* is achieved, is isomorphic to the IMLN obtained by sequentially unfold at the (same) nodes in until u* is achieved. Then, the same process can be repeated by considering new equivalent compatible sequences obtained from and by suppressing the first |A′| positions and starting with the IMLN N_u*.

Therefore, given a compatible sequence (u₁, u₂, …, u_r) of all the elements of R(N), and its associated sequence , we define the unfolding of an IMLN N, denoted by U(N), by means of the equation . We may refer to such a sequence as a sequence of unfoldings. See Fig 1 for an example of a sequence of unfoldings for an IMLN; in fact for an internally labelled phylogenetic network.

Download:

Fig 1. The unfolding of an IMLN.

Top two figures: A phylogenetic network N on {x₁, x₂, x₃, x₄}, and the IMLN obtained by considering the labelling function over R(N) given by ℓ(u_i) = λ_i for i ∈ {1, 2, 3}. Notice that N is an internally labelled phylogenetic network. The three figures below are the sequence of unfoldings associated to the compatible sequence of reticulations (u₂, u₃, u₁). Following the introduced terminology, , and . Note that u₂, u₃ ∈ R_min(N) and u₁ ∉ R_min(N), there is a path from u₁ to u₂ in N.

https://doi.org/10.1371/journal.pone.0268181.g001

Now, we are interested in the “reverse” process to unfolding. Roughly speaking, we are interested in formally defining a way to “fold” an IMLT to recover the IMLN from which it comes. We can, given an IMLN N, also define a partial order over the set of elementary nodes E(N) by saying that for any two u, v ∈ E(N), u ≤_E v if and only if there exist u′, v′ ∈ E(N) with ℓ(u) = ℓ(u′) and ℓ(v) = ℓ(v′) and a directed path from v to u. We call the set of elementary nodes that are maximal under this order E_max(N).

Lemma 6. Let be a sequence of unfoldings of an internally labelled phylogenetic network N. For any in it and for every , there exists exactly another such that ℓ(u) = ℓ(v) and the IMLT’s and are isomorphic.

Proof. Let be one of the IMLN’s in the sequence of unfoldings. Let , with N′ = N when i = 1. By construction, u_i ∈ R_min(N′).

Since , the IMLT N′(u_i) is duplicated; say u and v the two resulting copies of u_i in , we have ℓ(u) = ℓ(v) and . Moreover, ; otherwise, if u (or v) is not maximal under the order ≤_E in , it means that there are with ℓ(w) = ℓ(w′) such that there is a path from w to u. By Remark 1 this path is preserved in every with j < i. Since the labelling function ℓ is injective over reticulation nodes and N has not elementary nodes, this means that the pair w, w′ corresponds to a reticulation node in some with j < i; equivalently, this is a reticulation node equal to some u_j with j < i. This leads to a contradiction with the fact that the sequence (u₁, u₂, …, u_r) is compatible. If we consider a maximal element in different to the two coming from the duplication of u_i in N′, the previous argument can be reproduced similarly. These pair of maximal elements are preserved as maximal in every with j < i right up until the unfolding on this reticulation is produced. This proves that the IMLT’s rooted on the corresponding copies of it are also preserved until is reached.

In particular, in the proof of Lemma 6, and following the same notation, we show that the node u_i is maximal under the ≤_E order in . Notice also that this could be false if elementary nodes are allowed in the initial IMLN N.

Proposition 7. Let be a sequence of unfoldings of an internally labelled phylogenetic network N. For any in it, let . Then, is such that v ≤_E w if and only if v ≤_R w in R(N).

Proof. We begin by the “if” direction. If v, w are such that v ≤_R w when seen as reticulation nodes in N, there exists at least a path from w to v. Now, since , by Lemma 6, there exists such that ℓ(w) = ℓ(w′) and , via an isomorphism f. Then, since by hypothesis and, by Remark 1, the path from w to v in N is preserved in , there exist paths from w to v and from w′ to f(v) in , such that ℓ(v) = ℓ(f(v)) and therefore v ≤_E w in .

On the opposite direction, suppose that v, w are such that v ≤_E w. Again by Lemma 6, in there exists w′ such that ℓ(w) = ℓ(w′) and via an isomorphism f. Since v ≤_E w, there exists a path from w to v and a path from w′ to f(v) and ℓ(v) = ℓ(f(v)). Now, since there are no elementary nodes in N, there must exist j < i such that in (it could be that ), the nodes v and w are reticulations. By Remark 1, this implies that there would exist a path from w to v in , and therefore v ≤_R w in , and so in N. Thus concludes the proof.

Given N an IMLN, u ∈ R_min(N) and U(N, u), we would like to consider N to be the result of a folding operation over U(N, u): N = F(U(N, u), u), for some suitable F. For any unfolding sequence , we say that each of its members is a (phylogenetic) pseudo-network —in particular, they are IMLN’s. Equivalently, we can define a pseudo-network recursively as follows: let N be an IMLN; it is a pseudo-network if it satisfies the following three conditions:

(i). no reticulation node descends from an elementary node;
(ii). for any u ∈ E_max(N) there exists v ∈ E_max(N) such that ℓ(u) = ℓ(v) and N(u) = N(v) as IMLT’s;
(iii). for any u ∈ E_max(N), the IMLN obtained by the process of
1. considering the node v ∈ E_max(N) such that ℓ(v) = ℓ(u) and N(u) = N(v), and the parent of v, say v⁽¹⁾;
2. deleting N(v), as well as the edge (v⁽¹⁾, v);
3. adding the arc (v⁽¹⁾, u),
is also a pseudo-network.

The IMLN obtained by the process described in (iii) is denoted by F(N, u), and called the folded IMLN of N at u. Notice that if u, v ∈ E_max(N) are such that ℓ(u) = ℓ(v), then F(N, u) = F(N, v).

Lemma 8. Let N be a pseudo-network and u ∈ R_min(N). Then,

Proof. Let N′ = U(N, u). Since u ∈ R_min(N), then N(u) (the tree rooted at u) is an IMLT. Let v₁, v₂ be the parents of u in N. When N(u) is duplicated in the unfolding process, u and a new copy of it, say v, are elementary nodes and the roots of N′(u) and N′(v) respectively, such that N′(u) = N′(v). Moreover, (v₁, u), (v₂, v) are arcs in N′. Since u ∈ E_max(N′) (because u ∈ R_min(N)), by Lemma 6, v is the other node in E_max(N′), such that ℓ(u) = ℓ(v) and N′(u) = N′(v). By definition of the folding process of N′ at u, the IMLT N′(v) and also the arc (v₂, v) are deleted and a new arc (v₂, u) is created. This results in a reticulation node u with parents v₁ and v₂ which is the root of N′(u). Since N(u) = N′(u), then F(N′, u) = N.

Given N an IMLN and a sequence of unfoldings, by Lemma 8 we have that and that . Therefore, we derive the following result.

Corollary 9. Let N be an internally labelled phylogenetic network and any sequence of unfoldings. Then

Note that, similarly as we have done by the equivalent compatible sequences, there is not a unique way to recover the IMLN N by applying a set of foldings.

If N is a pseudo-network we know that it is the product of a sequence of unfoldings performed over an IMLN, N′. We can then rewrite Corollary 9, by defining a function F from the set of pseudo-networks to the set of IMLN’s by F(N) ≔ N′. Hence,

Corollary 10. Let N be an internally labelled phylogenetic network. Then

This result is the analogue of the concept of stable networks in Section 4 of [10]. The key difference here is that we allow elementary nodes.

A polynomial for internally multi-labelled phylogenetic networks

Given a phylogenetic network N on X, one can obtain a rooted tree by removing one incident arc to each reticulation node. These (sub)trees could contain elementary nodes, and its leaves might be labelled in X (the leaves from N) and other sets different from it (for instance when the single outgoing arc to a reticulation is removed). Those trees become unrooted if the direction of the arcs is suppressed (particularly, the root becomes a degree two node) and are called embedded spanning trees if its set of leaves is exactly X. Tree-child phylogenetic networks are characterized by their set of embedded spanning trees [11], but not general phylogenetic networks.

In [3], the Liu polynomial is generalized to phylogenetic networks by their sets of embedded spanning trees. Roughly speaking, the polynomial of the network is the product of the polynomials of the embedded spanning trees (considering trees with multiplicity). Consequently, this extension is a complete invariant for tree-child networks.

There are some natural extensions of the Liu polynomial to IMLN’s that come to mind. The first one, for internally labelled phylogenetic networks, is to completely unfold such a network and, from any elementary node u labelled λ_i, for some i ∈ {1, …, r} and labels λ_i distinguishable from labels x_i, grow an arc to a new node v, label v as λ_i, and finally forget the labelling of u. Thus, the unfolded IMLT becomes a multi-labelled tree over leaves {x₁, …, x_n, λ₁, …, λ_r}. See an example of that decomposition in Fig 2 from the internally labelled phylogenetic network N depicted in Fig 1. By means of Corollary 3.5 in [1], this extension of the polynomial is immediately seen to uniquely characterize an internally labelled phylogenetic network.

Download:

Fig 2. A multi-labelled tree derived from an internally labelled phylogenetic network.

Let N be the network depicted in Fig 1. This figure depicts a decomposition of N resulting in a multi-labelled tree.

https://doi.org/10.1371/journal.pone.0268181.g002

We will here deal with a natural extension that reflects the reticulation process in the sheer morphology of the polynomial, rather than in the name of the variables.

Let N be an IMLN. Then, consider

to be defined recursively as follows. Let u ∈ V(N), then:

if u is a leaf, p(u) = φ(u);
if u is an internal tree node whose two children are v₁, v₂, p(u) = y + p(v₁)p(v₂);
otherwise, i.e. if u has only one child v and its associated label is λ_i = ℓ(u), then p(u) = λ_i p(v).

Then, let ρ_N be the root of N; we define p(N) to be p(ρ_N). Notice that this definition of the polynomial p is given over generic IMLN’s.

For example, the polynomial associated to the IMLN represented in Fig 1 is

Proposition 11. Let N be an IMLN. Then, for any u ∈ V(N), is an irreducible polynomial if and only if u is a tree node.

Proof. If u is not a tree node the polynomial will not be irreducible, since then there would exist v ∈ V(N) as the only descendant of u, and p(u) = ℓ(u)p(v).

It then remains only to see that if u is a tree node, p(u) is irreducible. In this case, either u is a leaf and then p(u) = φ(u) = x_i for some i ∈ {1, …, n} and so irreducible, or u has two children and p(u) = y + Λp(w₁)p(w₂), where Λ is a product of λ_i from λ₁, …, λ_r, and w₁, w₂ are the first descendants from u at each side that are tree nodes (they are possibly equal). Now consider the polynomial p′(u) obtained from p(u) by changing every variable x₁, …, x_n, λ₁, …, λ_r for, say, x₁. Then, it can be seen that p′(u) satisfies Eisenstein’s irreducibility criterion in (which is an unique factorization domain, UFD) applied to the ideal 〈y〉, and so p(u) is irreducible when seen as a polynomial in . But, since y does not divide p(u), then p(u) is also irreducible in .

The next proposition will show that the polynomial is conserved throughout a sequence of unfoldings, and therefore will allow us to compute it over any of its members without distinction. In particular, it can be computed on the unfolding of the network.

Proposition 12. Let N be an IMLN, and be a sequence of unfoldings. Then, and, for any i ∈ {1, …, r − 1}, .

Proof. Let N′ be an IMLN, and u ∈ R_min(N′). If we are able to show that p(N′) = p(U(N′, u)), then the proposition will hold. Let v⁽¹⁾, v⁽²⁾ be the parents of u, in U(N′, u) each of them will be the parent of at least one elementary node u_x, x ∈ {1, 2}, which will be the root of a copy of the IMLT N′(u), and by construction p(u₁) = p(u₂) = p(u) = p(N′(u)). Now, by the definition of the polynomial, p(v^(x)) will be the same in N′ and in U(N′, u). Therefore, p(N′) = p(U(N′, u)).

We now introduce two remarks, the first concerning the interpretation of the coefficients and, the second, about the reconstruction of the unfolding of an IMLN from the polynomial if it characterizes the IMLN.

Remark 2. The interpretation of the coefficients of the polynomial p(N) can be extended from Lemma 2.4 in [1] by slightly modifying the definition of primary subtrees to the IMLT T = U(N). Let a primary subtree S of T be a rooted subtree of T such that S shares the same root node with T and any leaf node in T is either a leaf node in S or a descendant of a leaf node in S which does not come from an elementary node.

Then, if we represent p(N) as each one of its coefficients counts the number of primary subtrees of U(N) satisfying that:

γ_i (for i ∈ {1, …, r}) is the number of nodes labelled by λ_i of these subtrees;
α_i (for i ∈ {1, …, n}) is the number of leaf nodes labelled by x_i of these subtrees which are also leaves in U(N);
β is the number of leaf nodes of these subtrees which are internal nodes in U(N).

See Fig 3 for the interpretation of some of the terms of the polynomial p(N) of the IMLN N depicted in Fig 1. Notice that these primary subtrees can then be folded into a sort of “sub-primary networks”.

Download:

Fig 3. Two primary subtrees of U(N).

Let N be the IMLN depicted in Fig 1. The figure depicts two primary subtrees of U(N) corresponding to the terms λ₁λ₂ x₂ y³ (left), and (right), of the polynomial p(N).

https://doi.org/10.1371/journal.pone.0268181.g003

Remark 3. In this remark we shall give a first approximation to the problem of reconstructing the Newick string of an IMLT U(N) from p(N), in the case where the polynomial characterizes N. Roughly speaking, we proceed as follows: start by substracting y from p(N) and then factor p(N) − y = q₁ ⋅ q₂. Then the Newick string to consider is (q₁, q₂). From now on, whenever it is possible to substract y from a polynomial q, do so. If the factorization involves only two members, q = q₁ ⋅ q₂, then proceed as before and replace q by (q₁, q₂). Otherwise, there could be conflicts in terms of deciding how to group members in a factorization of type where q_k are polynomials. But there will always be in the queue of factorizations pending to be grouped, a pair of them where a “minimum” monomial of type λ_i ⋅ q_s is common in both; this allows one to determine that there is an arc from an elementary node labelled by λ_i to the subtree determined by the polynomial q_s. In terms of the Newick string, it could be replaced by (λ_i(q_s)).

We are now specially interested in determining under which conditions the polynomial associated to an IMLN uniquely characterizes it. Note that this is not always the case, indeed for IMLT’s. See for instance the three representations of IMLT’s in Fig 4. The polynomial fails to correctly distinguish between them. Roughly speaking, looking at the polynomials of the elementary vertices we could readily distinguish between the three possibilities, but we cannot do so by only looking at p(u), since p(u) = y + λ₁λ₂ p(w₁)p(w₂).

Download:

Fig 4. Non-isomorphic IMLT’s.

Three non-isomorphic IMLT’s presenting the same polynomial at u.

https://doi.org/10.1371/journal.pone.0268181.g004

Strong paths.

We shall now present a series of definitions. Let N be an IMLN, and u, v ∈ V(N). If there exists a path from u to v consisting only of elementary or reticulation nodes, we say that u is a strong ancestor of v, and that v is a strong descendant of u. Such a path is called a strong path. For example, by considering the situation in Fig 4, we can see that in all three cases w₁, w₂ strongly descend from u.

Lemma 13. Let N be an internally labelled phylogenetic network, and v₁, v₂ two reticulation nodes. If p(v₁) = p(v₂), then v₁ = v₂.

Proof. Let w₁ be the child of v₁; by the definition of the polynomial, p(v₁)/p(w₁) = λ_i for some λ_i ∈ {λ₁, …, λ_r}. Since p(v₁) = p(v₂), it also means that p(v₂)/p(w₁) = λ_i, but since N is an internally labelled phylogenetic network this implies that v₂ is a parent of w₁ and that ℓ(v₂) = λ_i. Thus, they are the same node.

Lemma 14. Let N be an internally labelled phylogenetic network, and v a reticulation node in it. A node u is a strong ancestor of v if, and only if, one of the two following conditions happens:

p(v) | p(u), that is p(v) divides p(u), and then u is a reticulation node, or
p(v) | (p(u) − y), and then u is a tree node.

Proof. By the definition of the polynomial and Lemma 13.

Now, if we want to compare two IMLN’s on the same sets of labels {x₁, …, x_n} and {λ₁, …, λ_r}, we should take into account the possibility that two of them are isomorphic up to a permutation of the labels. In order to express this possibility, let σ: {x₁, …, x_n, λ₁, …, λ_r} → {x₁, …, x_n, λ₁, …, λ_r} be a permutation such that σ(X) = X (i.e., that fixes the sets of labels of the leaves and of the elementary or reticulation nodes). Given an IMLN N, we denote by ^σN the network isomorphic to N that has all its labels permuted according to σ, and by ^σp(N) we mean p(^σN) or, equivalently, the polynomial that has all its variables changed according to σ.

Definition 7. Let N₁, N₂ be two IMLN’s, and σ a permutation of their labels such that σ(X) = X. We say that N₁ and N₂ are equivalent modulo strong paths if the following three conditions are satisfied:

p(N₁) = ^σp(N₂);
there exists a bijection f between the sets of tree nodes of N₁ and N₂ such that, if u, v are tree nodes and v is a strong descendant of u, then f(v) is a strong descendant of f(u);
for any tree node u in N₁, p(u) = ^σp(f(u)).

Being equivalent modulo strong paths is an equivalence relation.

Remark 4. The above definition can also be easily stated exclusively in terms of strong paths, which are intrinsic to the IMLN. However, the definition in terms of the polynomial is more tractable and concise.

Notice that all the IMLT’s in Fig 4 are equivalent modulo strong paths. Indeed, we present the following theorem:

Theorem 15. Let N₁, N₂ be two IMLN’s, and σ a permutation of their labels such that σ(X) = X. Then, p(N₁) = ^σp(N₂) if, and only if, N₁ and N₂ are equivalent modulo strong paths.

Proof. The “if” part of the implication is direct by the first condition of the definition of equivalence modulo strong paths.

Suppose now that p(N₁) = ^σp(N₂), and let us show that N₁ and N₂ must be equivalent. We first see that there exists a bijection f between the sets of tree nodes of N₁ and N₂ such that for any tree node u in N₁, p(u) = ^σp(f(u)). We will use the following inductive schema: we shall prove that, if u is a tree node in N₁ and f(u₁) is a tree node in N₂ such that p(u) = ^σp(f(u)), then if w₁, w₂ in N₁ are the two tree nodes that strongly descend from u₁, then the two tree nodes that strongly descend from f(u) in N₂ are such that and . Then, we will provide tree nodes u₁, u₂ in N₁ and N₂, respectively, from which all other tree nodes will descend and such that p(u₁) = ^σp(u₂).

Let u be a tree node in N₁, and w₁, w₂ be the two tree nodes that strongly descend from it. Then, p(u) = y + μ₁ ⋅ … ⋅ μ_r′ p(w₁)p(w₂), for μ₁, …, μ_r′ ∈ {λ₁, …, λ_r}. Then, if p(u) = ^σp(f(u)), , where are the tree nodes that strongly descend from f(u) in N₂; but since p(w₁), p(w₂) are both irreducible and different from any λ_i, then it must happen that (without loss of generality) and . Thus, set and .

We will now show that there is a tree node in both N₁ and N₂ such that any other tree node descends from it. Suppose that the root of N₁, say ρ₁, is a tree node; if so, since p(N₁) = ^σp(N₂) and by Proposition 11, the root of N₂, say ρ₂, must also be a tree node. Therefore, any other tree node in their respective IMLN’s must descend from them, and furthermore p(ρ₁) = ^σp(ρ₂). Set f(ρ₁) = ρ₂.

Finally, suppose that ρ₁ is not a tree node; then, p(ρ₁) is not an irreducible polynomial, and therefore neither will ^σp(ρ₂). Let w₁ be the only tree node strongly descending from ρ₁ in N₁. It is straightforward to see that, if is the only tree node strongly descending from ρ₂ in N₂, then . In both cases, any other tree node in the network will descend from them. Therefore, set .

Now, the question arises: under which conditions can we say that two internally labelled phylogenetic networks that are equivalent modulo strong paths are actually isomorphic?

Separability: A sufficient condition.

In this part we shall give a sufficient condition for two internally labelled phylogenetic networks to be completely characterized by the polynomial. In order to do so, we will work with the immediate neighbourhood of any tree node.

Let N be a phylogenetic network, and let u be a tree node in N. Let w₁, w₂ be the two (possibly equal) tree nodes that strongly descend from it. Let be the reticulation nodes in the strong paths from u to w₁ and w₂, and suppose that there are r₁ such nodes in the path from u to w₁ and r₂ in the other. See Fig 5. Let U(u) = {u₁, …, u_k} be the set of all the tree nodes that are strong ancestors of w₁ or w₂ different from u. Note that the node u_i in Fig 5 (left) is a node in U(u). In what follows, we will allow ourselves to write U if the context is sufficiently clear. We will present now the following lemma.

Download:

Fig 5. Strong paths from a tree node.

A tree node u and its strong descendants w₁ and w₂ (left) or w₁ (right). The curly paths represent strong paths. The nodes v and u_i are used in the proof of Lemma 16.

https://doi.org/10.1371/journal.pone.0268181.g005

Lemma 16. Consider the situation above. Let v be a reticulation node from the collection . Then, there are two possibilities:

both its parents are nodes from , or
there exists at least one tree node u_i ∈ U such that there is a strong path from u_i to v not containing any other reticulation node .

Furthermore, the first possibility can only happen for one reticulation node in , and it will hold if, and only if, w₁ = w₂.

Proof. Suppose that v is the first reticulation node (counting by proximity to u) that satisfies the first condition (this makes sense, since our networks are binary). In this situation, from it emerges only one path up to the next tree node. But since N is binary, the two paths that emerged from u are now confounded in the only path from v to the next tree node, w₁ = w₂. See Fig 5, right. Therefore, since there is now only one path of reticulation nodes, no other node in it can satisfy the first condition.

If v does not satisfy the first condition, one of its parents must not be from . Let u_i be a tree node strong ancestor of such a parent of v. The pair v, u_i satisfies the second condition. See Fig 5, left.

We say that a tree node u_i ∈ U(u) enters the neighbourhood of u at v if the pair v, u_i satisfies the second condition of Lemma 16. If the context is sufficiently clear, we shall only say that it enters at v. Likewise, we say that v is the entry of u_i to the neighbourhood of u (or that it is just its entry).

We can then divide the set U into five sets: let v^(x), x ∈ {1, 2}, be the two children of u, then we define

Notice that, if w₁ ≠ w₂, then

The above division is a partition of U. In Fig 6 three tree nodes u₁, u₂ and u₃ from the set U = U(u) are represented. Note that , and u₃ ∈ U₃.

Download:

Fig 6. Division of U(u).

Three trees nodes evidencing the type of sets in the division of U(u). In this case, , and u₃ ∈ U₃.

https://doi.org/10.1371/journal.pone.0268181.g006

In general, given all the polynomials evaluated at each tree node of U, we cannot deduce the exact configuration of the v_i’s. Remember, for instance, for the case where r₁ + r₂ = 2, the three situations presented in Fig 4. That is, we had no a priori information on which v_i were strong ancestors of w₁ and which of w₂. This fact motivates the following definition.

Definition 8. Let N be a phylogenetic network and u a tree node in it. Let v^(x), x ∈ {1, 2}, be the two children of u. We say that u is separable if either v⁽¹⁾ and v⁽²⁾ are tree nodes, or if there exists a tree node u₁ different from u such that it satisfies one of the following conditions:

is a strong ancestor of v⁽¹⁾ (or v⁽²⁾) but not of any other strong descendant of u, or
is a strong ancestor of v⁽¹⁾ (or v⁽²⁾) and of one of its strong descendants.

Remark 5. In this case, the negative definition might be more intuitive. Let u be a tree node with w₁ and w₂ the tree nodes strongly descended from u. Then u is not separable if none of its two children v⁽¹⁾ and v⁽²⁾ are tree nodes, and

if w₁ ≠ w₂, all the strong ancestors of v⁽¹⁾, v⁽²⁾ that are not u are in U₃(u), or
if w₁ = w₂ and v is the first reticulation node that is a strong descendant of both v⁽¹⁾ and v⁽²⁾, then any strong ancestor of v⁽¹⁾ that is not u will be a strong ancestor of a reticulation node in the strong path from v⁽²⁾ to v, and vice versa.

A phylogenetic network is called separable if all its tree nodes are so.

Remark 6 Notice that separability is a completely topological condition. Thus, we will use it indistinguishably for phylogenetic networks and internally labelled phylogenetic networks.

The key point in separability is that given u a separable tree node and all the polynomials of the tree nodes that are strong ancestors of w₁ and w₂, we can actually identify the polynomial p(u₁) of the tree node that satisfies the conditions of the definition, and thus we can identify which reticulation nodes descend from v⁽¹⁾ and which from v⁽²⁾. Indeed: if w₁ ≠ w₂, p(u₁) will be such that p(w₁) divides p(u₁) − y but p(w₂) does not, and contains the largest number of λ₁, …, λ_r dividing p(u) − y. If w₁ = w₂, the argument is analogous using . As a result, we are able to deduce that , x ∈ {1, 2}, for dividing p(u) − y. Thus, we are able to “separate” p(u) into the contributions from p(v⁽¹⁾) and p(v⁽²⁾).

Fig 7 depicts two sub-networks which can be part of internally labelled phylogenetic networks (and then part of the underlying phylogenetic networks) that are not separable. Notice that they are not separable at any of the nodes u₁, u₂, u₃. The filled triangle and non-filled triangle pendant at w₁ and w₂ represent non-isomorphic sub-networks (for example a leaf and a cherry). Note that in both cases we have the same polynomials at u_i, namely p(u₁) = y + λ₁λ₂λ₃ p(w₁)p(w₂), p(u₂) = y + λ₁λ₂λ₃λ₄ p(w₁)p(w₂) and p(u₃) = y + λ₁λ₂λ₄ p(w₁)p(w₂). Thus, we can not distinguish between the sub-networks when looking at p(u₁), p(u₂), p(u₃).

Download:

Fig 7. Non separable internally labelled phylogenetic networks.

None of the nodes u₁, u₂, u₃ are separable. The filled and non-filled triangles pending from w₁ and w₂ represent non-isomorphic sub-networks.

https://doi.org/10.1371/journal.pone.0268181.g007

Lemma 17. Let N be an internally labelled phylogenetic network, and u₁ a tree node in it such that it is one of the deepest tree node (i.e., one for which exists path of maximal length from the root to it) satisfying the following condition: there exists another tree node u₂ such that p(u₁) = p(u₂). Then, u₁ and u₂ must have the same set of children.

Proof. If u₁ is a leaf, there is nothing to prove, because all the leaves have a different label. Then if p(u₁) = p(u₂), and p(u₁) = φ(u₁), we must have u₂ = u₁. In the other case, let v⁽¹⁾, v⁽²⁾ be the two children of u₁; since p(v⁽¹⁾) and p(v⁽²⁾) both divide p(u₂) − y and are unique (because u₁ is one of the deepest node satisfying the condition in the statement of the lemma), u₂ is a strong ancestor to both of them. Therefore, v⁽¹⁾, v⁽²⁾ must be reticulation nodes.

We write where w₁, w₂ are the tree nodes that strongly descend from u₁, for x ∈ {1, 2}, and . From v^(x) to w_x there is only one strong path of length r_x, and since u₂ is a strong ancestor of both v⁽¹⁾ and v⁽²⁾ there are r₁ + r₂ polynomials λ₁, …, λ_r that divide p(u₂) − y. But these are exactly the number of polynomials in λ₁, …, λ_r that must divide p(u₂) − y, since p(u₁) = p(u₂).

Lemma 18. Let N be an internally labelled separable phylogenetic network, and u₁, u₂ two internal nodes in it. Then, p(u₁) = p(u₂) if, and only if, u₁ = u₂.

Proof. The “if” part is trivial by the definition of the polynomial. By Lemma 13, if either u₁, or u₂ is a reticulation node, the result is proven. Therefore, assume that u₁, u₂ are both tree nodes, and suppose, for the sake of contradiction, that u₁ ≠ u₂. Furthermore, assume that u₁ is one of the deepest nodes satisfying that p(u₁) = p(u₂).

By Lemma 17, their sets of children are the same. Let v₁, v₂ be the two children of u₁ and u₂. Then u₁ and u₂ are the only strong ancestors of both v₁ and v₂. Moreover, u₂ is in U₃(u₁). This means that u₁ is not separable and, therefore, neither is N.

Corollary 19. If N is a separable phylogenetic network, then there is no pair of tree vertices with the same set of children.

Note that the other direction of the implication in the above Corollary is false. See for instance the (internally labelled) phylogenetic subnetworks depicted in Fig 7. These are non separable and they have different set of children for every pair of tree nodes.

Isomorphism of internally labelled phylogenetic networks.

In this part we prove the main theorem of this paper. It roughly says that the polynomial is a complete invariant for the class of internally labelled separable phylogenetic networks up to equivalence modulo strong paths.

Lemma 20. Let N₁, N₂ be two internally labelled phylogenetic networks such that, for any u₁, u₂ ∈ N_x, x ∈ {1, 2}, p(u₁) = p(u₂) implies that u₁ = u₂. Suppose that, for any u, v ∈ V(N₂), p(u) ≠ p(v) if u ≠ v, and let f: V(N₁) → V(N₂) be a bijection. If there exists a permutation σ of their labels with σ(X) = X such that p(u) = ^σp(f(u)) for any u ∈ V(N₁), then f is an isomorphism of internally labelled phylogenetic networks.

Proof. In order to ease the notation, and without loss of generality, let us assume that σ is the identity. The fact that f is a bijection is already required in the statement of the Lemma. Then, we must prove that if (u, v) ∈ E(N₁), then (f(u), f(v)) ∈ E(N₂) and that f preserves the labels.

Suppose that u is a reticulation node; if (u, v) ∈ E(N₁), then p(u) = λ_i p(v) for some λ_i ∈ {λ₁, …, λ_r}. Therefore, p(f(u)) = λ_i p(f(v)) which, since p(f(v)) is unique for f(v), implies that f(v) is the only child of f(u) (which is a reticulation node since p(f(u)) is not irreducible).

Suppose now that u is a tree node, and let v₁, v₂ be its two children. Then, we know that p(v_x) = p(f(v_x)) for x ∈ {1, 2}, and that p(f(u)) = y + p(f(v₁))p(f(v₂)). Since each node is uniquely characterized by its polynomial, it means that both f(v₁) and f(v₂) are strong descendants of f(u). By an argument analogous to that in the proof of Lemma 17, we can deduce that f(v₁) and f(v₂) are actually the children of f(u).

Now, we prove that f preserves the labels on the leaves and on the reticulations. If u ∈ L(N₁), then f(u)∈L(N₂). Since u ∈ L(N₁), by definition, p(u) = φ₁(u). Moreover, p(u) = p(f(u)) because leaves are tree nodes. Since f(u) ∈ L(N₂), p(f(u)) = φ₂(f(u)). Then, φ₁(u) = φ₂(f(u)). Now, let u ∈ R(N₁) (a reticulation on N₁). By definition, p(u) = ℓ₁(u)p(v), where v is the single child of u. We have seen above that p(f(u)) = ℓ₁(u)p(f(v)); but, since f(u) is a reticulation in N₂ and f(v) is its single child, by definition, p(f(u)) = ℓ₂(f(u))p(f(v)). Then, ℓ₁(u) = ℓ₂(f(u)).

Theorem 21. Let N₁, N₂ be two internally labelled separable phylogenetic networks. If they are equivalent modulo strong paths, then they are isomorphic.

Proof. By Lemma 18, if N₁ and N₂ are separable, then p(u₁) = p(u₂) implies u₁ = u₂ for any internal node in either N₁ or N₂. Then, if we are able to find a bijection f between the sets of nodes satisfying the premises of Lemma 20, we will be able to apply it and show the result.

Now, N₁ and N₂ are equivalent modulo strong paths, and that means that there exists a bijection f between the sets of tree nodes such that, for a fixed permutation σ between the sets of labels with σ(X) = X, p(u) = ^σp(f(u)) for any tree node u, and if u, v are tree nodes and v is a strong descendant of u, then f(v) is a strong descendant of f(u). We shall show that this f induces our bijection if we generalize it to any internal node (i.e., if we define it correctly over the reticulation nodes in N₁). In order to ease the notation, and without loss of generality, let σ be the identity.

Let v be a reticulation node in N₁, and u a tree node that is a strong ancestor of it. Let v⁽¹⁾, v⁽²⁾ be the children of u, and suppose that v strongly descends from v⁽¹⁾. Let w₁, w₂ be the two (possibly equal) tree nodes that strongly descend from u.

Since N₁ is separable, in particular u is separable, and we know that we can write and . Now, by Lemma 16, either (1) there exists a tree node u′ that enters the neighbourhood of u at v, or (2) it does not and both parents of v are strong descendants of u.

Thus, we distinguish the following cases:

(1). There exists a tree node u′ that enters the neighbourhood of u at v, and
- if v is the only reticulation node at which u′ enters the neighbourhood of u (that is ), then , where are the only polynomials in λ₁, …, λ_r that divide both p(u) − y and p(u′) − y.
- if u′ also enters the neighbourhood of u at some v′ and there is no strong path between v and v′ (that is u′ ∈ U₃(u)), then , where are the only polynomials in λ₁, …, λ_r that divide both p(u) − y and p(v⁽¹⁾).
- if u′ also enters the neighbourhood of u at some v′ that is a strong ancestor of v (that is a case where ), then , where are the only polynomials in λ₁, …, λ_r such that they divide p(u) − y and, for every j ∈ {i₁, …, r₁}, .
- if u′ also enters the neighbourhood of u at some v′ that is a strong descendant of v (that is a case where ), then , where are the only polynomials in λ₁, …, λ_r that divide both p(u) − y and p(u′) − y.
Notice that the above arguments are independent of whether w₁ = w₂ or not.
(2). Both parents of v are strong descendants of u (and so w₁ = w₂). Let the label of the reticulation v and let the labels of reticulations in the strong path from v to w₁. Then , where μ_j for j ∈ {i₁, …, r₃} are the only polynomials in λ₁, …, λ_r such that (μ_j)²∣p(u) − y.

Since N₂ is also separable, in particular f(u) is separable, and since p(f(u)) = p(u) (because N₁ and N₂ are equivalent modulo strong paths), some of its children cannot be a tree node. Therefore, if are its children, there must exist a tree node u₁ that is either a strong ancestor of but not of any other strong descendant of f(u) or a strong ancestor of, say, and of one of its strong descendants. This node will allow us to characterize . But since N₁ and N₂ are equivalent modulo strong paths, there exists f⁻¹(u₁) in N₁ that satisfies the same condition with regard to the pair u, v⁽¹⁾ in N₁, and so and . Thus, we set and .

Now, for any v_* reticulation node strongly descending from either or , any of its strong ancestors that are tree nodes are such that there exists a tree node in N₁ with its same polynomial (and thus, is a strong ancestor of some v strongly descending from u). Therefore, we will have that p(v) = p(v_*), and we can then set f(v) = v_*.

Theorem 15 and Theorem 21 together imply the following main result.

Theorem 22. Let N₁, N₂ be two internally labelled separable phylogenetic networks, and σ a permutation of their labels such that σ(X) = X. If p(N₁) = ^σp(N₂), then N₁ and N₂ are isomorphic.

Orchard networks.

In this subsection we prove that the phylogenetic networks in the class of orchard networks [12] are separable. These (strictly) include tree-child networks.

Before we recall the definition of orchard networks, we need to introduce some definitions. Let N be a phylogenetic network on X. Let {a, b} ⊆ X. The set {a, b} is a cherry of N if a and b share a parent. Let p_a and p_b the parents of a and b, respectively. If p_b is a reticulation and (p_a, p_b) is an arc in N, then {a, b} is a reticulated cherry of N.

Let N be a phylogenetic network and let {a, b} be a cherry of N. Then “reduce b” is the operation of deleting b and suppressing the resulting elementary node. If p_a = p_b is the root of N, then delete b and the root. If {a, b} is a reticulated cherry of N in which p_b is the reticulation, “cut {a, b}” is the operation of deleting (p_a, p_b), and suppressing the two resulting elementary nodes. For both operations, we say that a cherry-reduction is performed on N.

Let N be a phylogenetic network. The sequence N = N₀, N₁, …, N_k of phylogenetic networks is a cherry-reduction sequence of N if, for all i ∈ {1, …, k}, the phylogenetic network N_i is obtained from N_i−1 by a (single) cherry-reduction. Then, a phylogenetic network N is orchard if there exists a cherry-reduction sequence N = N₀, N₁, …, N_k of N such that N_k consists of a single vertex.

Theorem 23. Orchard networks are separable.

Proof. Let N be an orchard network and let N = N₁, …, N_k be a sequence of cherry-reductions of N. We prove that, for any i ∈ {1, …, k − 1}, if N_i is not separable, then N_i+1 is not either. This means that if N is not separable, the last network in every cherry-reduction sequence cannot be a single vertex, reaching a contradiction due to N being orchard.

If a reduction of a leaf in a cherry is produced there is nothing to prove because it does not involve reticulation nodes. Then suppose that a cut of a reticulated cherry {a, b} is produced in N_i. Let p_a and p_b the parents of a and b, respectively, and let p_b the reticulation node. Then p_a is a tree node. Moreover p_a is a separable node in N_i because the single strong descendant that is a reticulation node of p_a is p_b. Then, N_i is not separable due to some other tree node.

Notice that the cut of the reticulated cherry {a, b} does not change the relation of strong descendance in the remaining nodes; i.e., u, v were such that v strongly descended from u in N_i if, and only if, the correspondent nodes in N_i+1 satisfy this condition too. More precisely, let u be a non separable tree node, v⁽¹⁾, v⁽²⁾ its children and w₁, w₂ the tree nodes that strongly descend from it. By Remark 5 this means that, to begin with, neither v⁽¹⁾ nor v⁽²⁾ are tree nodes and, if w₁ ≠ w₂, all the strong ancestors of v⁽¹⁾, v⁽²⁾ that are not u are in U₃(u). Now, p_a can never be in U₃(u) because one of its children is a leaf, a. Therefore, the cut of the reticulated cherry {a, b} would not affect the non separability of u. Suppose now that w₁ = w₂. By Remark 5, if v is the first reticulation node that is strong descendant of both v⁽¹⁾, v⁽²⁾, the reticulation node p_b cannot be in the strong paths from v⁽¹⁾ to v and from v⁽²⁾ to v (note also that must be p_b ≠ v). Then, both strong paths remain untouched to the cut of the reticulated cherry and also the set of strong ancestors of v⁽¹⁾ and v⁽²⁾ that cause the non separability of u. Therefore, any non separable tree node in N_i continues to be so in N_i+1.

Unlabelled version.

Throughout this paper we have not made any use of the different labels of the leaves of an IMLN, and so the arguments could be translated, mutatis mutandis, to IMLN’s whose leaves are not labelled (although internal labels would still be necessary), modelled by labelling all leaves using a single variable x, to give a polynomial in . Again, for the case of phylogenetic networks, this would require that given two unlabelled phylogenetic networks we consider internally labelled phylogenetic networks with the same topology. This leads to the following proposition:

Proposition 24. Let N₁, N₂ be two internally labelled separable phylogenetic networks whose leaves are all labelled by x. Then, p(N₁) = p(N₂) implies that N₁ and N₂ are isomorphic.

Discussion and conclusion

In this paper a new complete polynomial invariant for a class of (binary) phylogenetic networks, that of separable networks, is introduced. It generalizes results in both [2] for phylogenetic trees and in [3] for phylogenetic networks where their set of embedded spanning trees (like tree-child) characterizes it. The introduced polynomial p is a generalization of the Liu polynomial and it is defined in a more generic structure of networks, called IMLN’s, where the reticulations are also labelled with labels other than those on the leaves. In contrast to [3], we compute the polynomial directly over the IMLN, and we avoid to previously compute its set of spanning trees. We prove that for the case of separable phylogenetic networks, the internally labelled structure derived from those is completely characterized by the polynomial. This induces a complete polynomial invariant for separable phylogenetic networks. That is, given two separable phylogenetic networks N₁ and N₂ on X, we could fix an internally labelled phylogenetic network from it, say , by bijectively labelling the reticulations. Then, if we consider all possible internally labelled phylogenetic networks obtained from N₂ by the permutation of all its variables, X and the reticulations, we can compare with the polynomial of all the networks obtained from N₂. Note that, due to Proposition 24, we could avoid the permutation of the labels on X, reducing the cost of this computation.

Establishing a complete polynomial invariant for phylogenetic networks opens the door to several interesting opportunities for exploration, such as new ways to define metrics on networks, fast methods to distinguish networks, and possibly ways to extract important features of a network by examining this polynomial. To this end, it may be helpful to understand whether a particular polynomial is derived from a network or not (for clearly not all irreducible polynomials give networks).

Furthermore, the computation of p(N) here may be performed reticulation-by-reticulation for some network classes, eg orchard networks [12]. That is, suppose that N is an internally labelled phylogenetic network derived from an orchard network and N = N₀, N₁, …, N_k is a complete cherry reduction sequence of N (that is N_k is a single node). We can perform an assignment of polynomials to all leaves in every intermediate IMLN N_j. Finally, p(N) is the polynomial assigned to the single node in N_k. Start by assigning p(u) = φ(u), for every leaf u in N₀. Then, let {v₁, v₂} be the two leaves involved in the cherry-reduction to move from N_j to N_j+1 and let p(v_i) be the polynomial assigned to v_i in N_j for i ∈ {1, 2}. Then,

if {v₁, v₂} is a cherry, assign to the resulting leaf in N_j+1 the polynomial y + p(v₁)p(v₂).
if {v₁, v₂} is a reticulated cherry (being v₂ the child of the reticulation labelled by λ_i), assign to the resulting leaf in N_j+1 coming from the parent of v₁ the polynomial y + λ_i p(v₁)p(v₂), and to the resulting leaf in N_j+1 coming from the parent of v₂, the polynomial λ_i p(v₂).

It would be interesting to investigate more optimisations for general or for specific subclasses of phylogenetic networks.

It would also be interesting to think about ways to reduce the complexity of the polynomial assigned to a network; even at the expense of a loss of the uniqueness of this assignment. One possibility would be, for instance, to define a polynomial for a phylogenetic network over the IMLT into which is transformed the network following a similar approach that allow the computation of its extended Newick format [13]. Consider, for example, this: for every reticulation, split it (also copying its label) in two copies, the first such copy with one of its parent and its child, and the other copy with the other parent and no children. See two examples of this decomposition in Fig 8 from the internally labelled phylogenetic network N depicted in Fig 1. Clearly, this transformation process is not unique, and different IMLT’s can be obtained from the same network; but different networks result in disjoint sets of IMLT’s. Notice that this process can be understood as a way to prune irrelevant subtrees of the IMLT U(N) defined in the Subsection Folding and unfolding, with the goal to keep enough information to code the network. Roughly speaking, to recover the network from these IMLT’s one should only merge every pair of nodes labelled by the same λ_i. Applying the definition of the polynomial p to these IMLT’s, we obtain, for the example depicted in Fig 8(a), the polynomial where (some of) the terms are notably simpler than in the original.

Download:

Fig 8. Subtrees of U(N).

Let N be the internally labelled phylogenetic network depicted in Fig 1. The figure depicts two (IMLT) subtrees of U(N).

https://doi.org/10.1371/journal.pone.0268181.g008

There are potentially many further questions arising that relate to phylogenetic networks more broadly. For instance, do embedded spanning trees characterize general internally labelled phylogenetic networks? That is, if we keep the labels on elementary nodes (which come from reticulation nodes) of the embedded spanning trees, can we extend the results in [11] from tree-child networks to more general networks? Which classes of phylogenetic networks are separable? Do FU-stable networks require all the labels of the polynomials λ₁, …, λ_r or can these be replaced by a single variable λ? And, over all, is there a complete characterization in topological terms of the phylogenetic networks that are characterized by the polynomial introduced in this article?

With all this, we hope that the results here will stimulate these and many other investigations.

Acknowledgments

The authors thank Francesc Rosselló for his helpful comments and suggestions. All the authors thank anonymous reviewers for detailed comments on an earlier version of this manuscript.

References

1. Liu P. A tree distinguishing polynomial. Discrete Applied Mathematics. 2021;288:1–8.
- View Article
- Google Scholar
2. Liu P., Biller P., Gould M., Colijn C. Analyzing Phylogenetic Trees with a Tree Lattice Coordinate System and a Graph Polynomial. Systematic Biology, syac008, 2022.
- View Article
- Google Scholar
3. Janssen R., Liu P. Comparing the topology of phylogenetic network generators. journal of Bioinformatics and Computational Biology 2021;19(6): 2140012 pmid:34895114
4. Cavender J. and Felsenstein J. Invariants of phylogenies in a simple case with discrete states. journal of Classification. 1987;4(1):57–71.
- View Article
- Google Scholar
5. Cardona G, Rosselló F, Valiente G. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;6(4):552–569. pmid:19875855
6. Bai A., Erdős P., Semple C., Steel M. Defining phylogenetic networks using ancestral profiles. Mathematical Biosciences. 2021;332: 108537 pmid:33453221
7. Willson S. Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;8(3):785–796
- View Article
- Google Scholar
8. Van Iersel L., Moulton V. Trinets encode tree-child and level-2 phylogenetic networks. journal of Mathematical Biology. 2014;68(7):1707–1729 pmid:23680992
9. Semple C., Toft G. Trinets encode orchard phylogenetic networks. journal of Mathematical Biology. 2021;83(3):1–20 pmid:34420100
10. Huber K., Moulton V., Steel M., Wu T. Folding and unfolding phylogenetic trees and networks. journal of Mathematical Biology. 2016;73(6-7):1761–1780 pmid:27107869
11. Francis A., Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination-mutation model of evolution. journal of Theoretical Biology. 2018;446:160–167. pmid:29548737
12. Erdős P., Semple C., Steel M. A class of phylogenetic networks reconstructable from ancestral profiles. Mathematical Biosciences. 2019;313:33–40 pmid:31077680
13. Cardona G., Rosselló F., Valiente G. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics. 2008;9(1):1–8 pmid:19077301

[ref1] 1. Liu P. A tree distinguishing polynomial. Discrete Applied Mathematics. 2021;288:1–8.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Liu P., Biller P., Gould M., Colijn C. Analyzing Phylogenetic Trees with a Tree Lattice Coordinate System and a Graph Polynomial. Systematic Biology, syac008, 2022.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Janssen R., Liu P. Comparing the topology of phylogenetic network generators. journal of Bioinformatics and Computational Biology 2021;19(6): 2140012 pmid:34895114
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Cavender J. and Felsenstein J. Invariants of phylogenies in a simple case with discrete states. journal of Classification. 1987;4(1):57–71.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Cardona G, Rosselló F, Valiente G. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009;6(4):552–569. pmid:19875855
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Bai A., Erdős P., Semple C., Steel M. Defining phylogenetic networks using ancestral profiles. Mathematical Biosciences. 2021;332: 108537 pmid:33453221
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Willson S. Regular networks can be uniquely constructed from their trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;8(3):785–796
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Van Iersel L., Moulton V. Trinets encode tree-child and level-2 phylogenetic networks. journal of Mathematical Biology. 2014;68(7):1707–1729 pmid:23680992
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref9] 9. Semple C., Toft G. Trinets encode orchard phylogenetic networks. journal of Mathematical Biology. 2021;83(3):1–20 pmid:34420100
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref10] 10. Huber K., Moulton V., Steel M., Wu T. Folding and unfolding phylogenetic trees and networks. journal of Mathematical Biology. 2016;73(6-7):1761–1780 pmid:27107869
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref11] 11. Francis A., Moulton V. Identifiability of tree-child phylogenetic networks under a probabilistic recombination-mutation model of evolution. journal of Theoretical Biology. 2018;446:160–167. pmid:29548737
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Erdős P., Semple C., Steel M. A class of phylogenetic networks reconstructable from ancestral profiles. Mathematical Biosciences. 2019;313:33–40 pmid:31077680
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Cardona G., Rosselló F., Valiente G. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics. 2008;9(1):1–8 pmid:19077301
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

Figures

Abstract

Introduction

Methods

Results

Folding and unfolding

A polynomial for internally multi-labelled phylogenetic networks

Strong paths.

Separability: A sufficient condition.

Isomorphism of internally labelled phylogenetic networks.

Orchard networks.

Unlabelled version.

Discussion and conclusion

Acknowledgments

References