Degree sums and dense spanning trees

Finding dense spanning trees (DST) in unweighted graphs is a variation of the well studied minimum spanning tree problem (MST). We utilize established mathematical properties of extremal structures with the minimum sum of distances between vertices to formulate some general conditions on the sum of vertex degrees. We analyze the performance of various combinations of these degree sum conditions in finding dense spanning subtrees and apply our approach to practical examples. After briefly describing our algorithm we also show how it can be used on variations of DST, motivated by variations of MST. Our work provide some insights on the role of various degree sums in forming dense spanning trees and hopefully lay the foundation for finding fast algorithms or heuristics for related problems.


Background information
A spanning tree T of a graph G is a connected acyclic subgraph that contain all vertices of G. In the case that G is weighted, the classic problem of finding the minimum spanning tree (MST) seeks the spanning tree with the minimum weight (sum of edge weights on the spanning tree). Because of its extensive applications such as network design and cluster analysis, numerous studies have been published on the algorithms (see, for instance, [1] and the references therein) and related topics including a number of variations of MST such as the k-MST (finding the minimum subtree containing exactly k vertices), the Steiner tree problem [2], degree constrained minimum spanning tree problem [3], capacitated minimum spanning tree problem [4], MST with conflict pairs [5].
A variation of the tree with minimum weight (in weighted graphs) is the tree with minimum sum of pairwise distances between vertices (in unweighted graphs). This "sum of distances" has been a simple but interesting mathematical concept since early 20th century, but has started receiving tremendous attention in the last couple of decades as the so-called Wiener index [6,7] for its applications in biochemistry: Here d (u, v) is the distance between u and v. Thus a natural variation of the MST is to find the spanning tree with the minimum Wiener index. Extremal trees and graphs that minimize the Wiener index in various classes of graphs have been extensively studied, see [8] for an earlier informative survey and part of [9] for some recent results. One interesting observation was that the extremal structures that minimize the Wiener index usually maximize the number of subtrees (see for instance [10]). This correlation was further analyzed in [11]. The number of subtrees relates to the complexity of phylogeny reconstruction algorithms [12] and "density" of graphs [13,14].
Intuitively, indeed a "dense" structure with many subtrees tends to minimize the sum of distances. Consequently the MST becomes finding densest (with minimum Wiener index) spanning trees (DST) in unweighted graphs. We explore the known mathematical properties of dense trees that lead to useful methods for solving DST.

Degree sequence and the greedy tree
In the study of dense trees, trees with a given degree sequence (non-increasing sequence of vertex degrees) are often considered. It has been established that the greedy tree (Definition 1) minimizes the Wiener index and maximizes the number of subtrees among all trees with a given degree sequence. Here we use deg(v) to denote the degree of a vertex v.
Definition 1 (Greedy Tree). With a given degree sequence, the greedy tree is achieved through the following "greedy algorithm": i. Label the vertex with the largest degree as v (the root); ii. Label the neighbors of v as v 1 , v 2 , . . ., assign the largest degrees available to them such that iii. Label the neighbors of v 1 (except v) as v 11 , v 12 , . . .,such that they take all the largest degrees available and that deg(v 11 )!deg( v 12 )!. . ., then do the same for v 2 , v 3 , . . .; iv. Repeat (iii) for all the newly labeled vertices. Always start with the neighbors of the labeled vertex with largest degree whose neighbors are not labeled yet.
Furthermore, greedy trees with different degree sequences can be compared according to their Wiener indices or numbers of subtrees. Without going into details, it is easy to see that the degree sequences (6, 5, 4, 3, 2, 2, 1, . . ., 1) and (5, 4, 4, 3, 3, 3, 1, . . ., 1) correspond to trees with same number of vertices; and it is easy to verify that the greedy tree with the first degree sequence is "denser". Based on the simple idea of putting larger degrees closer and obtaining "better" degree sequences, in [15] an edge-swapping heuristic was presented for the DST.
In order to further explore the potential of using degrees as a credential for measuring the denseness of a spanning tree, we explore a number of conditions on the sum of vertex degrees.

Methodology
In a recent study [16], as an effort to find dense spanning trees sum of vertex degrees is used as a possible condition. It is pointed out that finding a spanning tree T of a given graph G that maximize X uv2EðTÞ ðdegðuÞ þ degðvÞÞ ð1Þ can be handled through simple integer linear programming. Note that (1) can also be easily realized through integer linear programming. This second condition takes into consideration the sum of degrees at distance 2 apart, and hence further select from spanning trees with the same degree sequence. Due to the limitations of integer linear programming, further variations of such conditions cannot be tested. In this note, for a vector (of real numbers)j ¼ ðj 1 ; j 2 ; . . . ; j i Þ, we let Cj be the condition where the condition C i,j is a generalization of (1), with It is obvious that C 1,1 is exactly (1) and C(1, 1) is exactly (3). We seek solutions to DST through maximizing Cj. As an intuitive explanation, we note that maximizing such expressions finds "superior" degree sequences as discussed earlier. And among spanning trees with the same degree sequence these conditions put vertices with larger degrees closer to each other.

Results and discussion
We will first explore performances of the proposed methodology with various choices ofj. First we provide a comprehensive examination, followed by some concluding remarks on possible optimal vectorsj. We then apply our optimized parameters in some practical examples. We also briefly describe our algorithm and mention the application of our method to variations of DST, motivated from variations of MST in the literatures. In the end we summarize our results and propose some future work.

Performance analysis
In this section we apply various degree sum conditions and evaluate the Wiener index of the resulted spanning tree.

Sum of degrees at distance i
First consider the case whenj has only one nonzero entry. In what follows we let j ¼ ð1; 0; 0Þ; ð0; 1; 0Þ; ð0; 0; 1Þ and apply Cj to 1200 random graphs with 6, 7, or 8 vertices. The Wiener index of the selected spanning trees is evaluated and the distribution is plotted in Fig 1. It is obvious thatj ¼ ð1; 0; 0Þ performs much better than the other two. We conclude that at least for small graphs, the adjacent degree sum condition (corresponding to (j ¼ ð1; 0; 0Þ) outperforms any other single sum of distances.

Sum of degrees at different distances
In the case of a sparse graph, it is likely that the adjacent degree sum condition is no longer sufficient to find the best solution. For instance, the conditions withj ¼ ð2; 2; 0; 0Þ and ð2; 2; 1; 0Þ outperform that withj ¼ ð2; 0; 0; 0Þ when applied to the set of all (labeled) random graphs on 7 vertices and 10 edges, as plotted in Fig 3. At least for small graphs it does not seem to be beneficial to include sum of nonzero power of vertex degrees at distance 4 or more, as shown in Fig

Discussion
We have systematically analyzed the performance of various degree sum conditions to find dense spanning trees. When only one degree sum has nonzero power, the adjacent degree sum condition greatly outperforms the others (Fig 1), as one would expect. Furthermore, using larger exponents generally result in better performance (Fig 2).
On the other hand, since the adjacent degree sum condition is equivalent to simply the sum of squares of degrees, it is obvious that including multiple degree sums in the condition should lead to better result. This fact is verified in (Fig 3). However, when a star (generally considered as the densest tree) or "the second densest" structure exists as a spanning subgraph, the adjacent degree sum condition does always find the densest spanning tree. As in the case of graphs with 7 (labeled) vertices, all 9555 cases of the spanning star and 110691 cases of a spanning T 1 (a tree with degree sequence (5, 2, 1, 1, 1, 1, 1)) are found through the adjacent degree sum condition. This is because of the uniqueness of these dense spanning trees (and hence can be identified with the adjacent degree sum condition alone) given their degree sequences. This is formally stated below.   1, 1, . . ., 1)) or a tree with degree sequence (n − 2, 2, 1, . . ., 1), using the adjacent degree sum condition will always find these spanning trees.
While in theory we believe that conditions involving five or more degree sums could be useful in very large graphs, it seems that (from our collected data) in practice (when all graphs are of "reasonable size") the conditions Cj withj ¼ ð4; 2; 0; 0Þ; ð4; 2; 2; 0Þ or ð4; 2; 2; 2Þ result in the densest spanning trees.
First, in Fig 6 we have two models of molecular circuits of cell cycle control, established by using QIAGEN's Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, www.qiagen. com/ingenuity). The originally generated network contain many more proteins including key proteins in regulating cell cycle and with extremely high relevance in human cancers, IPA analyses was used to represent the protein-interaction network by fewer proteins, which serve as molecular hubs for the circuits of cell cycle control.
Conditions withj ¼ ð4; 2; 0; 0Þ or ð4; 2; 2; 0Þ or ð4; 2; 2; 2Þ lead to the same results shown in Fig 7. As one can see from the result, our dense spanning trees identifies the key proteins, evidently TP53 in the 8-gene model and SKP2 in the 10-gene model. This finding is consistent with the biological findings that confirms the importance of these two genes in cell cycle control.
Next, Fig 8 shows the eight regions of mainland United States, to its graph representation we apply our "optimal conditions" and find the same densest spanning tree that "centers" at the Southeast (Fig 9).
The "center position" of the Southeast region on this map and the corresponding dense spanning tree is rather obvious from the fact that it is adjacent or close to the most number of regions. This trivial observation, however, does lend support to many observations where the Southeast stands out from the rest of the country. For instance, Table 1   and Prevention [17]. It is easy to see that the Southeast region has the largest infected population.