Data Mechanics and Coupling Geometry on Binary Bipartite Networks

We quantify the notion of pattern and formalize the process of pattern discovery under the framework of binary bipartite networks. Patterns of particular focus are interrelated global interactions between clusters on its row and column axes. A binary bipartite network is built into a thermodynamic system embracing all up-and-down spin configurations defined by product-permutations on rows and columns. This system is equipped with its ferromagnetic energy ground state under Ising model potential. Such a ground state, also called a macrostate, is postulated to congregate all patterns of interest embedded within the network data in a multiscale fashion. A new computing paradigm for indirect searching for such a macrostate, called Data Mechanics, is devised by iteratively building a surrogate geometric system with a pair of nearly optimal marginal ultrametrics on row and column spaces. The coupling measure minimizing the Gromov-Wasserstein distance of these two marginal geometries is also seen to be in the vicinity of the macrostate. This resultant coupling geometry reveals multiscale block pattern information that characterizes multiple layers of interacting relationships between clusters on row and on column axes. It is the nonparametric information content of a binary bipartite network. This coupling geometry is then demonstrated to shed new light and bring resolution to interaction issues in community ecology and in gene-content-based phylogenetics. Its implied global inferences are expected to have high potential in many scientific areas.


16!
k!(16−k)! many vector-nodes for all k = 1, ..., 16. Therefore the 2 16 vector-nodes form a homogeneously connected graph with constant degree 16 for all nodes. This graph is denoted as G H [2 16 ]. The homogeneity indicates that each vector is at the same neighboring relations as all the others have, so there is no clear geometric structure on space {0, 1} 16 under Hamming distance. On the other hand, it is clear to image that if the majority of vector-nodes are completely randomly removed from G H [2 16 ] along with the edges connecting to them, the remaining graph would fall into many isolated pieces.
Back to the phylogeny data set, among the 8581 genes, there are 481 distinct 16-dim binary vectors. Consider the subgraph of G H [2 16 ] with these 481 vector-nodes and edges pertaining to them. We term this subgraph G H [481]. Suppose that the (2 16 − 481) vectornodes are completely randomly removed from G H [2 16 ], in other words, 481 vector-nodes are randomly selected with their connectivity, G H [481] should be highly disconnected in a sense of having only cliques of rather small in sizes. This is because the probability of randomly selecting a 16-dim binary vector with Hamming distance being equal to one to the given vector is 16/2 16 , which is less than 0.0002. However, countering this expectation, the real network G H [481] retains a giant clique of size 379. A hierarchical clustering tree of these 481 vector-nodes with Hamming distance via single linkage is shown in Fig. S1(a).
This simple analysis reveals that the selection of the 8581 genes must have been involved with a highly structural mechanism, so that the presence of such a giant clique in G H [481] is possible. To support this statement, we randomly simulate a network with 481 vector-nodes as illustration, denoted asĜ S H [481] with uniform probability on {0, 1} 16 . A hierarchical clustering tree ofĜ S H [481] reveals contrasting and striking differences by having no cliques of size more than 3, as shown in Fig. S1(b).
Given that the network G H [481] does not afford clear structural information among 481 distinct vectors, the giant clique seems to offer an interesting base to look further into. We propose breaking the giant clique into pieces, and then looking into each sizeable pieces. To do so, we empirically further remove all vector-nodes with total presences being less than 4 or larger than 13 among the 481 vectors for not being informative in differentiating species. The removal scheme results into 312 vector-nodes, containing 286 vector-nodes from the giant clique. The resultant hierarchical clustering tree is shown as Fig. S1(c). The tree level at distance 1 (the bottom level) reveals 3 large cliques (marked from A through C) and three smaller-sized ones. It is not surprising that each of these six cliques reveal rather strong, but distinctly visible characteristics that can differentiate between the three species groups: HL, LL, and Syn.
These gene cliques with differentiability on the three species groups: HL, LL, and Syn, bring out the essential aspect of why constructing a realistic geometry on gene dimension is crucial for phylogeny. In order to make the gene-species relational coupling geometry even more explicit, we elaborate as follows. With respect to all genes, each species' gene content is a 8581-dim vector (or 312-dim vector if only counting for genes with distinct and informative gene content). By using Hamming distances as the initial distance measure, the 16-species vectors will give rise to a distance matrix with which an initial phylogenetic tree is derived by DCG algorithm. A modified version of Hamming distance is then derived for gene's 16-dim vectors, and a DCG-based ultrametric tree on genes is derived. In this fashion, the iterative procedure is implemented to derive the data-driven distance measures for the species and genes. The resultant trees for species and genes would accordingly rearrange the matrix of gene content (i.e. presence/absence of genes) to reveal multiscale block structural information. Therefore a phylogenetic tree is not the only goal in phylogeny. The ultimate goal is the coupling geometry constituted by the genes vs. species relational patterns. Ideally we obtain information of which gene clusters play which roles in differentiating among species groups HL, LL, and Syn, and which gene core clusters can differentiate species within the three species groups. Therefore it is extremely important that we can systematically derive such gene vs. species relational coupling geometry. On top of the fact that our coupling geometry based phylogeny avoids unrealistic assumptions on speciation, this geometry offer an advantageous platform for identifying functional roles of gene-clusters in phylogeny.  Figure S1: Exploratory hierarchical clustering tree with Hamming distance and single linkage.