^{1}

^{1}

^{1}

^{1}

^{2}

^{1}

^{3}

^{4}

The authors have declared that no competing interests exist.

Biological networks entail important topological features and patterns critical to understanding interactions within complicated biological systems. Despite a great progress in understanding their structure, much more can be done to improve our inference and network analysis. Spectral methods play a key role in many network-based applications. Fundamental to spectral methods is the Laplacian, a matrix that captures the global structure of the network. Unfortunately, the Laplacian does not take into account intricacies of the network’s local structure and is sensitive to noise in the network. These two properties are fundamental to biological networks and cannot be ignored. We propose an alternative matrix

Networks are a representation of choice for many problems in biology and medicine including protein interactions, metabolic pathways, evolutionary biology, cancer subtyping and disease modeling to name a few. The key to much of network analysis lies in the spectrum decomposition represented by eigenvectors of the network Laplacian. While possessing many desirable algebraic properties, Laplacian lacks the power to capture fine-grained structure of the underlying network. Our novel matrix, Vicus, introduced in this work, takes advantage of the local structure of the network while preserving algebraic properties of the Laplacian. We show that using Vicus in spectral methods leads to superior performance across fundamental biological tasks such as dimensionality reduction in single cell analysis, identifying genes for cancer subtyping and identifying protein modules in a PPI network. We postulate, that in tasks where it is important to take into account local network information, spectral-based methods should be using Vicus matrix in place of Laplacian.

This is a

Networks are a powerful paradigm for representing relations among objects from micro to macro level. It is no surprise that networks became a representation of choice for many problems in biology and medicine including gene-gene and protein-protein interaction networks [

The traditional formulation of the Laplacian captures the global structure of the matrix, which is often insufficient in biology where local topologies are what needs to be sought and exploited. Moreover, recently algorithms designed to capture the local structure of the data have been shown to significantly outperform global methods [

In this paper we introduce Vicus and compare its performance to the Laplacian across a wide range of tasks. Our experiments include single cell dimensionality reduction, protein module discovery, feature ranking and large scale network clustering. Since we consider such a diverse set of biological questions, in each case we also compare to appropriate state-of-the-art methods corresponding to each question. Spectral clustering using Vicus outperforms competing approaches in all of these tasks. Our experiments show that Vicus is a more robust alternative to traditional Laplacian matrix for network analysis.

In this section we consider predetermined 2D and 3D structures, represent them as a graph and analyze the performance of local Vicus as compared to traditional Laplacian in the task of graph-based dimensionality reduction.

First, let us consider a particular type of protein fold that has a complex structure in which four pairs of antiparallel beta sheets, only one of which is adjacent in sequence, are wrapped in three dimensions to form a barrel shape. This structure known as jelly roll or Swiss roll is particularly common in viral proteins and is schematically depicted in

A shows an example of the structure known as jelly roll or Swiss roll which is particularly common in viral proteins. B shows five random non-overlapping clusters in 3D space connected by sparsely measured channels. C shows a toroidal helix containing a circle as its basic geometric shape. D is an example of sampling in 3D space where we sample points from a solid bowl-shaped figure non-uniformly: the top of the bowl is more densely sampled, gradually reducing sampling towards the bottom of the bowl graph. On all the cases, Vicus is able to recover the underlying distributions of the input data more robustly.

Another simulation that we considered is a typical example in bioinformatic imaging, structured 3D data. A schematic of clustered signal within brain regions and connecting channels between them is captured in

A very common structure in protein folding is a helix. Among such foldings are toroidal helices, where the helix is wrapped around a toroid. These structures have a pore in the middle that allows unfolded DNA to pass through. The toroidal helix in

Our final example is the task of sampling in 3D space, such as sampling an image of a cell shape in a cell morphology study. We sampled points from a solid bowl-shaped figure (

These examples show the benefits of capturing local structure in a network (graph) decomposition, which gives a better understanding of patterns and neighborhoods hidden in complex networks.

Single-cell RNA sequencing (scRNA-seq) technologies have recently emerged as a powerful means to measure gene expression levels of individual cells [

We benchmark our method on four recently published single-cell RNA-seq datasets with validated cell populations:

Pollen data set [

Usoskin data set [

Buettner data set [

Kolodziejczyk data set [

The main reason we chose these four single-cell datasets is that their ground-truth labels have been validated either experimentally or computationally in their original studies. We formulate the problem of clustering cells from RNA-seq data in terms of networks. First, cell-to-cell similarity networks (

To demonstrate the representative power of the low-dimensional representations by Vicus, we ran t-SNE [

Four columns represent the embedding results for Buettner data, Kolodziejczyk data, Pollen data, and Usoskin data respectively. In each dataset, cells are color-coded as their ground-truth labels. Larger separations between different clusters usually indicate better performances in low-dimensional embeddings.

We compare spectral decomposition using Vicus with spectral methods using traditional global Laplacian along with 6 other popular dimensionality reduction methods. The six methods include linear methods such as Principle Component Analysis (PCA), Factor Analysis(FA), and Probabilistic PCA (PPCA) and nonlinear methods such as multidimensional scaling (MDS), Kernel PCA, Maximum Variance Unfolding (MVU), Locality Preserving Projection (LPP) and Sammon mapping. We use a widely-used toolbox [

Results in

Buettner | Kolodziejczk | Pollen | Usoskin | |
---|---|---|---|---|

PCA | 0.429/0.394 | 0.553/0.539 | 0.946/0.941 | 0.468/0.395 |

FA | 0.337/0.278 | 0.686/0.679 | 0.700/0.558 | 0.135/0.104 |

PPCA | 0.182/0.174 | 0.770/0.727 | 0.922/0.890 | 0.694/0.731 |

MDS | 0.429/0.395 | 0.557/0.543 | 0.931/0.885 | 0.468/0.438 |

Sammon | 0.247/0.232 | 0.434/0.423 | 0.903/0.825 | 0.582/0.551 |

KPCA | 0.286/0.204 | 0.413/0.339 | 0.701/0.692 | 0.268/0.189 |

LPP | 0.314/0.229 | 0.720/0.699 | 0.890/0.801 | 0.632/0.654 |

MVU | 0.247/0.154 | 0.610/0.652 | 0.839/0.690 | 0.226/0.175 |

InfoMap | 0.584/0.274 | 0.688/0.443 | 0.930/0.884 | 0.580/0.257 |

Louvian | 0.731/0.654 | 0.728/0.599 | 0.770/0.643 | 0.603/0.561 |

AP | 0.214/0.135 | 0.712/0.699 | 0.816/0.671 | 0.256/0.172 |

Global Laplacian | 0.271/0.166 | 0.600/0.495 | 0.855/0.790 | 0.592/0.555 |

Vicus | 0.778/0.742 | 0.780/0.719 | 0.934/0.880 | 0.695/0.701 |

One of the major challenges in single-cell analysis is to detect rare populations of cells from noisy single-cell RNA-seq data. The signals of rare populations can be easily neglected due to the existence of various sources of noises. Our approach based on Vicus matrix is able to discover weak signals of rare populations by exploiting local structures while global Laplacian fails. We applied our method on a scRNA-seq data consisting of 2700 peripheral blood mononuclear cells (PBMC). It is generated by 10x Genomics GemCode platform, a droplet-based high-throughput technique and 2700 cells with UMI counts were identified by their customized computational pipeline [

A: a 3-D mapping of the learned low dimension by Vicus. Each cell is colored according to its ground-truth. The rare population of Megakaryocytes is shown in yellow. B: The top 5 differential genes for each cell types detected by Vicus.

Identification of functional modules in Protein-protein interaction (PPI) networks is an important challenge in bioinformatics. Network module detection algorithms can be employed to extract functionally homogenous proteins. In this application, first submodules are detected and subsequently these submodules are investigated for enrichment of proteins with a particular biological function. Stability is one of the essential goals of the multi-scale module detection problem [

To analyze the stability of our method we partition a Protein-Protein Interaction(PPI) network, which consists of 7,613 interactions between 2,283 Escherichia coli proteins [

Stability in Panel A indicates the robustness of the community detection algorithms (Vicus vs Laplacian) while Variations in Panel B show how the corresponding algorithm exploits the community membership information in the network.

One of the holy grails of computational medicine is identification of robust biomarkers associated with the phenotype of interest. Here we consider the question of identifying genes associated with cancer subtyping in 5 cancers from 6 microarray datasets. These are benchmark datasets for feature selection in computational biology from

Data Set | # Instances | # Features | # Classes | Attributes |
---|---|---|---|---|

ALLAML | 72 | 7129 | 2 | continuous, binary |

Carcinom | 174 | 9182 | 11 | continuous, multi-class |

GLIOMA | 50 | 4434 | 4 | continuous, multi-class |

leukemia | 72 | 7070 | 2 | discrete, binary |

lung | 203 | 3312 | 5 | continuous, multi-class |

lung-discrete | 73 | 325 | 7 | discrete, binary |

In the standard formulation of spectral clustering, the ranking of features (in this case, genes) is done using Laplacian score. Laplacian Score is a score derived based on the network spectrum that is commonly used to rank features in the order of their importance and relevance to the clusters. Given a feature

Unfortunately, LS has difficulty identifying features that are only relevant to one of the clusters (a certain local subnetwork) but not the whole network. Traditional LS will prefer features that are globally relevant to all the clusters, even if they are not as strongly indicative of any cluster in particular. We thus, propose to substitute the Laplacian matrix

For each data set presented in

Experiments are performed on 6 cancer datasets. On each dataset, we vary the number of selected features (genes) and use k-means to report the clustering accuracy. NMI and ARI are used to measure the goodness of selected features. It is consistently observed across six datasets that Vicus can select better features than Laplacian.

The proposed Vicus matrix for weighted networks exhibits greater power to represent the underlying cluster structures of the networks than the traditional global Laplacian. The key observation is the ability of our local spectrum to make the top eigenvectors more robust to noise and hyper parameters in the process of constructing such weighted networks. The proposed Vicus-based local spectrum can supplant the usage of Laplacian-based spectral methods for weighted networks in various tasks such as clustering, community detection, feature ranking and dimensionality reduction. Sharing similar algebraic properties with global Laplacian, our local spectrum helps to understand the underlying structures of the noisy weighted networks. As demonstrated, local spectrum is robust with respect to noise and outliers. Finally, we have parallelized Vicus to achieve scalability. While the discussed applications contained at most a few thousands nodes, we have performed experiments on networks with up to 500,000 nodes. On this very large network, Laplacian based spectral clustering took 7.5min while Vicus took 12.9min with better performance (higher NMI). Thus, Vicus is not only more accurate but it can scale to very large networks, a property which will become important as we start constructing, for example, DNA co-methylation probe-based networks with hundreds of thousands of probes.

The power of local network neighborhoods has become abundantly clear in many fields where the networks are used. Principled methods are needed to take advantage of the local network structure. In this work we have proposed the Vicus matrix, a new formulation that shares algebraic properties with the traditional Laplacian and yet improves the power of spectral methods across a wide range of tasks necessary to gain deeper understanding into biological data and behavior of the cell. Taking advantage of the local network structure, we showed improved performance in single cell RNA-seq clustering, feature ranking for identifying biomarkers associated with cancer subtyping and dimensionality reduction in single cell RNA-seq data. Further, we have shown that our method is amenable to parallelization which allows it to be performed in time comparable to the traditional methods.

Suppose we have a network _{ij} represents the weight of the edge between the

Traditional state-of-the-art spectral clustering [^{1}, ^{2}, …, ^{C}] is the set of eigenvectors, capturing the structure of the graph. Eigenvectors associated with the Laplacian matrix of the weighted network are used in many tasks (e.g., face clustering, dimensionality reduction, image retrieval, feature ranking, etc). These eigenvectors suffer from some limitations. For example, the top eigenvectors, in spite of their ability to map the data to a low-dimensional space, are sensitive to noisy measurements and outliers encoded by pairwise similarities (

Our Vicus Matrix (

Let our data be a set of points {_{1}, _{2}, …, _{n}}. Then, each vertex _{i}, in the weighted network _{i} and _{i}’s neighbours, _{i}. We constrain the neighbourhood size to be held constant across nodes (i.e.,

Our main assumption is that the labels (such as cluster assignments 1 … ^{th} datapoint (_{i}) can be inferred from the labels of its direct neighbors (_{i}. The similarity matrix associated with the subgraph _{i}. Using the label diffusion algorithm [_{i} represents the normalized transition matrix of ^{i}, i.e., _{k} to estimate _{ik}. _{i})^{−1}, representing label propagation at its final state. Here, _{i} represents the convergence of the label propagation for the datapoint _{i}[1: _{i} and _{i}[_{i}, corresponding to the ^{th} datapoint.

We can construct a matrix

Our objective is to minimize the difference between ^{k}:

Similarly to the original spectral clustering formulation (

Our Vicus Matrix

1. Both matrices are symmetric and positive semi-definite.

2. The smallest eigenvalue of both matrices is 0, the corresponding eigenvector is the constant vector

3. Both matrices have _{1} ≤ _{2} ≤ … ≤ _{n}

4. The multiplicity of the eigenvalue 0 of both

Here

For the second property, we first prove that _{i} is the last row of the transition kernel _{i} is all one. Thus we prove

It is then easy to verify that

Hence we proved that matrix

Given a feature set that describes a collection of objects, denoted as _{1}, _{2}, …, _{n}}, we want to construct a similarity network

Throughout the paper, we used Normalized Mutual Information (NMI) [_{p}| and |_{q}| denote the cardinality of the

Details can be found in [

The Adjusted Rand Index (ARI) is another widely-used metric for measuring the concordance between two clustering results. Given two clustering

The (normal) Rand Index (RI) is simply

Given a network on a set of

_{i}: the degree of node _{i} = ∑_{j} _{ij}

Σ: the normalized degree matrix with nonzero values only on the diagonal

_{ij} = 1 if the node _{ij} = 0 otherwise.

Then the stability measure on time _{t} over such a time span.

The variation is defined in terms of the asymptotic stability induced by going from the ‘finest’ to the ‘next finest’ partitions is:
_{2} is the normalized Fiedler eigenvector with its corresponding eigenvalue _{2}. We refer the mathematical details in deriving these two definitions to [

There are mainly three hyper-parameters in Vicus: first the number of neighbors

The proposed Vicus is very robust to the choice of

We also want to emphasize that, when performing clustering tasks, Vicus does not specify the number of clusters since Vicus is only providing a new form of Laplacian that captures local structures in the network. In our experiments of single-cell applications, we only feed the number of clusters to the clustering algorithms (i.e, K-means algorithm) as the true number of clusters.

Panel A shows the underlying ground-truth network heatmap consisting of 3 connected components. Given this perfect network, we manually add random noise. The random noise is generated from uniform distribution between [0,

(EPS)

The first column shows the groundtruth of the data distribution. Panel A is the 3D scattering of the data points used in the experiment. Panel E shows the corresponding 2D ground-truth distribution generating the data. This is also a desired output of low-dimensional embedding we want to recover. Panels B-D shows the results of low-dimensional embedding by Laplacian while Panels F-H are for Vicus using different values of hyper-parameters.

(EPS)

We apply Vicus on the Buettner data set of single-cell RNA-seq. Panel A shows both NMI and ARI with different choices of number of neighbors

(TIFF)