Fig 1.
Variance can be used as a proxy for unsupervised feature selection but does not take into account the underlying metric structure of the data.
In the figure a point cloud is labeled according to two binary features with equal variance. However, only the feature on the right shows a large degree of consistency with the metric structure of the data. Manifold learning approaches to feature selection prioritize features according to the degree of consistency with the underlying metric structure of the data.
Fig 2.
Summary of various related approaches to unsupervised feature selection and extraction, highlighting the concepts introduced in this paper.
Fig 3.
Simplicial complexes provide topological representations of a space.
They are generalizations of graphs containing higher dimensional elements such as triangles, tetrahedra, etc. In some cases, higher dimensional elements in a simplicial complex can convey information of a point cloud that is not captured by the underlying graph. For example, in (a) a Čech complex is constructed from intersections of fixed-radius balls centered at the points of a point cloud, which in this example are ordered according to the horizontal coordinate. Simplicial complexes enable the application of co-homological techniques to point cloud features (that is, to functions defined over the elements of the point cloud). Features can be defined over individual points (b, top), pairs of points (b, bottom), triplets, etc. Co-homological techniques, like those discussed in this work, can rank and classify point cloud features according to their amount of localization along topological structures (disconnected components, loops, cavities, etc.) of the underlying simplicial complex. In (c), two examples of point cloud features localized along a topologically trivial region (left) or a non-contractible loop (right) are shown.
Fig 4.
Combinatorial Laplacian score for a set of binary 1-point (a) and 2-point (b) features on a Čech complex. Features in this example can take values 0 (blue) or 1 (red). The 0-dimensional combinatorial Laplacian score was computed for each of the features shown in (a). These features assign values to the nodes in the complex.
can be expressed as a sum over edges (S1 Note), where edges that connect vertices on which the feature takes a high value contribute little. Features that take high values on highly connected regions of the simplicial complex therefore have lower values of
than features that take high values on disconnected nodes. Analogously, the 1-dimensional combinatorial Laplacian score
was computed for each of the features shown in (b). In this case, features are evaluated on the edges of the simplicial complex. Features that take high values on edges that form non-contractible loops in the simplicial complex have lower values of
than features that take high values on disconnected or contractible paths.
Fig 5.
Feature selection on the MNIST dataset using the combinatorial Laplacian score.
Each sample consists of a grey-scale image of a hand-written digit from 0 to 9 and each pixel represents a feature. The degree of localization of each feature in the simplicial complex is assessed using the 0- and 1-dimensional combinatorial Laplacian scores. (a) Number of rejected null hypothesis at a FDR of 0.05 as a function of the radius ε of the balls in the Vietoris-Rips complex. In this example, the statistical power of and
is maximized for ε~0.7. The significance of the scores was determined through a permutation test were the pixels were randomized 5,000 times. (b) Vietoris-Rips complex colored by the intensity of a pixel that is significant under
(q-value < 0.005) but not under
(q-value = 0.5). The intensity of the pixel is high in a densely connected, topologically trivial region of the complex containing images of the digit `4`. Images associated to several nodes are shown for reference, with the pixel highlighted in red. (c) Vietoris-Rips complex colored by the intensity of a pixel that is significant under both
(q-value < 0.005) and
(q-value < 0.005). The intensity of the pixel is high in a densely connected region that surrounds a large non-contractible cycle of the simplicial complex. The cycle is generated by images that belong to the sequence of digits `7`, `3`, `1`, `9`. Images associated to several nodes along the cycle are shown for reference, with the pixel highlighted in red.
Table 1.
Running times of the 0- and 1-dimensional combinatorial Laplacian scores for several simplicial complexes of different size on a standard 8-core desktop computer.
Fig 6.
Differential expression analysis using the combinatorial Laplacian score.
(a) ROC curves for three simulated scRNA-seq datasets with 10% of the genes being differentially expressed between two populations of cells. The 0-dimensional combinatorial Laplacian score, DeSeq2, edgeR, MAST, and variance were used to identify differentially expressed genes. Conventional methods (DeSeq2, edgeR, and MAST) in addition took as input a list of labels assigning each cell to the corresponding population. (b) Analysis of scRNA-seq data of 24,911 T-cells infiltrating lung tumors and adjacent tissues. Genes were ranked according to their 0-dimensional combinatorial Laplacian score. t-SNE plots color-coded for expression (grey to red) are shown for some of the top differentially expressed genes identified by this method. For reference, the expression of a gene with high combinatorial Laplacian score (TMEM14C) is also displayed. To facilitate interpretation, we use the same t-SNE embedding as in [23].
Fig 7.
Differential expression analysis of alternative paths of differentiation using the 1-dimensional combinatorial Laplacian score.
(a) The 0- and 1-dimensional combinatorial Laplacian scores were run over the scRNA-seq expression data of 3,582 cells from the differentiation of mESCs into MNs using the standard (SP) and direct programming (DP) protocols. The scatter plot represents the 0- and 1-dimensional combinatorial Laplacian scores of 10,063 genes. A k-nearest neighbor graph (k = 4) color-coded for expression (grey to red) is shown for some of the top differentially expressed genes identified by this method. Genes with low values of have upregulated expression along the cycle spanned by the two alternative paths of differentiation. For reference, the expression of a gene with high 0- and 1- combinatorial Laplacian scores (Pole3) is also displayed. (b) The 0- and 1-dimensional combinatorial Laplacian scores were run over the scRNA-seq expression data of 24,930 epithelial cells from the developing mouse (development stages E9.5-E13.5). Apical ectodermal ridge cells form a transient state that branches out and in from epithelial cells. The expression of genes with low value of
localizes along the cycle spanned by the two alternative paths of differentiation. Examples of the expression of genes with low 0- and 1-dimensinoal combinatorial Laplacian score, low 0- and high 1-dimensional combinatorial Laplacian score, and high 0- and 1-combinatorial Laplacian score are shown (left).Gene set enrichment analysis (right) shows that genes with low 1-dimensional combinatorial Laplacian score are strongly enriched for genes that are upregulated in the alternative path for differentiation spanned by apical ectodermal ridge cells.
Fig 8.
Analysis of multi-modal genomic data using the univariate and bivariate combinatorial Laplacian scores.
(a) The 0-dimensional Laplacian score was utilized to identify genes that are differentially expressed across spatial directions in a section of the murine somatosensory cortex using single-molecule FISH data. Gene expression levels were evaluated using the combinatorial Laplacian score based on the spatial dimensions. Genes with low have significant spatial patterns of expression. For reference, the expression patterns of two genes with low
(Lamp5 and Kcnip2) and two genes with high
(Ttr and Pdgfra) are shown. Cells are represented by means of a Voronoi tessellation and are color-coded according to the expression level of the gene (blue: low; red: high). (b) Analysis of spatial relations among cell populations using the bivariate combinatorial Laplacian score. The 0-dimensional bivariate combinatorial Laplacian score was computed for each pair of cell populations in the murine somatosensory cortex dataset using the spatial dimensions to build the Vietoris-Rips simplicial complex. The significance of the score was estimated for each pair of cell populations by randomization and is shown in the heatmap. Cells from pairs of populations that have a significant score often appear adjacent to each other in the spatial dimensions. (c) Analysis of whole exome and mRNA sequencing data of 667 low-grade glioma and glioblastoma tumors of The Cancer Genome Atlas (TCGA) using the 0-dimensional Laplacian score. A Vietoris-Rips complex was built based on the gene expression data. Binary vectors indicating whether a gene is non-synonymously mutated or not were taken as features. The 0-dimensional Laplacian score identifies genes for which their mutation is associated with consistent global expression patterns. A UMAP representation based on the expression data is shown for some representative genes with low (IDH1, EGFR, and RB1) or high (ACAN) value of
, color-coded according to the somatic mutation status of the gene (orange: non-synonymously mutated; grey: non-mutated or synonymously mutated).