Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Sparse representation learning derives biological features with explicit gene weights from the Allen Mouse Brain Atlas

Abstract

Unsupervised learning methods are commonly used to detect features within transcriptomic data and ultimately derive meaningful representations of biology. Contributions of individual genes to any feature however becomes convolved with each learning step, requiring follow up analysis and validation to understand what biology might be represented by a cluster on a low dimensional plot. We sought learning methods that could preserve the gene information of detected features, using the spatial transcriptomic data and anatomical labels of the Allen Mouse Brain Atlas as a test dataset with verifiable ground truth. We established metrics for accurate representation of molecular anatomy to find sparse learning approaches were uniquely capable of generating anatomical representations and gene weights in a single learning step. Fit to labeled anatomy was highly correlated with intrinsic properties of the data, offering a means to optimize parameters without established ground truth. Once representations were derived, complementary gene lists could be further compressed to generate a low complexity dataset, or to probe for individual features with >95% accuracy. We demonstrate the utility of sparse learning as a means to derive biologically meaningful representations from transcriptomic data and reduce the complexity of large datasets while preserving intelligible gene information throughout the analysis.

Introduction

Dimensionality reduction, manifold learning, and clustering are essential methods for processing transcriptomic data into “representations” of biology interpretable to the human mind [1]. A typical workflow can include Principle Component Analysis (PCA) [2], graph embedding with t-Distributed Stochastic Neighborhood Embedding (t-SNE) [3] or Uniform Manifold Approximation and Projection (UMAP) [4], and separation by Leiden clustering [5]. This approach is robust, but presents several weaknesses: Initial dimensionality reduction filters for high variance global trends over localized features, offering low sensitivity to rare or low expressing genes [6]. Manifold Learning methods, UMAP and t-SNE, conversely, are more sensitive to local structures of the data, but do not report the weights of input genes [7]. Clustering convolves noise within informative groupings [8]. Overall, each learning step further abstracts representation from data such that the contribution of any one gene to a cluster is not explicit.

Well described and unique markers solve most of these issues, offering the essential ground truth to connect points on a graph to biological states and a starting point for Gene Ontology (GO) analysis with experimental validation. Many representations lack a unique marker however and can only be described by the relative expression levels of common genes. Transcriptomic data lacking cellular resolution cannot be fully described by a singular marker even if one were present [9]. In these cases, each step of an analysis pipeline further obscures what a cluster might represent, requiring supervised learning approaches and careful validation to parse gene information.

Given the incredible advances in applied information theory [10], we asked if an unsupervised learning method could derive localized features within a dataset, while retaining actionable information about the inputs. Essentially, we asked if it was possible to generate representations of biology from transcriptomic data, in a single step, without implicit priors. For a more advanced approach to dimensionality reduction and clustering, we explored “sparse” learning methods, optimized for building localized features from a minimal number of input elements sparsely distributed within a large dataset [11].

To test if derived features were representations of biology and not just the isolation of a unique marker gene, we used a dataset with well described biological phenomena as a testable ground truth. The Allen Mouse Brain Atlas (AMBA) is the most comprehensive spatial transcriptomic dataset of the whole brain available during this analysis. What makes AMBA the optimal test data for our methods is that every voxel of transcriptomic data has been registered to the Common Coordinate Framework (CCF) and labeled as an element of anatomy based on thousands of person-hours worth of labor [12]. The low resolution (200μm isotropic) of AMBA convolves multiple cell types, making single gene markers uncommon and an incomplete descriptor of any element of neuroanatomy [13]. While the developmental genes involved in forming different brain structures are mostly inactive in the adult mouse brain, gene expression can still functionally define anatomy [13]. These data therefore offer a means to test the ability of a learning method to describe established elements of anatomy with derived signatures of gene expression [13, 14].

We surveyed previously established methods and new advances to find that learning methods with sparsity constraints offered the unique potential to return biologically meaningful representations without the need for secondary clustering steps. With some parameter tuning, all sparse representation learning methods tested for dimensionality reduction could generate anatomical features with a corresponding ranked list of genes. Moreover, gene lists could be compressed to a minimal ensemble of markers for each element of anatomy while retaining high fidelity to ground truth. Overall, sparse learning methods offer a means to derive biologically representative features and descriptive minimal gene lists from transcriptomic data.

Results

To survey representation learning methods and match previous analyses, we first established metrics for comparing representations derived from the high-fidelity coronal In Situ Hybridization (ISH) data for 2,941 transcripts from AMBA with ground truth anatomical information of CCF, to test the principles behind the algorithms that generated them [12]. Because anatomy is presented as labels, we initially applied secondary clustering using K-means for all methods so that our results were directly comparable to previously published studies [13, 15]. All analyses were done using the whole brain volume with 100 representations derived using all dimensionality reduction methods surveyed here to match previous studies, then compared against CCF volumetrically. CCF labels 574 unique brain structures at 200 μm isotropic resolution. Because many smaller brain regions are only a few voxels large, even if every anatomical region was perfectly defined by transcription, biological variance, resolution, and overfitting artifacts introduce significant error that was unavoidable using unsupervised methods, but common to all approaches.

Mutual information based feature learning with clustering derives low resolution representations of anatomy from the AMBA dataset

First postulated in information theory [16], the maximum information principle presents a model of information transfer that has since been used to filter complex data [17]. Infomax based algorithms build representations by separating low Mutual Information (MI), or independent, components of a set, and grouping high MI, or the most mutually interdependent ones. InfoMax is central to independent component analysis (ICA) and a commonly-used objective in many other machine learning methods, such as variational autoencoders [18, 19]. ICA was recently used on similarly low resolution brain-wide spatial transcriptomic data, so we asked if ICA components would recapitulate anatomy in AMBA in a similar manner to sequencing based approaches [14].

ICA generated components that were broadly active across the brain, but with areas of high intensity that appeared anatomical (Fig 1A). To filter for highly active areas and parcelate the data into labels, we applied K-means clustering, resulting in representations that were more visually similar to anatomy, though less detailed (Fig 1B). To quantify how well clustered representations fit anatomy, we used Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) to test if a region labeled by clustering was co-labeled by anatomy, then generated scores from 1 to 550 k-means clusters to look for an optimal fit to ground truth. Peak fitting occurred at 300 clusters by AMI (Fig 1C) and 150 by ARI (Fig 1D), well below the 574 labeled elements of anatomy in CCF, but more than the 181 clusters derived using ICA/K-means on a less comprehensive whole brain transcriptome [14].

thumbnail
Fig 1. Representation learning on AMBA with ICA.

In a whole brain analysis of 100 ICA derived representations (a) A selected representation shows activity across the brain with a bright anatomical like region. (b) K-means clustering of the ICA representations appears more selectively anatomical (K = 200). (c) AMI and (d) ARI of clustered components, compared to neuroanatomy, using K-means clustering with K ranging from 1 to 50 and from 50 to 550 with a step of 50. (e) 3D representation of K-means clusters with selected top overlapping clusters (in blue) with brain regions (Olfactory Tubercle, Reticular nucleus of the Thalamus, and Pontine Gray) in red and low overlapping clusters (in green) with brain regions (Basolateral Amygdalar Nucleus, posterior part and Main Olfactory Bulb) in red using Dice similarity coefficient.

https://doi.org/10.1371/journal.pone.0282171.g001

Visual inspection of clustered representations showed the poorest fits to anatomy, by Sørensen–Dice similarity coefficient, with poor representations appearing at interfaces like olfactory bulb and nucleus, or within a single coronal slice, which was the orientation of the high-fidelity In Situ Hybridization (ISH) experiments of AMBA used in this analysis (Fig 1E). Similar artifacts were previously reported with ICA of spatial transcriptomic data, requiring manual curation before subsequent clustering and parcellation by supervised learning methods [14]. Without applying these additional steps, ICA/K-means derived representations offered the lowest similarity to AMBA labels of any method tested.

Variance based feature learning with clustering generates improved representations of anatomy, with no advantage to nonlinear models

The earliest analysis of the AMBA data used Singular Value Decomposition (SVD) and K-means clustering to generate representations that resembled anatomy [13]. SVD is the basis for PCA, which similarly projects the data to a subspace of maximal variance for evaluation of a subset of components. We applied and chose 100 representations for uniform comparison with previous methods. PCA of AMBA generated components with high intensity areas that visually recapitulate anatomy, but are active across most of the brain, similar to ICA (Fig 2A). K-means clustering on the full ensemble of PCA components parcellated brain voxels into finer and more anatomical looking representations of anatomy than ICA (Fig 2B).

thumbnail
Fig 2. Linear and nonlinear PCA generate equivalent representations of anatomy from AMBA.

(a) First PCA component representation in a sagittal slice. (b) Representation of K-means clustering (K = 200) of PCA components in a coronal slice of the left hemisphere. (c) First KPCA component with Quadratic kernel function in a sagittal slice. (d) Representation of K-means clustering (K = 200) of KPCA components with Quadratic kernel function in a coronal slice of the left hemisphere. (e, f) Adjusted Mutual Information and Adjusted Rand Index of clustered components using K-means clustering with K ranging from 1 to 50 and from 50 to 550 with a step of 50.

https://doi.org/10.1371/journal.pone.0282171.g002

Biological relationships are not always linear, so we asked if a nonlinear approach could more accurately reflect underlying biology. Unlike the eigendecomposition of PCA, Kernel PCA (KPCA) components are fit to a preselected kernel function. Parameter tuning across multiple nonlinear transformations however, did not produce visually different components from those generated by PCA (Fig 2C). Applying K-means to KPCA components produced similar parcellations to PCA/K-means (Fig 2D). Quantitative comparison of linear and nonlinear methods demonstrated no advantage to KPCA, with some kernels performing worse at every value of k, suggesting no advantage to non-linear methods.

Sparsity constrained feature learning with clustering generates the most accurate anatomical representations

Sparse representation learning algorithms generally satisfy infomax principles but apply sparsity constraints to one or more aspects of their outputs: the number of representations that are informative of the input data (feature or population sparsity), the samples of data that comprise any given representation (sample or lifetime sparsity), and the distribution of informative samples across all representations (dispersal) [20]. For DLSC and SPCA, the sparsity variable α is a regularization parameter, which imposes sparsity on components in DLSC and loadings in SPCA. SFt does not have a sparsity parameter, and instead uses a nonlinear transformation, ridge (l2) regression for normalization across features, a second l2 normalization across samples and the least absolute shrinkage and selection operator (l1) regularization for global minimization [20, 21]. Ideally sparse representations would then be non-overlapping, with each one presenting similar levels of informativeness regarding the input data. Previous work has claimed AMBA representations generated with Dictionary Learning and Sparse Components (DLSC), outperformed PCA when combined with clustering [15].

We generated representations starting with previously described optimal hyperparameters, including a low sparsity hyperparameter (α = 0.1) [15]. Sparsely derived features covered similar anatomy as other tested methods with similarly broad activity (Fig 3A, 3C and 3E). As previously reported, clustering of DLSC resulted in similar but visually finer representations than those generated by PCA (Fig 3B). Sparse PCA (SPCA) performed similarly with the same hyperparameters, generating larger representations with comparable anatomical features to DLSC (Fig 3B and 3C) We then tried Sparse Filtering (SFt), a method that derives its own sparsity levels. SFt generated uniquely compact features, with activation in single anatomical-like regions (Fig 3E). SFt + K-means clustering yielded fine features, including the clearest depiction of cortical layers of any method tested (Fig 3F).

thumbnail
Fig 3. Sparse representations of AMBA.

(a) DLSC representation, active in the cortex. (b) K-means clustering (K = 200) of DLSC representations in a coronal slice. (c) SPCA representation, chosen for similarity. (d) K-means clustering of SPCA representations. (e) SFt representation, active in the cortex. (f) K-means clustering of SFt components. (g, h) Adjusted Mutual Information and Adjusted Rand Index of clustered components using K-means clustering with K ranging from 1 to 50 and from 50 to 550 with a step of 50.

https://doi.org/10.1371/journal.pone.0282171.g003

We quantified similarity to labeled anatomy across all sparse methods with fixed parameters across a range of clusters, showing greater accuracy for SFt than any other method, with a peak score of 0.61 at 200 clusters based on AMI scores (Fig 3G). ARI scores showed SFt performing substantially better, peaking at 0.28 with 50 clusters. In all methods, AMI scores peaked well below the 574 labeled brain regions, with most fitting optimally at ~200 clusters. The most accurate representations were predominantly subcortical, with resolution preventing an even more optimal fit to anatomy. Some visually obvious artifacts appeared as single slices or matched artifacts in the original histology data (Fig 1G). Visual inspection of each feature suggested improved fits were in part due to fewer coronal plane artifacts, but there is no way to determine if the remaining mismatch between SFt and AMBA labels are due to the limits of resolution.

Sparse learning methods without clustering can represent anatomy when optimized for minimal, spatially contiguous features

The uniquely sparse and compact features from SFt suggested no secondary clustering was required to describe anatomy and we asked if this could be achieved with other sparse methods after sparsity parameter tuning using α values from {0.1, 1, 10, 20}. To optimize DLSC and SPCA for better representations, we established descriptive metrics of spatial properties, similarity to ground truth, and underlying gene information: Shannon entropy measures the total information contained in all features. Sørensen–Dice similarity coefficient is a measure of overlap between a representation and the best fitting anatomical ground truth. The number of connected components describes the contiguity of a representation. Feature sparsity measures the exclusivity of activation to a given representation, with a single activated point being maximally sparse. Spatial entropy measures the total homogeneity of representation space. Finally, weight sparsity in the samples comprising each representation.

Across all conditions, Shannon entropy remained the same, indicating that no information was lost using a given method, only distributed differently depending on the method and parameters (Fig 4A–4C). As a baseline, metrics were applied to the expression of individual genes (Fig 4A). Without a clustering step PCA made poor representations of anatomy. For individual PCA components, spatial entropy remained high, with a lower connected components score and poor fits to anatomy by Dice coefficient (Fig 4B).

thumbnail
Fig 4. Relative performance metrics of DLSC, SPCA, and SFt.

Representations derived from each method without additional clustering were evaluated along 5 metrics with increasing sparsity values (α) when possible. Metrics, minus weight sparsity, applied to individual genes (a) PCA had the poorest metrics, while SFt had the best Dice coefficients of any method, correlating with feature sparsity and connected component score (b). With increasing α, DLSC representations lose spatial entropy and gain feature sparsity, with fewer connected components and improvements in individual fit to ground truth as measured by Dice coefficient (c). For SPCA, higher α increases weight sparsity while Dice and connected components peak at α = 10 (d). Labeling all representations by the anatomy they best represent showed some brain regions were redundantly represented by all methods, with the caudate putamen (CP) overrepresented across all methods and α values, with representation diversity peaking with Dice (e).

https://doi.org/10.1371/journal.pone.0282171.g004

SFt offered the highest Dice coefficient of any method, even after parameter tuning, demonstrating that individual representations fit directly with ground truth anatomy (Fig 4B). This correlated with a high score for connected components, ideally one anatomical region per representation. A high feature sparsity score indicated spatially regions of high activation, while spatial entropy score further described low activity across most of anatomical space. A similar weight sparsity score to PCA indicated that a similar number of genes were used to generate each representation. Overall SFt derived sparse features from the data, but did not derive sparse weightings to generate those features.

For DLSC, previous work reported a peak AMI score with clustering when the sparsity parameter α was low. Increasing α however improved Dice coefficient for direct representations, peaking at 100X the previously reported optimal value for AMI (Fig 4C). Peak Dice coefficient again correlated with peak connected component score, and comparable spatial entropy to SFt, indicating that information consolidated within features with accuracy until becoming incoherently sparse. Feature sparsity did not change with parameter tuning however, nor did weight sparsity, suggesting that the contents of DLSC features did not change substantially with tuning.

For SPCA, the average Dice coefficient remained lower than DLSC or SFt with optimal parameter tuning (Fig 4D). Like DLSC and SFt the Dice score, correlated with the number of connected components, with no relationship to spatial entropy or feature sparsity. At high α, optimization was limited to weight sparsity, meaning the number of genes used to build each representation compressed without changing representation accuracy.

Comparing each feature across all methods and conditions, Dice coefficients were highly correlated with the connected components score (r = 0.97, p<0.01). Individual measures of sparsity across features and weights did not reflect general properties of a biologically relevant feature.

Sparse learning without clustering more optimally represents anatomy, when the number of redundant features is minimized

Dice coefficient and connected component scores described how well a representation fit a single element of anatomy, however upon labeling which elements of anatomy returned an optimal Dice score for each of the 100 features generated, we found some brain regions were redundantly represented. For SFt, half of the features represented just 3 regions, for a total of 33 unique representations.

In DLSC and SPCA overlap was more extreme at the lowest α tested, with ~95% of features being representations of the CP (Fig 4E). Secondary clustering with K-means consolidated this overwhelming association into one label, with the remaining information describing the rest of anatomy and surprisingly giving an optimal fit to ground truth by AMI score [15]. Increasing α improved distribution of informativeness, generating more diverse features while improving the Dice coefficient. This was optimal in DLSC, with 64 unique representations at the same α that returned optimal Dice coefficient. Overshooting optimal sparsity caused a collapse back to redundant representations with an overall poorer fit. For SPCA, just as increasing α beyond minimal levels did not improve the Dice coefficient, the number of unique representations did not increase. Across all methods and conditions, more unique features correlated with the Dice coefficient (r = 0.91, p<0.01).

Sparse learning derived features retain intelligible information as a weighted list of genes

While information is distributed across every feature derived from the data, the genes that comprise each one have explicit weightings. Using InfoMax and variance-based methods alone, weighting is spread across much of the data, but added sparsity constraints, can return a gene list that is itself sparse, with fewer genes of greater weight (Fig 4A–4D). SPCA, with hyperparameter tuning, returned maximal sparsity of gene weights, however this did not correlate with optimal representation accuracy by Dice coefficient (Fig 4D). Just as secondary clustering can filter less relevant information, we asked if the gene weights from one step representation learning could be compressed to the most relevant markers, while retaining representation fidelity. We chose SFt representations for their higher accuracy by Dice coefficient to see the effects of reducing the number of input genes (Fig 4B and 4D).

We increasingly compressed the AMBA dataset by generating SFt representations, establishing a weighted gene list for each representation, eliminating low ranked genes from the weighted gene lists, then repeating SFt on the new data subset until only the single highest ranked gene for each original representation was left. Representation fidelity degraded linearly with elimination of low rank genes when compared against SFt labels from a full dataset (Fig 5A), however compared to anatomical ground truth, representations from the compressed data did not see a performance drop until the gene samples fell were reduced to under 1000 genes then fell with a shallower slope, retaining 95% fidelity to anatomy at over 80% compression, or 584 genes (Fig 5B). Inaccuracies at this stage were limited to the edges of anatomical regions, suggesting the initial loss of fidelity was masked by resolution and systemic variance (Fig 5C).

thumbnail
Fig 5. Performance metrics of compressed gene sets.

Lower ranked genes across all SFt representations were iteratively removed from AMBA data, followed by SFt/clustering on the compressed dataset. (a) Adjusted Mutual Information between the full dataset vs. compressed representations and anatomy vs compressed representations. (b) Adjusted Rand Index of compressed representations vs full representations and anatomy. (c) Visualization of a selected feature derived from the SFt of the 2941-gene dataset (in red) with highest Pearson correlation to the 584-gene dataset (in blue), where 95% of the AMI score with reference is met. (d) Ten genes with the largest associated weights were used to train classifiers for their corresponding brain region. Probability of positive classification is shown by the colorbar, and the corresponding anatomy shown in blue. ROC score for five-fold cross-validation is shown for each method.

https://doi.org/10.1371/journal.pone.0282171.g005

We then asked if the gene information from a single SFt representation could be compressed using a supervised approach. We took the highest fidelity SFt representations by Dice coefficient, hippocampal cornu ammonis (CA) 1, CA3, and the CP, as labels, then used the top 10 most weighted genes from each to train a logistic regression-based classifier (Fig 5D). Of the most highly weighted genes, SFt picked several known markers for each region of the hippocampus CA1 [22], CA2 [23], CA3 [22], and Dentate Gyrus (DG) [24] (Table 1). CP has high molecular diversity and uniquely expressed genes, several of which were selected by SFt. Highly weighted genes included known markers of each anatomical region, but also genes expressed across multiple elements of anatomy or even markers of adjacent anatomy that could define a border (Table 1). Data was randomized over 5 iterations, taking 80% for training and the remaining 20% and accuracy measured by Receiver Operating Characteristic (ROC) Precision-Recall (PR) and curves. Classification was highly accurate with AUROC scores above 95% for every permutation. In order to account for the class imbalance, we also measured the AUPRC scores for CA1 (0.80), CA3 (0.78), and CP (0.74). The scores are high compared to base chance of classifying voxels in the representing feature for CA1 (0.02), CA3 (0.01), and CP (0.05). This demonstrated that an additional supervised learning step on a representation weight matrix can compress the list of genes that describe it with nominal loss of representation fidelity.

thumbnail
Table 1. 12 genes with highest weights for the overlapping SFt representation with the brain regions, with selectively expressed genes of a given brain region in bold.

https://doi.org/10.1371/journal.pone.0282171.t001

Discussion

In a broad comparison of representation learning methods, sparse algorithms generated more accurately described anatomical ground truth. SFt in particular generated direct anatomical representations without the need for clustering. More comprehensive metrics demonstrated that, with hyperparameter tuning, DLSC and SPCA could produce nearly comparable representations of anatomy. Each method optimized for different aspects of sparsity in the dataset with SFT deriving the highest feature sparsity, SPCA deriving more weight sparsity, and DLSC distributing informativeness across the most features. Sparse derived features without clustering presented a weighted gene list that could be compressed down to the most relevant genes for analysis, offering a minimal probe list for deriving specific elements of anatomy.

It is important to note that none of the learning methods tested considered spatial aspects of the data, meaning that adjacent voxels would only be included in a representation if they shared a molecular feature. This is the only basis for comparing learning methods and suggests that transcriptional signatures are sparse, or only found in the few voxels occupied by anatomy, rather than the result of broad variation across the brain such as a gradient. Further, this transcriptional variation is adequately described by linear combinations of individual gene weights. This would fit with biology where a few localized transcriptional programs determine tissue function on a background of commonly expressed genes.

The detailed extrinsic ground truth of anatomy made AMBA an ideal dataset for this comparison, but during analysis we found inherent properties of the data that could be used for optimization when anatomy is not available. Dice coefficient, or goodness of fit to ground truth, correlated with connected components score, and the number of unique features. These could themselves be a basis for learning in unlabeled datasets where assumptions can be made about the spatial properties of the biology being represented.

Previous use of DLSC reported low sparsity as optimal for anatomical representations by global analysis of fit to ground truth as measured by AMI, but deeper analysis showed a different relationship with the sparsity parameter. Low sparsity derives redundant, broadly active, and overlapping features that only represent anatomy after secondary clustering. Increasing sparsity derives more accurate representations of specific elements of anatomy, directly resolving more unique regions. DLSC was the least accurate method, but at optimal sparsity, generated nearly twice the number of unique representations.

By our metrics, SFt representations were more accurate than DLSC or SPCA even after parameter tuning, for the same number of features. This is likely due to how each method applies sparsity to feature detection. The only parameter for SFt beyond the input data is the number of representations to optimize over and is common to all methods tested [25]. The simple implementation of SFt, makes it an attractive method for future applications, but raises the question if this means of applying sparsity is truly self optimizing or just particularly compatible with this dataset. In future implementations SFt could be tuned for further sparsity by using a smoothed fractional norm in place of the l1 norm [26].

Sparse representation learning methods were unique in not requiring secondary clustering to represent anatomy, making the weight matrices that comprise each feature intelligible. Representations were not simply correlatively expressed genes. Instead, genes that defined borders of anatomy from the inside or outside were both highly weighted. Within the hippocampus, Dock10, an established marker of CA2 was highly weighted in representations of every anatomical region that bordered it, despite all methods being blind to spatial information (CA1, CA3, DG, CP) (Table 1) [27]. This example demonstrates how feature detection is different from differential expression and should not be used as an analog for expression analysis when dissecting the possible biology of features derived without ground truth. Highly correlated genes within a feature can have informative relationships to genes expressed in other features that would lead to higher or lower weighting in the gene list. The stochastic initialization of sparse methods can also change weights or entire features, as they are built independently. Direct representations only covered a fraction of total identified brain regions with redundancy in a few elements of anatomy. We were able to define isolated elements of anatomy with relative ease, but this was limited to what was testable with defined marker genes. The limits of which brain regions that can be directly represented, and which large areas with multiple signatures should be excluded to improve analysis, would have to be determined and validated for more obscure elements of anatomy.

SPCA delivered the most sparse gene lists, but we found that SFt representations could be even further compressed without significant loss of fidelity. In this pilot study, after training the model on the 3k gene dataset we reduced the input data to 580 genes and still generated a better fit to anatomical ground truth than other methods using a full dataset. We could further minimize our gene lists for individual representations by applying supervised learning to the highest ranked elements to optimize their weighting. This two-step approach allowed for high accuracy representations with a list of elements that can be curated and optimized for downstream applications, such as limiting the list to proteins with commercially available antibodies. basis for deriving representations of biology with descriptive molecular signatures from a spatial transcriptomic dataset. This study benefited from having testable ground truth for biological validation, but the resulting weight matrix of each feature offers an unbiased transcriptional signature that is generalizable to other morphologies. The anatomy of the wild type (WT) C57BL/6J mouse is well described, but much less so for knockout or transgenic mouse strains or for related species of interest. AMBA scale data acquisition is unlikely to be repeated, but analysis with sparse learning has reduced the number genes needed to describe anatomy to a scale tractable by Multiplexed error-robust fluorescence in situ hybridization (MERFISH) or similar multiplexed spatial transcriptomic methods. What once took years of work could now be largely recapitulated in a single session [28].

One obvious improvement that could be made to any sparse methods on spatial datasets would be consideration of neighboring voxels for unsupervised learning. In this study, spatial aspects of the dataset were implicit, with representations appearing compact and contiguous only by virtue of their fit to ground truth. Finer representations could potentially be achieved by adding a spatial convolution step to sparse learning, or using spatial metrics as the basis for a loss function. For nonspatial datasets, other inherent biology such as gene ontology could serve a similar role to leverage this versatile tool for representation learning with preserved gene information.

Methods

Allen Mouse Brain Atlas (AMBA)

The Allen brain Atlas project (http://mouse.brain-map.org) has provided a comprehensive set of ~20,000 gene expression profiles with cellular resolution in the male, 56-day old C57BL/6J mouse brain [29]. Image data were collected using the in situ hybridization method in sagittally-oriented slices with 200 μm inter-slice resolution and 25 μm thickness. The expression patterns were replicated in the coronal sections for ~4,000 genes of high neurobiological interest [12]. The expression patterns were reconstructed in 3D and registered to a Nissl stain-based reference atlas (Allen Reference Atlas; ARA). The expression of a gene within 200 μm isotropic voxel is the average intensity of pixels in the pre-processed image called smoothed expression energy. Each 3D gene expression data consists of 67 × 41 × 58 (rostral-caudal, dorsal-ventral, left-right) spatially-matched volumes [29].

Data pre-processing

We followed the quality control measure implemented by [13]. The Pearson correlation coefficient was measured for each gene in coronal and sagittal experiments to find the higher-consistency dataset. Then 25% of genes with the lowest correlations were removed, resulting in a selection of 3,041 genes in coronal slices from which 2,941 still exist in the current version of the AMBA dataset, which we adopted for this study. The genes are selected from the experiments done on coronal slices because of the higher fidelity.

We extracted and concatenated the expression energy values in the brain for all dimensionality reduction methods to form a large 63,113 voxel × 2,941 gene matrix, E(v,g), where v and g denote voxels and genes. We then z-transformed the data for each gene so that the mean of each gene’s expression across the voxels was zero, and the standard deviation was one.

Representation learning methods

For PCA, KPCA, ICA, SPCA, and DLSC we used their implementation in the publicly available SciKit-Learn package in Python [30]). We selected 100 for the number of components in all methods. In the sparse methods, SPCA and DLSC, the amount of sparseness is adjustable by the coefficient of the l1 penalty, given by the parameter α, for which we tested different values of α = {0.1, 1, 10, 20}. The kernels used in KPCA were the quadratic and cubic polynomial, rbf, and sigmoid functions.

Sparse Filtering (SFt)

Sparse filtering is an unsupervised feature learning method that efficiently scales to handle large input dimensions. The single hyperparameter to tune is the number of features. SFt optimizes for population sparsity (a few non-zero features represent each sample), lifetime sparsity (each feature is active for a few samples), and high dispersal (features share similar statistics) [31].

The math proposed for SFt [20] considers the feature distribution matrix indexed by j for features as rows and by i for samples as columns.

In the corresponding feature distribution fj(i) represents the jth feature value (rows) for the ith example (columns), resulting in . The first step is to normalize each feature by dividing it by its l2 norm across all examples f~j = fj/||fj||2. The features are then l2 normalized across examples f^(i) = f~(i)/||f~(i)||2. Then the l1 penalty is used to optimize the features for sparsity so that in a dataset of M examples, the objective becomes: .

We used the open-source GitHub repository of Sparse Filtering in Python (https://github.com/jmetzen/sparse-filtering) in this study. This software package transposes the voxel × gene matrix, E(v,g), to be consistent with the Matlab code provided by (Ngiam et al., 2011). It also incorporated the soft-absolute activation function and Limited-memory Broyden, Fletcher, Goldfarb, and Shanno (L-BFGS) minimizer.

K-means clustering

After data representation using previously mentioned methods, we applied the unsupervised K-means clustering method. We used the publicly available K-means clustering implementation in SciKit-Learn (Pedregosa et al., 2011), and the number of clusters (K) ranged from 1 to 50 with the step of 1, and 50 to 550 with the step of 50.

Similarity measures

We chose Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) in the SciKit-Learn package [30] to measure the similarity of clustered representations with anatomy, where the adjusted versions of MI and RI account for the chance.

We developed a function to compute the DICE similarity coefficient to measure the spatial overlap of brain regions and representations (code example available on GitHub). It ranges from 0, for no overlap, to 1, showing a perfect overlap. The formula to calculate DSC for two regions (A and B) is: where ∩ is the intersection [32].

Logistic regression

For Sparse Filtering, we found the candidate pairs of brain regions and features by measuring the Dice coefficient of a given brain region against a thresholded feature. We thresholded the features using K-means clustering, with the single feature of interest as input, and choosing the number of clusters at two. We compared both labels against the region of interest using the Dice coefficient, and kept the better of the two. We repeated this procedure for every brain region against all features, and kept the best region-feature combination.

We chose the brain regions CA1, CA3, and CP for logistic regressions, as they were among the top overlapping combinations for SFt and other methods. We determined the gene inputs for logistic regressions by identifying the largest ten weights by absolute value associated with a given feature produced by a transformation, and then finding the genes associated with these weights (Table 1). We then mean-centered and standard-scaled these ten genes, and prepared a shuffled five-fold cross-validation set. We trained a logistic regression from the normalized genes onto the region of interest, and reported the ROC-AUC score on the test set. Normalization, logistic regression, shuffling, cross-validation, and ROC-AUC score were all performed using their respective implementations in the scikit-learn library [30].

Sparsity metrics

Here we described the metrics used to measure sparsity and provided the code examples on GitHub.

Feature sparsity

To measure the feature sparsity, first, we found the number of active voxels in each feature by measuring the mean value of voxels in each feature and counting the number of voxels with values above that. Then we used the ratio below to measure sparsity:

Feature Sparsity = Total number of voxels in the feature / Number of active voxels

Weight sparsity

To measure the weight (gene) sparsity, first, we found the number of active genes in each feature transformation by measuring the mean value of genes in each feature and counting the number of genes with values above that. Then we used the ratio below to measure sparsity:

Weight (gene) Sparsity = Total number of genes / Number of active genes in the feature

Connected components

We determined the number of connected components within a feature using the implementation of marching cubes in the scikit-image library [30] to develop an adjacency matrix, and an algorithm that determines the number of connected components in a graph from its adjacency matrix available in the NetworkX library [33].

We reconstructed the representations generated by each method into the anatomical space, then applied the scikit-image implementation of marching cubes [34] to the resulting 3D array, using the mean value of the feature as the threshold level for the method. The output of this method is a set of vertices and faces that comprise a set of 3D surfaces. We constructed an adjacency matrix for the set of vertices using the set of faces, and found the number of connected components by passing this adjacency matrix to the number_connected_components algorithm in the NetworkX library [33].

Shannon entropy

We measured the Shannon entropy of a feature by binning out the continuous values of the feature into discrete levels, and treated each discrete value as though it were a unique symbol in the Shannon entropy formula: where each xi in i = [1, n] is a discrete value of the binned feature.

Spatial entropy

We developed a measurement of the spatial entropy of a feature using an extension of the gray level co-occurrence matrix method in use on 2D images. Pixel intensities present in an image are binned into discrete levels, and instances of co-occurrence of these levels is counted with specified spatial relationships as the elements of a matrix. We extended this method to a 3D image by including a three-dimensional relationship [35]. Spatial entropy of an image can be measured by finding the Shannon entropy of this matrix, in which each (i,j) index of the matrix is as though it were a unique symbol in the Shannon entropy formula described previously.

Acknowledgments

Thanks to Dr. Chris Plaisier for consultation on our metrics, Jan Hendrik Metzen for sharing his implementation of SFt, and Research Computing at Arizona State University for providing the high-performance computing and data storage used to generate the analyses reported within this paper.

References

  1. 1. Xiang R, Wang W, Yang L, Wang S, Xu C, Chen X. A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data. Front Genet. 2021;12: 646936. pmid:33833778
  2. 2. Jiang J, Wang C, Qi R, Fu H, Ma Q. scREAD: A Single-Cell RNA-Seq Database for Alzheimer’s Disease. iScience. 2020;23: 101769. pmid:33241205
  3. 3. Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nat Commun. 2019;10: 5416. pmid:31780648
  4. 4. Ma S-X, Lim SB. Single-Cell RNA Sequencing in Parkinson’s Disease. Biomedicines. 2021;9: 368. pmid:33916045
  5. 5. Ciortan M, Defrance M. Contrastive self-supervised clustering of scRNA-seq data. BMC Bioinformatics. 2021;22: 280. pmid:34044773
  6. 6. Torgerson WS. Multidimensional scaling: I. Theory and method. Psychometrika. 1952;17: 401–419.
  7. 7. Liu Z. Visualizing Single-Cell RNA-seq Data with Semisupervised Principal Component Analysis. Int J Mol Sci. 2020;21: E5797. pmid:32806757
  8. 8. Ben-David S, Haghtalab N. Clustering in the Presence of Background Noise. Proceedings of the 31st International Conference on Machine Learning. PMLR; 2014. pp. 280–288. Available: https://proceedings.mlr.press/v32/ben-david14.html
  9. 9. Vahid MR, Brown EL, Steen CB, Kang M, Gentles AJ, Newman AM. Robust alignment of single-cell and spatial transcriptomes with CytoSPACE. bioRxiv; 2022. p. 2022.05.20.488356.
  10. 10. Nanga S, Bawah AT, Acquaye BA, Billa M-I, Baeta FD, Odai NA, et al. Review of Dimension Reduction Methods. Journal of Data Analysis and Information Processing. 2021;9: 189–231.
  11. 11. Kolali Khormuji M, Bazrafkan M. A novel sparse coding algorithm for classification of tumors based on gene expression data. Med Biol Eng Comput. 2016;54: 869–876. pmid:26337064
  12. 12. Ng L, Bernard A, Lau C, Overly CC, Dong H-W, Kuan C, et al. An anatomic gene expression atlas of the adult mouse brain. Nat Neurosci. 2009;12: 356–362. pmid:19219037
  13. 13. Bohland JW, Bokil H, Pathak SD, Lee C-K, Ng L, Lau C, et al. Clustering of spatial gene expression patterns in the mouse brain and comparison with classical neuroanatomy. Methods. 2010;50: 105–112. pmid:19733241
  14. 14. Ortiz C, Navarro JF, Jurek A, Märtin A, Lundeberg J, Meletis K. Molecular atlas of the adult mouse brain. Science Advances. 2020. pmid:32637622
  15. 15. Li Y, Chen H, Jiang X, Li X, Lv J, Li M, et al. Transcriptome Architecture of Adult Mouse Brain Revealed by Sparse Coding of Genome-Wide In Situ Hybridization Images. Neuroinformatics. 2017;15: 285–295. pmid:28608010
  16. 16. Shannon CE. A mathematical theory of communication. The Bell System Technical Journal. 1948;27: 379–423.
  17. 17. Bell AJ, Sejnowski TJ. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation. 1995;7: 1129–1159. pmid:7584893
  18. 18. Crescimanna V, Graham B. The Variational InfoMax AutoEncoder. 2020 International Joint Conference on Neural Networks (IJCNN). 2020. pp. 1–8.
  19. 19. Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, et al. Learning deep representations by mutual information estimation and maximization. arXiv:180806670 [cs, stat]. 2019. Available: http://arxiv.org/abs/1808.06670
  20. 20. Ngiam J, Chen Z, Bhaskar S, Koh P, Ng A. Sparse Filtering. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011. Available: https://papers.nips.cc/paper/2011/hash/192fc044e74dffea144f9ac5dc9f3395-Abstract.html
  21. 21. James G, Witten D, Hastie T, Tibshirani R. Linear Model Selection and Regularization. In: James G, Witten D, Hastie T, Tibshirani R, editors. An Introduction to Statistical Learning: with Applications in R. New York, NY: Springer; 2013. pp. 203–264. https://doi.org/10.1007/978-1-4614-7138-7_6
  22. 22. Hamilton D, White C, Rees C, Wheeler D, Ascoli G. Molecular fingerprinting of principal neurons in the rodent hippocampus: a neuroinformatics approach. J Pharm Biomed Anal. 2017;144: 269–278. pmid:28549853
  23. 23. Dudek SM, Alexander GM, Farris S. Rediscovering area CA2: unique properties and functions. Nat Rev Neurosci. 2016;17: 89–102. pmid:26806628
  24. 24. Radic T, Frieß L, Vijikumar A, Jungenitz T, Deller T, Schwarzacher SW. Differential Postnatal Expression of Neuronal Maturation Markers in the Dentate Gyrus of Mice and Rats. Frontiers in Neuroanatomy. 2017;11. Available: https://www.frontiersin.org/article/10.3389/fnana.2017.00104
  25. 25. Gultepe E, Makrehchi M. Improving clustering performance using independent component analysis and unsupervised feature learning. Human-centric Computing and Information Sciences. 2018;8: 25.
  26. 26. Buccini A, Pasha M, Reichel L. Modulus-based iterative methods for constrained ℓp-ℓq minimization. Inverse Problems. 2020;36: 084001.
  27. 27. Kohara K, Pignatelli M, Rivest AJ, Jung H-Y, Kitamura T, Suh J, et al. Cell type–specific genetic and optogenetic tools reveal hippocampal CA2 circuits. Nat Neurosci. 2014;17: 269–279. pmid:24336151
  28. 28. Lewis SM, Asselin-Labat M-L, Nguyen Q, Berthelet J, Tan X, Wimmer VC, et al. Spatial omics and multiplexed imaging to explore cancer biology. Nat Methods. 2021;18: 997–1012. pmid:34341583
  29. 29. Lein ES, Hawrylycz MJ, Ao N, Ayres M, Bensinger A, Bernard A, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445: 168–176. pmid:17151600
  30. 30. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12: 2825–2830.
  31. 31. Zennaro FM, Chen K. Towards understanding sparse filtering: A theoretical perspective. Neural Netw. 2018;98: 154–177. pmid:29232616
  32. 32. Zou KH, Warfield SK, Bharatha A, Tempany CMC, Kaus MR, Haker SJ, et al. Statistical Validation of Image Segmentation Quality Based on a Spatial Overlap Index. Acad Radiol. 2004;11: 178–189. pmid:14974593
  33. 33. Hagberg AA, Schult DA, Swart PJ. Exploring Network Structure, Dynamics, and Function using NetworkX. 2008; 5.
  34. 34. Walt , Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. scikit-image: image processing in Python. PeerJ. 2014;2: e453. pmid:25024921
  35. 35. Tsai F, Chang C-K, Rau J-Y, Lin T-H, Liu G-R. 3D Computation of Gray Level Co-occurrence in Hyperspectral Image Cubes. In: Yuille AL, Zhu S-C, Cremers D, Wang Y, editors. Energy Minimization Methods in Computer Vision and Pattern Recognition. Berlin, Heidelberg: Springer; 2007. pp. 429–440. https://doi.org/10.1007/978-3-540-74198-5_33