Fig 1.
PAC recursively partitions the data space to obtain rational initialization structure.
Partition-based methods estimate data density by cutting the data space into smaller rectangles recursively. Shown in parts a-b are three clusters of points, the data marginal densities, and several partition scenarios. The data space (box) is partitioned in sequential steps denoted by numbers on the cut lines. Only the first three partition cuts are shown in parts a and b. (a) Bayesian Sequential Partition (BSP) is a Bayesian procedure that maximizes the posterior of the density estimation by dividing the data space via binary partitions; these partitions occur in the middle of the bounded region. On the other hand, Discrepancy Sequential Partition (DSP) performs division other than the mid-point; here, the division is guided by the discrepancy score through a series of tests of uniformity in point distribution, and the procedure stops when discrepancies are smaller than a set threshold. (b) In the (one-step) look-ahead version of BSP partition, the algorithm cuts the data space for all potential cuts plus one step more (steps 2 and 3), and it finds the optimal future scenario (after step 3). In comparison to the (sub-optimal) BSP scenario (one of many scenarios) illustrated in part (a), the scenario in (b) segregates the gold cluster much better, and it is a preferred cut to make in the continuation of partitioning procedure. In theory, BSP can produce sequential partitions for a pre-set number of steps ahead; however, to maintain computational feasibility, we implemented the one-step look-ahead BSP for this work. (c) The partitioning of simulated data space containing five subpopulations; the hyper-rectangles surround high-density areas, approximating the underlying distribution.
Fig 2.
Rational initialization is better than random initialization.
The hand-gated CyTOF data (see S1 Fig) is used for illustration. Commonly, kmeans algorithms utilize initialization via the Lloyd’s algorithm or kmeans++ algorithm. In comparison, (a) the overall sum of squares error is lower and (b) the F-measure is higher for DSP with kmeans versus the classic kmeans initialization algorithms. The rational initialization helps anchor the cluster starting points, and become very important for the fast convergence of PAC (Fig 3).
Fig 3.
Rational initialization, minimal kmeans post-processing iterations, and merging give fast convergence.
We use the hand-gated CyTOF data for illustration. The data space is first partitioned into 50 hyperrectangles, which is about twice (recommended setting) the expected number of subpopulations (24). Next, the number of kmeans iterations was varied followed by flowMeans style merging. The convergence of PAC toward the hand-gated results, or ground truth, is fast due to the informative anchoring of cluster centers around high-density regions by the rational initialization. It takes less than 50 post-processing kmeans iterations for the PAC to achieve convergence. This efficiency allows the PAC method to scale to handle the clustering of large samples.
Fig 4.
t-SNE visualization of clustering methods.
We compare the clustering results between hand-gate, (Lloyd’s) kmeans, SPADE, flowMeans, bPAC, and dPAC labels. Each t-SNE plot contains all gated cell events from the hand-gated CyTOF data with different set of colored labels. The colored labels denote different subpopulations within each plot; however, the colors do not have cross-plot meaning. The subpopulation numbers for all methods were set to be the same as that of hand-gated results (24 subpopulations). PAC methods achieve a significantly better convergence to the hand-gate labels than alternative methods.
Table 1.
F-measure comparisons of methods on simulated and hand-gated cytometry datasets**.
Fig 5.
Consider a deck of networks (in analogy to cards), with each “suit” representing a sample and each “rank” representing a unique network structure. The networks are aligned by similarity and organized on a dendrogram. The tree is cut (red line) at the optimal level (by elbow point analysis, see S8 Fig) to output k clades. Within each clade, the network structures are similar or the same. If the same sample has multiple networks in the same clade, then these networks are merged (black box around same cards).
Fig 6.
A simple batch effect dataset was simulated and visualized. This data has 5 dimensions, with 2 informative dimensions for visualization. (a) Two simulated data samples with the same subpopulations. The means shifted (up in sample 2) due to measurement batch effect. (b) When the samples are combined, as in the case of analyzing/pooling all samples together, two different subpopulations overlap (left panel). The overlapped subpopulations cannot be distinguished by clustering (right panel). (c) PAC could be used to discover more subpopulations, however, the hints of the present of another subpopulation do not help to resolve the batch effect. Thus, in this case, it is necessary to analyze the samples separately and then find relationships between the subpopulations across the samples.
Fig 7.
Calculation of sample clusters and their underlying network structures.
(a) In the batch effect simulation data, PAC was used to discover several subpopulations per sample without advanced knowledge of the exact number of subpopulations. Here, the colors denote the different clusters within each sample. Panels (b)-(c) show the networks of the subpopulations in both samples 1 and 2, respectively, that are discovered in (a). In these networks, the nodes denote the markers (or genes) measured (in this simulation data, the dimensions are named V1, V2,…, V5). The edges denote correlative relationship in terms of mutual information. These networks can be grouped by similarities to organize the subpopulations across samples. In the PAC-MAN implementation, the alignment is based on Jaccard dissimilarity network structure, and we organize the networks with hierarchical clustering of the Jaccard scores.
Fig 8.
Resolution of batch effects for simple batch effect scenario.
Network alignments allow the resolution of mean shift batch effect. (a) Resolution of batch effect by networks of all subpopulations discovered. In the left panel, the colors denote subpopulations that are aligned by network structures. The overlapped subpopulations are correctly labeled. The right panel shows the hierarchical clustering of the subpopulations’ networks via Jaccard dissimilarities. These subpopulations are the same as those in Fig 7. (b) Resolution of batch effect by marker levels. Alternative to alignment by network, marker levels (subpopulation centroids) can be used. However, the overlap of the different subpopulations from the two samples makes it impossible to resolve the mean shift in this simulated data. The hierarchical clustering of the centroids organize the subpopulations differently than that in part (a).
Fig 9.
Dynamic batch effect scenario.
Two subpopulations, in blue color, migrate in a time series fashion (begins in sample 1, and progresses through samples 2, 3, 4, and 5). In this simulation data, the dimensions are named V1, V2,…, V5, and V1 and V2 are the informative dimensions. The two sample subpopulations almost converge by mean shifts through the time series. The bottom right panel shows the subpopulations pooled into one figure; the colors denote subpopulations.
Fig 10.
PAC clustering on dynamic batch effect scenario samples.
We used PAC to discover several subpopulations per sample without advanced knowledge of the number of subpopulations present. The colors within each sample denote a distinct PAC subpopulation, but the colors have no meaning across samples.
Fig 11.
Resolution of dynamic batch effects scenario.
Comparison of PAC-MAN results between representative clades (number of clades set to 2). Using network structures (left panel) or expression information (middle panel) alone does not resolve the dynamic information. On the other hand, the dynamic information is resolved first by alignments of networks of larger subpopulations and then by merging smaller subpopulations (with unstable network structures) by expression into the aligned clades (right panel).
Fig 12.
Visualization of PAC vs. PAC-MAN results for blood, bone marrow, colon, inguinal lymph node, and liver samples.
The PAC (explorative clustering) and PAC-MAN (data-level cellular states) results are presented for each sample in column-wise fashion. Each tissue sample’s t-SNE plots were generated using 100,000 randomly drawn cell events for that sample. The results from PAC (top panel) and PAC-MAN (bottom panel) steps are presented in pairs. Initial PAC discovery was set to 50 subpopulations without advanced knowledge of the number of subpopulations in each sample. In MAN, 130 network clades (optimal number from elbow point analysis) were outputted, and the cellular states are defined by expression (marker signal), network structure, and dataset-level variation. This composite definition of cellular state naturally aggregates the PAC clusters to yield smaller number of subpopulations in less variable samples. S11 Fig is a higher resolution version of Fig 12 with subpopulation and clade labels.
Fig 13.
Visualization of PAC vs. PAC-MAN results for lung, mesenteric lymph node, spleen, thymus, and small intestine samples.
The settings and descriptions are the same as those in Fig 12. Continuation of visualization of PAC-MAN results for the mouse tissue data. S12 Fig is a higher resolution version of Fig 13 with subpopulation and clade labels.
Fig 14.
Heatmap of clade proportions across the tissue samples.
Sample-specific clades have a value of 1, while shared clades have proportions spread across different samples. Physiologically similar samples share more clades. S13 Fig is a higher resolution version of Fig 14 with clade labels.
Fig 15.
Heatmap of average subpopulation expression levels in all tissue samples.
The expression heatmap illustrates the average expression of PAC-MAN-discovered subpopulations. The subpopulations are grouped by hierarchical clustering, and subpopulations close in expression space are organized into blocks. S14 Fig is a higher resolution version of Fig 15 with clade labels.
Fig 16.
The constellation plot is designed to visualize both the expression and clade information of discovered subpopulations. Here, the centroids (average expression) of PAC-MAN-discovered subpopulations in the example tissue dataset are projected onto a t-SNE 2D space. Clades that only occur in one sample are colored grey. The non-grey clades occur in at least two samples, with unique colors and clade identification denoting each clade. On the constellation plot, the closest multi-sample clade subpopulations are connected by a straight line.
Fig 17.
Network structures of Clade 2: B cells.
In each network figure, the markers are denoted by nodes of different colors. These networks show the top edges that define the network structures for subpopulations in clade 2. The subpopulation network structures for each subpopulation in clade 2 show that certain markers, such as B220 and MHCII, are hubs across most, if not all, networks in this clade. This hub combination is consistent and unique to clade 2 (see Fig 18).
Fig 18.
Network structures of Clade 8: CD4+ T cells.
The set up is the same as in Fig 17. The subpopulation network structures for each subpopulation in clade 8 show that certain markers, such as CD3 and CD4, are hubs across most, if not all, networks in this clade. This hub combination is consistent and unique to clade 8.