^{1}

^{*}

^{2}

^{3}

^{*}

Analyzed the data: SR. Contributed reagents/materials/analysis tools: SR. Wrote the paper: SR SP. Conceived the project: SP. Designed the software used in analysis: SR.

The authors have declared that no competing interests exist.

In recent years, intense research efforts have focused on developing methods for automated flow cytometric data analysis. However, while designing such applications, little or no attention has been paid to the human perspective that is absolutely central to the manual gating process of identifying and characterizing cell populations. In particular, the assumption of many common techniques that cell populations could be modeled reliably with pre-specified distributions may not hold true in real-life samples, which can have populations of arbitrary shapes and considerable inter-sample variation.

To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.

The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.

Flow cytometry is one of the most commonly used platforms in clinical and research labs worldwide. It is used to identify and characterize types and functions of cell populations in a sample by measuring the expression of specific proteins on the surface and within each cell. In recent years, intense research efforts have focused on automated analysis of flow cytometric data, especially for cell population identification

Flow cytometric data consists of per cell measurements (or

While a variety of new algorithms have been proposed to automate gating, in general they have some important limitations. Often these algorithms use statistical clustering approaches that model cell populations as distribution of points which are assumed to have a certain pre-specified form, e.g. Gaussian kernels

A serious drawback of most of the current automated gating methods is that they almost entirely ignore the key aspects of human perspective and intuition that guide the manual gating process. Clearly the task of gating relies on expert understanding of the underlying biology of the experiment - in terms of both design and outcome - as well as the different factors involved such as markers, dyes as well as the instrument under consideration. While machine learning techniques have traditionally been employed for understanding tasks that involve human faculties such as visual perception that guides the gating process, we believe a mathematically intuitive and syncretic approach may be better suited to address both of the above limitations and thereby improve automated gating. To emulate the subjective yet often quite reliable gating steps as executed by a trained human expert, an algorithmic framework must first be able to mathematically represent the flow data in terms of a “global” perspective, and then identify the more complex and inter-connected population structures therein. With the right mix of precision and intuitive flexibility, such a framework can best serve the needs of a number of problems in cytometric data analysis.

We present flowScape, a new computational framework to automate gating by emulating the human perspective. To achieve this, flowScape follows four steps: (a) mapping the data landscape with modal clusters, (b) building a hierarchical structure connecting the modal clusters, (c) performing ridgeline analysis to isolate the populations, and (d) constructing flexible, sample-specific templates to automate gating. Thus flowScape is designed to capture the best of two worlds: inferential properties of model based clustering and the flexibility of non-parametric techniques. Below we describe these steps in further detail. It begins with (a) a novel mapping of the multi-dimensional data landscape of a given flow cytometric sample, which creates a global overview of the data. However this overview is created with precision and rigor by characterizing regions in the landscape in terms of varying densities of points. These regions could be of arbitrary shapes but each of these are concentrated around a

Notably, the modal clusters of flowScape are high-dimensional and unrestricted in shape. This offers flowScape a unique opportunity to improve the automation of the gating process. The modal clusters are used by flowScape to (d) construct dynamic, sample-specific templates for detecting populations not by their absolute coordinates but the corresponding congregation of events. Taking a semi-supervised approach, flowScape enables the user to construct templates of target populations in a training set of samples. Then those templates can be applied to new batches of samples to automatically identify the analogous features – in terms of their densities and not rigid locations – in a flexible, sample-specific manner. This capacity of flowScape generalizes gating and supports automated analysis of large cytometric cohorts. Similarly, flowScape may be useful for many common applications such as determining the optimal data transformation per flow channel, gating of live cells and lymphocytes, etc. For demonstration, we applied flowScape to multiple flow cytometric data sets, both published and newly generated, and also illustrated its advantages over other existing methods.

We describe the methodology used in flowScape both as a general algorithm as well as in terms of particular applications in flow data analysis.

Our formal approach to map the landscape in a multi-dimensional space of flow events utilizes two statistical concepts: a

In flowScape, we begin with the construction of the density of the data landscape for a given sample, and then determine the modes of that function. Although mode-counting or mode hunting has been extensively used as a clustering technique (see

For convenience, we outline the steps of the clustering algorithm using a multivariate Gaussian kernel with covariance

Let the set of data to be clustered be

As the clusters are formed by associating the observations with their corresponding modes, we call this procedure Mode Association Clustering (MAC). Any covariance structure for the kernel, be it Gaussian or otherwise, can be used to construct the modal cluster. The use of Gaussian kernel in our algorithm is motivated primarily by the the computational simplicity it provides (for details see Li et al.

Without assuming any specific parametric form for the cluster densities, our MAC approach is more robust to unusual shapes and features (such as non-Gaussian tails) than than robust parametric clustering methods such as multivariate skew normal/t mixture models proposed recently by Lin

The notion of a “meaningful” population in a human expert’s understanding is often more complex than a simple isolated cluster of events. In flowScape, we address this complexity by enhancing the MAC procedure with a hierarchical framework to enable multiscale or multi-level resolution that we believe is better suited to emulate the nuanced human perspective. The hierarchical MAC procedure (called HMAC), and indeed any multiscale data analysis technique, presents an exciting new research area in statistics

We note that when the bandwidth

Let the clustering of samples obtained at bandwidth level

Importantly, while the number of objects being clustered reduces as we move up the hierarchy, the density estimator is always formed using all the original data samples, which has distinct advantages. Notably, HMAC differs from the traditional linkage-based hierarchical clustering, which also builds a hierarchy of clusters, in an important manner. In the linkage-based methods, only the two clusters with the minimum pairwise distance are merged, and the hierarchy is constructed as a sequence of such pairwise greedy merges, which are based on local comparisons. The lack of global analysis can result in skewed clusters (or “chain” sequences). In contrast, the merging of clusters in every level of HMAC is determined by a global criterion such that the contribution of every original data point on the overall clustering is retained through the density function

After preparing the above methodology for mapping a generic multi-dimensional data landscape, we adapted it for specific applications in flow data analysis. One such application is the use of

In low-dimensions, flowScape uses ridgelines to provide an insightful representation of the overall landscape of flow data as fitted by the hierarchically structured modal clusters. Notably, by setting thresholds in the altitude (or “dip”) of a ridgeline at a particular level (or scale in the hierarchy), the user can separate and extract complex features easily and objectively. This, in fact, extends the human capacity since the user can now specify the level at which the population separation sufficiently matches her intuition. We can generalize this capability even further by allowing such thresholds to be user-specified “knobs”, thus flowScape can construct flexible templates to identify a collection of robust and suitable features in a semi-automated manner. By the nature of construction, such features can very effectively capture populations with unusual shapes or tails that may vary from sample to sample. Notably, they can be defined in relative terms, as opposed to only absolute population parameters (like physical location). The entire procedure can be regulated using visual feedback from density-based coloring at each point of the ridgelines. Interestingly, our ridgeline-based feature extraction procedure can be performed in high-dimensions.

In summary, modal clustering and its corresponding ridgeline analysis allow flowScape to exploit the geometry of a probability density function in a nontrivial manner. The steps of clustering can be conducted in accordance with our geometric heuristics, as described below. In particular, every modal cluster should be associated with a “hill”, and every point in a cluster can be moved to the corresponding hilltop along an ascending path without crossing the “valley” that separates two hills. Finally, by tracking the ridgeline between two peaks, the way in which two hills separate from each other can be measured and charted out, enabling diagnostics of our clustering results and also any adjustment of our clustering output as might be required for a particular flow data analysis.

One of the key practical problems in flow data analysis, especially in the context of manual gating, is to ensure an optimal display of fluorescence intensities for different markers. Typically such marker distributions are log-normal, and thus log transformation is used for normalizing the data for visualization. While log_{10} transformation has been the norm in flow analysis, more recently other options have been considered for addressing several important issues on this topic

Logarithmic transformation, however, is not defined on non-positive points, and therefore flow data displays quite often show a “log artifact” in which there is an artificial pile-up of points on the baseline. To address this, alternatives to log-scale displays, which nevertheless preserve many of the desired characteristics of log transformation, have been proposed. In general, a linear scaling is applied to the low end populations for spreading those events away from 0 at a rate faster than log-transformation. For points already farther away from 0, log-transformation is used. Such linear-log type transformations are usually symmetric around 0, applicable to negative values, and they smoothly transition from the faster linear spread to the gentler logarithmic for higher intensities (see Novo and Wood

While transformations such as bi-exponential (e.g. logicle by Parks et al.

To systematically address the problem of optimizing data transformation (ODT), we applied landscape mapping based on a new procedure flowScape.ODT. Unlike many flow analysis methods that rely on Gaussian densities and kernels for identifying populations, flowScape.ODT uses the more robust HMAC algorithm. There are two major advantages of this approach. First, untransformed data may not originally have Gaussian-like populations and thus may not conform to Gaussian models. Being free of the normality assumptions, flowScape.ODT can still identify these populations in the form of dense regions in the mapped landscape with precision. Further, it actually allows flowScape to utilize normality properties of the modified populations as statistical criteria for determining when a transformation has reached optimality. Indeed we combined multiple such criteria to test different aspects of what may be considered a “well-rounded cluster” such as unimodality, skewness and kurtosis. Clearly such determination would be either infeasible or redundant had we used Gaussian distributions in the first place for identifying the intermediate, not-yet-normal populations during the transformation process.

Our approach minimizes the redundancy in modifying the populations by observing that the rate of dispersion of points due to a log-like transformation gradually slows down away from 0. In other words, the choice of cofactor becomes increasingly less important for populations with high mean, i.e. the ones further away from the baseline. Hence, the criterion for an optimal transformation should primarily be concerned with any cluster that is located around 0 (besides the additional aim of removing the negative clusters, if any). Further, as noted in Parks et al.

The flowScape.ODT procedure is based on the following steps:

Based on the the sample’s landscape map for a given marker, flowScape identifies if there is any cluster with significant proportion (

The data are iteratively transformed with different values of the relevant parameter (such as increasing the Arcsinh cofactor) until there is no negative cluster – in other words, negative clusters are removed via transformation.

The data are transformed with new values of the relevant parameter (e.g. cofactor) until the

Once the optimal argument for the cofactor

Based on the above steps the algorithm is given by:

The above rules or guidelines can be easily fine-tuned according to one’s domain knowledge (for instance, the stains can influence one’s choice of cofactors) and understanding of the generated data (such as the effect of compensation for a specific dye). Thus, for instance, the baseline

We now describe an algorithm that is suitable for automating manual gating. The map of the data landscape, as done by flowScape, can be a natural representation to capture the intuition behind manual gating since the populations could be viewed as dense regions of arbitrary shapes, sizes and locations spread over this landscape. For the purpose of applying our landscape-based approach in batch mode, we designed a new procedure flowScape.DTG. In the first step of the procedure, we construct a flexible template for one or more target populations in a sample. To support the flexibility, flowScape.DTG allows template specifications based on a mix of relative and absolute population characteristics. The templates are constructed by running our hierarchical modal clustering framework on some representative or training samples as supplied by the user. Subsequently, that learnt template is used to guide the identification and extraction of the corresponding target populations from a large batch of samples in a fully automated manner. Below we demonstrate the application by gating populations of live cel’ls and lymphocytes.

Our new procedure offers several novel features to tackle the problem of automated gating. First, we designed flowScape.DTG as a generic pattern-recognition procedure which can be used for extracting any subset of points – not just live cells or lymphocytes – that is identifiable in terms of either relative or absolute (or a mix of both) characteristics on the data landscape. Second, the unique advantage of flowScape lies in its use of hierarchical ordering of modal clusters, which allows it to isolate even complex populations with overlapping features that are otherwise much harder to demarcate automatically. The resulting template could thus be robust yet free of modeling constraints. Third, the templates could be specified in relative terms such as, “the

For the live gating example, we constructed our temp’late based on the assumption that the live cell population is distributed in the

Although here we have described the application using only two dimensions (e.g.

The Treg data were originally generated and described in Maier et al.

Variable | CD4 | HLADR | CD25 | Foxp3 |

Cofactor | 3.5 | 3 | 3 | 3 |

Compensation control data were generated by staining a 1∶1 mixture of of positive (anti-Mouse Ig

The LCL data were originally generated and described in Choy et al.

We show the distribution of Treg events after applying logicle transformation based on marker-specific optimal parameters computed with flowScape. The optimal arguments are shown in

The GvHD data were originally generated and described in Brinkman et al.

The distributions of CompControl events after Arcsinh transformation based on different values of the cofactor are shown. The cofactor values that satisfied our tests were 2500 and 1000. For these values, we see that there is no spurious splitting of the 0-cluster, which produces distinctive negative clusters for cofactors less than 1000. On the other hand, for cofactors greater than 2500, the 0-clusters are clearly spiky. In contrast, the 0-cluster for the cofactor values optimized according to flowScape normality criterion is neither too peaked nor too flat. Thus flowScape addressed both problems of over- and under-transformation of data.

Flow cytometry is among the most popular in research and clinical labs around the world for several decades, yet only recently has computational cytomics started to receive major attention from the analytical scientists

We demonstrate live gating on a representative LCL sample using flowScape. (a) The sample is shown as a scatterplot in terms of forward and side scatters. (b) Using flowScape, we map the data landscape and determine the ridgeline (red curve) for the sample, as shown in 3-D. The ridgeline connects every modal cluster in the multi-dimensional data by traversing the terrain from peak to peak across slopes and valleys in terms of data density, thus providing a systematic hierarchical description of the sample using the landscape map. (c) The ridgeline (here shown as blue/yellow curve for dense/sparse regions) can therefore be used for objective extraction of relatively denser concentrations of events. A dip in the ridgeline (red asterisks) can guide the demarcation of cell subpopulations that are otherwise hard to isolate with automated clustering. Thus flowScape can offer the unique advantages of human intuition without paying the cost of associated subjectivity. (d) The final live gating results of flowScape are shown as 2 major populations in blue (live cells) and red (dead), after removing points at the extremity (around bin 1000). Clearly these clusters have non-elliptical shapes that could not be captured by many of the common clustering methods.

Through mapping of flow data landscape with hierarchical modal clustering and using algorithmic devices like ridgeline analysis and flexible templates, flowScape emulates the congregation-oriented view of data densities, which is free of pre-specified constraints on population shape. Based on the hierarchical representation, it also reflects the “zoom-in/zoom-out” approach of the human perspective. In future work, we want to create a semi-automated tool to implement the same approach with extensive interactive features.

In the left panel we show the scatterplots of two LCL samples in terms of forward- and side-scatters. Owing to the inter-connected nature of the distributions, extraction of the live cell population is difficult via automation. Using modal clustering and ridgeline analysis, flowScape provides algorithmic means to separate and extract the populations based on locations where the altitude of the ridgeline dips while moving from one peak to another, as marked with red asterisks. The ridgeline is colored according to its altitude at each coordinate.

In this section, we demonstrate the use of flowScape.ODT to determine the optimal transformations of two datasets: Treg and CC.

We compared the results of lymphocyte gating for two representative samples (s6a06, s6a07 – the last two time points for Patient 6 in GvHD data). For both samples, we ran 2 well-known methods, flowCore and SamSpectral, and flowScape to automatically identify the lymphocyte populations (as defined in Ellis et al.

When we applied the default logicle transformation to Treg data, for each of the markers Foxp3, CD25 and HLADR, we observed a significant “negative cluster” (

The distributions resulting from the optimally transformed Treg data is shown in

Next we applied the flowScape.ODT procedure to the CC dataset. Here we have just one variable corresponding to two artificial populations of a 1∶1 mixture of positive and negative compensation control beads stained with PerCP-Cy5.5. The cytometer settings placed the center of the distribution near zero, making this an excellent example for issues with events near and below 0. The data were transformed according to Arcsinh (i.e. inverse sine hyperbolic) function over a range of values of cofactor

We applied the automated gating procedure flowScape.DTG for live gating of LCL data. Here we demonstrate the results using a representative sample. (The full set of results are available from the authors upon request.) The data dimensions that we used for live gating are forward and side scatters. The

First, we mapped the

As noted, unlike other live gating approaches

We applied flowScape.DTG’s dynamic, sample-customized gating methodology to GvHD data. In principle, the clustering and the ridgeline analysis steps of lymphgating are similar to livegating except for the different definitions of the dynamic template in each case. For instance, for lymphgating, we defined the lymphocyte template as the population whose mode is the second farthest from the origin in terms of Euclidean distance in

The flexibility of the flowScape.DTG templates allows highly robust automated detection of cell populations, even in the presence of platform noise, high-inter-sample variation, sparse or diffuse populations, etc. To illustrate this point, we selected two consecutive time-points measured in the same patient from GvHD dataset (s6a04,s6a05), and applied flowScape.DTG as well as other methods (

Understanding the human perspective in thinking about and making sense of visual information, as in the steps of manual gating, is a complex problem. When a flow cytometry analyst visualizes the data, a complex interplay between human intuition and technical understanding (both biological and mechanical) is brought into action. While such insight may be difficult, if not impossible, to reproduce outside the human mind, we can try to emulate certain aspects of it via automation. For instance, the zooming in/out approach could be captured with a data representation that has multi-level resolution. Toward this, we used flowScape to utilize the notion of a modal cluster to offer a congregation-oriented view of the data landscape. The resulting map of the data landscape uniquely emulates the global overview of a human analyst but it does so with a mathematically rigorous density function. Then we use a bottom-up hierarchical representation of the modal clusters to mimic the manual construction of complex structures at multi-level resolution. Thus we try to capture certain amount of the subjectivity of the human perspective, and the strength it brings to manual flow data analysis, via our objective means. Finally, we extended the manual gating capacity with our novel flexible, sample-specific templates for extracting features of interest which may have unusual shapes and distributions and are possibly difficult to isolate using other computational methods.

(TIF)

(TIF)

(TIF)

The authors thank J. M. Irish for helpful discussions and for providing the compensation control data.