Analysis of Flow Cytometry Data by Matrix Relevance Learning Vector Quantization

Flow cytometry is a widely used technique for the analysis of cell populations in the study and diagnosis of human diseases. It yields large amounts of high-dimensional data, the analysis of which would clearly benefit from efficient computational approaches aiming at automated diagnosis and decision support. This article presents our analysis of flow cytometry data in the framework of the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukemia (AML) Challenge, 2011. In the challenge, example data was provided for a set of 179 subjects, comprising healthy donors and 23 cases of AML. The participants were asked to provide predictions with respect to the condition of 180 patients in a test set. We extracted feature vectors from the data in terms of single marker statistics, including characteristic moments, median and interquartile range of the observed values. Subsequently, we applied Generalized Matrix Relevance Learning Vector Quantization (GMLVQ), a machine learning technique which extends standard LVQ by an adaptive distance measure. Our method achieved the best possible performance with respect to the diagnoses of test set patients. The extraction of features from the flow cytometry data is outlined in detail, the machine learning approach is discussed and classification results are presented. In addition, we illustrate how GMLVQ can provide deeper insight into the problem by allowing to infer the relevance of specific markers and features for the diagnosis.


Introduction
We present in this article our main results obtained in the context of the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukemia Challenge 2011 [1][2][3]. This challenge was organized in a joint effort by the Dialogue for Reverse Engineering Assessments and Methods (DREAM) project [3][4][5][6] and the Flow Cytometry: Critical Assessment of Population Identification Methods (Flow-CAP) initiative [2].
Flow cytometry constitutes a powerful technique which is widely used in medical research and clinical practice for the study and diagnosis of various diseases [7]. Flow cytometry measurements typically yield a quantitative description of several tens or even hundreds of thousands of cells in a given sample. Light scatter and fluorescence properties are used to identify deviations from normal cell size or structure and to quantify functional properties in terms of, e.g., protein marker expressions [7,8]. The amount of available data, its high dimension, and the complexity of the diagnosis tasks trigger a significant interest in systems for automated analysis and decision support.
Along these lines, the DREAM6/FlowCAP2 challenge addressed the analysis of given flow cytometry data, representing peripheral blood and bone marrow samples of, in total, 359 subjects. Some of these corresponded to cases of Acute Myeloid Leukemia (AML) and the ultimate goal was to predict the condition of a number of patients whose diagnosis was unknown to the participants. Hence, the goal of the challenge could be formulated as a machine learning problem: From the given example data with known diagnoses, criteria were to be inferred which then allowed for the classification of the test samples.
We extracted feature vectors from the data in terms of a few characteristic quantities, summarizing the statistics of the observed marker values. Predictions were obtained by means of a specific machine learning technique termed Generalized Matrix Relevance Learning Vector Quantization (GMLVQ) [9][10][11]. This prototype based method extends standard Learning Vector Quantization [12,13] by using Adaptive Distance Measures in Relevance LVQ, which motivated the acronym and team name Admire-LVQ. In the challenge, our team achieved the best possible performance with respect to the required test set prediction.
In the following section a description of the data set and our analysis is given. Thereafter we present and discuss our main results and the obtained prediction. We conclude with a brief outlook on possible extensions and future work.

Data Set and Analysis
In this section we first describe the extraction of features from the given data. The specific machine learning analysis based on Generalized Matrix Relevance Learning Vector Quantization is outlined. Furthermore, its validation in terms of the given data set is discussed.
The data set provided in the challenge comprised 359 subjects. For each of these, a varying number of cells, on the order of a few thousands, had been analysed by means of flow cytometry, see [2,3] for details. The first 179 subjects served as the training data; label information S m [ f1,2g, m~1,2, . . . ,179 was provided, specifying 23 subjects as AML patients (S m~2 ). The remaining 156 subjects are referred to as healthy donors (S m~1 ) throughout this contribution. Note that the latter group of subjects includes a number of patients with a diagnose different from AML [14].
The task was to predict the diagnosis with respect to a test set of 180 subjects for which no label information was provided. The total number of AML cases in the test set, 20, was also disclosed to the participants. However, this information was not exploited in our approach. We have analysed the transformed and compensated flow cytometry data as provided by the organizers of the challenge [2,3]. In our analysis we omitted the non-specific isotope control data representing non-human binding antibodies, which corresponds to tube 8 in the data set [3].
In clinical practice, a possible workflow is to sort cells according to a small number of gating variables in a first step, identifying potentially degenerate or immature cells. Subsequently, the selected cells are analysed according to the remaining markers, aiming at a reliable diagnosis and potential identification of the AML subtype [7,8]. In our approach we follow a simpler, more direct strategy in which we omit cell specific information. After visual inspection in terms of histograms we decided to represent the data by a limited number of statistical characteristics per patient and marker. Moreover, we took into account all markers at once in order to assign each subject to one of the two classes in a single processing step.

Feature Extraction and Normalization
A key step in the design of a classifier in this challenge was the extraction of appropriate features from the provided data. The data corresponding to tubes 1-7 represents 31 characteristic quantities per cell: the so-called Forward Scatter on linear scale (FS Lin), the Sideward Scatter on logarithmic scale (SS Log), and 29 fluorescence intensities on logarithmic scale quantifying the expression of various surface proteins. All of these quantities are referred to as markers in the following. Table 1 lists the considered markers and the index j~1,2 . . . 31 which we refer to in the analysis.
Note that the potential gating markers FS Lin, SS Log, and CD45-EDC were provided for all cells in the data set. The other 28 markers were measured in one tube only, representing a subpopulation of cells per subject. We rescaled all markers by the respective largest possible value as to limit all observations to the interval ½0,1.
FS Lin can be interpreted as a measure of cell size, while SS Log roughly quantifies intracellular granularity [7]. Note furthermore that the expression of IgG1 was measured by means of four different binding antigens. In our analysis, however, the corresponding values were treated as four independent markers (j~4, 11,18,25), formally.
For the purpose of a first, visual inspection, we computed histograms corresponding to the frequency of marker values in the training set. Figures 1 and 2 display histograms of 4 example markers: FS Lin (j~1), SS Log (j~2), CD45-EDC (j~3), and CD10-PC7 (j~31) for one patient per class (m~100 and103). The main purpose of Figures 1 and 2 is to illustrate the extraction of feature vectors from the sample data which is described in the following.
For each patient and marker a varying number M of cell measurements, typically a few thousands, were made available. In our analysis, we did not make use of cell specific information, as it is done frequently in terms of a so-called gating procedure in clinical practice [7,8]. We extracted information only on the level of single marker statistics over the entire population of cells. A direct classification of histograms using, for instance, entropic distance measures or statistical divergences would be feasible here [15,16]. We resorted, however, to reducing the information to only six quantities per marker which summarize the characteristics of the corresponding histogram. We denote by v m j (m) the value measured for marker j (1ƒjƒ31) in individual cell m (1ƒm ƒM m ) of patient m. From the available data we determined the following quantities:  is a measure of the asymmetry, with positive values indicating that more weight is contained in the left side of the histogram. The kurtosis quantifies how sharply peaked a histogram is. Note that in the above defined kurt, sometimes termed excess kurtosis in the literature, a constant 3 is subtracted yielding kurt~0 in case of normal densities. Hence we obtained, for each patient m, a set of 6 quantities per marker. A particular subject was subsequently represented by the concatenated vector x m [ R N of N~186 characteristic features. As one example, the skewness of marker 17 (CD19-PE, see Table 1) observed for patient 42 corresponds to component In the training processes described in the following, we applied an additional z-score transformation: Given a (sub-)set of P training examples we computed for i~1,2, . . . ,186 the quantities and rescaled all features in training, validation or test data by subtracting the mean x x i and subsequently dividing by s i . Consequently, the transformed features display zero mean and unit variance in the actual training set. While the transformation did not affect the classification performance, it enhances the interpretability of the results, in particular with respect to the relevance matrix, see below.

Matrix Relevance Learning Vector Quantization
We employed Generalized Matrix Relevance Learning Vector Quantization (GMLVQ) for the analysis of the obtained feature vectors. This highly flexible and powerful variant of LVQ is described in detail in [9][10][11]. Here we employed the algorithm in its simplest setting with one prototype per class and a single, global relevance matrix as defined below.
The two classes, i.e. healthy donors (class 1) and AML patients (class 2), are represented by the prototype vectors w 1 ,w 2 [ R N , respectively. Given a particular z-score-transformed feature vector x [ R N representing one of the patients, its distance from the prototypes is determined as Figure 1. Example histograms and extracted features: FS Lin and SS Log. Histograms and extracted features correspond to one healthy donor (subject m~100, upper panels) and one AML patient (subject m~103, lower panels), respectively. Histograms display the frequency of a particular marker value for visual inspection. Six features are extracted per patient and marker, corresponding to mean, standard deviation, skewness, kurtosis, median, and interquartile range of the observed frequency of marker values, cf. Eq. (1). Here the first 12 components of the 186dim. feature vectors are displayed before z-score transformation. doi:10.1371/journal.pone.0059401.g001 Here L and V are N|N matrices and the specific parameterization of the distance guarantees non-negativity of the measure: In a simple Nearest Prototype Classification (NPC) scheme, a feature vector x is assigned to class 1 if d(x,w 1 )vd(x,w 2 ) and to class 2, else. While w 1,2 serve as typical representatives of the classes, elements L ij of the symmetric matrix L can be interpreted as to quantify the relevance of a pair of feature dimensions i,j in the classification scheme.
Both, prototypes and relevances, are determined in the same supervised training process. Given a set of P examples x m ,S m f g P m~1 with class labels S m [ 1,2 f g, training is guided by the minimization of the cost function [9,17,18] where the index J corresponds to the correct prototype with J~S m while K=S m identifies the wrong prototype. In general, the objective of training can be further specified by introducing a function W(e m ) in the cost function, e.g. a sigmoidal [17]. Here, we resorted to the simple case W(x)~x. Note that the contribution e m of a single example to the cost function satisfies {1ve m v1. It is negative if x m is classified correctly and its absolute value relates to the margin of the classification. Alternatively we refer to the closely related score s which is computed as A value s(x)&0 indicates that feature vector x is assigned to class 1, healthy donors, with high certainty. Large values close to s~1 signal confident classification as an AML patient (class 2). The NPC scheme can be reformulated as assigning vector x to class 1 if s(x)ƒ1=2 and to class 2 else. While the score may serve as a relative measure of certainty in GMLVQ, it should not be interpreted directly as a probability for AML. Note that any monotonically increasing function f : ½0,1?½0,1 could be used to transform s without modifying the actual ordering of patients according to s.
We implemented the iterative optimization of E, cf. Eq. (4), by means of a gradient descent procedure with respect to the adaptive quantities w 1 ,w 2 , and V. At iteration step t, updates along the normalized gradients and subsequent normalization of V were performed: with the time-dependent step sizes a w (t) and a V (t). The full form of the gradient terms is given in [9,11,19], for instance. We employed gradient descent with waypoint averaging and step size control, which has been introduced and described in greater detail in [19]: After a gradient step, Eq. (6), the achieved value of the cost corresponds to the position in search space, on average over the last T updates. The observation of Ê (tz1)vE(tz1) signals oscillatory behavior of the iteration. In this case, we set w i (tz1)~ w w i (t) and V(tz1)~ V V(t) and reduced the step sizes by a factor av1 : a w,V (tz1)~a : a w,V (t). All results presented here were obtained with parameters T~3 and a~2=3 in the waypoint averaging scheme. Initial step sizes were a w (0)~10=N for prototypes and a V (0)~100=N for matrix updates, respectively.
In the problem at hand, the obtained classification scheme and error rates turned out very robust with respect to the choice of these parameters. For a given training set, we initialized prototypes w 1,2 close to the corresponding class conditional means with small random deviations; similarly we chose the initial V close to the (N|N) identity matrix: where the Kronecker-Delta d mn~1 if m~n and d mn~0 if m=n.
The components of r [ R N and all elements of C [ R N|N were generated independently according to a uniform density U({0:1,0:1). Results were found to depend only very weakly on details of the initialization.

Validation
In order to evaluate the performance of the GMLVQ classifier before applying it to the test set, we employed a validation scheme based on randomized subsets of the available training data. In every run we selected ca. 3=4 of the data from each class randomly, i.e. 17 of the 23 AML examples and 117 of the 156 healthy donors. These P~134 example data were used for training the GMLVQ system while the remaining 45 served as a validation set. The random split of the data was repeated 50 times and, if not stated otherwise, results presented in this section were obtained on average over the validation runs. Figure 3 displays the averaged error rates of naïve Nearest Prototype Classification in the course of gradient based training. Note that an over-fitting effect was observed: Performing more than ca. 60 training steps decreased the error rates with respect to training examples to very low values. At the same time, however, validation set performance deteriorated. Closer inspection revealed that this effect was essentially due to patient m~116, listed as a case of AML in the training set. If contained in the validation set, this patient was consistently misclassified by the NPC scheme. On the contrary -if employed for training -the system achieved agreement with the label, eventually, but at the expense of an increased error rate in the validation set.
Based on this observation, we employed an early stopping strategy, terminating the training process after 40 gradient steps. When omitting patient m~116 from the training set or re-labeling the subject as healthy donor, the learning curves converged smoothly and overfitting was not observed anymore. Moreover, we obtained virtually the same classification, i.e. the same order of scores with respect to the test set patients in all these scenarios. The precise numerical results reported in the following section were obtained by means of the early stopping strategy including subject m~116 labelled as an AML case (S 116~2 ).
In addition to the error rates of the naïve NPC scheme we also evaluated the validation set performance in terms of the Receiver Operating Characteristics (ROC) [20]. By introducing a threshold H, the GMLVQ scheme can be biased with respect to one of the two classes: with the score s defined in Eq. (5) For thresholds in the range {1=2ƒHƒ1=2 we computed the corresponding class-wise error rates with respect to the validation set on average over the 50 training runs, yielding the threshold-averaged ROC curves [20] displayed in Figure 4. The ROC analysis revealed very high sensitivity (true positive rate) and specificity (1 -false positive rate) with respect to the validation set performance, the corresponding Area Under Curve being AUC~0:9935 [20]. In addition, removal of patient m~116 from the data set resulted in an almost perfect ROC with AUC~0:9996. Given the close to error-free classification we refrained from employing complementary performance measures such as precision/recall or other characteristics [20]. For the same reason, we did not compare the validation performance of the simple GMLVQ scheme with more sophisticated settings or alternative classifiers.

Results and Discussion
Final results, including the test set scores, were obtained using all 179 training samples for training. In addition, we performed an average over 50 randomized intializations in order to rule out an influence of the initial configuration of the GMLVQ system. In each run, 40 gradient steps were performed with waypoint averaging and step size control as described above. The final test set scores were obtained on average over the 50 randomized training runs.
Before discussing the outcome of the GMLVQ training in terms of prototypes and relevances we present the actual test set predictions. Figure 5 displays the GMLVQ based scores s, cf. Eq. 5, with respect to the 180 test set patients. Values close to s~1 correspond to patients that are identified as AML patients with high certainty, while small s&0 correspond to a classification as healthy donor. It would be very interesting to study potential correlations of the scores s with additional information about the patients, e.g. measures of the severity of the AML cases. Unfortunately such information was not disclosed and is not available for the given data set.

Test Set Prediction
Although it was known to the participants that the test set contained 20 AML cases, we did not make explicit use of this information. In the GMLVQ training, a threshold value does not have to be specified. The result in terms of scores s and the corresponding ranking of test set patients is independent of the actual number of AML cases. In a practical context, and if a crisp classification is the goal, the actual value of H should be set according to domain expert (user) preferences concerning the compromise between sensitivity and specificity. The example threshold value marked in the right panel of Figure 5 was chosen a posteriori for illustration purposes only and is neither a result nor a parameter of the training process. With respect to performance in the challenge it is irrelevant.
The comparison with the unknown test set labels after submission of the predictions [3] revealed that the 20 patients with highest GMLVQ score s corresponded precisely to the 20  AML patients in the test set. Hence, we achieved the best possible prediction according to Receiver Operator Characteristics or other evaluation methods like Precision/Recall, which only depend on the order of scores and the corresponding ranking of patients.
The obtained classifier can be illustrated in terms of a twodimensional visualization: Figure 6 displays the training and test data in terms of projections on the leading eigenvectors of the relevance matrix L [21]. Two rather well separated clusters can be identified which reflect the assignment of classes. Note that the training set subject (patient 116) that was consistently misclassified by the NPC scheme is, indeed, located in the cluster representing healthy donors. This relates to the overfitting behavior discussed in greater detail in the previous section.
It is remarkable that error-free classification of the test set data was obtained by a number of teams who extracted different features from the data and used a variety of classification approaches [1]. For example, Vilar et al. employed a histogram based classifier in connection with the Kullback-Leibler divergence used as an entropic distance measure [16]. Amar et al. also extracted statistical moments from the data, but applied Support Vector Machine Regression, subsequently [22]. Logistic Regression was applied successfully by Manninen et al. [23]. Strickert and Seifert based their predictions on a method termed Correlative Matrix Mapping [24]. Using their software library Jstacs [25], Keilwagen and Grau built a weighted ensemble of classifiers which also achieved perfect classification.
An additional ranking of the best performing teams was suggested by the organizers in retrospect [1,3,26]. It hinges on interpreting the submitted scores as probabilistic assignments and on the reliability of the test set labels. In our opinion, the suggested posterior ranking according to, e.g., the Pearson correlation between scores s and the test set class labels is questionable, see also the DREAM6 discussion forum at [3].

Characteristics of the GMLVQ Classifier
Apart from yielding the actual classification scheme, the GMLVQ analysis provides insights into the structure of the data which become available by inspection of the prototypes and relevance matrix. The interpretability of the classifier has proven useful in several applications and facilitates discussions with the respective domain experts [27,28].   Figure 7 visualizes the difference vector (w 2 {w 1 ) of prototypes representing healthy donors (1) and AML patients (2), respectively. For the sake of clarity, we have shown only the 31 components which correspond to the features mean j , cf. Eq. ( a). A positive difference corresponds to markers which display a greater value in the AML prototype compared to the typical healthy donor in the data set, examples being HLA-DR-FITC (j~9), CD117-PE (j~16), and CD34-PC5 (j~23). Example markers which display reduced values mean j in AML patients are CD15-FITC (j~7), CD16-PC5 (j~21), and CD10-PC7 (j~31).
In addition we analysed the resulting relevance matrix L~V T V [ R 186|186 . We focused on the diagonal elements L ii~P j V 2 ij which formally accumulate the importance of feature i for the resulting classification.
The direct interpretation of L is simplified if all features assume values on the same order of magnitude. This condition was realized here by the explicit z-score transformation mentioned above. Moreover, it is important to note that, given a particular set of feature vectors x m and prototypes w j , a continuum of matrices V may exist which yield the same distances d(x m ,w j ), cf. Eq. (3) and, hence, the same classification scheme with respect to the training data. This ambiguity problem is particularly pronounced for interdependent or highly correlated features in high dimension. Resulting difficulties concerning the interpretation of L in terms of feature relevances are discussed in depth in [28]. There, schemes are suggested for posterior regularization which provide unique, interpretable V and L which we also applied here: Note that arbitrary vectors from the null-space or kernel ker(C) of the matrix.
can be added to the rows of a given V raw without changing the GMVLQ cost function (4) and the actual classification of training data. In [28] a column space projection is suggested in order to remove contributions from ker(C) [28]: is constructed from the eigenvectors u m of C with eigenvalues zero. Zero eigenvalues of C reflect the presence of linear dependent or strongly correlated features and the corresponding eigenvectors mark directions in input space in which training samples and prototypes do not vary. In the data considered here, one clearly expects dependencies between related markers, the four versions of IgG1 being an obvious example. In addition, extracted features like std j and iqr j or mean j and med j should be strongly correlated.
For the following discussion we determined V by means of a posterior column space projection (9) with Y retaining only the leading eigendirections of C with eigenvalues §10 {12 . Thereafter, the matrix was normalized again to satisfy P ij V 2 ij~1 and we computed the regularized L~V T V.
It is remarkable that, in the given problem, this posterior regularization has very little influence on the test set classification. In particular, the ordering of test set scores obtained with the regularized system is the same as the one presented in the previous section. This suggests that the correlations and dependencies observed in the training set are already representative for the entire data. In [28] example problems are presented where the posterior regularization has a non-trivial effect also on test set performance. Figure 8 displays the diagonal entries of L for all 186 features. After regularization, the heuristic interpretation of L ii as the relevance or significance of feature x j in the classification is justified [28]. The figure displays the features in groups of six, corresponding to the 31 markers, cf. Table 1.
A relatively small number of markers appears to contribute the most significant features: FS-Lin (1), SS-Log (2), CD15-FITC (7), CD117-PE (16), CD16-PC5 (21), CD34-PC5 (23), and CD10-PC7 (31). A more detailed discussion of the obtained L ii provides further, valuable information: For instance, the histogram shape as measured by skewness and kurtosis appears to be of minor importance with respect to marker 16 (CD117-PE), while measures of the corresponding histogram width (std, iqr) seem to 1 represent significant differences between AML patients and healthy donors. On the contrary, for CD10-PC7 (marker 31) skewness and kurtosis carry most discriminative power. While several of the above mentioned markers have been discussed as relevant in the context of AML in the literature, see e.g. [7,28,29,30], their expression characteristics can vary a lot with the actual AML variant. For instance, both, HLA-DR positive and HLA-DR negative types of AML exist [31], the same applies to several other markers.
Due to the limited size of the data set and because information about AML subtypes was not disclosed, one should not overinterpret the results presented above. It is very likely that our findings in terms of relevances and prototypes are highly specific for the provided data set which seems to represent particular types of AML only. Nevertheless, our results demonstrate the interpretability of the GMLVQ approach and illustrate how the method could be used for efficient biomarker selection in collaboration with domain experts.
Obviously, the outcome and interpretation of relevance parameters depends on the precise form of the distance measure, Eq. (3), or more generally, on the parameterization of the classifier. For instance, systems with diagonal matrix L could only take into account single features and would disregard the discriminative power of particular pairs of features. Accordingly, featurs which display low relevance in our scheme might become significant in more complex classifiers. Nevertheless we believe that our method provides valuable insight into the discriminative power of features and pairs of features. The following simple experiment further illustrates this claim: We ranked features according to the corresponding L ii and restricted the obtained GMLVQ classifier to the use of only 18 features for classification. All other features were omitted when evaluating distances and scores, cf. Eqs. (3,5), no re-training of the system was performed. The restricted classifier was evaluated in terms of its test set ROC. Close to perfect test set classification with an AUC&1 was retained when using only the leading 18 features which all are derived from the above mentioned 7 markers. It is interesting to note that also the following two subsets of 18 features, i.e. with relevance ranks 19-36 and 37-54, yielded excellent test set performance. Figure 9 shows how the resulting AUC decreases for subsequent subsets of 18 features with decreasing relevance. Performance deteriorated when subsets of features with very low relevance were used, resulting in essentially random class assignments with AUC&0:5.
A more reliable determination of discriminative markers, and even more so, the selection of a minimal set of features for correct classification would require systematic validation studies including the re-training of the GMLVQ system on the respective feature sets. Due to the limitations of the data provided in the challenge we postponed this line of research to forthcoming studies.

Outlook
More challenging data sets will have to be inspected to further demonstrate the usefulness of the approach in the analysis of flow cytometry data. This should, of course, include the systematic comparison with other methods. A comparison of various classifiers in the context of the FlowCAP2/DREAM6 challenge can be found in [1].
The identification of leukemia subtypes in a larger study population requires the introduction of several prototypes representing the class of AML patients. The extension of GMLVQ  in terms of localized distance measures [9,11] appears also promising in this context.
The reliable identification of feature relevances for marker selection should also be based on larger, more representative data sets. For a successful application of GMLVQ for bio-marker selection in the context of tumor classification see [27]. The application of multi-class, potentially localized, GMLVQ will open new routes to the identification of discriminative markers in the differential diagnosis of AML subtypes. In forthcoming studies, the consideration of histogram specific distance measures will also be studied along the lines of [15].
The analysis presented here was based on the entire cell population of a given subject. More general problems, including the above mentioned identification of AML subtypes, might require an analysis on the level of individual cells. We intend to consider the development of prototype based automated gating procedures in forthcoming projects.

Available Software
The specific Matlab code used to generate our contribution to the DREAM6/FlowCAP2 Molecular Classification of Acute Myeloid Leukemia Challenge 2011 is publicly available at http://www.thedream-project.org/story/code [3].
A Matlab toolbox Relevance and Matrix adaptation in Learning Vector Quantization, including GMLVQ and important variants, is made available at http://matlabserver.cs.rug.nl/gmlvqweb/web/ [32].