Adaptive Dimensionality Reduction with Semi-Supervision (AdDReSS): Classifying Multi-Attribute Biomedical Data

Medical diagnostics is often a multi-attribute problem, necessitating sophisticated tools for analyzing high-dimensional biomedical data. Mining this data often results in two crucial bottlenecks: 1) high dimensionality of features used to represent rich biological data and 2) small amounts of labelled training data due to the expense of consulting highly specific medical expertise necessary to assess each study. Currently, no approach that we are aware of has attempted to use active learning in the context of dimensionality reduction approaches for improving the construction of low dimensional representations. We present our novel methodology, AdDReSS (Adaptive Dimensionality Reduction with Semi-Supervision), to demonstrate that fewer labeled instances identified via AL in embedding space are needed for creating a more discriminative embedding representation compared to randomly selected instances. We tested our methodology on a wide variety of domains ranging from prostate gene expression, ovarian proteomic spectra, brain magnetic resonance imaging, and breast histopathology. Across these various high dimensional biomedical datasets with 100+ observations each and all parameters considered, the median classification accuracy across all experiments showed AdDReSS (88.7%) to outperform SSAGE, a SSDR method using random sampling (85.5%), and Graph Embedding (81.5%). Furthermore, we found that embeddings generated via AdDReSS achieved a mean 35.95% improvement in Raghavan efficiency, a measure of learning rate, over SSAGE. Our results demonstrate the value of AdDReSS to provide low dimensional representations of high dimensional biomedical data while achieving higher classification rates with fewer labelled examples as compared to without active learning.


Introduction
The ability to mine disease patterns from large biomedical datasets could enable the identification of prognostic disease markers, which in turn, could save lives, reduce morbidity, and alleviate the overall cost of healthcare today. Generally speaking, biomedical data may be regarded as a collection of diagnostic attributes, which can be obtained from a variety of sources, ranging from medical imagery, DNA microarrays, or protein expression data obtained from mass spectrometry techniques [1][2][3][4][5][6]. Most popular approaches to identify disease patterns are some variant of supervised classification strategies. In these approaches, classifiers are taught to distinguish between disease classes via a collection of these attributes and labeled training instances [1].
One of the primary challenges in building predictors for biomedical data is that it is typically high dimensional (large K) with relatively few samples (small N) [7]. Particularly in the case of DNA microarray data, the number of features can number in the tens of thousands [8]. Machine learning classifiers are often used to leverage the predictive power of a multitude of features and discriminate between patients with different underlying pathologies [2][3][4][5][6]. However, given the 'curse of dimensionality' problem [9], where K > N, it can be difficult to build a generalizable classifier from biomedical data. The Hughes effect [10] states that given a fixed number of training samples, increasing the dimensionality eventually reduces the predictive power of a classifier. This is because training a discriminating classifier in a high dimensional feature space results in many potential class separation boundaries for distinguishing the instances to be classified [11]. This implies that before these measurements can be incorporated within a classifier to generate predictions, the original measurements need to be first reduced to a smaller number of variables, K < <N, in order to build an accurate and generalizable classifier.
In the case of very high dimensional data, it has been desirable to represent the data in a low dimensional representation that can allow for the classes to be separable. [3,[12][13][14]. Feature selection is one method to reduce dimensionality by identifying the best k < <K features to represent the data [14][15][16][17][18]. While more readily interpretable compared to dimensionality reduction methods, feature selection methods may not provide the most compact or efficient low dimensional representation due to curse of dimensionality and possible presence of redundant and correlated features.
Dimensionality reduction (DR) methods, such as Principal Component Analysis (PCA) [19], have been used for analyzing high dimensional biomedical data [3] by mapping high dimensional data into a low dimensional embedded representation (or embedding). DR methods help to mitigate the 'curse of dimensionality' problem by learning a low dimensional representation which aims to approximate the original high dimensional features with fewer variables. DR methods can be grouped into two broad classes: linear and nonlinear methods.
Linear DR methods such as PCA, Linear Discriminant Analysis (LDA) [11], and Multidimensional Scaling (MDS) [20,21], generally preserve Euclidean distances when mapping data into the embedding space. For example, PCA [19] determines the optimal projections of the data by a rotation of the high dimensional space to the axis of greatest variance. Alternatively, in MDS [20], Euclidean distances between each pair of data points are collected into a pairwise affinity matrix, which is then mapped into a low dimensional embedding which best preserves these distances.
In contrast, nonlinear dimensionality reduction (NLDR) methods [22][23][24][25] are founded on the premise that Euclidean distance does not represent true object similarity. In fact, researchers have found that NLDR methods may be better suited towards classification of high dimensional biomedical data compared to linear DR methods [12,13,26]. Graph-based DR methods such as Graph Embedding [23] are similar to the idea of manifold learning [24], where a graph dictates similarity between data points via a set of weighted edges. The graph itself is representative of an abstract low dimensional manifold which encompasses all data points and is embedded in the high dimensional space [24]. In order to extract the manifold from the high dimensional space, similarity between data points is re-defined as distances along the graph and the graph distance information can be projected into a low dimensional embedding. Specifically, NLDR methods such as Isomap [24], Locally Linear Embedding (LLE) [25], and Graph Embedding (GE) [23] have been shown to provide low dimensional data representations for improving classification performance and overall data interpretation [12,13].
While unsupervised methods, such as NLDR schemes, have been utilized for preliminary analysis of data, for classification tasks, it is desirable to incorporate all available object class labels to optimize the embedding for class separation, as opposed to basing the affinities solely based off the pre-defined similarity criterion [27][28][29]. Recently, there has been a great deal of interest in semi-supervised dimensionality reduction (SSDR) methods, which utilize labeled instances to improve separation of object classes in the low dimensional embedding [30][31][32][33][34][35][36]. This is typically done by extending the pairwise affinity matrix of previous DR methods to incorporate class label information, such that if a pair of objects belong to same class, they are weighted to be more similar and will be mapped to be closer together in the low dimensional embedding. Similarly, if a pair of objects are of different classes, they are weighted to be less similar and will be mapped farther away in the embedding. Sugiyama et al. [33] applied semisupervised learning (SSL) to Fisher's discriminant analysis in order to find the linear projection that maximized object class separation. Verbeek et al. [37] utilized a method for semi-supervised learning using Gaussian fields with locally linear embedding for object pose recognition. Yang et al. [34] similarly applied SSL toward manifold learning methods. Zhao [35] presented a semi-supervised method for graph embedding which utilizes weights to simultaneously attract samples of the same class labels and repel samples of different class labels given a neighborhood constraint. Zhang [36] employed a similar approach to SSDR as Zhao, but without utilizing neighborhood constraints.
In addition to the large-K/small-N problem, a second challenge with building predictors for biomedical data is that very often, biomedical datasets are not adequately labeled or annotated [38]. This is due to the significant overhead involved in procuring well-annotated biomedical datasets and also due to the fact that invariably an expert is required to perform this task [5]. Hence, if one is attempting to build a predictor to identify disease aggressiveness or predict long term outcome in a patient, one would need a well curated and annotated dataset to provide training instances for the predictor. Active learning (AL) can reduce the number of samples needed to train an accurate predictor.
AL is a specific instance of semi-supervised learning, where the learning algorithm may interactively query the desired labels from a user or other source [39]. AL differs from random sampling, which queries training instances randomly from an unlabeled pool [40]. The objective of AL is to find an optimal training set. The benefits of using AL are twofold as 1) classifier accuracy can be improved, and 2) the number of training labels necessary to achieve a classification goal is reduced.
While AL has been used for providing fewer, optimal instances for training a classifier, its extension towards learning the best training instances for improving the quality of low dimensional embedding representations has not been heavily investigated [37,41]. Zhang et al. [42] has suggested that searching in a locally linear or manifold space could provide more representative points for active learning. Thus, an extension of AL to SSDR would be important for prediction and representation of biomedical data.
In this paper, we present a novel dimensionality reduction (DR) method, AdDReSS (Adaptive Dimensionality Reduction with Semi-Supervision), which aims to seamlessly integrate semi-supervised dimensionality reduction and active learning. This allows AdDReSS to construct low dimensional data representations to improve classification of high dimensional biomedical data while using fewer labels compared to previous SSDR methods.
The major contributions and implications of this work are: First, a novel NLDR method which seamlessly incorporates active learning and semi-supervised learning to guide embedding construction. Second, a demonstration showing the effects of active learning towards improving embeddings generated via SSDR compared to random sampling. Third, a simple framework that could be extensible for other SSDR methods to create more discriminatory low dimensional representations.
We evaluated our methodology on different tasks for four relevant medical datasets: (a) Discrimination of tumoral and non-tumoral prostate samples in a gene expression dataset [8], (b) Discrimination of neoplastic and non-neoplastic disease within the ovary in a protein expression dataset [4], (c) Mitosis detection in breast cancer images [43], and (d) Identifying white matter and grey matter in a Brain MR Imaging dataset [44]. These datasets were chosen to represent varied types of imaging and non-imaging biomedical data-radiologic medical imaging, histologic imaging, DNA microarray, and proteomic spectra.
The rest of this paper is organized as follows. In Section 2, we formalize notation and provide an overview of an unsupervised dimensionality reduction method (Graph Embedding) and a semi-supervised dimensionality reduction method (Semi-Supervised Agglomerative Graph Embedding) In Section 3, we introduce an active learning strategy (Uncertainty Sampling), thereby providing the theoretical background for AdDReSS. and describe our method AdDReSS (Adaptive Dimensionality Reduction with Semi-Supervision). In Section 4, we outline the datasets, training parameters, and the performance measures used to evaluate the methodologies described in this work. In Section 5, we demonstrate the performance of the comparative methodologies on the basis of learning rate, classification accuracy, clustering performance, and variability, followed by concluding remarks in Section 6.

Notation
We denote a set E of samples c i , c j 2 E, i, j 2 {1, 2, . . ., N}, where N is the number of samples in set E. Each sample c i is represented by a 1 × K feature vector x i 2 X. We can formalize a dataset X as a N × K matrix containing K feature values for each of N samples. The goal of dimensionality reduction is to reduce the N × K matrix, defined by a 1 × K feature vector x i 2 X, where k < K, to a N × k matrix, where all samples c i are defined by a 1 × k eigen-feature vector y i 2 Y. Label information may be introduced such that ℓ(c i ) denotes the object class label of sample c i as being a positive class (+1) or negative class (−1). Labels ℓ(c i ) = 0 denotes that sample c i is unlabeled.

Graph Embedding
NLDR methods, such as Graph Embedding [23], can be used to reduce samples c i originally represented as K-dimensional vectors To perform this transformation, data X is first represented as an affinity matrix W, which describes the similarity between all pairs of objects c i , c j 2 S as a graph G = {V, E}, where V represents all objects c i and c j as vertices, and E represents the edges which connect them.
Similarity is computed via the Gaussian diffusion kernel g ¼ e À jjx i Àx j jj 2 s , which affects the weighting of the components in W. The kernel allows for a flexible local neighborhood constraint induced based on σ. A small σ narrows the size of the local neighborhood such that fewer points are deemed similar, whereas a large σ increases the size of the local neighborhood such that more points are similar. We set σ = max i,j ||x i − x j || 2 .
Alternatively, E, the edges in the graph G, expressed via the affinity matrix, W, can be pruned to further constrain local neighborhoods for NLDR. E can be defined based on a local neighborhood size determined by the number of nearest neighbors κ. For each c i , if c j is one of the κ-nearest neighbors of c i , then we may include c j in the set K i and we can express the edge as E(c i , c j ) = 1. The weight matrix W represents a non-binary extension of the graph G, which takes into account the explicit similarity between objects c i and c j in terms of x i and x j such that As performed in the normalized cuts algorithm [23], the affinity matrix is normalized such thatW W ðx i ; x j Þ is used to solve the eigenvalue problem where D is a diagonal matrix containing the trace ofW , and e are the eigenvectors. The embedding Y GE is formed by taking the most dominant eigenvectors e β , β 2 {1, 2, . . ., k}, corresponding to the k smallest eigenvalues λ β , where k corresponds to the dimensionality of Y GE .

Semi-Supervised Agglomerative Graph Embedding
Adding semi-supervised learning to DR is performed by modifying the Graph Embedding algorithm to introduce the label information ℓ(c i ). A typical strategy for introducing label information into the Graph Embedding framework is to apply an additional set of weighting constraints to describe pairs of c i and c j with either the same (ℓ(c i ) = ℓ(c j )) or different (ℓ(c i )6 ¼ℓ(c j )) labels. We utilize a methodology used by Zhao et al. [35], SSAGE, which includes a multiplier to the Gaussian diffusion kernel g ¼ e À jjx i Àx j jj 2 s , where σ = max i,j ||x i − x j || 2 , such that the affinity matrix is now defined aŝ W contains the weighted similarities between c i and c j based on (a) its position in K-dimensional space via the Gaussian diffusion kernel, (b) its proximity to its κ nearest neighbors, (c) whether that neighbor is of the same label class or not.
W is subsequently normalized via Eq (2) and the resulting normalized affinity matrix undergoes eigenvalue decomposition as performed in Eq (3). As with GE, the embedding Y SS for SSAGE is formed by taking the most dominant eigenvectors e β , β 2 {1, 2, . . ., k}, corresponding to the k smallest eigenvalues λ β , where k is the dimensionality of Y SS .

Brief Overview
The spirit of AdDReSS is embodied in Fig 1. Given an initial low-dimensional representation, a Support Vector Machine (SVM) [45] classifier is used to identify instances of the classes that are difficult to classify. The goal then, is to separate these two classes in a lower dimensional embedding representation such that each class is in a distinct region of the low dimensional embedding space. AdDReSS invokes AL to identify difficult to classify samples from within the embedding representation. These samples are subsequently used to train the semi-supervised agglomerative graph embedding (SSAGE) strategy to produce a more separable representation of the data. This process can be iterated to further refine the embedding representation.

Active Learning by Uncertainty Sampling for Identifying Ambiguous Samples
One can identify samples for AL by querying difficult to classify samples [5,38,40,46,47]. While many strategies have been investigated for AL using different classifiers, ultimately these differences were found not to be heavily correlated with classification performance [5]. For uncertainly sampling, a labeled set S tr is first used to train a classifier. For each S tr , γ and c parameters are optimized by the grid search methodology proposed in Hsu et al. [48] and subsequently used to predict on the unlabeled set S ts . For each sample in the unlabeled set S ts , the classifier predicts the object class label ℓ(c i ) with a certain probability that c i belongs to that particular object class ℓ(c) (i.e. P(ℓ(c i ) = 1)). We can define the most ambiguous samples as those with a probability of P(ℓ(c i )) = 0.5. We aim to find samples c i nearest to P(ℓ(c i )) = 0.5 via the objective function argmin c i 2S ts Pð'ðc i Þ ¼ 1Þ À 0:5 : ð5Þ These samples c i are assigned to set S a . Labels ℓ(c i ), c i 2 S a are queried and these ambiguous samples are added to the training set Learning via the updated labels ℓ(c i ), c i 2 S tr , we endeavor to improve classification performance compared to S tr = 2 S a .

Algorithm
The iterative Algorithm AdDReSS is presented below. Additionally, we employ the synthetic Swiss Roll example [24] presented in Fig 2 to guide the explanation of the AdDReSS algorithm. Fig 2(a) shows the 3-dimensional representation of the Swiss Roll dataset [24] shown with the two classes. The goal is to separate these two classes in a lower dimensional embedding representation such that each class is in a distinct region of the low dimensional embedding space.  Difficult to classify examples are identified by the SVM classifier in embedding space and are shown in blue in Fig 2(e). The newly identified objects discovered via AL attract towards similarly labeled samples already available to SSAGE and the classifier while repelling from dissimilarly labeled samples, thus creating the separation shown in Fig 2(g). Thus, it is clear that the discovery of difficult to classify labels can produce greater separation of the embedding as these samples are leveraged by SSAGE. The use of random sampling would probabilistically provide a uniform sampling of points in the dataset such that SSAGE could not leverage the samples at the classification boundary, resulting in a smaller degree of separation of object classes.
Line 0 of the algorithm refers to Model Initialization, the construction of the initial embedding Y Ad , and is illustrated in Fig 2(c) which shows the application of AdDReSS on the Swiss Roll dataset. The initialized embedding Y Ad is created using data X via GE. In Fig 2(d), the revealed labels used for active learning are mapped onto Y Ad .
The subsequent illustrations, Fig 2(e) and 2(g), represent successive runs of Active Learning and Model Refinement via SSDR, respectively, which are contained within the while loop of the algorithm (lines 2-7).

Algorithm AdDReSS
Predict ℓ(c i ) in S ts using classifier model in Step 2 4.
Identify ambiguous samples from c i 2 S ts via Eq (5) 5.
Query labels ℓ(c i ), c i 2 S a 6.
Update embedding Y Ad using updated ℓ(S tr ) via Eq (4) 8. end 9. return Y Ad end Lines 2-6 of the algorithm represent the Active Learning component described earlier in Section 3.2, where ambiguous samples are identified based on the results of a trained classifier. Although Doyle et al. [5] have suggested that the particular choice of active learner is not significantly correlated with classifier performance, we have chosen the Support Vector Machine (SVM) classifier to identify the ambiguous samples for the following reasons. Firstly, SVMs have been shown to be highly generalizable to new unseen testing data, suggesting that the algorithm can consistently identify ambiguous samples [45,46]. Secondly, SVMs have been heavily investigated and employed for active learning [49,50]. Finally, SVMs, like GE, operate on a kernel representation of the data, allowing for seamless identification of ambiguous samples derived from the kernel space in construction of the embeddings. A linear kernel was used based on the assumption that the NLDR method GE provides a linearly separable embedding as GE is able to account for non-linear data. We have previously shown the ability of linear kernel SVM to separate biomedical data using low dimensional representations from NLDR methods [13].  (Fig 2(f)). New labels are obtained for these samples and added to the training set, completing the active learning phase (lines 2-6). Line 7 of the algorithm represents the Model Refinement component where the updated label set ℓ(S tr ) found via active learning is used to create an improved embedding representation via SSDR (Fig 2(g)). This representation demonstrates an improvement upon the previous embedding (Fig 2(c)). These steps of identifying samples (Fig 2(e)) and generating an optimized representation (Fig 2(g)) may be repeated until there are no additional unlabeled samples available for querying or until there is a lack of ambiguous samples to be queried.

Dataset Description
A total of 4 datasets (D 1 -D 4 ) were used in this study. These datasets include: D 1 : gene-expression of prostate cancer, D 2 : protein expression of ovarian cancer, D 3 : breast histology image data, and D 4 : synthetic brain image data. The datasets are summarized in Table 1.
Feature Extraction: No additional feature extraction was performed and all embeddings were calculated directly from the provided data. For all results, the K = 12,600 dimensional dataset was reduced down to dimensionality k 2 {2,3}.
4.1.2 D 2 : Protein Expression of Ovarian Cancer. Preprocessing: The study [4], obtained from the Biomedical Kent-Ridge Repositories (http://datam.i2r.a-star.edu.sg/datasets/krbd/) uses proteomic spectra extracted from serum to distinguish 91 neoplastic from 162 non-neoplastic disease within the ovary. The proteomic spectra generated by SELDI mass spectroscopy for each sample contains the relative amplitude of 15,154 intensities at each molecular mass / charge (M/Z) identity.
Feature Extraction: No additional feature extraction was performed and all embeddings were calculated directly from the provided data. For all results, the K = 15,154 dimensional protein spectra was reduced down to dimensionality k 2 {2,3}.
4.1.3 D 3 : Mitotic Detection in Breast Cancer Histological Images. Preprocessing: This dataset was obtained from the mitosis 2012 ICPR contest [43]. The task is mitotic nuclei identification (http://www.ipal.cnrs.fr/event/icpr-2012). Five breast cancer biopsy slides are stained with hematoxylin and eosin (H&E). In each slide, pathologists selected 10 high power fields (HPF) at 40X magnification. An HPF has a size of 512 × 512 μm 2 . Each HPF was scanned by an Aperio XT scanner at 0.2456 μm per pixel to create a 2084 × 2084 image. These 50 HPFs contain 316 annotated mitotic nuclei in total and an automated nuclear detection algorithm is used to select an additional 8592 non-mitotic nuclei for a total of 8908 nuclei. The automated nuclei detection algorithm involves the application of a blue ratio transformation [51] upon each HPF followed by a global thresholding via Otsu's method [52] to obtain a binary image. Following a morphologic opening operation applied to the binary image, we assign the centroid of each connected component as a nucleus. Patches containing each nucleus as its centroid are illustrated in Fig 3. Feature Extraction: 8908 nuclei are processed using the centroid of each nuclei as the center of an 8 × 8 image. In this manner, 8 × 8 images are generated for 4 resolutions (20X, 10X, 5X, and 2.5X). These RGB intensities for all pixels across all the 4 patch resolutions are subsequently vectorized such that 8 × 8 × 4 × 3 = 768 RGB intensities [53]. The K = 768 dimensional feature vector was reduced down to dimensionality k 2 {2, 3}.
4.1.4 D 4 : BrainWeb Images. Preprocessing: Synthetic brain data [44] was acquired from the Montreal Neurological Institute (http://www.bic.mni.mcgill.ca/brainweb/), consisting of simulated proton density (PD) MRI brain volumes at various noise and bias field inhomogeneity levels. Gaussian noise artifacts have been added to each pixel in the image, while inhomogeneity artifacts were added via pixel-wise multiplication of the image with an intensity nonuniformity field. Parameters for Gaussian noise artifacts (NO) ranged from 1% to 9% noise. Similarly, intensity non-uniformity (RF) ranged from 0 to 40%. Images were acquired at a slice thickness of 1mm. The dataset provides corresponding labels for each of the separate regions within the brain, including white matter (WM) and grey matter (GM). A single slice is used in this study comprising WM and GM alone (ignoring other brain tissue classes).
Feature Extraction: 14 texture features [54] were extracted from each image on a per-pixel basis: angular second moment, contrast, correlation, sum of squares variance, inverse difference moment, sum average, sum variance, sum entropy, entropy, difference variance, difference entropy, two features of information measure of correlation, and max correlation coefficient. These features are based on calculating statistics from a gray level intensity cooccurrence matrix constructed from the image, and were chosen due to previously demonstrated discriminability between cancerous and non-cancerous regions in the prostate [55] and different types of brain matter [56] for MRI data. For all results, the K = 14 dimensional texture feature space is reduced to dimensionality k 2 {2, 3}.

Comparative Strategies
Our experimental design was constructed to highlight the differences between embeddings generated via three schemes: (1) Graph Embedding (GE), (2) Semi-Supervised Agglomerative Graph Embedding (SSAGE) and (3) AdDReSS, a SSDR method using active learning. A summary of the methods is presented in Table 2. An empirical maximum ("Empirical Max") is also shown in some of the plots to demonstrate a ceiling for classification performance. The empirical maximum is calculated as the highest ϕ Acc obtained for any single iteration of Y such that

Embedding Parameters
Embeddings Y Ad and Y SS for AdDReSS and SSAGE, respectively, (refer to Sections 2.3 and 3.3 for more details) were generated with 20 different randomly selected training sets S tr of training samples. Measures designed to evaluate each embedding were calculated across multiple iterations, Y Ad l% , corresponding to an embedding for a percentage l of revealed labels ℓ(c i ). These trials were repeated across a range of parameters for each dataset D 1 -D 3 (as described in Section 4). Embeddings Y GE were also generated for unsupervised GE (refer to Section 2.2 for more details) for comparison, but since no label information is used, only one embedding is obtained across all label iterations for each parameter set. Optimal κ parameters κ 2 {2, . . ., n − 1} were selected for all experiments, where n is the number of samples in the dataset.

Training Parameters
Each dataset is divided equally into training and testing pools, E tr and E ts , respectively, for the purpose of an unbiased evaluation of the resulting Y. Random stratified sampling was performed such that samples for each of E tr and E ts are randomly chosen such that the number of positive and negative class labels ℓ(c) is the same in both E tr and E ts . Note that E tr and E ts are distinct from the training and testing sets S tr and S ts used for querying samples for active learning. S tr and S ts are solely used for construction of the embedding and make up the entirety of the training pool E tr , described in this section such that E tr = [S tr [ S ts ]. Meanwhile, the labels 'ðE ts Þ in the testing pool are used only for analysis and are not used for constructing Y.

Performance Evaluation
We evaluate AdDReSS on the basis of summarize 7 evaluation measures summarized in Table 3. 5 measures have been previously explored, and we refer the reader to the provided citations and Appendix for additional details on Random Forest classification accuracy [57], Silhouette Index [58], and Raghavan Efficiency [59]. Additionally, we present 2 new additional AdDReSS SSDR method using active learningŴ measures (Maximum Query Efficiency and Maximum Information Gain) to illustrate the learning rates provided via an active learning approach as compared to a random sampling approach. These two measures are described below. 4.5.1 Evaluation of Maximum Query Efficiency (ϕ MQE ) While Raghavan Efficiency is useful as an overall measure, there remain important insights that cannot be surmised by the global measure. One example is the cost savings associated with using active learning based dimensionality reduction compared to with traditional SSDR using random sampling. Maximum Query Efficiency is the ratio between the maximum difference in the number of labels necessary to achieve the same classification performance and the number of potential queries such that where l SS and l Ad refer to the mean number of labels queried by SSAGE and AdDReSS, respectively, to achieve a classification performance ϕ Acc . N refers to the number of total samples c i 2 E. A larger ϕ MQE is indicative of greater savings in terms of labels queried. 4.5.2 Evaluation of Maximum Information Gain (ϕ MIG ) Another useful measure of active learning performance is the maximum information gain from using a particular algorithm of choice. We define maximum information gain as the maximum difference in classification performance ϕ Acc at a given label query amount l, such that Acc ðY Ad l Þ À Acc ðY SS l Þ: ð8Þ Table 3. Summary of Evaluation Measures.

EvaluationMeasure Description
Classification Accuracy (ϕ Acc ) [57] Classifier accuracy (Acc) is calculated to evaluate class separability within the embedding Silhouette Index (ϕ SI ) [58] Silhouette Index (SI) offers an independent measure to quantify the separation of multiple classes in the embedding. SI can detect more subtle changes in the embedding with regards to overall class separation compared to classification accuracy.

Embedding Variance via Classification
Accuracy (ρ Acc ) [57] It is anticipated that active learning will be able to consistently identify training instances, S a , which will lead to improved classification, whereas random sampling will show more varied improvement due to the variance in the specific training instances chosen.
Embedding Variance via Silhouette Index (ρ SI ) [58] Similar to ρ Acc , we also aim to quantify the variance of the embedding with regards to the Silhouette Index, which reflects the separability of the two object classes in terms of the Euclidean distance between data points in the embedding Y.
Raghavan Efficiency (ϕ Eff ) [59] Raghavan Efficiency describes the rate of learning among active learning algorithms. We use ϕ Eff to compare the overall learning rate between 1) AdDReSS vs GE, 2) SSAGE vs GE and 3) AdDReSS vs SSAGE.
Maximum Query Efficiency (ϕ MQE ) Maximum Query Efficiency is the ratio between the maximum difference in the number of labels necessary to achieve the same classification performance and the number of potential queries.
Maximum Information Gain (ϕ MIG ) W e d e fine maximum information gain as the maximum difference in classification performance ϕ Acc at a given label query amount l doi:10.1371/journal.pone.0159088.t003 A larger ϕ MIG refers to a larger difference between the classification performance between embeddings constructed by AdDReSS and embeddings generated by SSAGE.  D 4 ), where different amounts of labeled data l are revealed to the classifier. We notice greater ϕ Acc for AdDReSS across all amounts of revealed labels l. The accuracy curve corresponding to AdDReSS also approaches the empirical maximum ϕ Acc at a faster rate compared to SSAGE. GE is also shown for each case as a comparison. The use of sufficient labeled instances suggests a clear advantage in employing semi-supervision for DR. Furthermore, the improved performance of AdDReSS over SSAGE across all labeled instances reveals a measurable difference in ϕ Acc at a point between the minimum l = 10% and the maximum number of revealed labels l = 50%. This is due to the fact that for small training size, l = 10%, there is a significant overlap in S tr for AdDReSS and SSAGE due to the identical initialization S tr . Similarly at l = 50%, training samples are exhausted from the pool E tr , such that S tr ¼ E tr for both AdDReSS and SSAGE. Therefore, the greatest measurable difference between Acc ðY Ad l Þ and Acc ðY SS l Þ can be seen where 10%<l < 50%, reflecting the difference in the active learning and random sampling strategies towards the composition of c i 2 S tr , and subsequently, towards the resulting embeddings Y Ad l and Y SS l .

Evaluation via Silhouette Index (ϕ SI )
In Fig 5, we compared AdDReSS against SSAGE and GE in terms of ϕ SI on datasets (D 1 -D 4 ) by revealing different amounts of labeled data l. Compared to ϕ Acc , there appears to be greater separation for ϕ SI between the semi-supervised methods compared to GE. This in turn seems to suggest that the separation of the object classes in the embedding space is more pronounced. Furthermore, ϕ SI (Y Ad ) outperforms ϕ SI (Y SS ) across all l. In contrast to ϕ Acc , the improvement in ϕ SI tends to continue with increasing numbers of revealed labeled information l. Only when the revealed labeled information is nearly l = 50% does ϕ SI approach its empirical maximum ϕ SI .

Evaluation of Variance (ρ Acc , ρ SI )
In Fig 6, we compare variance ϕ Acc across varied amounts of revealed labels l for Y Ad , Y SS and Y GE . In D 4 , we notice very small differences in ϕ Acc , as ρ Acc is found to be on average less than Adaptive Dimensionality Reduction with Semi-Supervision (AdDReSS) 0.0003 for all values of l. Nevertheless, we can view significant differences between ρ Acc of AdDReSS and SSAGE, with AdDReSS showing ρ Acc < 0.0001 in all but one instance, and most instances of SSAGE showing ρ Acc > 0.0001. We notice greater differences in ρ Acc for D 1 and D 2 in Fig 6(a) and 6(b) respectively, as both AdDReSS and SSAGE are more sensitive to the composition of initial training c i 2 S tr , reflected in the higher ρ Acc when l < 10%. ρ Acc is subsequently seen to decrease with increasing l as more training samples are queried by the active learner. For all experiments in D 1 , AdDReSS shows more consistency in ϕ Acc as demonstrated by lower ρ Acc compared to SSAGE. Furthermore, AdDReSS shows similar ρ Acc values when compared to the unsupervised GE method, which is reflective of the precision of the classifier. The same trends can be seen in D 2 for l > 28% (Fig 6(b)), where over 29 revealed labeled instances were used and AdDReSS shows lower ρ Acc compared to SSAGE. Similar to D 4 , there are very small differences in ϕ Acc (less than 0.00005) across all l. However, AdDReSS shown to have a lower ϕ Acc than SSAGE in all but 1 case, where the difference between AdDReSS and SSAGE is extremely small. In Fig 7, we demonstrate more consistent embeddings Y Ad compared to Y SS as demonstrated by a lower ρ SI . However, unlike with ρ Acc , ρ SI tends to increase with increasing l. In all three of four datasets D 1 -D 4 , we notice SSAGE to have greater ρ SI than AdDReSS and up to 3 or 4 times greater for D 2 and D 4 . In the final dataset D 3 , ρ SI of AdDReSS steadily decreases  with increasing l, whereas ρ SI of SSAGE experiences a slight increase with ascending l. These trends in Figs 5 and 7 are reflective of the ability of the embedding to converge more quickly with increasing l for AdDReSS compared to SSAGE. The embedding for GE does not change with respect to l, therefore, there is no change in ϕ SI , and ρ SI = 0 in any of the cases. These results are suggestive of a embedding representation Y Ad which is more stable than Y SS , and is robust to the specific c i 2 S tr used to initialize AdDReSS.

Evaluation via Raghavan Efficiency (ϕ Eff )
In Fig 8, we show the overall differences in efficiency between each pair of methods (1) AdDReSS vs SSAGE, 2) AdDReSS vs GE, and 3) SSAGE vs GE) employed for this study via ϕ Eff . In all cases, AdDReSS outperforms SSAGE in terms of ϕ Eff . Furthermore, the large positive ϕ Eff (Y Ad |Y SS ) values are consistent to what is seen in Fig 4, where AdDReSS shows greater ϕ Acc for varying proportions of revealed labels l.
In investigating dimensionality, ϕ Eff (Y Ad |Y SS ) is slightly higher overall when k = 2 for D 1 and D 2 compared to when k = 3, but show similar ϕ Eff (Y Ad |Y SS ) for the imaging datasets D 3 and D 4 . While the imaging datasets do not show much of a difference in ϕ Eff (Y Ad |Y SS ) between dimensionalities, AdDReSS appears to show a pronounced difference in efficiency with fewer  dimensions when evaluating the gene and protein expression datasets. While it is unclear why this difference is pronounced in the gene expression and proteomic datasets, overall, the results suggest that utilizing active learning could be used to represent the data with fewer features compared to random sampling.
The improvement in efficiency afforded by AdDReSS compared to SSAGE is summarized in Table 4 using GE as the baseline. Table 4 shows the percentage increase between ϕ Eff (Y Ad |Y GE ) and ϕ Eff (Y SS |Y GE ) for all datasets D 1 -D 4 . Overall, the mean percentage improvement in ϕ Eff across all datasets was found to be +10.52% for k = 2 and +60.05% for k = 3 from using AdDReSS instead of SSAGE, suggesting that AdDReSS appears to outperform SSAGE as the number of dimensions begins to increase.

Evaluation via Maximum Information Gain (ϕ MIG )
In Fig 9, we show the maximum amount of information gain that can be achieved via AdDReSS compared to SSAGE for each dataset. For D 4 , ϕ MIG = 0.0208, which means there is a maximum improvement in ϕ Acc of over 2% (from 0.8340 to 0.8548) due to AdDReSS compared to SSAGE. This improvement in ϕ Acc via Y Ad is equivalent to 60 additional correctly classified samples for D 4 compared to Y SS . In Fig 9(a), when l = 46% (47 labels revealed), D 1 shows ϕ MIG = 0.0608, with over an 8% improvement in ϕ Acc when using AdDReSS compared to SSAGE. For D 2 , ϕ MIG = 0.0465, with an improvement from 0.8764 to 0.9228 in terms of ϕ Acc and the best improvement is found when l < 30% (less than 72 labels revealed). Lastly, for D 3 , ϕ MIG = 0.0079, which is significant given the high overall classification performance in the dataset. The maximum information gain also occurs when l < 30% for D 3 . The results for ϕ MIG suggest a faster rate of learning when using AdDReSS compared to SSAGE.  AdDReSS requires an average of 417 fewer labels than SSAGE to achieve ϕ Acc = 0.8462. Stated another way, SSAGE required the use of an additional 6.98% of the labels lðc i Þ; c i 2 E, to achieve the same performance as AdDReSS. For D 1 , ϕ MQE = 0.1748. While an average of 25 revealed labeled instances were used to achieve ϕ Acc = 0.74 for AdDReSS, SSAGE required an average of 43 revealed labeled instances in order to achieve the same ϕ Acc . Similarly, for D 2 , ϕ MQE = 0.1730, such that AdDReSS required, on average, 74 labels to achieve ϕ Acc = 0.9244 while SSAGE required nearly the entire training pool, E tr , of 126 labels, as shown in Fig 10(c).
Although D 3 showed a relatively small ϕ MIG , the ϕ MQE = 0.2817, which results in 2509 fewer training cases for Y Ad to achieve the same classification accuracy of Y SS at l = 4178. Put another way, Y SS required 2.503 times as many training samples to achieve the classification performance of Y Ad at l = 1669. Overall, for D 1 À D 4 , AdDReSS was able to achieve the same classification accuracy as SSAGE while utilizing a mean of 48.8% (and up to 60%) fewer labels.

Concluding Remarks
In this work, we presented a novel nonlinear dimensionality reduction methodology, Adaptive Dimensionality Reduction with Semi-Supervision (AdDReSS), which attempts to seamlessly integrate active learning into semi-supervised dimensionality reduction (SSDR) to yield low dimensional data representations of high dimensional data. To date, no methods that we are aware of, have demonstrated the utility of active learning for improving low dimensional data representations. These representations yield greater classification accuracy and class separability while using fewer class labels. AdDReSS attempts to address the problems of classifying 'big data' and the very real problem of often not having class labels or annotations with which to train a classifier. Our scheme employs the use of active learning to query fewer labels which contribute the most towards building low dimensional embeddings with high object class separability and classification performance. We quantified the differences between AdDReSS and SSAGE on problems involving imaging and non-imaging channels from 4 distinct biomedical datasets (MR brain imaging, prostate gene expression, ovarian proteomic spectra, and breast histology images). Based on the results assessed over 8000 experiments, we make the following observations: • AdDReSS has a greater predictive potential compared to SSAGE and GE based on classification accuracy when different numbers of instances have their labels revealed.
• AdDReSS achieved a higher Silhouette Index compared to SSAGE and GE, suggestive of an embedding with greater separation between the object classes.
• In comparisons of overall efficiency, AdDReSS learns at a faster rate of convergence to the maximum possible accuracy compared to SSAGE and GE, measured by a mean 35.95% increase in Ragahavan efficiency.
• The potential savings in terms of the number of labels to be queried to achieve the same classification accuracy was shown to be up to 56% for AdDReSS compared to SSAGE across the datasets considered.
• AdDReSS was also found to be more robust to randomized training set initialization, in that it appeared to have a lower variance in terms of classification accuracy and Silhouette Index compared to SSAGE in the datasets considered.
Our findings suggest that active learning has a measurable effect compared to random sampling on SSAGE for embedding construction and that AdDReSS could be a powerful data analysis and classification tool for high dimensional biomedical data, especially in scenarios where partial or incomplete annotations and class labels are available. Future work will involve further evaluation of the effects of AL on SSDR methods beyond the ones considered in this paper.
the formulation for ϕ SI is as follows, ϕ SI ranges from -1 to 1, where -1 demonstrates the worst, and 1 is the best possible embedding. For each experiment, ϕ SI is calculated using all labels ℓ(c i ), c i 2 E tr in Y.

Evaluation of Embedding Variance via Classification Accuracy (ρ Acc )
The rate of learning is affected by the initial training examples S tr provided to the algorithm. It is anticipated that active learning will be able to consistently identify training instances, S a , which will lead to improved classification, whereas random sampling will show more varied improvement due to the variance in the specific training instances chosen. We test the variance in ϕ Acc of our algorithm (AdDReSS) compared to SSAGE across all runs, each with a unique random initializations S tr . Classification Variance is computed as where n = 20, representing the number of random initializations, and Acc refers to the mean across n values of Acc i , Acc ¼ 1 n P n i Acc i . A lower ρ Acc suggests greater robustness to initialization via a more consistent ϕ Acc .

Evaluation of Embedding Variance via Silhouette Index (ρ SI )
Similar to ρ Acc , we also aim to quantify the variance of the embedding with regards to the Silhouette Index, which reflects the separability of the two object classes in terms of the Euclidean distance between data points in the embedding Y. ρ SI captures the variance in the embedding Y across all runs, each with unique, random initializations, such that Silhouette Variance is computed as where N = 20, the number of random initializations, and SI refers to the mean across n values of SI i , SI ¼ 1 n P n i SI i . A lower ρ SI suggests greater robustness to initialization in terms of a more consistent ϕ SI .

Evaluation of Overall Embedding Learning Rate via Raghavan Efficiency (ϕ Eff )
Raghavan Efficiency [59] describes the rate of learning among active learning algorithms. Fig  11 [46] provides a visual interpretation of Raghavan Efficiency, where the region identified by A represents the area between the the Active Learning curve and the maximum achievable performance, and the region defined by B represents the area between the the Active Learning curve and the Random Sampling curve. Raghavan Efficiency is defined by a subtraction of the ratio A/B such that ϕ Eff ranges between 0 and 1 and larger values of ϕ Eff are indicative of a faster learning rate. We use ϕ Eff to compare the overall learning rate between 1) AdDReSS vs GE, 2) SSAGE vs GE and 3) AdDReSS vs SSAGE.
To compare the efficiency of an active learner Y Ac against random sampling Y Rd , ϕ Eff may be expressed as Eff ðY Ac jY Rd Þ ¼1 À A A þ B ¼ 1 À P t f t¼t 0 Acc ðY Rd l¼t f Þ À Acc ðY Ac l¼t Þ P t f t¼t 0 Acc ðY Rd l¼t f Þ À Acc ðY Rd l¼t Þ ; where t 0 and t f represent the number of initial training samples used to learn Y, and the final number of training samples used to learn Y, respectively. The empirical maximum accuracy refers to the highest ϕ Acc obtained for any single iteration of Y such that EM ¼ max i;l ½ Acc i ðY Ac l Þ, where i 2 {1, 2, . . ., n} denotes specific run of Y Ac with a unique initial training set S ts .
Additionally, to compare AdDReSS and SSAGE against the same baseline comparison, GE, we summarized these results using percentage comparison between for 1) ϕ Eff (Y Ad |Y GE ) and 2) ϕ Eff (Y SS |Y GE ). The percentage change in ϕ Eff for AdDReSS from SSAGE can be expressed as

Author Contributions
Conceived and designed the experiments: GL. Performed the experiments: GL DR. Analyzed the data: GL. Contributed reagents/materials/analysis tools: GL DR. Wrote the paper: GL DR AM.