Hit screening with multivariate robust outlier detection

Hui Sun Leong; Tianhui Zhang; Adam Corrigan; Alessia Serrano; Ulrike Künzel; Niamh Mullooly; Ceri Wiggins; Yinhai Wang; Steven Novick

doi:10.1371/journal.pone.0310433

Abstract

Hit screening, which involves the identification of compounds or targets capable of modulating disease-relevant processes, is an important step in drug discovery. Some assays, such as image-based high-content screenings, produce complex multivariate readouts. To fully exploit the richness of such data, advanced analytical methods that go beyond the conventional univariate approaches should be employed. In this work, we tackle the problem of hit identification in multivariate assays. As with univariate assays, a hit from a multivariate assay can be defined as a candidate that yields an assay value sufficiently far away in distance from the mean or central value of inactives. Viewed another way, a hit is an outlier from the distribution of inactives. A method was developed for identifying multivariate hit in high-dimensional data sets based on principal components and robust Mahalanobis distance (the multivariate analogue to the Z- or T-statistic). The proposed method, termed mROUT (multivariate robust outlier detection), demonstrates superior performance over other techniques in the literature in terms of maintaining Type I error, false discovery rate and true discovery rate in simulation studies. The performance of mROUT is also illustrated on a CRISPR knockout data set from in-house phenotypic screening programme.

Citation: Leong HS, Zhang T, Corrigan A, Serrano A, Künzel U, Mullooly N, et al. (2024) Hit screening with multivariate robust outlier detection. PLoS ONE 19(9): e0310433. https://doi.org/10.1371/journal.pone.0310433

Editor: Longxiu Huang, Michigan State University, UNITED STATES OF AMERICA

Received: March 12, 2024; Accepted: August 28, 2024; Published: September 12, 2024

Copyright: © 2024 Leong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Hit screening is an integral component of the preclinical workflow, from identification and validation of druggable targets through to the identification of lead compounds. Single-shot hit identification is traditionally performed with a univariate assay by comparing the assay signal of a compound or other external perturbation (hereafter, compound) to that of a set of inactive compounds or to one or more controls [1]. While there are many procedures to detect activity in a hit screening assay, this work will focus on techniques for which activity of a compound is determined by calculating the distance between the assay signal of a single compound to the mean signal of inactives. If the distribution of inactives is assumed to be normal (a.k.a., Gaussian) and the assay output is univariate, then distance may be calculated as the number of standard deviations between the two. Some hit screening assays, such as Cell Painting [2] or other morphological profiling, produce multivariate output that cannot be reduced to a single dimension. In such a case, a multivariate metric called Mahalanobis distance [3] is calculated to measure the gap between a compound and the mean of inactives. Whether the data are univariate or multivariate, hit screening may be recast as an outlier detection exercise, i.e., the compound of interest is declared to be active if it falls outside of the normal population distribution of the inactive compounds. For the i^th compound, the two competing hypotheses of interest are (1)

In this work, a method for evaluating hypotheses in (1) by detecting outliers for multivariate normal data is proposed and evaluated for its properties of Type I error, false discovery rate, and true discovery rate. The method is demonstrated via simulation as well as on a real 96-dimensional CRISPR [4] gene knockout hit screening bioassay. To initiate the reader, it may be valuable to review univariate systems for hit screening and outlier detection.

Univariate outlier detection

For notation purposes, let X_i denote the assay value of the i^th compound (i = 1, …, N), denote the sample mean, and σ denote the population standard deviation of inactives. One of the simplest systems to detect activity is given by Makarenkov et al. (Method 1 in [5]) who identify the i^th compound as active if , where c is an appropriately pre-selected constant. Although σ cannot be known exactly, with enough experience with an assay, the value of σ may be determined with superb precision. The Makarenkov rule with c = 3 would correctly identify the outliers in Fig 1, which displays computer-generated data from a normal distribution with true population mean μ = 10 and standard deviation σ = 3 for a set of 25 inactive compounds (circles) with three outliers (triangles) that rise more than three standard deviations above the distribution. Through simple algebra, the value c may also be seen as the lower threshold for a Z-score so that a compound is declared active (i.e., H₁ is declared in (1)) if .

Download:

Fig 1. Sample mean and standard deviation estimates are influenced by the presence of outliers.

The graph shows observations from a normal distribution (circles) with mean = 10 and standard deviation = 3 and outliers (triangles). Solid, dashed, and dotted lines show the true population distribution, the normal distribution based on sample mean and standard deviation, and the normal distribution based on the median and median absolute deviation (MAD), respectively.

https://doi.org/10.1371/journal.pone.0310433.g001

Another procedure for selecting actives is given by the Minimum Discriminatory Difference (MDD) [6,7], which discriminates the signal between two compounds if the distance between their respective measurements is larger than , where Φ⁻¹(.) is the inverse of the cumulative normal probability density function (pdf). The MDD statistic can be modified to compare X_i to , declaring a compound as active if or, with some algebraic manipulation, if . In many situations, the value of σ is not known and is estimated by the sample standard deviation s with λ degrees of freedom. In that case, a T-score and rule to declare H₁ may be created by , where t⁻¹(.,λ) is the inverse central T pdf with λ degrees of freedom.

The formulation of MDD assumes that the calculation of sample mean includes the i^th compound assay value X_i. In the situation for which X_i is not included in , then the standard error of is . In this case, the T-score version of MDD is equivalent to a prediction interval outlier detection method in the pharmaceutical manufacturing literature [8,9], which declares X_i as an outlier if or . For large values of N, the difference between T_i,MDD and T_i,pi is negligible. By removing N completely, becomes an outlier rule for standardized residuals.

In the presence of outliers, it is well known that the sample mean and standard deviation estimates can be biased, which can lead to false negatives (outliers are not detected) as well as false positives (inliers are detected). The problem of bias is well-demonstrated by the data shown in Fig 1, where the population mean is 10 and the standard deviation is 3 for inactives. The sample mean and standard deviation of the inactives (circles) are very close to their true values, but the overall sample mean (including outliers) is 11.5 (biased by 1.5 units) and the overall sample standard deviation is 5.3 (1.8 times inflation). Bias in the center and spread can result in falsely declaring H₀ (false negatives) and/or falsely declaring H₁ (false positives) in (1). These effects are sometimes respectively called masking and swamping [10]. To mitigate these problems, so long as the percentage of outliers is small enough, various authors [11–13] propose robust measures of center and spread that are little influenced by the outliers in the system, such as the median and MAD. Applying robust statistics to the data in Fig 1, the median and MAD of the full data are 10.0 and 3.5, which are closer to the population values. Hampel’s outlier detection rule [14], which is a robust version of Method 1 from Makarenkov et al. [5], substitutes the median and MAD for the mean and standard deviation. Brideau et al. [11] propose the B (“better”) score, which is similar to Hampel’s rule, but also robustly models the row and column effects from an experimental plate using median polishing. Other robust methods to calculate the center and spread of a normal distribution with outliers include trimming, winsorizing, and weighting. Sondag et al. [12] propose a robust version of the prediction-interval method (T_i,pi) that uses iterative weighting. For references on robust statistical methods, see [13,15,16].

Another robust method for outlier-contaminated normal data uses a symmetric, wide-tailed model, such as the Cauchy distribution, to locate the center of the inlier distribution. Motulsky and Brown [17] follow this track and propose robust outlier detection (ROUT) to estimate the mean of a normal distribution with a Cauchy location statistic . The authors estimate the standard deviation with the robust standard deviation of residuals (RSDR), where RSDR is the 68.27th percentile of the absolute residuals . Sondag et al. [12] demonstrate that ROUT controls the false positive rate while providing an acceptable true positive rate. An outlier is declared by ROUT if the Benjamini-Hochberg [18] (hereafter, BH) multiplicity-adjusted p-value is less than a pre-determined false discovery rate Q (Motulsky and Brown suggest using Q = 0.01), where, under H₀ in (1), the p-value is and T is central t-distributed with N—1 degrees of freedom.

Multivariate outlier detection

Much of the literature for multivariate hit screening follows a similar logic by identifying outlier compounds that are a detectable distance from a robust estimate of the multidimensional mean of inactives. Statistical methods of multivariate outlier detection compute the distance of each observation from the center of the data and classify those above a certain threshold as outliers. Let W∼MVN(μ,Σ) denote a p-dimensional vector. The conventional measure of distance between W and μ is the Mahalanobis distance given by . When μ and Σ are known, it is straight-forward to show that the squared Mahalanobis distance follows a chi-square (χ²) distribution with p degrees of freedom, .

For hit screening, let X_i and denote the p-dimensional vectors for the i^th compound and the mean across compounds. The Mahalanobis distance analogue of the univariate MDD T-score is (2) where is the sample variance-covariance estimator for Σ. Note that from Eq (2) is also called a Hotelling’s T-squared statistic [19], which tests if the mean of X_i is equal (or not) to the mean of . Under H₀ in (1), distributed. It follows that the p-value to test the hypotheses in (1) is given by , where F*∼F(p,N−p). As in the univariate case, for the situation in which X_i is not part of the calculation of , then the variance-covariance term in Eq (2) is .

Because masking and swamping can occur in the multivariate space, it is often desirable to compute a robust version of the Mahalanobis distance using robust replacements of and (say, and ) that are less affected by outlying observations in Eq (2). Many robust procedures to calculate and have been proposed in the literature, including the minimum covariance determinant (MCD) estimator [20–22], S-estimators [23], MM-estimators [24], the orthogonalized Gnanadesikan-Kettenring (OGK) estimator [25], and shrinkage procedures [26]. When used to calculate Eq (2), these methods are called robust Mahalanobis distance (rMD) estimators.

Although the MCD estimator of Rousseeuw is widely used for rMD, Adrover and Yohai [27] report that MCD becomes unreliable as p becomes large. Other rMD methods may fare better, but because the number of parameters increases quadratically with p, it is likely other robust procedures for estimating μ and Σ show a similar reliability issue. For large p, a dimensionality reduction step via principal components analysis (PCA) can be performed. Hubert et al. [28] claims that standard PCA may be unreliable if the data contain outliers and suggests several robust PCA (rPCA) algorithms. Various authors [28–32], evaluate (1) by implementing a rPCA step, followed by a rMD testing procedure on the reduced-dimension scores matrix. See Todorov and Filzmoser [32] for an overview of this procedure. Unlike classic PCA (cPCA), the rMD scores columns are not necessarily orthogonal so the variance-covariance matrix in Eq (2) must be estimated; however, because of the reduced dimensionality in the rPCA scores space, the rMD reliability may be less of an issue. More modern dimensionality-reduction techniques, such as t-distributed Stochastic Neighbor Embedding (t-SNE) [33], Uniform Manifold Approximation and Projection (UMAP) [34], Independent Component Analysis (ICA) [35], kernel PCA [36], and Local Linear Embedding with Adaptive Neighbors (LLEAN) [37], might also be considered and may provide better separation in the data. However, it is unlikely that these methods would produce scores that are multivariate normally distributed, meaning that alternative approaches to the F-statistic would need to be considered.

The aforementioned authors who base outlier detection of Mahalanobis distance in a robust PCA space do not calculate a p-value or other probability metric, but instead provide a flag to indicate whether an observation is declared an outlier. This operation can result in uncontrolled false discovery rate of outliers, which is a major drawback. By contrast, the method proposed in this work, which performs cPCA followed by rMD, tests the hypotheses in (1) with a BH FDR-adjusted p-value. By way of explanation for the cPCA step, our method requires orthogonality of the scores matrix. To enable a direct comparison of our method and the rPCA-rMD approaches, we adjusted the latter from returning an outlier flag to returning a p-value based on the distance metrics and distributional assumption (chi-square) proposed by the authors, and further corrected the p-value for multiple testing using the BH FDR correction method. That is, the authors provided an outlier flag if , where χ² is their test statistic and r is their calculated degrees of freedom. Even after we modified rPCA-rMD methods to also provide a BH FDR-adjusted p-value and despite the warning of using cPCA by Hubert et al. [28], we found that our method had better Type I error control, better FDR control, and/or higher statistical power to declare H₁ in (1) compared with other methods. The performance of the proposed approach was evaluated through simulation studies with p = 2, 3, and 96, where the case p = 96 mimics a real data situation. The proposed method was also applied to a real CRISPR data set from in-house hit screening programme.

While beyond the scope of this work, it is worth considering other advanced multivariate outlier detection methods that are capable of handling nonlinear relationship, non-stationary data and large-scale datasets, such as [38–40]. These methods typically involve the conceptualization of at least two groups, namely inactives versus others, through classification or clustering. While these techniques may offer advantages in some scenarios, our proposed method focuses solely on determining whether an observation belongs to the inactives group, eliminating the need for assumption about the data being separable into more than one group, therefore simplifies the analysis and enhances efficiency.

Materials and methods

The mROUT algorithm

Motivated by the robust outlier detection method (ROUT) of Motulsky and Brown [17], we propose a method called multivariate ROUT (mROUT) for identifying outliers in high-dimensional data using a cPCA-rMD approach. The procedure mROUT borrows from Filzmoser, Maronna, and Werner [31] and Motulsky and Brown [17], and consists of the following broad steps.

Dimensionality reduction
Robust estimation of location and scatter
Outlier detection via Hotelling’s T-squared testing

It is assumed that the majority of compounds (rows of the N × p matrix X) will show no activity. Further, assay data for inactive compounds are assumed to follow a multivariate normal distribution, i.e., X_i∼MVN(μ,Σ). No distribution assumption is placed on active compounds, which, under H₁ of (1) is not part of the multivariate normal distribution.

The assumption that inactive compounds follow a multivariate normally distribution is not likely universally valid. Later, we describe processing of images using a multi-task convolutional neural network (CNN) approach. For our real in-house data sets, the CNN appears to produce Gaussian-like data columns. For profiles generated with other image analysis software, the data may exhibit skewed or other non-normal characteristics. The recommended practice by Caidedo et al. [41] is to transform feature values with mathematical operations, such that the feature values approximate a normal distribution, mean centered and have comparable standard deviation to facilitate downstream analysis. Therefore, we think our proposed approach will be generally applicable for many cases after appropriate transformation or preprocessing of image features.

The essence of our work is that we are creating a robust distribution (after an appropriate dimensionality reduction) for inactives and then asking whether each observation belongs to the distribution or not. Big deviations from the Gaussian distribution will require a different robust distribution for the inactives and different test statistic. Such a situation could be a subject for future research.

Step 1: Dimensionality reduction.

cPCA is performed on mean-centered and standard deviation-scaled data N × p matrix X with the aim of finding a smaller set of linear combinations of the original p variables that explains most of the variability of the data. These new variables, referred to as principal components scores, are orthogonal and uncorrelated with each other.

Let denote the full data matrix. PCA can be constructed from a singular value decomposition with X = RDV′, where R and V are orthonormal matrices and D is a diagonal matrix. For an overview of PCA, see Jolliffe and Cadim [42]. The PCA scores are generated as S≡RD = XV with the property that S is an orthogonal matrix. It follows that, if X_i∼MVN(μ,Σ), then the i^th row of the scores matrix is also multivariate normal with S_i = X_iV∼MVN(μV,V′ΣV), where . Let Y_i = S_i,[1,k] denote the first k (k ≤ p) columns of the i^th row of S. Then Y_i is multivariate normal with mean δ = (δ₁,…,δ_k) = (μV)_[1:k] and variance-covariance matrix .

The contribution (proportion of variance) of the j^th PCA score column is calculated as and the importance of the first k columns is , where {d₁, …, d_p} are the eigenvalues of S. The contribution of each subsequent column decreases such that the first principal component explains the largest variance in the data, the second principal component explains the maximum variance in the data that has not been explained by the first principal component and so on. As performed in Filzmoser et al. [31], we choose k so that the importance is at least 99% of the total variance of X. We did explore lowering the percentage for importance, such as to 95%, but found that lower values did not result in good statistical properties (Type I error was not well-controlled). We also explored using all principal component columns (k = p), and observed that when p is large (e.g., around 100), the Type I error was controlled, but the power dropped precipitously because of the inclusion of all the extraneous noise columns. Let Y denote the reduced N × k scores matrix constructed from the first k columns of S and let Y_i = (Y_i1,…,Y_ik)′, denote the i^th row of Y. For an inactive compound, Y_i follows a multivariate normal distribution with orthogonal columns so that Y_i∼MVN(δ,Ω), where δ and Ω were earlier defined. The j^th element of Y_i follows a univariate normal distribution with and for .

If both δ and Ω are known, then the Mahalanobis distance from the i^th observation to the mean of inactives in the scores space may be calculated as

Step 2: Robust estimation of location and scale.

For the j^th variable (i.e., j^th column of Y) in the selected k principal component space, a likelihood model is fitted to the data to estimate the Cauchy location parameter, which serves as a robust measure of δ_j. The Cauchy distribution is symmetric at its median and has wider tails compared to the normal distribution, which allows for better accommodation of outlier-contaminated normal data. The Cauchy likelihood with location parameter η and scale parameter ν for independent and identically-distributed Cauchy random variables z₁, …, z_n is given by

The estimated location parameter is a robust estimator for δ_j. Because 68.27% of values lie within one standard deviation of the mean in a normal distribution, a robust measure for σ_j is calculated using Motulsky and Brown’s robust standard deviation of residuals [17], rsdr_j = 68.27% quantile of , j = 1, …, k.

The step to robustly estimate location and scale is applied independently to each of the principal component columns. We use a Nelder-Mead [43] optimization of the log likelihood and found that the algorithm reached convergence in all of our computer simulations. For other nonlinear minimization algorithms, see [44]. Practioners of our proposed method are encouraged to check that convergence was reach for all principal component columns.

Step 3: Outlier detection via Hotelling’s T-squared testing.

Given and as robust estimators for δ and Ω, it follows under H₀ of (1) that , where and so that is a robust version of the Hotelling’s T-squared statistic and MD_i is a robust Mahalanobis distance. The testing p-value is (3) where F*∼F(k,N−k).

As in the ROUT procedure, p-values from the N compounds are multiplicity-corrected via the BH FDR procedure. A compound is declared active if the FDR-adjusted p-value is less than a pre-determined value Q. As in the ROUT method, we adopted Q = 0.01. Recall that if Y_i is not part of the calculation of the mean or variance matrix, then and . Regardless of whether Y_i is part of the calculation, with , the mROUT method is slightly more conservative and so we decided to keep in the denominator for all calculations. For large values of N, the term can be dropped.

Simulations

Simulation experiments were performed to study the performance of mROUT as a function of dimension (p), levels of contamination (% of outliers), and distance of the outlier from the main body of data. The performance of mROUT was compared with other robust methods that have been proposed in the literature. For all simulations, the total number of observations was set to N = 200, which is similar to the data from our laboratories. The bulk of the observations were generated as inliers, drawn from a multivariate normal distribution. Outliers were randomly generated but located at specific Mahalanobis distances from the center of the distribution. For a given level of contamination ε,N(1−ε) inlier observations were generated from a p-dimensional multivariate normal distribution with mean μ and covariance matrix Σ, where 0≤ε≤0.2, Σ = diag(s)×C×diag(s) for correlation matrix C and vector s of marginal standard deviations. Without loss of generality, μ = 0 and all elements of s were set to 1 so that Σ = C. The Nε outlying points were generated with Mahalanobis distance d from the population mean 0. See S1 Appendix for details of outlier generation procedure. Example computer code that demonstrates the simulation and mROUT procedures written in the R language [45] is provided as (S1 File).

By varying the values for p,ε,C and d, different simulation scenarios were created. Simulations were run with p = 2, 3, and 96, with p = 96 representing the dimension of the CRISPR assay data shown in a subsequent section. The contamination levels of outliers ranged from 0–20% of total observations. For p = 2, correlations were set to ρ₁₂ = 0, 0.5, and 0.9, where . For p = 3, three different combinations of correlations were examined with (ρ₁₂,ρ₁₃,ρ₂₃,) = (0, 0.1, 0.3), (0, 0.3, 0.7), and (0, 0.5, 0.7), where . For p = 96, a single correlation matrix constructed to reflect our real data set was generated. The 96×96 correlation matrix is available as (S2 File). Except for the case when ε = 0, Mahalanobis distance d ranged from 4 − 6 for p = 2, 3, and 12 − 30 for p = 96. For each scenario, 10,000 Monte Carlo were run. An overview of the values assigned to these parameters in different simulated datasets is shown in Table 1.

Download:

Table 1. Simulation parameters.

https://doi.org/10.1371/journal.pone.0310433.t001

Other outlier detection procedures with available public software were evaluated and compared to mROUT. These methods come from the rrcov [32] and mvoutlier [31] libraries in R. While the work of Cabana et al. [26] looked promising, given the lack of software, we were unable to make an evaluation. From the rrcov library, the functions PcaCov(), PcaGrid(), PcaProj(), and PcaHubert() are rPCA-rMD methods that respectively use MCD [22], a grid-search projection pursuit [30], the projection pursuit algorithm of Croux and Riuz-Gazen [29], and the Hubert et al. algorithm that blends MCD and projection pursuit [28]. Henceforth these methods will be referred to as PcaCov, PcaGrid, PcaProj, and PcaHubert. All of the rrcov methods declare the i^th observation as an outlier if the robust Mahalanobis distance MD_i is greater than χ²(k,0.975), where k is the reduced dimensionality from the rPCA step. A p-value is easily created from this procedure as Pr (χ²(k)>MD_i). Based on this calculation, we tested hypotheses in (1) with BH FDR-adjusted p-values. From the mvoutlier library, the function pcout() provides the method of Filzmozer et al. [31], henceforth referred to as PCOut. Regrettably, the outlier-flagging procedure of PCOut could not be retooled into a p-value or any other probability metric; thus, we could not adjust it to avoid a high false positive rate.

For each simulated scenario and outlier detection procedure, Type I error, false discovery rate, and power were estimated. Type I error was estimated for inlier observations with ε = 0 by the proportion of times (out of 10,000) the unadjusted p-value was less than or equal to Q = 0.01. For mROUT, the unadjusted p-value is given by Eq (3) and for rrcov procedures, the unadjusted p-value is given by Pr (χ²(k)>MD_i). For PCOut, because there are no p-values, the proportion of outlier declarations was used to estimate Type I error. For the evaluation of hypotheses in (1), H₁ was declared (the observation is an outlier) whenever the BH FDR-corrected p-value was less than or equal to Q = 0.01. For PCOut, H₁ was declared whenever the software flagged an observation as an outlier. A false positive is declared whenever an inlier is misclassified as an outlier. A true positive is declared whenever an outlier is correctly identified. For scenarios with ε > 0 and d > 0, the false discovery rate (FDR) = FP/(FP+TP) was calculated for each Monte Carlo run, where FP = number of false positives and TP = number of true positives out of N = 200 observations. The reported FDR is the Monte Carlo average over 10,000 simulated runs. Adjusting for Monte Carlo error, both Type I error and FDR can be as high as 0.013. Scenarios with Type I error and FDR less than 0.013 were considered to be in control. Finally, statistical power of mROUT and rrcov procedures was calculated as the proportion of times (out of 10,000) the BH FDR-adjusted p-value was less than or equal to Q = 0.01 for a single outlier observation in scenarios with ε > 0 and d > 0. For PCOut, unadjusted statistical power was calculated based on the proportion of contaminated observations correctly flagged as outliers.

Results

Low-dimensional simulation results

Although mROUT was developed primarily for hit screening in high-dimensional data, we also evaluated its performance against the outlier detection algorithms mentioned in previous sections in 2- and 3-dimensional settings to observe relationship between the simulated outcomes and correlation.

The behaviours of mROUT, PcaCov, PcaHubert, PcaGrid, PcaProj and PCOut were examined in the absence of outliers for p = 2, 3. As shown in Fig 2A and 2B, mROUT, PcaCov and PcaHubert exhibited Type I errors that are within the acceptable range in both dimensions, independent of the correlation structure of the data. For the two projection pursuit-based approaches, PcaProj and PcaGrid, Type I error appear to be under control in most cases but starts to elevate as correlation between variables increases. PCOut, however, exhibited uncontrolled Type I error of roughly 0.12 across all simulation settings, more than 10x the nominal threshold Q = 0.01 (see S1 Table).

Download:

Fig 2. Performance of mROUT and selected rPCA outlier detection procedures on 2- and 3-dimensional simulated data.

(A) Type I error for p = 2, ε = 0, (B) Type I error for p = 3, ε = 0, (C) FDR for p = 2, (D) FDR for p = 3, (E) power for p = 2, and (F) power for p = 3. The performance measures were estimated through 10,000 simulations with N = 200. Dashed line in (A), (B), (C) and (D) represents Monte Carlo error at 0.013. Dashed line in (E) and (F) represents 80% power.

https://doi.org/10.1371/journal.pone.0310433.g002

Fig 2C and 2D display the FDR of mROUT against the rrcov procedures for p = 2, 3 and Q = 0.01 at different contamination levels. FDR is well-controlled for mROUT, reasonably controlled for PcaCov and PcaHubert (very slightly elevated in some cases), and sometimes out of control for the projection-pursuit methods PcaGrid and PcaProj. The FDR of rrcov procedures appear to be influenced by the amount of contamination and correlation structure, where the worst behaviour is observed for PcaGrid and PcaProj in scenarios with 1% contamination that involves variables with larger correlation. Because PCOut could not be adjusted for multiple testing, the FDR for PCOut is a decreasing function of the contamination level (ε) and ranged between 0.12–0.92 in our simulations (see S2 Table).

Lastly, the statistical power to detect the effects of contamination level ε and Mahalanobis distance d was examined. The results are presented in Fig 2E and 2F, which display power as a function of ε in different correlation settings with d ranged from 4 − 6, p = 2, 3 and Q = 0.01. Two features are apparent from these plots. First, for a specific contamination level, the Mahalanobis distance required for confidently declaring an outlier increases as the dimension rises. Second, as the amount of contamination rises beyond 10%, all methods suffer a drop in power and only outliers with d ≥ 5.5 were almost always detected. Out of all the methods, the best overall performance is observed for mROUT as it attains reasonably high power (99% for p = 2, 94% for p = 3) in identifying outliers with d ≥ 5.5 without being affected by the correlation structure of the data. When the level of contamination rises to 20%, PcaCov and PcaHubert seem to outperform mROUT in cases with smaller d, but their respective statistical power drops markedly in highly correlated data. Similarly, PcaGrid and PcaProj show comparable power to mROUT in most cases but perform less well in scenarios with highly correlated variables; however, the increase in power for PcaGrid and PcaProj may be partially attributed to their inflated Type I error and FDR. PCOut has exceptionally high power (~100%) in all scenarios; however, this comes at the cost of uncontrollable FDR, meaning that the user cannot be sure if a detected outlier is a true or false positive (S2 and S3 Tables).

High-dimensional simulation results

Based on the results obtained from the low-dimensional simulation runs, we selected three methods that showed a good balance in performance and computational speed for evaluation in higher dimensional setting. The chosen approaches are mROUT, PcaHubert and PcaProj. While the MCD-based PcaCov approach showed good overall performance in lower dimensional space, its runtime (for us) was prohibitively high for p = 96. We did not apply PcaGrid because its performance was similar, but slightly inferior to PcaProj. Finally, we did not include PCOut given its Type I error and FDR performance.

The simulation was performed with p = 96, N = 200 and a single correlation matrix that mimics the structure of a real data set. The levels of contamination are identical to those explored in previous simulation experiments, but the range of Mahalanobis distance d (in 96-dimensional space) was increased to 12 − 30.

All three methods have reasonably good performance under the null hypothesis of no outliers in (1) and maintain the Type I errors at an acceptable level (< 0.013). Notably, mROUT has the lowest Type I error while PcaHubert and PcaProj are slightly elevated (Fig 3A). There is strong suggestion that PcaProj generates an overage of false positives in systems with low level of contamination ε = 1% and 5%. PcaHubert displays slightly elevated FDR in scenarios with ε = 1% (Fig 3B). mROUT is the only approach that keeps the FDR under control below Q = 0.01 across all settings, an observation that is consistent with the results from p = 2, 3 simulation studies. In terms of statistical power, mROUT and PcaProj show comparable performance and both outperform PcaHubert for the range of ε and d tested, but the FDR rate of PcaProj can sometimes be undesirably large (Fig 3B and 3C and Table 2). For a given contamination level, mROUT and PcaProj were able to detect outliers that are at least 26 Mahalanobis units away from the center of the data with 80% power, and the power remained relatively constant as the level of contamination grows from 1% to 20%.

Download:

Fig 3. Performance comparison of mROUT, PcaHubert and PcaProj on 96-dimensional simulated data.

(A) Type I error, (B) FDR, and (C) power were estimated through 10,000 simulations for each combination of ε and d with N = 200. Dashed line in (A) and (B) represents Monte Carlo error at 0.013. Dashed line in (C) represents 80% power.

https://doi.org/10.1371/journal.pone.0310433.g003

Download:

Table 2. Statistical power and false discovery rate (FDR) of mROUT, PcaHubert and PcaProj in high-dimensional simulations using Q = 0.01.

https://doi.org/10.1371/journal.pone.0310433.t002

Intrinsic performance evaluation

In the comparative analysis presented in the previous section, the number of observation (N) was fixed at 200 to emulate the size of a typical data set encountered in hit screening assays from our laboratories. Using the same simulation setup, we extended our experiments to assess the outlier detection efficacy of mROUT in system with fewer observations, where N was dropped to 20, 40 and 100. We found that, while Type I error and FDR are maintained at levels below 0.01, there is a noticeable decrease in statistical power when the number of observations is limited (N = 20 and 40), even when p = 2 and 3. The decline in performance is likely due to insufficient data for accurately estimating the inactive distribution and robust statistics. As the number of observations increases (N = 100), there is a marked improvement in statistical power, irrespective of the number of dimensions and the level of contamination in the system (see S3 File).

Controlling the FDR is an important consideration for multivariate outlier detection because, in practice, the candidate outliers are not known in advance, and all observations are tested sequentially. Up to this point, we have adhered to the Q value recommended by Motulsky and Brown in the original ROUT paper [17]. By setting Q = 0.01, we are aiming to restrict the ‘hits’ to observations that fall beyond the 99^th percentile of the inactive distribution. In certain applications, a different Q value may be preferable.

To provide insights into the trade-offs between true positive rate (sensitivity) and false positive rate (1-specificity) of our proposed method, we used simulated data to evaluate the effects of varying Q within the range of 0.001 to 1. The simulations were conducted with p = 96, N = 200, incorporating different levels of contamination at 1%, 5%, 10% and 20%. Mahalanobis distance (MD) values of 15, 20, 25 and 30 with reasonable discovery rate were used. In each scenario, 10,000 Monte Carlo runs were performed to estimate the true positive rate (TPR) and false positive rate (FPR). The simulation outcomes are presented as receiver operating characteristic (ROC) curves in Fig 4. The results show that mROUT effectively constrains the false positive rate (FPR) across the range of tested Q values. While TPR (sensitivity) remains low when MD = 15 (represents small effect size that is difficult to detect), it rises to over 75% when MD exceeds 25, without compromising the FPR. In all the scenarios tested, increasing Q from 0.01 to 0.05 lead to an improvement in the TPR, but has minimal impact on the FPR. These findings suggest that mROUT is well-suited for applications where prioritizing higher sensitivity is advantageous, as it alleviates concerns about incurring an excessively high false positive rate when using a higher Q value.

Download:

Fig 4. ROC curves of mROUT obtained by varying the significance threshold Q.

False positive rate (x-axis) is calculated as FP / (FP + TN), where FP is the number of false positives and TN is the number of true negatives. True positive rate (y-axis) is calculated as TP / (TP + FN), where TP is the number of true positives and FN is the number of false negatives. The numbers in the brackets indicates the false positive rate (first number) and true positive rate (second number) observed for Q = 0.01 or 0.05.

https://doi.org/10.1371/journal.pone.0310433.g004

In the next section, we will demonstrate the performance of mROUT on a real data set generated from an in-house phenotypic screen.

Application to real data

Although hit screening was described as a method to detect the activity of compounds, it can be viewed more generically, such as in this high-content image-based CRISPR screening data set (provided as S4 File). This data set was collected as part of a high-throughput arrayed CRISPR knockout screening effort for identifying potential novel drug targets for chronic kidney disease (CKD) in a disease-relevant renal cell model–primary human glomerular microvascular endothelial cells (HGMECs, Cell Systems, catalog no. ACBRI 128). A gene knockout that produced the desired phenotypic changes in HGMECs upon treatment with pro-inflammatory cytokines is considered a “hit”. In this study, the activity of single-shot gene knockouts was examined relative to inactive and active controls. The entire experiment was supported on a single 384-well plate with 218 gene knockouts (KO), 14 inactive control (IC) wells, 14 active control (AC) wells, and 6 lethal control (LC) wells. A priori, it was assumed that the majority of knockouts would not show activity.

Images of the wells were acquired using a Cell Voyager 7000S spinning disk confocal microscope (Yokogawa) and processed with a multi-task convolutional neural network (CNN). The CNN was trained on a self-supervised task using contrastive learning to represent the image content [46] with a binary classification task to differentiate between inactive and active controls. The two tasks are trained jointly by adding the loss functions from each task to give a univariate loss to be minimized. We use the output from the classification task to assign cells in each well as active or inactive, and then take the fraction of active cells in each well as the standard univariate response variable. The data were initially analyzed with a univariate statistical approach but we found that it lacks the sensitivity to detect gene knockouts showing subtle and distinct phenotypic changes. Recognizing that CKD is a complex disease with multiple patient endotypes, a more sophisticated approach that scores the phenotype beyond simple classification was considered. Instead of analyzing the results from the final classification node of the neural net, we captured p = 96 dimensional data from the penultimate node and took the median across cells in each well to give a single vector per treatment well. In this high dimension, the data have no intrinsic meaning, but given the self-supervised neural net training, should separate gene knockouts that show an activity different to the inactive controls based on their visual appearance in the images.

The mROUT procedure was applied to the 218 KOs and 14 ICs from the p = 96 data. Before PCA, the data were mean-centered and standard deviation-scaled. The number of PCA components required to explain 99% of variability in the original data was k = 11. Using a significance threshold Q = 0.01, mROUT identified 12 gene KOs as hits. All the ACs and LCs were detected (true positives) and none of the ICs were declared as hits (true negatives). Fig 5A shows a two-dimensional PCA plot of the study data with the PCA model used to project the AC and LC wells. Most of the gene KOs (light blue circles) were mapped to the same cluster as the IC (orange squares) with the detected KOs (dark blue circles) appearing to be detached from the main cluster and pointed towards different directions.

Download:

Fig 5. mROUT was able to identify biologically relevant hits in real data set.

(A) Two-dimensional representation of the feature space data. Acronyms are KO = gene knockout, AC = active control, IC = inactive control, LC = lethal control. Dark blue circles represent observations detected by mROUT as potentially active KOs. (B) to (G) are confocal images of phenotypes seen in IC, AC and four active KOs detected by mROUT. IC showed stressed phenotype induced by cytokine treatment. AC showed healthy phenotype. KO70 overlaps with the AC in feature space and showed similar phenotype as the AC. KO38, KO189 and KO47, which deviates from the IC-AC axis, showed fibrotic- and apoptotic-like phenotypes suggestive of novel disease-driving mechanisms. The cells were stained with CD31 (orange), Hoechst (blue) and CellMask (red).

https://doi.org/10.1371/journal.pone.0310433.g005

While it is beyond the scope of this paper to delve into the specific biology associated with each of the detected KOs, the two active KOs pointed in direction of the ACs are likely the most biologically relevant hits because they were able to prevent the cytokine-induced stressed phenotype seen in the ICs while overlapping with the feature space of ACs (Fig 5B–5D). It may, however, also be worth exploring other active KOs that deviated from the IC-AC axis, because some of them showed distinct phenotypes that could be associated with novel molecular mechanisms driving the disease (Fig 5E–5G).

Instead of relying solely on a two-dimensional visualization, one can enrich the mROUT results by additionally calculating a multivariate angle for each KO relative to the means of IC and AC. Angle calculations will be briefly explored in the discussion section.

Discussion

The mROUT procedure was illustrated through simulation and a real data set, and shown to detect outliers with statistical power competitive with rPCA-rMD methods while also maintaining Type I error and FDR. Thus, mROUT can be a powerful tool for multidimensional hit screening. There are several considerations for future work, which are discussed below.

The neural network architecture we have implemented tends to generate approximately normally distributed features in our hands, however for other morphological profiling features, such as those generated from CellProfiler [47], this is usually not the case. To enable application of mROUT to such data, we propose incorporating a variational autoencoder (VAE) [48,49] into the image feature extraction workflow. The VAE serves to enforce the distribution of the latent space alongside the reconstruction loss through the inclusion of the Kullback-Leibler divergence term in the overall loss function. This regularization encourages the VAE to learn a latent space adhering to a specific distribution—in our case, a standard normal distribution—while simultaneously optimizing for precise data reconstruction.

As implied in the real data section, the relative angle between a gene KO and two reference points (e.g., inactive and active controls) may provide additional information to enrich or perhaps triage the list of detected KO activity. In terms of mathematical vector properties, the mROUT system can detect outliers that are of a certain magnitude with no regard for direction. If a system contains two reference points, such as the robust mean of the inactives (from the Cauchy step of mROUT) and mean of active controls in the PCA scores space, the angle of a KO can be relatively calculated. Let μ₀ and μ₁ denote the robust inactives mean and active control mean in the scores space, and let y denote the scores vector for a KO of interest. The angle of interest between y−μ₀ and μ₁−μ₀, where the angle in radians between two vectors u and v is given by . By this calculation, the direction may also be considered in hit screening.

The quality of an assay run should be examined prior to implementing mROUT for hit detection. Given a system with a set of inactive controls and active controls, one can calculate a multidimensional analogue to the Strictly Standardized Mean Difference (SSMD) quality statistic proposed by Zhang [50] for low and high univariate controls, given by , where μ_L and μ_H denote low and high control means and σ_L and σ_H denote the low and high control standard deviations. A Mahalanobis distance version of SSMD for multivariate data is given by where μ_C1 and μ_C2 denote the means and Σ_C1 and Σ_C2 denote the variance-covariance matrices of the two controls. Quality control cut-offs for MD_qc can be determined empirically through observations from assay development and its remaining assay life cycle.

Finally, we consider the issue of combining data from multiple plates. Possible solutions include normalization to remove the plate effect and modelling the plate effect in one of the mROUT steps. In our process, the image data is normalized to the [0, 1] range prior to the neural net step to remove gross plate-to-plate differences in image intensity. We also apply a plate mean subtraction step at the p-dimensional level (i.e., the raw data). If additional location effects are suspected, one may also correct for spatial effects (i.e., systematic row and column effects on the plate) using normalization techniques, such as B-score [11] or Loess [51] corrections. Alternatively, a plate, row, and column terms could be included in the PCA or the Cauchy step of the mROUT procedure. We have not fully explored the pros and cons of each plate adjustment step and view these corrections as worthy of further consideration.

Conclusions

We propose a new method mROUT for identifying outliers in multivariate assays based on principal components analysis and robust Mahalanobis distance. Analyses of computer-simulated data demonstrate that mROUT attains high true discovery rates while controlling Type I error and FDR. Relative to competitor procedures found in the literature with publicly available software, mROUT is the only method that include an FDR controlling step, implying that, without our modifications, the competitor methods do not control FDR. Even with our modifications, the competing procedures sometimes produce FDR that are larger than expected. When applied to real phenotypic screening data, mROUT was able to identify biologically relevant hits and uncovered potential novel mechanistic insights through genotype-phenotype mapping. Thus, for multivariate assays with multivariate normal data, such as the neural net-processed CRISPR image data in our example, mROUT can be a powerful tool for hit screening. Future work may involve comparisons to techniques incorporating a nonlinear dimension reduction step, while also exploring their applicability to non-Gaussian data.

Supporting information

S1 Appendix. Outlier generation procedure.

Details of the outlier generation method used in the simulation study.

https://doi.org/10.1371/journal.pone.0310433.s001

(DOCX)

S1 File. Computer code.

Computer code written in the R programming language that illustrates the mROUT procedures on a simulated multivariate data set.

https://doi.org/10.1371/journal.pone.0310433.s002

(R)

S2 File. Correlation matrix used in high-dimensional simulations.

The 96×96 correlation matrix used in the simulation study with p = 96.

https://doi.org/10.1371/journal.pone.0310433.s003

(CSV)

S3 File. Effect of sample size.

Performance of mROUT estimated for 2-, 3- and 96-dimensional simulations with different number of observations.

https://doi.org/10.1371/journal.pone.0310433.s004

(DOCX)

S4 File. CRISPR screen data.

Real CRISPR screening data that contains p = 96 features from the penultimate node of the neural net.

https://doi.org/10.1371/journal.pone.0310433.s005

(CSV)

S1 Table. Type I error of simulation studies.

Type I error of mROUT and other outlier detection methods estimated for 2-, 3- and 96-dimensional simulations.

https://doi.org/10.1371/journal.pone.0310433.s006

(DOCX)

S2 Table. FDR of low-dimensional simulation studiess.

FDR of mROUT and other outlier detection methods estimated for 2- and 3-dimensional simulations.

https://doi.org/10.1371/journal.pone.0310433.s007

(DOCX)

S3 Table. Power of low-dimensional simulation studies.

Statistical power of mROUT and other outlier detection methods estimated for 2- and 3-dimensional simulations.

https://doi.org/10.1371/journal.pone.0310433.s008

(DOCX)

References

1. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R. Statistical practice in high-throughput screening data analysis. Nat Biotechnol. 2006;24(2):167–75. pmid:16465162.
- View Article
- PubMed/NCBI
- Google Scholar
2. Bray MA, Singh S, Han H, Davis CT, Borgeson B, Hartland C, et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc. 2016;11(9):1757–74. Epub 20160825. pmid:27560178; PubMed Central PMCID: PMC5223290.
- View Article
- PubMed/NCBI
- Google Scholar
3. Mahalanobis PC. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). 1936;2:49–55.
- View Article
- Google Scholar
4. Wiedenheft B, Sternberg SH, Doudna JA. RNA-guided genetic silencing systems in bacteria and archaea. Nature. 2012;482(7385):331–8. pmid:22337052
- View Article
- PubMed/NCBI
- Google Scholar
5. Makarenkov V, Zentilli P, Kevorkov D, Gagarin A, Malo N, Nadon R. An efficient method for the detection and elimination of systematic error in high-throughput screening. Bioinformatics. 2007;23(13):1648–57. Epub 20070426. pmid:17463024.
- View Article
- PubMed/NCBI
- Google Scholar
6. Goedken ER, Devanarayan V, Harris CM, Dowding LA, Jakway JP, Voss JW, et al. Minimum significant ratio of selectivity ratios (MSRSR) and confidence in ratio of selectivity ratios (CRSR): quantitative measures for selectivity ratios obtained by screening assays. J Biomol Screen. 2012;17(7):857–67. Epub 20120514. pmid:22584786.
- View Article
- PubMed/NCBI
- Google Scholar
7. Landqvist C, Middleton B, Jones B, O’Donnell C. AstraZeneca’s Novel Approach To Monitor Primary DMPK Assay Performance. Drug Discovery World [Internet]. 2014 26/01/2023; (Fall 2014). Available from: https://www.ddw-online.com/astrazenecas-novel-approach-to-monitor-primary-dmpk-assay-performance-1508-201410/.
- View Article
- Google Scholar
8. PhRMA CMC Statistics and Stability Expert Team. Identification of out-of-trend stability results, a review of the potential regulatory issue and various approaches. 2003;27(4):38–52.
- View Article
- Google Scholar
9. Yu B, Zeng L, Ren P, Yang H. A Unified Framework for Detecting Out-of-Trend Results in Stability Studies. Statistics in Biopharmaceutical Research. 2018;10(4):237–43.
- View Article
- Google Scholar
10. Hadi AS, Simonoff JS. Procedures for the Identification of Multiple Outliers in Linear Models. Journal of the American Statistical Association. 1993;88(424):1264–72.
- View Article
- Google Scholar
11. Brideau C, Gunter B, Pikounis B, Liaw A. Improved Statistical Methods for Hit Selection in High-Throughput Screening. Journal of Biomolecular Screening. 2003;8(6):634–47. pmid:14711389.
- View Article
- PubMed/NCBI
- Google Scholar
12. Sondag P, Zeng L, Yu B, Yang H, Novick S. Comparisons of outlier tests for potency bioassays. Pharmaceutical Statistics. 2020;19(3):230–42. pmid:31762118
- View Article
- PubMed/NCBI
- Google Scholar
13. Wilcox R. Chapter 10—Robust Regression. In: Wilcox R, editor. Introduction to Robust Estimation and Hypothesis Testing (Third Edition). Boston: Academic Press; 2012. p. 471–532.
14. Hampel FR. The Breakdown Points of the Mean Combined with Some Rejection Rules. Technometrics. 1985;27(2):95–107.
- View Article
- Google Scholar
15. Wilcox R. Chapter 3—Estimating Measures of Location and Scale. In: Wilcox R, editor. Introduction to Robust Estimation and Hypothesis Testing (Third Edition). Boston: Academic Press; 2012. p. 43–101.
16. Huber PJ. Robust Statistics. Wiley Series in Probability and Statistics: John Wiley & Sons, Inc.; 1981. p. p. 43–153.
17. Motulsky HJ, Brown RE. Detecting outliers when fitting data with nonlinear regression—a new method based on robust nonlinear regression and the false discovery rate. BMC Bioinformatics. 2006;7:123. Epub 20060309. pmid:16526949; PubMed Central PMCID: PMC1472692.
- View Article
- PubMed/NCBI
- Google Scholar
18. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289–300.
- View Article
- Google Scholar
19. Johnson RA, Wichern DW. Chapter 5—Inferences about a mean vector. Applied Multivariate Statistical Analysis. 6th ed: Prentice-Hall; 2007. p. 210–72.
20. Rousseeuw PJ. Least Median of Squares Regression. Journal of the American Statistical Association. 1984;79(388):871–80.
- View Article
- Google Scholar
21. Rousseeuw PJ. Multivariate Estimation With High Breakdown Point. Mathematical Statistics and Applications Vol B. 1985:283–97.
- View Article
- Google Scholar
22. Rousseeuw PJ, Driessen KV. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics. 1999;41(3):212–23.
- View Article
- Google Scholar
23. Davies PL. Asymptotic Behaviour of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. The Annals of Statistics. 1987;15(3):1269–92.
- View Article
- Google Scholar
24. Tatsuoka KS, Tyler DE. On the Uniqueness of S-Functionals and M-Functionals under Nonelliptical Distributions. The Annals of Statistics. 2000;28(4):1219–43.
- View Article
- Google Scholar
25. Maronna RA, Zamar RH. Robust Estimates of Location and Dispersion for High-Dimensional Datasets. Technometrics. 2002;44(4):307–17.
- View Article
- Google Scholar
26. Cabana E, Lillo RE, Laniado H. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Statistical Papers. 2021;62(4):1583–609.
- View Article
- Google Scholar
27. Adrover J, Yohai V. Projection estimates of multivariate location. The Annals of Statistics. 2002;30(6):1760–81, 22.
- View Article
- Google Scholar
28. Hubert M, Rousseeuw PJ, Branden KV. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics. 2005;47:64–79.
- View Article
- Google Scholar
29. Croux C, Ruiz-Gazen A. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis. 2005;95(1):206–26.
- View Article
- Google Scholar
30. Croux C, Filzmoser P, Oliveira MR. Algorithms for Projection-Pursuit Robust Principal Component Analysis. Econometrics eJournal. 2007.
- View Article
- Google Scholar
31. Filzmoser P, Maronna R, Werner M. Outlier identification in high dimensions. Computational Statistics & Data Analysis. 2008;52(3):1694–711. https://doi.org/10.1016/j.csda.2007.05.018.
- View Article
- Google Scholar
32. Todorov V, Filzmoser P. An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software. 2009;32(3):1–47.
- View Article
- Google Scholar
33. van der Maaten L, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(86):2579–605.
- View Article
- Google Scholar
34. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2020.
- View Article
- Google Scholar
35. Bach FR, Jordan MI. Kernel independent component analysis. J Mach Learn Res. 2003;3(null):1–48.
- View Article
- Google Scholar
36. Schölkopf B, Smola A, Müller K-R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation. 1998;10(5):1299–319.
- View Article
- Google Scholar
37. Xue J, Zhang B, Qiang Q. Local Linear Embedding with Adaptive Neighbors. Pattern Recognition. 2023;136:109205.
- View Article
- Google Scholar
38. Vishwakarma GK, Paul C, Elsawah AM. A hybrid feedforward neural network algorithm for detecting outliers in non-stationary multivariate time series. Expert Systems with Applications. 2021;184:115545. https://doi.org/10.1016/j.eswa.2021.115545.
- View Article
- Google Scholar
39. Vishwakarma GK, Paul C, Hadi AS, Elsawah AM. An automated robust algorithm for clustering multivariate data. Journal of Computational and Applied Mathematics. 2023;429:115219.
- View Article
- Google Scholar
40. Touny HM, Moussa AS, Hadi AS. Scalable fuzzy multivariate outliers identification towards big data applications. Applied Soft Computing. 2024;155:111444. https://doi.org/10.1016/j.asoc.2024.111444.
- View Article
- Google Scholar
41. Caicedo JC, Cooper S, Heigwer F, Warchal S, Qiu P, Molnar C, et al. Data-analysis strategies for image-based cell profiling. Nature Methods. 2017;14(9):849–63. pmid:28858338
- View Article
- PubMed/NCBI
- Google Scholar
42. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016;374(2065):20150202. pmid:26953178
- View Article
- PubMed/NCBI
- Google Scholar
43. Nelder JA, Mead R. A Simplex Method for Function Minimization. The Computer Journal. 1965;7(4):308–13.
- View Article
- Google Scholar
44. Ruszczyński A. Nonlinear Optimization: Princeton, N.J.: Princeton University Press; 2006.
45. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2021.
46. Chollet F, editor Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul: IEEE Computer Society.
47. Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biology. 2006;7(10):R100. pmid:17076895
- View Article
- PubMed/NCBI
- Google Scholar
48. Kingma DP, Welling M. An Introduction to Variational Autoencoders. Found Trends Mach Learn. 2019;12(4):307–92.
- View Article
- Google Scholar
49. Kingma DP, Welling M. Auto-Encoding Variational Bayes. 2022.
- View Article
- Google Scholar
50. Zhang XD. A new method with flexible and balanced control of false negatives and false positives for hit selection in RNA interference high-throughput screening assays. J Biomol Screen. 2007;12(5):645–55. Epub 20070521. pmid:17517904.
- View Article
- PubMed/NCBI
- Google Scholar
51. Mpindi JP, Swapnil P, Dmitrii B, Jani S, Saeed K, Wennerberg K, et al. Impact of normalization methods on high-throughput screening data with high hit rates and drug testing with dose-response data. Bioinformatics. 2015;31(23):3815–21. Epub 20150807. pmid:26254433; PubMed Central PMCID: PMC4653387.
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R. Statistical practice in high-throughput screening data analysis. Nat Biotechnol. 2006;24(2):167–75. pmid:16465162.
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Bray MA, Singh S, Han H, Davis CT, Borgeson B, Hartland C, et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc. 2016;11(9):1757–74. Epub 20160825. pmid:27560178; PubMed Central PMCID: PMC5223290.
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Mahalanobis PC. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta). 1936;2:49–55.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Wiedenheft B, Sternberg SH, Doudna JA. RNA-guided genetic silencing systems in bacteria and archaea. Nature. 2012;482(7385):331–8. pmid:22337052
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Makarenkov V, Zentilli P, Kevorkov D, Gagarin A, Malo N, Nadon R. An efficient method for the detection and elimination of systematic error in high-throughput screening. Bioinformatics. 2007;23(13):1648–57. Epub 20070426. pmid:17463024.
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Goedken ER, Devanarayan V, Harris CM, Dowding LA, Jakway JP, Voss JW, et al. Minimum significant ratio of selectivity ratios (MSRSR) and confidence in ratio of selectivity ratios (CRSR): quantitative measures for selectivity ratios obtained by screening assays. J Biomol Screen. 2012;17(7):857–67. Epub 20120514. pmid:22584786.
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Landqvist C, Middleton B, Jones B, O’Donnell C. AstraZeneca’s Novel Approach To Monitor Primary DMPK Assay Performance. Drug Discovery World [Internet]. 2014 26/01/2023; (Fall 2014). Available from: https://www.ddw-online.com/astrazenecas-novel-approach-to-monitor-primary-dmpk-assay-performance-1508-201410/.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref8] 8. PhRMA CMC Statistics and Stability Expert Team. Identification of out-of-trend stability results, a review of the potential regulatory issue and various approaches. 2003;27(4):38–52.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref9] 9. Yu B, Zeng L, Ren P, Yang H. A Unified Framework for Detecting Out-of-Trend Results in Stability Studies. Statistics in Biopharmaceutical Research. 2018;10(4):237–43.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref10] 10. Hadi AS, Simonoff JS. Procedures for the Identification of Multiple Outliers in Linear Models. Journal of the American Statistical Association. 1993;88(424):1264–72.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref11] 11. Brideau C, Gunter B, Pikounis B, Liaw A. Improved Statistical Methods for Hit Selection in High-Throughput Screening. Journal of Biomolecular Screening. 2003;8(6):634–47. pmid:14711389.
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref12] 12. Sondag P, Zeng L, Yu B, Yang H, Novick S. Comparisons of outlier tests for potency bioassays. Pharmaceutical Statistics. 2020;19(3):230–42. pmid:31762118
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref13] 13. Wilcox R. Chapter 10—Robust Regression. In: Wilcox R, editor. Introduction to Robust Estimation and Hypothesis Testing (Third Edition). Boston: Academic Press; 2012. p. 471–532.

[ref14] 14. Hampel FR. The Breakdown Points of the Mean Combined with Some Rejection Rules. Technometrics. 1985;27(2):95–107.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref15] 15. Wilcox R. Chapter 3—Estimating Measures of Location and Scale. In: Wilcox R, editor. Introduction to Robust Estimation and Hypothesis Testing (Third Edition). Boston: Academic Press; 2012. p. 43–101.

[ref16] 16. Huber PJ. Robust Statistics. Wiley Series in Probability and Statistics: John Wiley & Sons, Inc.; 1981. p. p. 43–153.

[ref17] 17. Motulsky HJ, Brown RE. Detecting outliers when fitting data with nonlinear regression—a new method based on robust nonlinear regression and the false discovery rate. BMC Bioinformatics. 2006;7:123. Epub 20060309. pmid:16526949; PubMed Central PMCID: PMC1472692.
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref18] 18. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289–300.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref19] 19. Johnson RA, Wichern DW. Chapter 5—Inferences about a mean vector. Applied Multivariate Statistical Analysis. 6th ed: Prentice-Hall; 2007. p. 210–72.

[ref20] 20. Rousseeuw PJ. Least Median of Squares Regression. Journal of the American Statistical Association. 1984;79(388):871–80.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref21] 21. Rousseeuw PJ. Multivariate Estimation With High Breakdown Point. Mathematical Statistics and Applications Vol B. 1985:283–97.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref22] 22. Rousseeuw PJ, Driessen KV. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics. 1999;41(3):212–23.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref23] 23. Davies PL. Asymptotic Behaviour of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. The Annals of Statistics. 1987;15(3):1269–92.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref24] 24. Tatsuoka KS, Tyler DE. On the Uniqueness of S-Functionals and M-Functionals under Nonelliptical Distributions. The Annals of Statistics. 2000;28(4):1219–43.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref25] 25. Maronna RA, Zamar RH. Robust Estimates of Location and Dispersion for High-Dimensional Datasets. Technometrics. 2002;44(4):307–17.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref26] 26. Cabana E, Lillo RE, Laniado H. Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. Statistical Papers. 2021;62(4):1583–609.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref27] 27. Adrover J, Yohai V. Projection estimates of multivariate location. The Annals of Statistics. 2002;30(6):1760–81, 22.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref28] 28. Hubert M, Rousseeuw PJ, Branden KV. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics. 2005;47:64–79.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref29] 29. Croux C, Ruiz-Gazen A. High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis. 2005;95(1):206–26.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref30] 30. Croux C, Filzmoser P, Oliveira MR. Algorithms for Projection-Pursuit Robust Principal Component Analysis. Econometrics eJournal. 2007.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref31] 31. Filzmoser P, Maronna R, Werner M. Outlier identification in high dimensions. Computational Statistics & Data Analysis. 2008;52(3):1694–711. https://doi.org/10.1016/j.csda.2007.05.018.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref32] 32. Todorov V, Filzmoser P. An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software. 2009;32(3):1–47.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref33] 33. van der Maaten L, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(86):2579–605.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref34] 34. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2020.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref35] 35. Bach FR, Jordan MI. Kernel independent component analysis. J Mach Learn Res. 2003;3(null):1–48.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref36] 36. Schölkopf B, Smola A, Müller K-R. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation. 1998;10(5):1299–319.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref37] 37. Xue J, Zhang B, Qiang Q. Local Linear Embedding with Adaptive Neighbors. Pattern Recognition. 2023;136:109205.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref38] 38. Vishwakarma GK, Paul C, Elsawah AM. A hybrid feedforward neural network algorithm for detecting outliers in non-stationary multivariate time series. Expert Systems with Applications. 2021;184:115545. https://doi.org/10.1016/j.eswa.2021.115545.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref39] 39. Vishwakarma GK, Paul C, Hadi AS, Elsawah AM. An automated robust algorithm for clustering multivariate data. Journal of Computational and Applied Mathematics. 2023;429:115219.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref40] 40. Touny HM, Moussa AS, Hadi AS. Scalable fuzzy multivariate outliers identification towards big data applications. Applied Soft Computing. 2024;155:111444. https://doi.org/10.1016/j.asoc.2024.111444.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref41] 41. Caicedo JC, Cooper S, Heigwer F, Warchal S, Qiu P, Molnar C, et al. Data-analysis strategies for image-based cell profiling. Nature Methods. 2017;14(9):849–63. pmid:28858338
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref42] 42. Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2016;374(2065):20150202. pmid:26953178
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref43] 43. Nelder JA, Mead R. A Simplex Method for Function Minimization. The Computer Journal. 1965;7(4):308–13.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref44] 44. Ruszczyński A. Nonlinear Optimization: Princeton, N.J.: Princeton University Press; 2006.

[ref45] 45. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2021.

[ref46] 46. Chollet F, editor Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul: IEEE Computer Society.

[ref47] 47. Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biology. 2006;7(10):R100. pmid:17076895
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref48] 48. Kingma DP, Welling M. An Introduction to Variational Autoencoders. Found Trends Mach Learn. 2019;12(4):307–92.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref49] 49. Kingma DP, Welling M. Auto-Encoding Variational Bayes. 2022.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref50] 50. Zhang XD. A new method with flexible and balanced control of false negatives and false positives for hit selection in RNA interference high-throughput screening assays. J Biomol Screen. 2007;12(5):645–55. Epub 20070521. pmid:17517904.
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref51] 51. Mpindi JP, Swapnil P, Dmitrii B, Jani S, Saeed K, Wennerberg K, et al. Impact of normalization methods on high-throughput screening data with high hit rates and drug testing with dose-response data. Bioinformatics. 2015;31(23):3815–21. Epub 20150807. pmid:26254433; PubMed Central PMCID: PMC4653387.
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

Figures

Abstract

Introduction

Univariate outlier detection

Multivariate outlier detection

Materials and methods

The mROUT algorithm

Step 1: Dimensionality reduction.

Step 2: Robust estimation of location and scale.

Step 3: Outlier detection via Hotelling’s T-squared testing.

Simulations

Results

Low-dimensional simulation results

High-dimensional simulation results

Intrinsic performance evaluation

Application to real data

Discussion

Conclusions

Supporting information

S1 Appendix. Outlier generation procedure.

S1 File. Computer code.

S2 File. Correlation matrix used in high-dimensional simulations.

S3 File. Effect of sample size.

S4 File. CRISPR screen data.

S1 Table. Type I error of simulation studies.

S2 Table. FDR of low-dimensional simulation studiess.

S3 Table. Power of low-dimensional simulation studies.

References