Classification of estrogenic compounds by coupling high content analysis and machine learning algorithms

Environmental toxicants affect human health in various ways. Of the thousands of chemicals present in the environment, those with adverse effects on the endocrine system are referred to as endocrine-disrupting chemicals (EDCs). Here, we focused on a subclass of EDCs that impacts the estrogen receptor (ER), a pivotal transcriptional regulator in health and disease. Estrogenic activity of compounds can be measured by many in vitro or cell-based high throughput assays that record various endpoints from large pools of cells, and increasingly at the single-cell level. To simultaneously capture multiple mechanistic ER endpoints in individual cells that are affected by EDCs, we previously developed a sensitive high throughput/high content imaging assay that is based upon a stable cell line harboring a visible multicopy ER responsive transcription unit and expressing a green fluorescent protein (GFP) fusion of ER. High content analysis generates voluminous multiplex data comprised of minable features that describe numerous mechanistic endpoints. In this study, we present a machine learning pipeline for rapid, accurate, and sensitive assessment of the endocrine-disrupting potential of benchmark chemicals based on data generated from high content analysis. The multidimensional imaging data was used to train a classification model to ultimately predict the impact of unknown compounds on the ER, either as agonists or antagonists. To this end, both linear logistic regression and nonlinear Random Forest classifiers were benchmarked and evaluated for predicting the estrogenic activity of unknown compounds. Furthermore, through feature selection, data visualization, and model discrimination, the most informative features were identified for the classification of ER agonists/antagonists. The results of this data-driven study showed that highly accurate and generalized classification models with a minimum number of features can be constructed without loss of generality, where these machine learning models serve as a means for rapid mechanistic/phenotypic evaluation of the estrogenic potential of many chemicals.

Introduction Characterization and prediction of the endocrine disruptive potential of complex chemical mixtures are essential to prevent their adverse effects on human health while understanding the biological pathways that lead to such undesirable health outcomes [1]. A key target of endocrine-disrupting chemicals (EDCs) is the Estrogen Receptor (ER), a modulator of important physiological and pathological states, including reproduction, metabolism, hormone-sensitive cancers, and obesity. There are many natural and man-made compounds that are capable of binding to the ER interfering with its activity, either as agonists, which activate a biological response (i.e., genistein, bisphenol A); or as antagonists, which generally compete with the endogenous hormones (i.e., 17β-Estradiol (E2)) to suppress the receptor function (i. e., 4-hydroxytamoxifen, fulvestrant). Mechanistically, E2 activates the ER pathway cascade through enabling a specific ER conformational change, receptor dimerization, DNA binding to regulatory elements in the genome, coregulator recruitment, and gene transcription activation/repression [2][3][4].
The estrogenic potential of different chemicals can be measured using cell-based or cellfree in vitro assays by recording several facets of the ER mechanism of action (i.e., ligand binding, cell proliferation, gene expression, etc.) [5][6][7]. Previously, a high content/high throughput microscopy-based assay in HeLa cells, engineered to harbor a visible multicopy integration of the ER response element present within the prolactin promoter/enhancer, was developed to capture several mechanistic steps of the ER pathway by imaging [5,[7][8][9][10]. Coupled with stable expression of GFP-ERα, this high content analysis-based approach allows the direct visualization of ligand-induced DNA binding by ER as a visually distinct nuclear spot/array when using standard fluorescent microscopy. Importantly, we observed differences in nuclear spot size and intensity when cells were treated with ER agonists and antagonists, which could potentially facilitate the characterization of ligands based upon their effect on ER activity when compared to known agonists and antagonists.
Furthermore, recent efforts have also focused on coupling high throughput experimentation with computational methods for enabling the rapid diagnosis of the estrogenic potential of various chemicals via in silico predictions [6,[11][12][13][14][15][16]. Judson et al. [6] used a linear model to predict the estrogenic activity of 1812 commercial and environmental chemicals based on the activity patterns across in vitro assays. The accuracy of this linear model is further tested by Browne et al. [13] for evaluating the ER agonist bioactivity, in which the authors postulated an integrated methodology to discriminate bioactivity from assay-specific interference. Similarly, Kleinstreuer et al. [15] used high throughput screening data of 1855 chemicals along with a linear additive model to predict the Androgen Receptor (AR) activity. Furthermore, Li and Gramatica [14] used AR data to develop quantitative structure-activity relationship (QSAR) models to classify binders as AR agonist or antagonist. The authors also investigated the performance of 4 different classification models, namely k-nearest neighbors (kNN), local lazy method (lazy IB1), alternating decision tree (ADTree), and an integrated consensus model [14]. In another study by Chierici et al. [16], deep learning and support vector machine (SVM) models were developed using the Collaborative Estrogen Receptor Activity Prediction Project (CERAPP) ToxCast data set for predicting the effects of EDCs on ER binding activity. A further detailed overview of in silico toxicity predictions using machine learning algorithms is provided in the notable review by Idakwo et al. [17].
Different from the aforementioned studies, we present here an integrated, data-driven framework for characterizing the endocrine-disrupting potential of chemicals that affect ER functions. In this framework, the high throughput/high content image analysis data, which provides hundreds of intensity and geometry-based features per cell, are used to generate classification models for promptly detecting the endocrine disruptive potential of unknown compounds as ER agonists or antagonists. We benchmark our approach using a group of control chemicals and present a systematic computational approach for predicting the estrogenic potential of unknown chemicals. Furthermore, by incorporating feature selection steps in this framework, we identify the most informative image-based features that enable a highly accurate separation between an ER agonist and antagonist without the loss of generality.

Results and discussion
During and after environmental emergencies (i.e., hurricanes), humans are exposed to a number of chemicals, which in return creates an urgent need for the precise identification of their estrogenic potentials using rapid assessment techniques. Towards this goal, we collected 18 biologically independent experiments each containing 4 technical replicates of 45 benchmark compounds and 3 control treatments used at a single concentration with resulting high content analysis generating 7680 data points (192 observations x 40 features) per experiment. We aim to construct robust, generalized data-driven models that can accurately predict the endocrine-disrupting potential of unknown compounds from a limited number of experimental observations. To this end, one experimental data set is randomly selected for constructing the classification models among the 18 repeated image analysis experiments.
The selected data set is first pre-processed to extract the active subset of the 45 benchmark chemicals, and the uncorrelated features are identified using the computational methodology described in section Materials and Methods. A total of 64 observations were removed from each data set due to a failure to detect adequate visible DNA binding to characterize the compound. These observations were associated with compounds known to be inactive or very weak estrogens. Then, the clean data is split into training and test sets. Although it is common to split the data set using 80-20 or 70-30 rules (i.e., 80% training-20% testing), the experimental analysis on the 32 active benchmark chemicals yields an unbalanced data set due to the limited number antagonist versus agonist compounds. Hence, we constructed our training data set using all technical replicates of the 4 antagonist compounds and the 5 randomly selected agonist compounds with varying potency (Table 1) such that the classification models are trained on data where the distinct characteristics of the two classes of estrogenic activity are learned precisely. The remaining 23 agonist compounds and their respective 4 technical replicates in the chosen experimental data are reserved as the test set, enabling the fair assessment of the classification accuracy and other performance metrics of the model. As a result, the final training data matrix size with 9 chemicals becomes 36 observations x 5 features, the final testing data matrix size with 23 chemicals becomes 92 observations x 5 features of the selected image analysis experiment. The distribution of this training and testing data is also visualized through Principal Component Analysis (PCA) where a biplot of the first two components is provided in S1 Fig. Besides, the other 17 biologically independent image analysis experiments of the same list of benchmark chemicals are used for quantifying the classification model performance of estrogenic potential of chemicals subject to experimental noise.

Linear classification results
The five biologically relevant features that were identified (see Materials and Methods for feature selection and descriptions) are used to construct individual linear classification models with a single descriptor. The best performing model is then selected out of these five logistic regression classifiers based on their Akaike Information Criteria (AIC) value, 5-fold training cross-validation (CV) accuracy, and testing accuracy. The results of linear classification training with the logistic regression model are provided in Table 2.
The results show that a logistic regression model with a single image analysis feature can accurately map the separation between agonist and antagonist compounds in the training phase. Specifically, we observe that linear classifiers trained with "Array PI Variance," "Array Furthermore, the results show that "Array Area" and "Array Total PI" features have a negative effect on the linear classifier whereas the rest of the features have a positive effect. Specifically, values of the β 1 parameter for "Array PI Variance" and "Array to Nucleoplasm Intensity Ratio" are the highest, respectively, indicating that a compound with higher values of these two features has an increased probability of being an antagonist. In addition, among these two most prominent features for the linear classification of estrogenic potentials of unknown chemicals, we observe that the model parameters of "Array to Nucleoplasm Intensity Ratio" and "Array PI Variance" have a relatively wider range of 95% confidence intervals.
Finally, the testing accuracy of trained models is evaluated using all technical replicates of the remaining 23 active compounds in this experiment. The testing accuracy results show that although "Array PI Variance" has a larger weight in the linear classifier compared to the rest of the descriptors, "Array to Nucleoplasm Intensity Ratio" has a higher testing accuracy for predicting the class information of the unseen chemicals. Table 2 shows that the linear classifier with "Array to Nucleoplasm Intensity Ratio" has a testing accuracy of 96% where this number drops to 87% when "Array PI Variance" is used as the sole predictor in the linear model. As a result, both predictors can perfectly map the separating linear boundary between the agonistic and antagonistic behaviors of chemicals in the training phase, whereas the linear model with "Array to Nucleoplasm Intensity Ratio" has a superior testing performance with a higher potential for achieving generality.

Nonlinear classification results
The nonlinear classification analysis results are summarized in Table 3 where it shows the ranking of the experimental features based on the mean decrease in the Gini index score. The mean decrease in Gini index score is a measure of how strong a feature is for separating different classes of information, where prominent features lead to a larger decrease in this index. The results of the Random Forest (RF) model indicate that "Array to Nucleoplasm Intensity Ratio" is the top informative feature followed by "Array Area" and "Array PI Variance." The mean decrease in Gini index for these 3 descriptors are very close to each other, showing that they are equally important for modeling the estrogenic potential of chemicals. The nonlinear classification results are consistent with the linear model, where these 3 features had 100% training CV accuracy and minimum AIC. Through careful consideration of the model parameters and the testing accuracy in linear models, we were able to distinguish "Array to Nucleoplasm Intensity Ratio" and "Array PI Variance" as the top two informative features for linear

PLOS COMPUTATIONAL BIOLOGY
classification of agonist and antagonist compounds. Different than the linear analysis, we observe that the "Array Area" is the second most important feature for the nonlinear classification of estrogenic compounds whereas in the linear model the second-best feature was identified as "Array PI Variance." Moreover, the model performance assessment with the training and testing data showed 100% and 93% classification accuracy, respectively. This high performance on the training data is expected as the model has learned the patterns within this set with high precision. The high testing accuracy of this model, on the other hand, shows that RF retains its predictive capability over a set of compounds that the model has not seen. As these initial tests show satisfactory results, further characterization of the model performance over biologically independent experiments is provided in the following section.

Visualization and evaluation of agonist and antagonist distributions
In addition to the classification model development and using their mathematical properties to extract valuable information on the experimental features, additional insights on the separation between agonist and antagonist compounds are obtained through exploratory data analytics. To this end, we have plotted the density distributions of agonist/antagonist compounds for all independent experiments using the top important features identified by both linear and nonlinear classification analysis, namely the "Array to Nucleoplasm Intensity Ratio" and "Array PI Variance." The density plots are provided in Figs 1 and 2 where the separation between agonistic and antagonistic behaviors of the chemicals, based on the values of the aforementioned descriptors, are visualized. The results in Figs 1 and 2 show that the "Array PI Variance" and "Array to Nucleoplasm Intensity Ratio" lead to a clear distinction between an agonist and antagonist for all experiments. To clearly distinguish between these two prominent features, the separation between the agonist and antagonist density distributions are quantified by calculating the Hellinger Distance (HD). This metric provides a measure of the distance between probability distributions and takes the values between 0 and 1, where smaller HD indicates that the two distributions are similar and the separation between them through the use of this feature is not statistically significant. In Fig 1, the results show that the HD between the density distributions of agonist and antagonist compounds based on the "Array to Nucleoplasm Intensity Ratio" is high (min = 0.61, max = 0.90). Specifically, for independent experiments 3, 5, 9, 11, 15, and 17 there is a clear separation between agonistic and antagonistic behavior based on this descriptor, hence a linear classifier enables a highly accurate separation between these two estrogenic potential classes. However, in experiments 4, 10, and 16, a portion of the density distributions of these two estrogenic activities overlap and may lead to misclassification of compounds if the normalized "Array to Nucleoplasm Intensity Ratio" value of an agonist/antagonist falls into this overlapping region. Overall, we observe that ER antagonists possess strong signals for this experimental feature, thus enabling the separation of these two classes via a linear logistic regression model.
Similarly, in Fig 2, we observe that the HD distance of "Array PI Variance" measurements for all independent experiments is high (min = 0.66, max = 0.85), indicating that this feature is a valid descriptor for agonist versus antagonist separation. Different from Fig 1, here we observe that ER agonists have strong signals for this experimental feature in 12 out of 18 experiments. Furthermore, although the range of HD values are similar for both features, the number of experiments with HD > 0.8 is higher for "Array to Nucleoplasm Intensity Ratio" compared to "Array PI Variance." This result indicates that "Array to Nucleoplasm Intensity Ratio" is more favorable for classification model building as this feature provides a clear distinction between ER agonists and antagonists over multiple biologically independent experiments. Furthermore, we perform unpaired two-sample Wilcoxon test (i.e., Wilcoxon rank-sum test), a non-parametric alternative to the unpaired two-sample t-test, on the training data to compare the distributions of agonist and antagonist compounds per experimental feature. The Wilcoxon rank-sum test results in Table 4 show that the median measurements of agonist and antagonist compounds for "Array to Nucleoplasm Intensity Ratio," "Array Area," "Array PI Variance," and "Array Mean PI" are different with p-values equal to zero. This statistically significant difference between agonist and antagonist distributions further strengthens the evidence that these features are appropriate for modeling the separation between two-classes of ER activity. In the case of "Array Total PI", we fail to reject the null hypothesis, indicating that the distributions of agonist and antagonist compounds do not possess a statistically significant difference for this feature. Hence, the separation of the agonist and antagonist compounds with this descriptor is less desirable and will lead to misclassifications with worse predictive capability. This outcome is also consistent with linear and nonlinear classification results where this feature ranked the worst in training and testing accuracy (Table 2), as well as in feature importance (Table 3), respectively.

Classification model performance over independent experiments
In addition to the HD calculations and the Wilcoxon rank-sum test, the predictive capabilities of the trained and tested logistic regression (linear) and RF (nonlinear) classifiers are validated with a set of new 17 biologically independent experiments, and their corresponding predictive performance is quantified with a two-fold approach. First, the model performance is quantified using the 4 technical replicates of 23 unseen agonist compounds in 17 independent experiments that the model has not been trained on (Fig 3).
The blind validation accuracy results in Fig 3 shows that the predictive performance of the logistic regression model with "Array PI Variance" feature is inferior to the logistic regression model with "Array to Nucleoplasm Intensity Ratio" and the RF classifier. In 4 out of 17 experiments, the validation accuracy of this model is below 80% whereas, for the logistic regression model with "Array to Nucleoplasm Intensity Ratio" and the RF classifier, the validation accuracies exceed 90% for all experiments. Furthermore, the predictive capability of the latter two models is comparable to each other, where only in experiments 1 and 6, a relatively high difference between the validation accuracies of these two models is observed. In experiments 1 and 6, the validation accuracy of the RF model is 95% and 97%, respectively, whereas for the logistic regression model with "Array to Nucleoplasm Intensity Ratio" has a validation accuracy of 91% for both replicate testing. The lowest prediction accuracy for these two models is 95% for the RF classifier and 91% for the logistic regression model with "Array to Nucleoplasm Intensity Ratio" whereas this number drops to 75% for the logistic regression model with "Array PI Variance." Moreover, the 95% CI of the validation accuracy is also provided in Fig 3. The RF and logistic regression models using "Array to Nucleoplasm Intensity Ratio" have tighter CI around the validation accuracy whereas the other logistic regression model has a wider CI on the prediction accuracy for all independent experiments. Overall, the blind validation accuracy results indicate that the nonlinear RF and linear logistic regression model with "Array to Nucleoplasm Intensity Ratio" are more favorable for predicting the estrogenic potential of unknown compounds as they have a more robust performance and can sustain their predictive capabilities over multiple biologically independent experiments.

PLOS COMPUTATIONAL BIOLOGY
Second, in Fig 4, accuracy, sensitivity, specificity, and balanced accuracy of different classification models are reported for these 17 experiments, but with all 32 active compounds. These results indicate that all three classification models have high accuracy and sensitivity for predicting the estrogenic potential of all active compounds considered in this study. Specifically, all models predict > 90% accuracy in 11 out of 17 experiments. Moreover, we observe that specificity and balanced accuracy of the logistic regression model with "Array PI Variance" is higher overall, when compared to other models, whereas the sensitivity of the RF classifier and logistic regression model with "Array to Nucleoplasm Intensity Ratio" is higher across  PLOS COMPUTATIONAL BIOLOGY different replicates. For the latter two models, we observe that the specificity value is zero for 5 experiments, indicating that these models identified all compounds as agonists and failed to classify the 4 antagonist compounds correctly. As a result, their balanced accuracy is also lower (50%), as this performance metric averages specificity and sensitivity values.

Data quality monitoring
This biphasic performance of each model on the identification of antagonist compounds suggests an underlying property of the experimental data sets that determines model performance.
An exhaustive analysis of the correlation between data set features and model performance identified cell density (number of cells per microscopic image) as having a strong negative correlation (-0.54 to -0.71) with model balanced accuracy. Cell density varies by 12.6% across independent experiments (Fig 5A). Using a threshold of 255 cells/well to divide them into 'Low-Density' and 'High-Density' experiments, we observe superior performance by all three models in the 'Low-Density' replicates ( Fig 5B) with average balanced accuracy of 91% for the logistic regression model with "Array PI Variance" predictor and 95% for the other models, respectively. Reassessing the model performance metrics with only low-density experiments reveals that the RF and the logistic regression model with the "Array to Nucleoplasm Intensity Ratio" predictor are highly accurate and precise in predicting both classes of information ( Fig  6). This result is not surprising based on the technical limitations of the assay and features used by the model. Increasing cell density increases the likelihood of any individual cell in a field being slightly out of focus. Since both "Array PI Variance" and "Array to Nucleoplasm Intensity Ratio" are contrast based features, they are dependent on focus quality. While the original optimization of cell density was based simply on the ability to detect the presence of a nuclear spot, slightly lower cell densities may be required to produce higher quality data required for high classification performance. Overall, the nonlinear RF classification model and the linear logistic regression model with "Array to Nucleoplasm Intensity Ratio" are found to be highly accurate and robust for predicting the endocrine-disrupting potential of compounds on the ER. These holistic classification models allow us to predict the ER activity of compounds without running in separate agonist or antagonist modes and are shown to possess comparable classification performances with the previously reported values [6,13,16]. In addition, among 40 different experimental features studied in this work, "Array to Nucleoplasm Intensity Ratio" is found to be the top informative feature through a series of supervised and unsupervised analyses, and the results indicate that it is essential for predicting the ER activity of compounds through generalized predictive models.
As this study benchmarked the predictive capabilities of our hybrid data-driven classification models with satisfactory results, this methodology can be further used to predict the endocrine disruptive potential of novel chemicals. This prediction will require the high throughput high content image analysis experiment to be conducted that meets the data quality monitoring criteria (i.e., low-density experiment), where a measurement for the 5 selected features will be recorded. For making predictions with the logistic regression model, we have already identified that this methodology provided superior predictions with the "Array to Nucleoplasm Intensity Ratio" feature. By using the model parameters that are identified as a result of this work and the pre-processed experimental measurement corresponding to "Array to Nucleoplasm Intensity Ratio", the probability of this chemical being an antagonist is calculated by evaluating Eq 2 in Materials and Methods. If this probability is less than 0.5, the compound will be classified as an agonist, else as an antagonist. In the case of making predictions with the RF model, the pre-processed measurements of all 5 experimental features will be used to predict an outcome. As the RF is a decision tree-based model, an outcome will be calculated from all the trees and the majority vote is reported as the final class of the novel chemical. Hence, the established models can rapidly identify the endocrine disruptor potential of chemicals by function evaluation and subsequently classify them as ER agonists or antagonists. This prompt evaluation provides key information regarding the potential health effects of unknown chemicals and enables decision makers to take mitigating actions during exposures.

Benchmark chemicals
Forty-five chemicals (Table 5) with varying estrogenic potentials were obtained from the United States Environmental Protection Agency (EPA) and were utilized for benchmarking our data-driven framework. The same compounds have been used by NIEHS/EPA as a set for developing computational models of the ER pathway [6].

Experimental data generation
High throughput microscopy and high content analysis-based experiments were performed using the GFP-ERα:PRL-HeLa cell line model following the experimental methodology described previously [5,[7][8][9][10]. This cell model allows for the direct simultaneous visualization of compound-dependent effects on several aspects of the ER signaling pathway, including ER expression, nuclear translocation, chromatin binding, and chromatin remodeling (Fig 7). Importantly, we have previously observed qualitative and quantitative differences in ER pathway signaling endpoints in the cell model-based compound activity class (Fig 7B and 7C) [5,[7][8][9][10]. 384 multiwell plates seeded with cells were treated for 2 hrs with a single concentration (10 uM) of 45 reference compounds provided by the EPA along with 3 control treatments. Control treatments included externally sourced agonist 17β-estradiol (E2, 20 nM), antagonist 4-hydroxytamoxifen (4HT, 20 nM), and negative control (DMSO, 0.5%). The experimental design included 4 technical replicates of each treatment, resulting in 192 different sample observations. Samples were imaged using an automated epifluorescent image cytometer and analyzed using myImageAnalysis/Pipeline Pilot to generate 40 different descriptors per cell  Table 5. Summary of benchmark chemicals analyzed in this work. The ER activity information is adapted from Judson et al. [6].

CASRN
Compound Name ER Activity [6] Potency [6] 140-66-9 PLOS COMPUTATIONAL BIOLOGY [18]. The single-cell descriptors capture GFP-ERα fluorescence intensity (i.e., pixel intensity-PI) and morphology features of each cell, nucleus, and PRL array. The single-cell population is filtered to remove artifacts generated from cell toxicity, cell clusters, and incorrect segmentation. The remaining cell population (minimum of 500 cells per sample) data is averaged per sample to yield a data matrix size of 192 observations x 40 features, where the categorical output information for classification is provided in Table 5 in "ER Activity" column. A full list of experimental features is provided in [10].

Computational methodology
The computational methodology follows a similar approach described in [19] where key steps of the framework are summarized in Fig 8. First, a series of pre-processing steps are executed to ensure accurate in silico predictions of ER activity with classification models. Once the preprocessing is completed, the data set is then passed on to the feature selection phase, where collinear features are eliminated from the analysis using hierarchical clustering. Later, a two-class classification problem is formulated using a subset of the features that are identified as independent and biologically relevant in the previous step. Finally, model validation is performed, and the predictive capability of the resulting classification model is quantified using model performance metrics. A detailed description of each step is provided below. Data pre-processing. The pre-processing steps used in this analysis are: (1) missing data handling, (2) data cleaning, (3) outlier detection via unsupervised analysis, and (4) data normalization. The experimental data is first analyzed for missing data entries. If any missing data is detected, several procedures can be followed including, deletion of the entire row, deletion of the entire column, or data imputation [20]. As the experimental data from the image analysis did not have any missing points, no action is taken at this step, and the data matrix size of 192 observations x 40 features are retained.
In the next pre-processing step, the data set is cleaned by removing the observations corresponding to inactive compounds (Table 5). After this cleaning step, the data matrix size is reduced to 32 active compounds with 4 technical replicates per compound (128 observations) x 40 features. For outlier detection, the replicate observations of each compound are averaged, yielding a data matrix size of 32 average observations x 40 features. Hierarchical clustering is

PLOS COMPUTATIONAL BIOLOGY
performed on the Euclidean distance-based dissimilarity matrix of this aggregate data with complete linkage. The clustering analysis is visualized using a dendrogram tree as shown in Fig 9. The results of the clustering analysis indicate that there are no global outliers present in the data set as none of the compounds significantly differ from each other. We observe that the active compounds are generally clustered under two groups based on their feature-specific patterns and are not presenting themselves on a separate branch at the root node of the dendrogram tree. As a result, the imaging data for all 32 compounds are viable for further analysis. The clustering is performed in R (version 4.0.0) using the hclust function under the stats library.
In the final pre-processing step, the remaining 32 active compounds are normalized using column-wise mean absolute deviation with respect to the control agonist E2 (Eq 1). The normalization is performed on the complete cleaned data set with technical replicates (Data PLOS COMPUTATIONAL BIOLOGY matrix size: 128 observations x 40 features). In Eq 1, i represents the rows in the data set (i.e., technical replicates of each compound) and j represents the columns in the data set (i.e., features).

ER Data normal
Feature selection. The ultimate goal of this work is to present a data-driven methodology that integrates high content, high throughput image analysis-based data with machine learning algorithms for developing a robust, generalized classification model that accurately predicts the estrogenic potential of unknown chemicals. Within the scope of this work, as the image analysis provides numerous fluorescence intensity and morphology features, several challenges come to rise in classification model development: (1) Only a subset of experimental features may provide valuable knowledge for the separation of agonist/antagonist ER activity and identification of those is a challenging task; (2) A subset of the features may be highly correlated, and may cause bias, leading to loss of generality, precision and accuracy in the predictive capability of the data-driven model; (3) Modeling with a high number of features without an adequate amount of samples may lead to overfitting.
In this work, we address the aforementioned challenges by incorporating a feature selection step in our data-driven modeling framework. Feature selection or variable selection is one of the key processes in machine learning model building, where the aim is to identify a subset of features among many others that are uncorrelated and the most informative set of descriptors, for a given data-driven modeling problem. There is a growing interest within various fields of engineering and sciences for developing computationally efficient feature selection algorithms that enable the identification of the minimum number of features for maximum predictive capabilities in data-driven models [21][22][23][24][25]. Here, the feature selection is done in two steps: (1) Through hierarchical clustering for identifying the groups of similar and correlated features, and (2) Through a heuristic feature selection step, in which a single feature is selected from each cluster based on the ER pathway model presented in [6].
In step 1, hierarchical clustering is performed on the pairwise similarity of experimental features, calculated using the Pearson correlation, with complete linkage. From the clusters of correlated features, groups that possess less than 5% similarity are identified as unique and uncorrelated for classification analysis. The clustering outcome is shown in Fig 10 with 20 independent feature groups of which we can select a subset of these for analysis. Like in the outlier analysis, the clustering for feature selection is performed in R (version 4.0.0) using the hclust function under the stats library.
In step 2, the goal is to further reduce the number of features for the classification analysis such that they are: (1) Selected from the independent groups of features following the clustering analysis (Fig 10); and, (2) the selected features are biologically relevant. The biological relevance of features is assessed through cross-referencing the image-based features to the ER pathway nodes presented in Judson et al. [6]. So, from 20 independent groups of features, the top 5 biologically relevant features (one shape and four PI-related descriptors) that are closely associated with a node on the ER signaling pathway, are selected. A summary of these features along with their descriptions are provided in Table 6 as well as in Fig 7D. This selection reduces the initial 40 image-derived features to 5, yielding a final data matrix size of 128 observations x 5 features that are passed on to the model development stage of the presented framework. The final data matrix is also visualized using PCA, where a biplot of the first two principal components is presented in S3 Fig.

Classification model development using logistic regression and Random Forest classifier
Once the feature selection step is completed, the clean normalized data set is split into training and testing sets, and the training data are passed on to the model building phase for supervised  PLOS COMPUTATIONAL BIOLOGY analysis. In this work, the training set is a balanced subset of the active compounds that contain both agonist and antagonist chemicals and their corresponding 4 technical replicates, whereas the testing set is the remaining unseen active compounds with their corresponding 4 technical replicates, not used in the training phase. Supervised learning algorithms are widely studied in many fields of engineering and sciences primarily in classification and regression-type problems for predicting either a categorical output or a continuous output, respectively [24,[26][27][28][29][30][31][32]. Classification is the problem of finding the categorical output of a new observation and distinguishing between different classes of information via statistical recognition of patterns in a training data set. In this study, we develop classification models to predict the endocrine disruptor activity of a set of benchmark chemicals. In this effort, both linear and nonlinear models are tested and their predictive performance on unknown chemicals is shown for comparison.
Linear classification is performed using the logistic regression model and the variables are selected using the Akaike Information Criterion (AIC). The logistic regression model with one predictor is provided in Eq 2, where x is the value of a predictor, P(antagonist) is the probability that the outcome is an "antagonist", and β 1 and β o are the parameters of the linear model where their values are estimated using the training data. The goal of the logistic regression training stage is twofold: (1) To create a highly accurate and precise linear separating boundary between ER agonist and antagonist compounds; and, (2) to identify the most descriptive feature out of the 5 selected in the Feature selection step such that the in silico distinction between an ER agonist and antagonist is achieved without loss of generality. To this end, an exhaustive search is performed where individual logistic regression models, of the form presented in Eq 2, are constructed for all possible combinations of single features. The best performing model in this training phase with the minimum AIC, the highest CV training accuracy, and the highest testing accuracy is selected. In addition, the most informative feature for the linear classification problem is identified through analyzing the β 1 parameter given that exp (−β 1 ) quantifies the increase in the odds of a compound being an antagonist. As a result, an important feature with a larger weight in the closed-form equation will have a larger impact on the classification predictions.
For nonlinear classification, we use the RF algorithm with built-in feature ranking. RF is a nonparametric, tree-based ensemble learning method that uses multiple decision trees, independently constructed with a bootstrap sample of training data, to predict an outcome based on a majority vote [33]. The algorithm can identify "strong features" that causes a larger mean decrease in accuracy and display the relevance of features used in the training stage via the "Gini index" score [33,34]. In our implementation, RF classifiers are constructed with 130 decision trees on the training data. The data-driven models for linear and nonlinear classification are implemented in R (version 4.0.0) using the glm function in stats library, and random-Forest function in randomForest library, respectively.
Model validation and performance metrics. Model validation is done with a two-fold approach: (1) Using the technical replicates of unseen compounds that the model has not been trained on with model performance metrics quantified over independent experiments; and, (2) Using all technical replicates of active compounds in Table 5 with model performance metrics quantified over independent experiments. This two-fold approach allows us to quantify and observe the robustness of the unbiased predictive performance of the trained classification model over multiple biologically independent experiments and enable us to diagnose potential issues with the experimental data set (i.e., data quality monitoring).
The classification model performances are assessed using several evaluation metrics. These include accuracy, sensitivity, specificity, and balanced accuracy. Accuracy is defined as TPþTN TPþFPþTNþFN , sensitivity (i.e., true positive rate) is defined as TP TPþFN , specificity is defined as TN TNþFP , and the balanced accuracy is defined as the average of sensitivity and specificity, 1 this study, a "true positive" (TP) is defined as an agonist being correctly identified as an agonist and a "false positive" (FP) is defined as an agonist being misclassified as an antagonist. On the contrary, a "true negative" (TN) is defined as an antagonist being correctly classified as an antagonist and a "false negative" (FN) is defined as an antagonist being misclassified as an agonist.

Conclusions
We have developed an integrated data-driven framework that enables the rapid identification of unknown pure chemicals that affect the estrogen receptor (ER) pathway as either agonists or antagonists. High throughput microscopy and high content analysis-based data are utilized to formulate highly accurate classification models by following a series of pre-processing, visualization, unsupervised, and supervised analysis steps. The framework is benchmarked with a set of chemicals with known ER activity. In the presented framework, a detailed pre-processing step is executed where: (1) experimental image analysis data is scanned for missing data points; (2) data is cleaned by removing the inactive compounds; (3) data set outliers are detected via hierarchical clustering; and, (4) experimental features are normalized via mean absolute deviation. Following pre-processing, the framework continues with a two-step feature selection methodology where uncorrelated features are first identified by hierarchical clustering using the pairwise similarity of the descriptors; secondly, the biologically relevant descriptor(s) are selected for analysis. Both linear and nonlinear classifiers are tested as a part of this framework for modeling endocrine-disrupting potentials of chemicals that affect ER functions, and their predictive performances are quantified via evaluation metrics. The linear and nonlinear classification model results show that high throughput microscopy and high content analysis-based experimental data can be used to train robust, highly accurate classifiers with a minimum number of features and sampling points (i.e., one feature for linear classification and five features for nonlinear classification). Through the results of this framework, we can identify the topmost important feature for the classification of ER agonists/antagonists without loss of generality and provide recommendations for the appropriate model selection. In addition, the presented data-driven framework serves as a guideline for rapidly scanning unknown chemicals and obtaining their estrogenic potentials with high accuracy. This property of the framework will be profound during environmental emergencies, where it is of the utmost importance to rapidly identify the potential biological risks of unknown chemicals.