Virtual screening by a new Clustering-based Weighted Similarity Extreme Learning Machine approach

Machine learning techniques are becoming popular in virtual screening tasks. One of the powerful machine learning algorithms is Extreme Learning Machine (ELM) which has been applied to many applications and has recently been applied to virtual screening. We propose the Weighted Similarity ELM (WS-ELM) which is based on a single layer feed-forward neural network in a conjunction of 16 different similarity coefficients as activation function in the hidden layer. It is known that the performance of conventional ELM is not robust due to random weight selection in the hidden layer. Thus, we propose a Clustering-based WS-ELM (CWS-ELM) that deterministically assigns weights by utilising clustering algorithms i.e. k-means clustering and support vector clustering. The experiments were conducted on one of the most challenging datasets–Maximum Unbiased Validation Dataset–which contains 17 activity classes carefully selected from PubChem. The proposed algorithms were then compared with other machine learning techniques such as support vector machine, random forest, and similarity searching. The results show that CWS-ELM in conjunction with support vector clustering yields the best performance when utilised together with Sokal/Sneath(1) coefficient. Furthermore, ECFP_6 fingerprint presents the best results in our framework compared to the other types of fingerprints, namely ECFP_4, FCFP_4, and FCFP_6.


Introduction
Drug screening is a process of determining drug candidates that contain relevant biological targets. Recently, computers have been used to speed up the development process in order to reduce the time required to launch drugs onto the market. Moreover, it has a potential savings of millions of dollars compared to testing in vitro. Virtual screening is a set of computational techniques which aims to rank molecule structures in a database [1]. This ensures that chemists can assay molecules which have a higher probability of being active with the relevant biological target first. A conventional technique in virtual screening is called "similarity searching". It ranks all molecules in a database on the basis of similarity or dissimilarity to a query molecule. PLOS  Machine learning techniques are becoming popular in many applications today. They also play an important role in the drug discovery process, e.g. prediction of target structures, and optimisation of hit compounds. Examples of techniques used in the drug discovery process are support vector machine (SVM) [2][3][4], binary discriminant analysis [2,5], artificial neural networks [6], and decision trees [7]. Many techniques used in virtual screening have been welldocumented and reviewed in the following references [8][9][10]. Among these techniques, SVM is one of the most powerful and popular in this area resulting in an increasing number of publications in recent decades [10].
Although SVM is a powerful algorithm, its main drawback is that it requires quadratic programming to solve the problem-at least the space complexity is quadratic. When the training dataset becomes large, its computational cost will be very intensive. In addition, SVM requires two or more user-specified parameters which directly affect the model's performance. These parameters are required to be tuned in order to get an optimal model. Thus, the higher the number of parameters to be tuned, the more the computational cost is. In 2004, Extreme Learning Machine (ELM) was proposed by Huang et al. and made use of single hidden layer feed-forward neural network [11]. Their proposed algorithm is fast and able to obtain the optimal solution. It has proved to be competitive with SVM in performance but with a remarkable speed of training compared to SVM. Moreover, ELM requires less human intervention than SVM because the only important parameter is the number of hidden nodes [12,13]. ELM has been applied to protein sequence classification [14][15][16]. To the best of our knowledge, ELM was first applied to the virtual screening task by [17] as Weighted Tanimoto ELM (WELM JT ). The algorithm is customised for 2D binary fingerprint descriptor. WELM JT replaces the activation function in neurons at the hidden layer with the Jaccard/Tanimoto (JT) similarity coefficient. Moreover, instead of randomly selecting hidden nodes with continuous distribution in the conventional ELM, WELM JT randomly selects hidden nodes from the training set. Since there are many available similarity coefficients, we adopt a weighted similarity ELM (WS-ELM) algorithm which employs different similarity coefficients. This is to obtain a suitable similarity coefficient for virtual screening task with 2D fingerprint descriptor.
In addition, WS-ELM performance, like ELM, is not robust due to random weight selection. This problem should be addressed. Therefore, a deterministic assignment of hidden weights shall be considered to increase the robustness of the conventional ELM. We propose an approach to carefully select the weights of WS-ELM. Here, clustering techniques are employed to carefully select the represented candidates of weights. The proposed algorithm the so-called "Clustering based Weight Similarity ELM"(CWS-ELM) is performed and compared to the conventional techniques on well-designed experimental frameworks with one of the most challenging databases-Maximum Unbiased Validation Dataset-which consists of 17 activity classes.

Methods
In this section, we explain all methods used in this work together with our proposed techniques.
have been introduced and re-introduced as they are in very common use in many applications [5,18,19].
In this paper, we investigate 16 coefficients selected from [5,20,21] as shown in Table 1. Some coefficients are excluded, e.g. Dice. Dice is monotonic to Jaccard/Tanimoto, therefore they give identical rankings. The similarity s(x i , x j ) and dissimilarity d(x i , x j ) of two molecules are usually calculated from four different quantities: (i) a: The number of bits set in common to both molecule i and j, (ii) b: The number of bits set in molecule i and unset in molecule j, (iii) c: The number of bits set in molecule j and unset in molecule i, and (iv) d: The number of bits unset in common to both molecule i and j. A combination of these four quantities (a + b + c + d) is equivalent to the number of bits m belonging to molecules i and j. The coefficients are divided into three main groups as follows: • Association coefficient is based upon the inner product operation. Most of the ranges are [0, 1] which indicates no similarity and complete similarity.
• Correlation coefficients measure the degree of correlation between the molecules.
• Distance coefficients quantify the degree of difference between two objects. The more similar two objects are, the smaller the distance value is. Distance function can be converted to similarity function by d( If multiple active molecules (n A ) are available, we can calculate the similarity value between a molecule x j in the unranked database and a set of query molecules-for all x i 2 Actives by,

Extreme Learning Machine
Extreme Learning Machine (ELM) was first proposed by Huang et al. [11]. It is based on a single layer feed-forward neural network architecture. Consider the matrix of m-dimensional sample vectors X = [x 1 , x 2 , . . ., x n ] T and a target vector y comprising y i 2 {−1, +1}. The output of ELM can be defined as a linear sum of weights (β i )-connecting the hidden neurons to the output-associated with the hidden layer outputs. There are l nodes in the hidden layer. The hidden layer outputs use an activation function g(Á) with a linear combination of input x and synaptic weights (w i ) and bias (b i )-connecting the hidden neuron to the input neurons-as function input. Therefore the model can be defined as: where w i = [w i1 , . . ., w im ] (randomly generated). Therefore, the activity of the hidden node can be represented as The ELM aims to minimise the mean squared error, whereŷ i is a predicted target. Thus, Moore-Penrose pseudo-inverse is employed to achieve the optimal solution for this problem. Hence, β can be defined by, The prediction score can be computed from

Weighted Similarity Extreme Learning Machine
The proposed Weighted Similarity ELM (WS-ELM) consists of two functions which are (i) empirical likelihood function-mean squared error-and (ii) penalised likelihood functionsridge penalty, The activation function g(Á) in the conventional ELM is replaced by s(Á, Á), hence, the H is represented as, . . . sðx n ; w 1 Þ . . . sðx n ; w l Þ C is a regularisation parameter to control the complexity of the model. w is randomly selected from the training set w & X instead of randomly selected from a continuous distribution. This is to ensure that the achieved weights are binary, sparse, and have identical dimension span. The virtual screening task faces a dramatic imbalance between the number of active (n A ) and inactive (n I ) molecules. In order to deal with this imbalanced class problem, a diagonal Γ n×n is defined associated with all training samples. A minority class will be given higher importance than a majority class. Thus, the likelihood function becomes The above likelihood function can be minimised using standard ℓ 2 -regularised weighted least squares which gives the following solution Instead of calculating H T ΓH, we can calculate (γ Á H) T Þ. This technique can speed up the computational time [17]. This leads to the solution in Eq 12.
whereĤ ¼ γ Á H. γ i can be defined as, The architecture of WS-ELM is shown in Fig 1.

Clustering-based Weighted Similarity Extreme Learning Machine
Due to randomness of weights between input and hidden layers, the prediction of the conventional ELM is not stable. This is applicable to the case of WS-ELM as well because a subset of samples in the training set is randomly selected to represent the weights in WS-ELM. Therefore, a deterministic assignment of hidden weights will be able to improve the performance of the conventional ELM. In order to enable the deterministic approach to this, we utilise cluster analysis methods to organise and summarise data through group prototypes. Thus, we propose a new algorithm called "Clustering-based WS-ELM"(CWS-ELM). Clustering analysis is an unsupervised learning technique for grouping samples in the space into k groups. It aims to minimise the distance of samples within each cluster while maximising the distance between groups. Many clustering algorithms have been introduced and well-documented [22,23]. In this paper, we investigate k-mean clustering and support vector clustering algorithms. The rationale behind this selection is the choice of representation of the data for each group. A cluster can be represented by its centroid identified by kmean clustering algorithm or a set of samples bounding the cluster. Brief details of these two algorithms are explained in the following subsection. The pseudo-code for CWS-ELM is shown in Algorithm 1.
k-mean clustering. This is the conventional clustering technique which aims to minimise the Euclidean distance between the samples and the centroid in each cluster. The number of clusters (k) must be determined by the user. Instead of Euclidean distance, we can adopt other distance-or similarity-coefficients listed in Table 1 as well. In order to ensure that CWS-ELM will pick a binary weight, we choose a sample that is the closest to the centroid. Thus, the number of nodes used in CWS-ELM is equal to the number of centroids representing all clusters in the training data. Support vector clustering. Support Vector Clustering (SVC) is inspired by a well-known algorithm, the so-called "Support Vector Machine"and is introduced by [24]. SVC employs a kernel trick to map all samples into a high dimensional feature space and obtains the smallest sphere which contains the mapped samples. The sphere can be mapped back to the original feature space and forms a set of contours which enclose samples. Samples in the same contour are hosted in the same cluster. Furthermore, any points lying on the boundary of the spherecluster boundary-are considered as support vectors. Moreover, embedding a soft margin in SVC can enable the sphere not to enclose all points in it. Thus the algorithm can have the Virtual screening by a new Clustering-based Weighted Similarity Extreme Learning Machine approach ability to deal with outliers. The similarity function in Table 1 can be adopted as a kernel function similar to [5]. In CWS-ELM, the number of nodes is equivalent to the number of support vectors bounding each clusters in the training data.
Algorithm 1 Clustering-based Weighted Similarity Extreme Learning Machine

Maximum Unbiased Validation Dataset
The experiments were conducted on a well-known open to the public dataset in a virtual screening task using the so-called "Maximum Unbiased Validation"(MUV) dataset which was created by the Institute of Pharmaceutical Chemistry, Braunschweig University of Technology, Germany [25]. The dataset consists of 17 bioactivity data sets carefully selected from Pub-Chem-an open archive of the biological activities of millions of molecules as shown in Table 2. Each set consist of 30 active compounds together with 15,000 carefully selected confirmed inactive compounds (also known as decoys). An active compound is a compound which causes a corresponding biological activity while an inactive compound does not. The active compounds in each activity are designed to be structurally heterogeneous (sometimes called diverse) with only 1.14 compounds on average of distinct scaffolds in each activity class. The scaffold is the core structure which is the main component of a molecule. Moreover, the classes are grossly imbalanced with over 99.8% belonging to the inactive group. Therefore, this dataset is one of the most challenging in virtual screening tasks.
We represent the data with two popular fingerprints generated by Pipeline Pilot software, namely: Extended Connectivity Fingerprint (ECFP), and Functional-class Fingerprint (FCFP) [26]. The reason behind the selection of these two types of fingerprints is that Gardiner et al.
As mentioned earlier the MUV dataset is very diverse; another widely used indicator for diversity among substructures of molecules in a database is mean pairwise similarity (MPS) score. The lower the score, the more heterogeneous an activity class is. Hence, it will be very difficult to identify/retrieve in a virtual screening task. The MPS of each compound with every other compound in the class, calculated with different fingerprints using the Jaccard/Tanimoto similarity coefficient is shown in Table 2. It can be seen that the MPS on average is only 0.19/1.00.

Experiment settings
The dataset is divided into training and test sets. The training sets are created similarly to [30][31][32]. All 30 active molecules from the 17 activity classes in the MUV dataset are collected in a data pool. Then, we randomly select 170 molecules (n Tr ) as a training set which consists of 10 active and 160 inactive molecules for each activity class under consideration. A set of the remaining samples in the data pool combined with inactive samples for each activity class under consideration constitute a test set. Active and inactive molecules are labelled as 1 and −1, respectively.
The experiments are divided into three parts as follows: • Evaluating WS-ELM in conjunction with different types of available similarity coefficients against the baseline method-similarity searching-on four considered fingerprints in order to obtain the best similarity coefficient suitable for WS-ELM and fingerprint.
• Comparing our proposed algorithm, CWS-ELM, with ELM variants. All experiments are run 10 times with different random splits on training and test sets. In addition, hyper-parameters in each algorithm are identified by estimating generalisation error via five-fold cross-validation on the basis of the area under the Receiver Operating Characteristic Curve (AUROC) on the training set. There are many criteria for evaluation of virtual screening tasks, e.g. AUROC, Enrichment Factor (EF), Robust Initial Enhancement (RIE), Boltzmann-Enhanced Discrimination of ROC [33][34][35][36]; but we select AUROC because it is simple and a standard metric for many fields.
In WS-ELM, there are two parameters which need to be tuned: number of hidden nodes (l) and a regularisation parameter (C). The range of l was [1, . . ., n Tr ] while the range of C was [10 −6 , 10 −5 , . . ., 10 5 , 10 6 ]. For CWS-ELM, the regularisation parameter (C) is required to be tuned using the same range as WS-ELM. In addition to the base hyper-parameter of WS-ELM, the number of clusters (k) for k-mean-based WS-ELM (CWS-ELM KMC ) is required to be within a range of 1 to n Tr , while SVC-based WS-ELM (CWS-ELM SVC ) has another regularisation C S with a range from 0.1 to 1.0 with increments of 0.1.
A model is trained with the training data with a set of optimal parameters. The model is tested on the test set and evaluated with a widely used performance measure in a virtual screening task-the average proportion of the maximum possible number of active molecules (hit rate) which is retrieved from the top 1% of the ranked database. The molecules are ranked based on the predicted score from the output layer of WS-ELM and its variants. The higher the score, the more likely the molecule is to be active.
All experiments are carried out using the Matlab environment. SVC toolbox is available to download at https://sites.google.com/site/daewonlee/research/svctoolbox and the proposed CWS-ELM can be downloaded at https://github.com/dsmlr/cwselm.

Results and discussions
A comparison of similarity searching and WS-ELM with the 16 similarity coefficients on four types of fingerprint WS-ELM together with the 16 coefficients and similarity searching were evaluated on the 17 activity classes with four types of fingerprint. The experiment results are shown in Tables 3 and  4 for similarity searching and WS-ELM, respectively. Each element in these tables contains the mean hit rate, when averaged across the four fingerprints and 10 different data splits, in the top 1% of the ranked database. It is clear that Sokal/Sneath(1) could achieve the best performance followed by Jaccard/Tanimoto and Sokal/Sneath(3) coefficients in both similarity searching and WS-ELM techniques. It should be noted that Sokal/Sneath(1) is a modified version of the Jaccard/Tanimoto function which gives double weight to non-matches. The worst similarity coefficients in similarity searching and WS-ELM are Roger/Tanimoto and Yule, respectively.
There is a degree of variation in the performance of the 16 similarity coefficients (N objects) by each of the 17 activity classes (k judges). The ranks in Tables 5 and 6 are assigned according  to Tables 3 and 4, respectively. The degree of agreement between the rankings assigned can be determined by a statistical analysis called the "Kendall Coefficient of Concordance" [37]. This can be calculated by Eq (14).
where " R i is the average of the ranks assigned to the i-th object. T j is a correction factor  (15). Table 5. Ranks assigned to 16 similarity coefficients-similarity searching-by 17 activity classes from Table 3. Coefficient   C01  C02  C03  C04  C05  C06  C07  C08  C09  C10  C11  C12  C13  C14  C15  C16   I01  9   where t i is the number of tied ranks in the i-th grouping of ties, and g i is the number of groups of ties in the j-th rank. The significance of the computed value of W can be obtained from the table of critical values for N 7 [37] or from a table of the chi-square distribution with N − 1 degrees of freedom for N > 7. We can calculate chi-square from

Class Similarity
The computed values of W for similarity searching and WS-ELM are 0.8103, and 0.5161, respectively, which correspond to χ 2 values of 206.64, and 131.61, respectively (p < 0.001 for 15 degrees of freedom). Because agreement between various rankings of the same set of activity classes is significant, this leads to the following orderings in the similarity searching case: The rank of the 16 coefficients in WS-ELM case is as follows: WS-ELM is then compared with the similarity searching technique. According to Tables 3  and 4, WS-ELM can achieve higher maximum percentage actives retrieved at 9.10% than similarity searching does at 7.19% on average across 17 activity classes, 16 similarity coefficients, four fingerprints, and ten runs. The t-test is used to test the significance level of the difference between the means of two independent samples [37]. It is confirmed that WS-ELM can perform better than similarity searching on average at p < 0.001.
Next, the performance of Sokal/Sneath(1) on similarity searching and WS-ELM is analysed. As shown in Tables 3 and 4, similarity searching and WS-ELM can achieve 11.79% and 12.16% of maximum percentage actives retrieved, respectively. However, it is inconclusive that similarity searching with Sokal/Sneath(1) is outperforming WS-ELM with Sokal/Sneath(1) at p = 0.5759. Fig 2 shows relative improvement or worsening of WS-ELM with respect to similarity searching on average across 16 similarity coefficients, four fingerprints, and 10 runs. The entries are sorted by MPS score. It is hardly surprising that WS-ELM performs better than similarity searching. This is because similarity searching only uses active molecules in its training set while WS-ELM has a proper training set consisting of active and inactive molecules. WS-ELM was more effective than similarity searching in 16 out of 17 cases, especially in the cases with low MPS (heterogeneous). This means that including inactive molecules in the training sets can improve overall performance. However, it might not be very useful in some homogeneous classes, i.e. I01, I02, I5, and I07. Further analysis is conducted by using a violin plot (as shown in Fig 3) to evaluate the distribution of the results for WS-ELM in conjunction with each similarity coefficient. It is clearly seen that there are two distinct distributions in each coefficient. These two distributions reflect those activity classes with high and low MPS scores. The distribution with higher performance contains activity I1, I2, I3, I4, and I5 with average MPS of 0.21 while distributions with lower hit rate contain the remaining activity classes with average MPS of 0.18. In other words, the five most homogeneous activity classes in the MUV dataset could achieve higher hit rates compared to the others. On the other hand, if the active molecules are very structurally heterogeneous, it is difficult to achieve a high hit rate in that activity class as shown in      average. Again, Kendall Coefficient of Concordance is applied in order to obtain ordering in four fingerprints-4 objects-and 17 judges. The computed W values are 0.1657 and 0.1352 for WS-ELM and similarity searching, respectively. According to these values, the chi-square values yield 84.5153 and 68.9718, respectively; both are significant at the 0.001 level of statistical significance. This suggests the same orderings in fingerprint case for both WS-ELM and similarity searching:

A comparison of CWS-ELM and WS-ELM with the best two similarity coefficients
The two best similarity coefficients-Sokal/Sneath(1) and Jaccard/Tanimoto-for MUV dataset from the first part are employed in the proposed CWS-ELM algorithm. The proposed algorithms are compared with WS-ELM on the same framework. Maximum percentage of active molecules retrieved in the top 1% and number of hidden nodes used in the model are reported in Table 7. Each element is an average across four fingerprints and 10 runs. The proposed algorithm is reported as CWS-ELM KMC and CWS-ELM SVC for CWS-ELM in conjunction with kmeans clustering and SVC, respectively.
In the overall picture, the proposed CWS-ELM yields the highest performance measure in 15/17 activity classes. The best technique is CW-ELM SVC in conjunction with Sokal/Sneath(1) which achieves the best percentage of active molecules retrieved in 9/17 cases at 13.02% on average across all activity classes, but it requires the highest number of nodes in the hidden layer at 71.0%. This is followed by CW-ELM SVC in conjunction with Jaccard/Tanimoto which achieved a 12.03% hit rate and exhibited high accuracy in 4/17 cases. However, CW-ELM KMC 's performance is slightly worse than WS-ELM because it contains a smaller number of hidden nodes on average than WS-ELM. The correlation coefficient between the mean percentage of hit rates and number of nodes used in the model is 0.93 which is considered very highly correlated. Due to the high degree of diversity in the dataset, therefore, the number of nodes in the hidden layer can directly affect the performance of the model. If the model is too simple, it can degrade the performance of the classifier.
As our proposed algorithm embeds two clustering techniques to select the represented samples in WS-ELM-one selects the centroids of the clusters and the other utilises support vectors bounding the clusters, they are different in nature. Considering the same number of clusters in the space, SVC requires more than one support vector to bound and identify the cluster while k-means clustering needs only one centroid to represent the cluster. Therefore, there is a high chance of SVC performing better than k-means clustering in this dataset as they are very diverse.
Again, we applied the Kendall Coefficient of Concordance to test the significance on the ranking of six contenders in Table 8. The computed W is 0.2581-leading to a χ 2 of 21.94which indicates that the results are highly statistically significant. This gives the following ranking: Furthermore, the effect of the number of nodes in the hidden layer on the performance is investigated. The most homogeneous (I1) and the most diverse (I17) classes in the dataset with ECFP_6 fingerprint are evaluated. The regularisation parameter for each model-with a different number of nodes used-is tuned by five-fold cross validation on the basis of AUROC. Again, the experiment is conducted ten times with different random splits. In this experiment, only WS-ELM and CWS-ELM KMC are evaluated because the number of hidden nodes of these two can be directly adjusted and compared. Unlike the CWS-ELM SVC , the number of nodes depends on C s . The results of I01 and I17 are displayed in Figs 6 and 7, respectively. It is clear that CWS-ELM KMC is better than WS-ELM in number of actives retrieved when a small number of nodes is used (1-28%) in the model for I17 as shown in Fig 7. Moreover, it is more robust than WS-ELM resulting in smaller standard deviations in the performances. This means that carefully selected samples in the hidden node is important. According to Fig 6, CWS-ELM KMC is comparable to WS-ELM in I01. Comparing performances of the classifiers in both activity classes, AUROC of I01 achieves convergence at 15% of number of nodes used while the convergence of AUROC in I17 occurs at 30% of number of nodes used. This shows that the performance of classifiers on I01 can achieve convergence quicker than I17.

CWS-ELM
We also show an enrichment plot which is a very useful method for evaluating the quality of virtual screening methods. It is a cumulative sum plot of the active molecules retrieved from the top 1% of the ranked database. Figs 8 and 9 show enrichment plots for I01 and I17, respectively. Clearly, CWS-ELM's performances are better than the conventional WS-ELM JT in both I01 and I17. Performances by all methods on I01, the most homogeneous activity class, are better than on I17, the most heterogeneous activity class.
In addition to comparing the overall performance results by using enrichment plots, the individual molecules that are being retrieved are shown in Figs 10 and 11 for activity I01 and I17, respectively. It can be seen that CWS-ELM SVC-SN is the best in I01. Basically, any molecules retrieved by other approaches can be retrieved in the top 1% of the list by CWS-ELM SVC-SN but in different orders. This is because Sokal/Sneath(1) is a modified version of the Jaccard/Tanimoto function as mentioned in the previous section. In I17 case, WS-ELM SN fails to retrieve any active molecules in the top 1% while the other methods can retrieve one or two active molecules.

A comparison of CWS-ELM and WS-ELM with the best similarity coefficient against other approaches
The proposed methods CWS-ELM and WS-ELM are compared against other approaches, namely SVM, RF, and Similarity Searching. Apart from RF, all other methods are based on Sokal/Sneath(1) coefficient. The hyper-parameters of SVM and RF are tuned with the same framework as the proposed methods. As mentioned earlier, there are many criteria to evaluate the algorithms but, in the previous experiments, AUROC is chosen for its simplicity, and the percentage hit rate in the top 1% which gives the same picture as EF. However, AUROC has been criticised because it is a global measure that does not pay attention to the top-ranked molecules, therefore Truchon & Bayly proposed a generalised ROC metric called "Boltzmann-Enhanced Discrimination of ROC" (BEDROC) which considers the early recognition problem [34]. However, the best approaches to evaluate the virtual screening task are recommended [35,38], and EF gives very much the same results as BEDROC but is easier to understand [36]. Therefore we follow the evaluations suggested in [35] by reporting the following measures: (i) EF at 0.5%, 1.0%, 2.0%, and 5.0%, and (ii) The ratio of true positive to false positive rates at 0.5%, 1.0%, 2.0%, and 5.0%. Fig 12 shows EF and the ratio of true positive to false positive rates at the top 0.5%, 1.0%, 2.0%, and 5.0% of the ranked database. Both criteria display the same overall picture. CWS-ELM SVC is still the best contender among all other algorithms followed by SVM at EF 0.5% and EF 1.0% . The worst is similarity search technique as expected. These are confirmed by Kendall Coefficient of Concordance (with N = 6 and k = 17)-W values are 0.2186 (p < 0.01) and 0.1576 (p < 0.05) for EF at 0.5% and 1.0%, respectively-and lead to the Unfortunately, values of W are not significant at p = 0.05 level in the case of EF of 2.0% and 5.0%. Furthermore, we also evaluate the task with BEDROC and a parameter α which relates to the number of considered top ranked molecules in the database. The higher the value of α is, the smaller the considered number of molecules is. As we are interested in the top 1% of the ranked database, α is equal to 160.9 (refer to [34]). The BEDROC results are shown in Moreover, Fig 13 also shows that EF 1.0% correlates with BEDROC(160.9) with correlation coefficient of 0.9917. Although EF and BEDROC are strongly correlated, EF does not take into account the ratio of active and inactive molecules while BEDROC does.

Conclusion
This study proposes a modified ELM, termed WS-ELM, which improves the overall performance of virtual screening tasks. It demonstrates the capability of WS-ELM on the MUV dataset which is known as one of the most challenging datasets. The results show that Sokal/Sneath (1) and Jaccard/Tanimoto are the two best performers in this task among 16 similarity coefficients. Moreover, statistical analysis shows that using the ECFP fingerprint is better than the FCFP fingerprint, and utilising a circular substructure of six diameter bonds is generally better than four diameter bonds. Because of random generation of the weights in hidden nodes, it is not able to guarantee the stability and robustness of WS-ELM. This can lead to a lack of accurate prediction. Thus, WS-ELM is extended as CWS-ELM which adopts a clustering algorithm to enhance its performance, namely k-mean clustering and SVC, to carefully select weights in hidden nodes instead of randomly. Experimental results confirm that CWS-ELM performances are better and more robust than WS-ELM. CWS-ELM SVC-SN is the best approach which is consistently listed in the top ranks compared with its variants and other machine learning techniques.