Average Information Content Maximization—A New Approach for Fingerprint Hybridization and Reduction

Fingerprints, bit representations of compound chemical structure, have been widely used in cheminformatics for many years. Although fingerprints with the highest resolution display satisfactory performance in virtual screening campaigns, the presence of a relatively high number of irrelevant bits introduces noise into data and makes their application more time-consuming. In this study, we present a new method of hybrid reduced fingerprint construction, the Average Information Content Maximization algorithm (AIC-Max algorithm), which selects the most informative bits from a collection of fingerprints. This methodology, applied to the ligands of five cognate serotonin receptors (5-HT2A, 5-HT2B, 5-HT2C, 5-HT5A, 5-HT6), proved that 100 bits selected from four non-hashed fingerprints reflect almost all structural information required for a successful in silico discrimination test. A classification experiment indicated that a reduced representation is able to achieve even slightly better performance than the state-of-the-art 10-times-longer fingerprints and in a significantly shorter time.


Introduction
Fingerprints are one of the most popular methods of converting chemical structures into a form that can be used in, e.g., machine learning experiments.They encode a compound's structural features into a bitstring, where "1" and "0" mean the presence or absence, respectively, of a particular pattern.Fingerprints are divided into two subgroups: non-hashed fingerprints (e.g., Substructure fingerprint, Klekotha-Roth fingerprint), which encodes precisely defined structural patterns, and hashed fingerprints (e.g., Extended fingerprint, Graph-only fingerprint) which are without an assigned meaning for each bit (Fig 1).Fingerprints are widely used in classification problems or similarity searching; therefore, they have found application in computer-aided drug design campaigns [1][2][3][4][5][6][7][8].
A multitude of structural features present in chemical compounds results in fingerprints, among which, the longest one contains 4860 bits [9].The physical impossibility of the occurrence of hundreds of chemical substructures in low-molecular-weight chemical compounds and the biological insignificance of many bits increase the noise level in classification experiments.Moreover, the high resolution of the data increases the computational time, which is crucial in large virtual screening cascades.
Therefore, the reduction of fingerprint length without the loss of any meaningful information has become an important cheminformatics challenge in recent years.Several methodologies, e.g., consensus fingerprints [10], bit scaling [11], reverse fingerprints [12] and bit silencing [13] were introduced to reduce fingerprints via the weighting of particular bits.Another approach proposed by Nisius et al. selects fingerprint bits according to their discrimination power which is measured by Kullback-Leibler divergence [14].The method was applied to single fingerprints as well as to collections of fingerprints, leading to a successful attempt at fingerprint hybridization.[15].
In this study, we introduce a new method for fingerprint hybridization and reduction-Average Information Content Maximization (AIC-MAX ALGORITHM).The algorithm uses an extended version of mutual information, hereafter referred as the Average Information Content (AIC), to select the most informative bits of different fingerprints needed for splitting active from inactive compounds.In contrast to the aforementioned techniques, the AIC-MAX ALGORITHM may construct an optimal fingerprint for several biological targets.This approach substantially extends its application area.The strength of the AIC-MAX ALGORITHM stems from the fact that the selection process evaluates the discrimination power of entire groups of bits instead of single ones.Consequently, the algorithm will not select two features that carry similar information.
The proposed methodology was applied to create a reduced representation dedicated to the analysis of five closely related serotonin receptors: 5-HT 2A , 5-HT 2B , 5-HT 2C , 5-HT 5A and 5-HT 6 (members of the G-protein coupled receptor superfamily) that play an important role in, e.g., the central nervous system (CNS) [16].The algorithm was additionally tested on four other targets families: carbonic anhydrases, cathepsins, histamine receptors and kinases (See S1 File).Although the advantages of hashed fingerprints cannot be denied, only non-hashed fingerprints were considered in the current study.This conscious abandonment of hashed fingerprints was due to the lack of predefined substructural features and bit collision phenomenon (the same bit is set by multiple patterns) commonly occurring in those fingerprints [17], which make the structural interpretation of particular fingerprint coordinates nearly impossible.A hybrid fingerprint, reduced to 100 bits, reflects 99.77% of the information needed to distinguish active compounds from inactive ones (Fig 2) and contains structural patterns typical for serotonin receptors ligands, such as positively polarizable nitrogen atoms and aromatic systems.
A reduced representation significantly outperformed four standard non-hashed fingerprints in a classification experiment and achieved slightly better results in comparison to hashed fingerprints generated by PaDEL software [18] when a random forest classifier [19] was used.Moreover, the average training time of the random forest predictor compared to the Extended fingerprint was reduced almost 20 times.The constructed fingerprint generalized well to related biological targets such as the 5-HT 1A receptor as shown by additional tests.The results indicate that AIC-MAX ALGORITHM is an efficient method for fingerprint reduction and hybridization, opening new perspectives for both virtual screening campaigns and structural analysis of chemical space covered by ligands acting on similar targets.

Materials and Methods
The Average Information Content Maximization algorithm (AIC-MAX ALGORITHM) uses the notion of Average Information Content (AIC) to rank the features by their significance.The AIC quantifies the percentage of information that a set of features X ¼ fX 1 ; . . .; X N g carries of the activity with respect to a set of biological receptors R ¼ f1; . . .; Kg (the corresponding set of activity variables will be denoted by Y ¼ fY 1 ; . . .; Y K g).The AIC is defined as the mutual information MIðX ; Y i Þ normalized by the entropy SE(Y i ) [20][21][22], averaged over R where S N = {0,1} N is a set of all binary sequences of length N and P i (y), P(x), P i (x;y) denote the probabilities that If X fully determines the activity of all receptors, then AIC = 1; for X independent of all elements of Y, it returns value 0. The set of features that reflects all the information of the activity against l receptors and none of the information for the remaining (k − l) receptors gives AIC ¼ l k , as demonstrated in Table 1.For closely related biological targets, however, the most informative features usually overlap to a large extent.
The important point is that the value of AIC depends on the joint information contained in all features included in X .In particular, if The above equality always holds if the correlation between X 1 and X 2 equals 1.In other words, the repeated addition of the same feature does not increase the value of AIC.In contrast, the extension of the set of features by an additional element cannot decrease AIC, as illustrated in Table 2.
To calculate AIC for a given set of receptors R, the datasets of compounds for each r 2 R can be created separately.This consideration implies that a single instance (compound) does not have a known activity label for all considered receptors.It is an important property because most of the compounds have proven activity (or inactivity) only for one receptor.It is worth mentioning that this reasoning cannot be applied to classical mutual information, where the activity of every compound has to be provided to perform analogical evaluation.
Table 1.Minimal and maximal values of AIC.The 3-bit fingerprint representation X 1 X 2 X 3 of eight compounds and their activity labels Y 1 , Y 2 , Y 3 given three biological targets, as listed in the table.Since the activity of the i-th receptor is fully determined by a single feature , since the activity of two out of three receptors was fully reflected by two bits. doi:10.1371/journal.pone.0146666.t001 Given a set F of all features (fingerprint coordinates), the goal is to find an N-element subset X of F such that AIC Y ðX Þ is maximal.In practice, it might be impossible to calculate AIC for all subsets of features to determine the most informative one (e.g, the number of m-element subsets of n-features equals n m À Á which even for n = 1000 and m = 10 gives about 2 Á 10 23 ).The proposed AIC-MAX ALGORITHM uses a heuristic search in the space of all features F to reduce the computational time of the entire selection process.It iteratively picks these coordinates X 2 F n X which maximize AIC Y ðX [ fXgÞ-the information contained in already chosen features.The selection of N features is described as follows:

AIC-MAX ALGORITHM:
Input: Fset of given features Output: Xset of selected features To provide more efficient computations, the calculation of AIC in step 2a can be performed for a randomly selected n N element subset of X -in the experiments we used n = 10.
The concept of the AIC is based on information theory and is partially related to Asymmetric Clustering Index [23].The most fundamental concept in information theory is Shannon entropy (SE), which quantifies the information contained in a given feature X [20].Formally, if X takes values in {1, . .., k}, then: PðiÞ log 2 PðiÞ; where P(i) is a probability of observation {Y = i}.Note, that SE(Y) = 0 if X = constant.In contrast, if all values of X are equally probable, then SE attains a maximal value of log 2 k.
To measure the joint information shared by two features, the notion of mutual information (MI) has to be used [20].For X and Y taking values in {1, . .., k}, the MI is formulated as Table 2. Influence of dependent and independent bits on AIC.The activity of a given receptor depends only on two out of four features: X 1 and X 2 .The addition of feature X 3 to X 1 does not change AIC because it is independent of Y, which results in AIC Y (X 1 ) = AIC Y (X 1 , X 3 ) = 0.38.The same holds for X 4 , which is completely correlated with X 1 , and AIC Y (X 1 ) = AIC Y (X 1 , X 4 ) = 0.38.
where P(i;j) is the probability that {X = i, Y = j}.It can also be naturally extended to the set of features X ¼ ðX 1 ; . . .; X n Þ; Y ¼ ðY 1 ; . . .; Y k Þ: the indexes i and j in the above expression must to be replaced by sequences of indexes (i 1 , . .., i n ), (j 1 , . .., j k ), respectively [20].
The evaluation of MI for a set of features X and a set of receptors R requires a single data set of chemical compounds and corresponding activity labels Y for all receptors.This makes technically impossible the application of MI for a determination of the most informative subset of features with respect to various receptors because there usually does not exist a representative data set where each compound has proven activity or inactivity given arbitrary r 2 R.
To overcome this problem, the calculation of MIðX ; YÞ was replaced by the computation of individual factors MIðX ; Y i Þ.These partial results are gathered into final form by averaging: The normalization by the entropy of Y i ensures that every factor describes the percentage of joint information instead of the absolute amount of information.In particular:

Results and Discussion
The experiments concerned the application of the AIC-MAX ALGORITHM for the selection of the most significant bits for ligands acting on five closely related biological receptors: 5-HT 2A , 5-HT 2B , 5-HT 2C , 5-HT 5A , 5-HT 6 .Among all fingerprints generated in the PaDEL software, only non-hashed fingerprints were considered: EState, MACCS, PubChem and Substructure (possessing 1434 bits in total) to ensure the structural analysis of selected bits (Table 3).
Although hashed representations can be more efficient for classification purposes, their coordinates do not have a straightforward meaning.Therefore, they were not incorporated into the selection process.Moreover, the longest fingerprint (KRFP), although it was non-hashed, was skipped because a high number of bits results in a rapid increase of the computational time required by the feature selection process.Clearly, some of the chemical patterns can be duplicated while concatenating the above four fingerprints together.Nevertheless, since the repeated addition of the same feature does not increase the value of AIC, there is no risk that the algorithm will pick two identical (or even very similar) bits for final representation.All ligands were extracted from ChEMBL database version 20 (February 2015) [27].Ligands with an inhibition constant (K i ) less than or equal to 100 nM were considered active; ligands with K i higher than 1000 nM were used as inactives.Putative inactive compounds were randomly selected from the ZINC database [28] in a ratio of 9 inactives per 1 active (Table 4) [29].
To evaluate the significance of the selected features, a 10-fold cross-validation was performed [30].In this approach, a dataset is randomly partitioned into 10 equally sized subsets.Then, a single subset is retained as test data while the remaining 9 subsets are used in training.This process is repeated 10 times-each of 10 subsamples is used exactly once as the test data, and the results are averaged.The AIC-MAX ALGORITHM was run on a training data set (including actives, inactives and putative inactives), and the evaluation of selected features was reported for a test set.The score was measured by the normalized mutual information Eq (2) between the constructed representation and the true activity labels for each of the receptors.
Information stored in a reduced fingerprint grows gradually with the increase in the number of features selected by AIC-MAX ALGORITHM (Fig 3).The level of 90% was rapidly attained by a representation containing approximately 20 bits for both datasets containing true inactives and compounds selected from ZINC.Nevertheless, to distinguish almost all considered active compounds from inactives, a set of 100 bits is required (more than 99% of information), while for putative inactives, only 30 bits suffice (close to 100% of information).This outcome is due to two particular reasons: the close structural similarity between actives and true inactives and the small amount of compounds with confirmed inactivity (Table 4).
Because the AIC-MAX ALGORITHM returned slightly different subsets of bits in each fold, the algorithm was additionally applied to the entire dataset to obtain a single set of features.The reduced fingerprint (see S1 File for details) contained features that are crucial in ligand-protein interaction for serotonin receptors: a positively polarizable nitrogen atom and an aromatic system [31].Moreover, the bit encoding the tertiary nitrogen atom is the most desirable in the reduction and hybridization process.Polarizable nitrogen atoms are encoded by several bits listed in the top-scored instances.The same situation can also be observed for the aromatic system, which appears three times out of the 10 most desirable bits.Amide and sulfonamide moieties (and their subelements) are another popular patterns present in universal fingerprint, which reflect actual trends in medicinal chemistry [32][33][34][35][36].
The quality of the bits chosen by the AIC-MAX ALGORITHM was verified in a classification experiment conducted for the 5 underlying serotonin receptor ligands.As a classification method, a random forests technique [19] implemented in randomForest R package was used because it is known to be one of the state-of-the-art approaches in activity prediction [6].The accuracy of classification was evaluated via Matthews Correlation Coefficient (MCC), the wellknown validation measure, especially for imbalanced datasets.This measure is defined as [37]: where TP stands for the number of true positives (actives labeled as actives), TN-true negatives, FP-false positives (inactives labeled as actives) and FN-false negatives.MCC takes values from -1 to +1; The number +1 represents perfect prediction while 0 represents random prediction and − 1 represents an inverse prediction.
The experiment also assumed a 10-fold cross-validation procedure; a training set was used for a selection of bits and training of a classifier which was then evaluated on a test set.In each fold the AIC-MAX ALGORITHM was run for a merged set of actives, inactives and putative inactives to enforce generality of representation.On the other hand, the classifier was trained and tested separately on compounds of proven activity and on datasets containing active and putative inactive compounds.
The addition of new features leads to the statistical improvement of the classification results (Fig 4).The highest increase was reported for representations including less than 20 bits.For a higher number of features, the difference in classification accuracy changes slightly.Because the gain in MCC value for representations containing more than 100 bits is negligible; then, longer representations were not taken into further consideration.
The classification performance of the representation created for 25, 50 and 100 bits was then compared with original (raw) fingerprints (Tables 5 and 6).The reduced representations including 100 as well as 50 bits outperformed existing fingerprints on all receptors when putative inactive compounds were used.This case is considered the most important one because it  or inactivity (Table 7).Although the reduced representation gave a significantly lower result, MCC = 0.663, it performed better than any of non-hashed fingerprints.In the case of putative inactives, the performance of constructed representation was slightly better than the MACCS and Extended fingerprints.
To complement the study and investigate deeper the discriminative power of Extended fingerprint, we also considered a representation created from all fingerprints (Table 3) except KRFP including hashed ones.The results (Table 7) showed that the enhancement by bits from the hashed fingerprints significantly improved the statistics and gave almost ideal separation of actives from inactives.
Analogue experiments were conducted also for four another families of biological targets: carbonic anhydrases, cathepsins, histamine receptors and kinases (see S1 File).

Conclusion
The paper introduced the AIC-MAX ALGORITHM as a method for fingerprint reduction and hybridization.The algorithm iteratively picks features uncorrelated among themselves to maximize AIC-a modified version of mutual information.In the present study, the algorithm was applied for constructing an essential representation of ligands of five families of closely related tergets.Such a representation can compete with raw fingerprints in classification experiments with significant CPU time reduction.The obtained results confirm that existing fingerprints contain much irrelevant information that may negatively influence on screening performance.The conducted experiments indicate that the generation and application of reduced and hybridized fingerprint allow rapid and effective calculations.The power of the methodology is underlined by the presence in universal representation bits that encode the most important structural features for serotonin receptor ligands: a polarizable nitrogen atom and the aromatic system.
Supporting Information S1 File.The additional file, which can be retrieved from: http://www.ii.uj.edu.pl/*smieja/aic, contains the full list of 100 most informative bits selected from four non hashed Table 7. Classification performance on a dataset containing active and inactive compounds of 5-HT 1A receptor (middle column) as well as actives and putative inactives (last column).The reduced representation was constructed from four non-hashed fingerprints based on five biological targets (first 3 rows).The reduced representation from all fingerprints (except KRFP) was also evaluated (last row).(PDF)

Fig 1 .
Fig 1. Exemplary hashed (A) and non-hashed (B) fingerprints.Presence of "1" and "0" corresponds to presence or absence of a particular pattern, repectively.In case of hashed fingerprint (A) bit collision phenomena is presented-one bit encodes more than one motif.doi:10.1371/journal.pone.0146666.g001

Fig 2 .
Fig 2. The relationship between the number of bits selected by the AIC-MAX ALGORITHM and information related activity.The information, measured by AIC Eq (1), was averaged over all datasets used in the underlying study.doi:10.1371/journal.pone.0146666.g002

Fig 3 .Fig 5 .
Fig 3.The relationship between the number of bits selected by the AIC-MAX ALGORITHM and associated information of activity.The information score was measured by the normalized mutual information calculated for constructed representations for every receptor averaged over all folds reported on a test set.doi:10.1371/journal.pone.0146666.g003 fingerprint fingerprints for five GPCRS receptors (Table A in S1 File) and the results of experiments conduced for the families of carbonic anhydrases (Tables B, F, J and K in S1 File), cathepsins (Tables C, G, L and M in S1 File, histamine receptors (Tables D, H, N and O in S1 File) and kinases (Tables E, I, Q and P in S1 File).

Table 4 .
The summary of datasets used in the selection process.

Table 6 .
Classification performance on a dataset containing actives and putative inactives.