Discovering Associations in Biomedical Datasets by Link-based Associative Classifier (LAC)

Associative classification mining (ACM) can be used to provide predictive models with high accuracy as well as interpretability. However, traditional ACM ignores the difference of significances among the features used for mining. Although weighted associative classification mining (WACM) addresses this issue by assigning different weights to features, most implementations can only be utilized when pre-assigned weights are available. In this paper, we propose a link-based approach to automatically derive weight information from a dataset using link-based models which treat the dataset as a bipartite model. By combining this link-based feature weighting method with a traditional ACM method–classification based on associations (CBA), a Link-based Associative Classifier (LAC) is developed. We then demonstrate the application of LAC to biomedical datasets for association discovery between chemical compounds and bioactivities or diseases. The results indicate that the novel link-based weighting method is comparable to support vector machine (SVM) and RELIEF method, and is capable of capturing significant features. Additionally, LAC is shown to produce models with high accuracies and discover interesting associations which may otherwise remain unrevealed by traditional ACM.


Introduction
Chemical and biological data contain information about various characteristics of compounds, genes, proteins, pathways and diseases. Thus a wide spectrum of data mining methods is used to identify relationships in these large and multidimensional datasets and to generate predictive models with high accuracy and interpretability. Recently, associative classification mining (ACM) has been widely used for this purpose [1][2][3][4]. ACM is a data mining framework utilizing association rule mining (ARM) technique to construct classification systems, also known as associative classifiers. An associative classifier consists of a set of classification association rules (CARs) [5] which have the form of XRY whose right-hand-side Y is restricted to the classification class attribute. XRY can be simply interpreted as if X then Y. ARM is introduced by Agrawal et al [6] to discover CARs which satisfy the user specified constraints denoted respectively by minimum support (minsup) and minimum confidence (minconf) threshold. Given a dataset with each row representing a compound, each column (called as item, feature or attribute) is a test result of this compound on a tumor cell line and all compounds are labeled as active or inactive class, a possible classification association rule can be {MCF7 inactive, HL60 (TB) inactive R inactive} with support = 0.6 and confidence = 0.8. This particular rule states that when a compound is inactive to both MCF7 cell line and HL60 (TB) cell line, it tends to be inactive. The support, which is the probability of a compound being inactive to both MCF7 and HL60 (TB) and being classified as inactive together, is 0.6; the confidence, which is the probability of a compound to be inactive given inactive to both MCF7 and HL60 (TB), is 0.8. In ACM, the relationship between attributes and class is based on the analysis of their co-occurrences within the database so it can reveal interesting correlations or associations among them. For this reason, it has been applied to the biomedical domain especially to address gene expression relations [7][8][9][10][11], protein-protein interactions [12], protein-DNA interactions [13], and genotype and phenotype mapping [14] inter alia.
Traditional ACM does not consider feature weight, and therefore all features are treated identically, namely, with equal weight. However, in reality, the importance of feature/item is different. For instance, {beef R beer} with support = 0.01 and confidence = 0.8 may be more important than {chips R beer} with support = 0.03 and confidence = 0.85 even though the former holds a lower support and confidence. Items/features in the first rule have more profit per unit sale so they are more valuable. Wang et al [15][16][17] proposed a framework called weighted association rule mining (WARM) to address the importance of individual attributes. The main idea is that a numerical attribute can be assigned to every attribute to represent its significance. For example, {Hypertension = yes, age.50R Heart_Disease} with {Hypertension = yes, 0.8}, {age.50, 0.3} is a rule mined by WARM. The importance of hypertension and age .50 to heart disease is different and denoted by value 0.8 and 0.3 respectively. The major difference between ARM and WARM is how the support is computed. Several frameworks are developed to incorporate weight information for support calculation [15][16][17][18][19][20][21][22]. Studies have been carried out on WARM by using pre-assigned weights. Nonetheless, most datasets do not contain those preassigned weight information.
In machine learning, feature selection and feature weighting are broadly used to deal with the significance of features and derive weight information automatically from a dataset itself. Feature selection is a technique of selecting a subset of relevant features by removing low significant features; feature weighting is a technique of approximating the optimal degree of influence of individual features. Feature weighting preserves all features by assigning smaller weight to relatively insignificant features and has the advantage of taking into account of all features as well as not requiring searching an appropriate cut-threshold [23]. In some circumstances, it might be the only option when eliminating features with a low contribution to classification is inappropriate. Especially, to understand the overall relationship between genes and a disease, a small subset of genes although having good prediction ability may not have sufficient discriminating power [24]. Like feature selection, feature weighting approaches fall into two categories: 1) filter methods which are performed in a preprocessing step before modeling; 2) wrapper methods which are iterative and generally use the same learning algorithm as modeling. In wrapper methods, the evaluation result of relevancy is used for feature weighting. Usually, wrapper methods perform better than filter methods while filter methods are faster and cheaper.
Sun et al. [25] proposed a link-based filter feature weighting approach. The weights are derived from the dataset itself by extending Kleinberg's HITS (Hyper Induced Topic Selection) model [26] and algorithm on bipartite graphs. HITS and PageRank are two major link-based ranking algorithms. PageRank is developed by Brin and Page [27] and has been commercially successfully used in the search engine Google. HITS ranks webpages by analyzing the in-links and out-links. Webpages pointed to by many other pages are defined as ''authority'' while webpages linked to many other pages are called ''hub''. HITS emphasizes the notion of ''mutual reinforcement'' between the ''authority'' and ''hub''. Its intuitive interpretation is that a good ''authority'' is pointed to by a lot of good ''hubs'' and a good ''hub'' points to many good ''authorities''. PageRank uses a very similar idea that a ''good'' webpage should be linked or link to other ''good'' webpages. Unlike the ''mutual reinforcement'' approach, it focuses on hyperlink weight normalization and web surfing based on random walk models. Both approaches have pros and cons. The computation of PageRank is stable and its behavior is well-defined due to the probabilistic interpretation. Furthermore, PageRank can be used on large page collections because even though the larger communities will affect the final ranking, they will not overwhelm the small ones. In contrast, HITS is not stable and cannot be applied to large page collections since only the largest web community will influence the final ranking. However, it can capture the relationships among the webpages with more details [28]. Hence, an algorithm capable of integrating both HITS and PageRank may improve Sun's weighting method.
The general PageRank cannot be applied to bipartite graphs as it produces different rankings for webpages with the same in-links [29], as a result, a better ranking scheme is needed for ranking in bipartite graphs while integrating PageRank and HITS [30]. The SALAS (stochastic approach for link structure analysis) [31][32][33] combines the random surf model of PageRank with hub/authority principle of HITS. It generates a bipartite undirected graph H based on the web graph G. One subset of H contains all the nodes with positive in-degree (the potential ''authorities'') and the other subset consists of all the nodes with positive out-degree (the potential ''hubs''). A travel is completed by a two-step random walk. For example, from the ''hub'' to the ''authority'' and from the ''authority'' back to the ''hub''. As in the PageRank, each individual walk is a Markov process with a well-defined transition probability matrix [31]. Nevertheless, besides SALAS does not really implement the ''mutual reinforcement'' of HITS because the scores of both authority and hub are not related by the hub to authority and authority to hub reinforcement operations, its score propagation differs from HITS (a similarity-mediated score propagation). Moreover, its random walk model does not directly simulate the behavior of the surfer in PageRank either. For SALAS, a surfer can jump from webpage p i to p j even though there is no hyperlink between them, and there is no link-interrupt jumps. Based on a similar approach as SALAS, Ding et al proposed a unified framework integrating HITS and PageRank [34]. Figure 1 indicates that a database can be represented by a bipartite graph equally [25]. In the graph, left is the table layout representation and can be represented by the bipartite graph on the right. Compounds and features linked to each other can be viewed as webpages. As a consequence, the link-based algorithms used to rank the webpage such as HITS or PageRank can be utilized to rank compounds or features. The algorithms say that if a webpage has many important links to it, the links from it to other webpages become important too. For our case, this means a highly weighted compound should contain many highly weighted features and a highly weighted feature should exist in many highly weighted compounds. Accordingly, the ranking score can be used for feature weighting. Although Ding's unified framework can be used to derive the ranking score automatically, it cannot distinguish the contributions of different types of connections. For chemical dataset mining, each chemical feature may connect to both active and inactive compounds; for biological dataset mining, each gene may connect to a disease either as suppressor or activator. Chemical features existing frequently in active compounds or genes major associated with suppressors are more interested in. In Figure 1, when we consider the contribution of compounds to the weight of a node/attribute 78, we want to distinguish the contribution of compound 5469540 from the contribution of compound 840827 and 5911714. Ding's unified framework treats the contribution of the nodes equally as a homogenous system [34]; Chen et al developed a framework calculating the weight for either homogenous or heterogeneous systems [35]. In Chen's model, connections can have different impacts on a node.
In this paper, we describe a link-based unified weighting framework which combines the mutual reinforcement of HITS with hyperlink weighting normalization of PageRank based on Ding and Chen's frameworks, resulting in highly efficient linkbased weighted associative classifier mining from biomedical datasets without pre-assigned weight information.
Our main contributions are: 1) development of a novel linkbased weighting scheme for mining biomedical datasets; 2) implementation of a novel link-based associative classifier by combining the feature weighting method, weighted association rule mining (WARM) and the CBA algorithm [5]; 3) application of this method to two important biomedical datasets.
In the following sections, the dataset, link-based feature weighting, WARM and algorithm of LAC will be discussed, followed by the application of LAC to two datasets. In the end, we present our conclusions and future work.

Data Set
LAC is applied to two datasets: a. Ames mutagenicity dataset [36], b. NCI-60 tumor cell line dataset [37]. In Ames dataset, there are 6,512 compounds provided in SMILES format and is benchmarked by SVM, Random Forests, k-Nearest Neighbors, and Gaussian Processes. The authors used 5-fold cross validation to evaluate the generated models. The area under this ROC-Curve (AUC) is utilized to assess the performance which ranges from 0.79 to 0.86. The GI50 data of NCI-60, which is the concentration of the anti-cancer drug that inhibits the growth of cancer cells by 50%, is used and processed as following. First, among the 60 tumor cell lines, IGR-OV1, MDA-MB-468 and MDA-N are removed due to too many missing values. Then, compounds having missing values are also discarded. In the final dataset, 5,937 compounds with 57 bioassay results in total are included. For the Ames dataset, if a compound is positive, it is carcinogenic; for the NCI-60, the compound is ''active'' only if its GI 50 is greater than 5.

MDL Public Keys
MDL public key set also called MACCS key set is a 166-bit string with each bit encoding a predefined chemical structure feature. MDL public keys are extensively used in biomedical research due to their relatively high performance and the one-toone map between the structural feature and fingerprint [37,38]. The fingerprint is computed by using the CDK [39] software package and reformatted for LAC.

Bio Fingerprint
Bioassay readouts have been used as features (''biospectra'' or ''bio fingerprint'') for data mining in several studies and produced high quality models [40,41]. These bioactivity profiles link the potential targets with the chemical compounds and provide insights into the relationships among diseases, compounds and bioactivities. In this study, results of related bioassay analyses are used as features for the classification of chemical compounds. Each GI50 value is transformed into ''active'' (GI50 is greater or equal than 5) or ''inactive'' (GI50 is less than 5). The T-47D is used as a label class and the results from other cell lines are used as features.
For each of the 6,512 compounds in Ames data, we attempt to predict whether it is carcinogenic or not based on the MDL public keys. For the 5,937 compounds in NCI 60, we first use Bio fingerprint to predict whether they are agonist or antagonist to T-47D cell line. Then, for those 3,199 compounds in the NCI-60  dataset having 2D structures available in the downloaded structure file, a hybrid fingerprint is generated by combing MDL public keys and Bio fingerprint to build models. Let L = (L ij ) be the adjacency matrix of the web graph G = (V,E), where V is the set of webpages and E is the set of links between them. L ij = 1 if page i links to page j and L ij = 0 otherwise. L T will be the transpose of L. If the graph is directed, the in-degree matrix D in and out-degree matrix D out are also defined. Given and D out = diag(d out ).

HITS
In HITS, vectors x = (x 1 ,x 2 ,…,x n ) T and y = (y 1 ,y 2 ,…y m ) T represent the scores of authority and hub respectively. HITS defines recursive equations as following: Where k §1 and y (0) = e, e is a vector of all 1s and x (k) denotes k-th iteration. Equation 1 tells that authoritative pages are those linked by good hub pages, and equation 2 means good hubs are pages that link to authoritative pages. It can be rewritten as:

PageRank
In PageRank, given x = (x 1 ,x 2 ,…,x n ) T , x i is the PageRank of page i; the recursive PageRank equation is defined in matrix notation as: where P = (P ij ) is a stochastic matrix (the sum of every column equals to 1) with P ij = 1 o i . P T can be expressed as: If considering the link-tracking jump and link-interrupt jump, the full transition probability can be written as: where a is the damp factor from 0 to 1.
As the way processed in SALAS, if the web graphs are transformed into bipartite graphs, the above x will be the authority score and the hub score y can be defined as: Comparing the equations between HITS and PageRank (equation 1 & 2 versus 5 & 8), it is possible that a unified framework can be derived to combine advantages from both HITS and PageRank. Chen's model [35] divided the web pages into homogenous and heterogeneous systems so the scores of authority and hub contain the reinforcement of links from both systems. Different weights can be assigned to homogenous or heterogeneous systems to adjust the importance of their links in the final ranking. Similarly, in our case, the nodes, such as compounds, are classified as active/ inactive or positive/negative thus the dataset is converted to a heterogeneous system. The relatively higher weight values can be  assigned to the active/positive compounds to promote their importance in the final feature weighting. Our link-based framework can be written as follows. a represents the ''active'' system and b is the ''inactive'' system. out , b or (1-b) will be replaced by their square roots). It has impact on the accuracy and size of classifiers along with rules in the classifiers. Generally, in order to assign higher weight values to active/positive compounds, b can be any value greater than 0.5. In our study, b is set to 0.9.

Unified Framework
Based on the comparison of implementations in [34], the following definitions of A op and H op are used. The support of the rule is the probability of transactions having both X and Y (X |Y ) among all the presented cases. An itemset is frequent only if its support satisfies a minimum support h. Additionally, the confidence of this rule is defined as the support of X and Y (X |Y )divided by the support of X which is the conditional probability Y is true under the circumstance of X. The process of discovering, pruning,   ranking and selecting of CARs and applying them to classification is called associative classification.

Weighted Associative Classification Mining
For the weighted associative classification (WAC) [15][16][17], each feature f i is associated with a weight w i [ W = { w 1 , w 2 , …, w n }. A pair (f i , w i ) is called a weighted item. Each transaction/compound is a set of weighted items plus the class type. The straightforward definition of itemset weight is: W(is) is the weight of itemset and is is the itemset. The weighted support of itemset WS(is) is: T is total transactions and S is all the transactions containing the itemset. In the classical associative classification, the difference of significance of items is not taken into account. It is assumed that if the itemset is frequent, then all of its subsets should be frequent as well. This principle is called downward closure property (DCP). Given the compounds C1-C6, their features and the weight of the features (  & 16), the DCP does not hold. An itemset may be frequent even though some of its subsets are not frequent which can be illustrated in the following example (h = 0.3). As shown in Table 3, the support of {83, 84} and {81, 83} are both 0.27 so they are not frequent. Several frameworks are proposed to maintain the DCP property [15][16][17][18][19][20][21][22]25]. Before introducing the framework, we define the transaction weight as:  t is the transaction. We then define the adjusted weighted support as: The S and T are the same as above. This definition will ensure that if X 5Y then AWS(Y )ƒAWS(X ) since any transaction containing Y will have X. By using the AWS, the DCP will not be violated. The discovered association rules are ranked, evaluated and pruned by using CBA approach [5]. The algorithm of PageRank based associative classification is given in Figure 2 & 3.
All the computations are carried out on a PC Q6600 2.4GHz with 6G memory running on the Windows 7 64bit operating system. The classifier is implemented in C#. To explore all possible rules, the mining is performed by using the following settings: MinSup (20%) and MinConf (70%) for AMES dataset; MinSup (1%) and MinConf (0%) for NCI-60 dataset. In all experiments, the maximum length of the rules is set to 4 and the maximum number of candidate frequent itemsets is 200,000. In the AMES data set, the SVM and RELIEF weighting method are applied for comparison. SVM and RELIEF are computed using Rapidminer 5.1 [42].

Model Assessment and Evaluation
The classification performance is assessed using 10-fold ''Cross Validation'' (CV) because this approach not only provides reliable assessment of classifiers but the result can be generalized well to new data. The accuracy of the classification can be determined by evaluation methods such as error-rate, recall-precision, any label and label-weight etc. The error-rate used here is computed by the ratio of number of successful cases over total case number in the test data set. This method has been widely adopted in CBA [5], CPAR [42] and CMAR [4] assessment.

Comparison of Feature Weight and Rank
The comparison is performed on AMES dataset. For AMES dataset mining, the identification of features which are good for ''positive'' compounds are considered more preferable. So the ''positive'' here is treated as ''active''. The weight generated by LAC is compared to that generated by frequency of the bits, SVM and RELIEF. Figure 4 shows that results of RELIEF and SVM are very similar. To confirm this, a correlation analysis is performed by SPSS 19 [43]. Table 4 shows at the 0.01 level (2tailed), SVM and RELIEF, LAC and frequency are highly correlated as the coefficient is 0.949 and 0.958 respectively. The coefficients of SVM, RELIEF and LAC with frequency are greater than 0.75 indicating that all are correlated with frequency. Among them, LAC has the strongest correlation (0.947) with frequency. This is mainly caused by bit 3, 8, 11, 36 and 166. For bit 3, 8 and 11, since their frequencies are not 0, both LAC and frequency assign small weight values while for SVM and RELIEF the weight values are set to 0. On the contrary, the weight values of 36 and 166 are set to 0 for LAC and frequency but are not set to 0 in SVM and RELIEF. The correlation of LAC and frequency can be explained by the principle of link-based weighting-mutual reinforcement. As expected, the rank and weight of features in the LAC and frequency are different. In Table 5, all features are ordered by ascending weight. 69 features (bold) are promoted and 61 features (*) are demoted while the rest remains unchanged in LAC. Generally, higher frequency will lead to higher ''authority'' resulting bigger weight (Figure 4). For example, bit 135 has high weight in both frequency and LAC; bit 127 and 141 are much bigger in LAC (red data label) than in frequency (black data label) since most of their connections are ''active'' compounds (58.6% and 56.6% respectively). Table 5 is the rank of the features in each scheme respectively. The bigger the number, the higher the rank is and the more important the feature is. Some features (bold) have a relatively lower rank in frequency; they may get higher ranks due to the promotion from connecting to compounds having higher ''rank'' values. Likewise, features (*) connected to many ''bad'' compounds may be degraded. The promotion or demotion depends on the number and type of its connections.

Comparison of Accuracy of Classification
The average accuracies of frequency, LAC, RELIEF, SVM and CBA are 90.11%, 91.57%, 89.05%, 89.26% and 90.63% respectively ( Table 6). The major purpose of WACM is to find more rules containing interesting items, in other word, items with higher significance, while trying to achieve high accuracy at the same time. Most of current comparisons of performance between WARM and traditional ARM are focused on time and space scalability, such as number of frequent items, number of interesting rules, execution time and memory usage [18][19][20][43][44][45]. The results showed that the difference between WARM and ARM are minor. The comparison of WACM and traditional ACM is scant due to the lack of easily accessible weighted association classifiers. Soni et al [46] compared their WACM results with those generated by traditional ACM methods-CBA [5], CMAR [4] and CPAR [47] on three biomedical datasets, and their results showed that WACM offered the highest average accuracy. In our study, among all four weighted schemes and CBA, LAC has the highest accuracy.

Comparison of Classifiers
There are 10 models generated for each weighting scheme and we are interested in the comparison between the classifiers of CBA and LAC. Model 1 is used as an example and there are 30 rules in the classifier of frequency and 132 in that of LAC. Among them, 14 rules are exclusively in the frequency classifier, 116 only in LAC classifier and 16 rules are shared by both. Table 7 shows that among the top 20 rules, 11 rules are shared by both classifiers, 9 rules (*) are only in the classifier of frequency and none of the top 20 rules (bold) are included in the classifier of frequency. All rules are ordered based on the CBA definition. During the classification, the match of the new compounds starts from the first and will stop immediately as long as there is a hit. As a result, although those 11 rules are in both classifiers, they may have different impacts on the final result of classification.

Rule Interpretation
Our recently submitted paper [48] showed that the rules generated by associative classification based on chemical fingerprints and properties can be interpreted by chemical knowledge and shed a light on the molecule design. In this study, we focus on the analysis of association rules generated by LAC using the bio fingerprint (NCI-60 dataset). The analysis for those generated by frequency can be done in the same manner. The accuracy of both frequency and LAC are 99.93% (Table 6) and the average size of the classifier is around 350 rules.
For all ten models, the top 5 rules are the same but with different order, support and confidence. The intuitive explanation of Rule 1 in Table 8 is that if compound is inactive to MCF7 and HL60 (TB) then it will be inactive to T47D at the same time. The adjusted weighted support of this rule is 29.1% and weighted confidence is 95.9%. Among the 5,937 compounds, 1730 compounds are covered by this rule. All these cell lines in the top 5 rules fall into two categories: a) breast cancer and b) Leukemia. On one hand, it means that there are many compounds which are inactive neither to breast cancer cell lines nor to Leukemia cell lines; on the other hand, it suggests that there might be some associations between these two types of cancers. [49,50] clustered the cell lines based on their gene expression data, their results also indicated that the cell lines in these two categories were clustered into one or their clusters were very close to each other. The association of MCF7 and T47D is not surprising as they belong to the same category-breast cancer. The rules here may also provide a potential direction of the drug resistance of breast cancer and leukemia. [50][51][52] discovered a novel ABC transporter, breast cancer resistance protein (BCRP). This transporter was termed breast cancer resistance protein (BCRP) because of its identification in MCF-7 human breast carcinoma cells. The drugsensitive cells become drug-resistant cells after transfection or overexpression of BCRP. They also found that relatively high expression of BCRP mRNA were observed in around 30% acute myeloid leukemia (AML) cases and suggested a novel mechanism of drug resistance in leukemia.
A hybrid feature set integrating the chemical fingerprint and bio fingerprint is generated by combining the MDL public keys and the bio fingerprint. Since we are only interested in the compounds which are active against tumor cell lines, the ''inactive'' value of the bioassay is treated as a feature of ''not existed'' in the compound. This also helps to treat the chemical fingerprint and the bio fingerprint equally.
The average accuracy of the classification is 99.7% ( Table 6). For rules in the final classifier, for example, (A, B R Active), it will be converted to (A associate Active) and (B associate Active). All the rules are transferred and plotted by Cytoscape 2.8.2 [53]. To make it clearer, nodes with degree less than 10 are removed. The top 2 rules in the classifier indicate that compounds containing phosphorus and active to MCF7 or SK-MEL-2 will be active to T-47D too ( Table 9). 22 out of 23 compounds match both rule 1 and 2. Among them, the once abandoned drug NSC 280594 (triciribine) attracts much attention and undergoes phase I trial due to its potential possibility of against a common cancercausing protein [53][54][55]. These rules reveal that phosphorus might be an important chemical structure for anti-cancer drugs.

Conclusions
In this paper, we describe a novel link-based feature weighting framework for datasets without pre-assigned weight information. This algorithm employs a unified framework which integrates the advantage of HITS and PageRank-the mutual reinforcement and normalized weights-to derive useful weights. It utilizes connectivity and connection type information. Combined with a weighted support scheme, it offers an effective way to find the useful associations by taking into account both the significance of occurrence and the quality of features. The latter is included by connections to the transactions.
Based on this new weight scheme, a CBA based classifier, LAC, is developed. The classifier is applied to two cases: the chemical fingerprint featured dataset and the bio-fingerprint featured dataset. Our experimental results show that although the weighting differs from the traditional RELIEF and SVM, it is able to capture the important features and afford good results. Especially for some sparse dataset, some significant features can be discovered by this link-based analysis which will be ignored by other methods.
The link-based classifier discovers interesting associations of bioactivities with chemical features and potential relationships among diseases, for instance, relationship between phosphorus and bioactivity against T47D and potential relationship between breast cancer and leukemia. Our next step will apply this method to large semantic data sets to mine information from the RDF resources such as ChEMBL [56] and KEGG [57].