Automatic Context-Specific Subnetwork Discovery from Large Interaction Networks

Genes act in concert via specific networks to drive various biological processes, including progression of diseases such as cancer. Under different phenotypes, different subsets of the gene members of a network participate in a biological process. Single gene analyses are less effective in identifying such core gene members (subnetworks) within a gene set/network, as compared to gene set/network-based analyses. Hence, it is useful to identify a discriminative classifier by focusing on the subnetworks that correspond to different phenotypes. Here we present a novel algorithm to automatically discover the important subnetworks of closely interacting molecules to differentiate between two phenotypes (context) using gene expression profiles. We name it COSSY (COntext-Specific Subnetwork discoverY). It is a non-greedy algorithm and thus unlikely to have local optima problems. COSSY works for any interaction network regardless of the network topology. One added benefit of COSSY is that it can also be used as a highly accurate classification platform which can produce a set of interpretable features.


Automatic Context-Specific Subnetwork Discovery from Large Interaction Networks (Supporting Information)
Ashis Saha, Aik Choon Tan, Jaewoo Kang

S1. Estimated t-score
The Welch's t-test is a widely used metric for measuring the differential expression of a probe or gene (see Eq. S1).
whereX + andX − are the mean expressions, σ + and σ − are the standard deviations, and n + and n − are the number of samples of positive and negative classes, respectively. The higher |t| is, the higher the differential power. We use a slightly modified version of the t-test to avoid the noise of the microarray data. We use the median instead of the mean, and we estimate the standard deviation from the interquartile range (IQR) which is defined as the difference between the upper and lower quantiles. The IQR contains 50% of the data within 1 2 IQR of the median. Our estimation comes from the empirical rule -about 68.2% of the values of a normal distribution lie within 1 standard deviation of the mean. The estimated standard deviation (σ) is given by Eq. S2,σ So, our estimated t-test score (t) is given by Eq. S3.
whereX + andX − are the median expressions,σ + andσ − are the estimated standard deviations, and n + and n − are the number of samples of positive and negative classes, respectively. The higher |t| is, the higher the differential power. We sort the probes of each MIS according to the absolute value of the estimated t-scores (|t|) in decreasing order, and select the top five probes as the representative probeset for the corresponding MIS.

S2. Binary Vote
Binary voting is applied when the voting weights for both classes become equal, which would be very infrequent. In the binary voting system, each top MIS casts a vote in favor of either the positive or negative class, i.e., the voting weight for each class will be either 1 or 0. Comparable to weighted voting, which was described in the main paper, binary voting for a new sample is also determined from the closest cluster. The majority class in the closest cluster gets the total vote (weight=1). IfP c >N c , then W i (positive) = 1 and W i (negative) = 0. Similarly, ifP c <N c , then W i (positive) = 0 and W i (negative) = 1. IfP c =N c , the voting weight is determined in the same way, from the normalized number of positive and negative samples in the next closest cluster from x new , and so on. If T , the number of voting MISs, is odd, then in binary voting will never be equal. If W (positive) > W (negative), the class, binary(x new ), predicted from the binary voting is positive; otherwise, it is negative. S4. An Experiment with the appropriate range

S3. Dataset Download Sources
We set the appropriate range, [minRange, maxRange], to generate the molecular interaction subnetworks. We experimented with different ranges and chose the optimal range producing the highest LOOCV accuracy over the datasets. Initially, we set minRange = 3, 5, 7 and maxRange = 15, 20, 25 for KEGG, and minRange = 5, 7, 10 and maxRange = 15, 20, 25 for STRING. Later, we expanded the range list based on the results observed. The results of the MISs with different appropriate ranges using KEGG and STRING are shown in Table S2 and S3, respectively. 'Appro. Range' denotes appropriate range. * The optimal appropriate range producing the highest average LOOCV accuracy is shown in bold font.  Figure S1. An illustration of the MIS generation. Let us consider one example of a connected molecular interaction network (MIS) and its community dendrogram as shown in the figure. Let the appropriate range be [5,10]. The size (the total number of leaf nodes) of the dendrogram is 24. As it is greater than the maxRange (10), we divide the dendrogram by removing edge E1 so we are left with two dendrograms (A-Q and R-X). The right dendrogram's (R-X) size is 7 (5 ≤ 7 ≤ 10), so we take it as an appropriate community (C 3 ). However, because the left dendrogram's (A-Q) size is above 10, we divide it again by removing edge E 2 . We have to further divide the dendrogram by removing edge E 3 . Thus we get four parts of the original community dendrogram -C 1 , C 2 , X 1 , and C 3 . Three of their sizes fall within the appropriate range [5,10] (C 1 , C 2 , and C 3 ), so we take them as appropriate communities. However, because X 1 's size is less than 5, we discard it. Now, we shall assign the nodes in X 1 -N, O, P, and Q -individually to their closest communities from the original network. P is 1-hop away from C 1 , so P is merged with C 1 ; N and O are 1-hop away from C 2 , so they are merged with C 2 . In the next iteration, Q is merged with its closest community, C 2 . Thus, we get three MISs -C 1 (A,B,C,D,E,F,P), C 2 (G,H,I,J,K,L,M,N,O,Q), and C 3 (R,S,T,U,V,W,X). Figure S2. Topology of two out-of-size MISs generated with an appropriate range of 5-15 from STRING network. A) MIS with 46 nodes has a star topology. B) MIS with 55 nodes is too dense.  (   U80040 at(ACO2), V00572 at(PGK1), X07834 at(SOD2), Z68129 cds1 at(IDH3G), X65965 s at(SOD2) 10 AB003177 at(PSMD9), D26599 at(PSMB2), D38047 at(PSMD8), D78151 at(PSMD2), X71874 cds1 at(PSMB10) 11