Estimation of the proteomic cancer co-expression sub networks by using association estimators

In this study, the association estimators, which have significant influences on the gene network inference methods and used for determining the molecular interactions, were examined within the co-expression network inference concept. By using the proteomic data from five different cancer types, the hub genes/proteins within the disease-associated gene-gene/protein-protein interaction sub networks were identified. Proteomic data from various cancer types is collected from The Cancer Proteome Atlas (TCPA). Correlation and mutual information (MI) based nine association estimators that are commonly used in the literature, were compared in this study. As the gold standard to measure the association estimators’ performance, a multi-layer data integration platform on gene-disease associations (DisGeNET) and the Molecular Signatures Database (MSigDB) was used. Fisher's exact test was used to evaluate the performance of the association estimators by comparing the created co-expression networks with the disease-associated pathways. It was observed that the MI based estimators provided more successful results than the Pearson and Spearman correlation approaches, which are used in the estimation of biological networks in the weighted correlation network analysis (WGCNA) package. In correlation-based methods, the best average success rate for five cancer types was 60%, while in MI-based methods the average success ratio was 71% for James-Stein Shrinkage (Shrink) and 64% for Schurmann-Grassberger (SG) association estimator, respectively. Moreover, the hub genes and the inferred sub networks are presented for the consideration of researchers and experimentalists.


Introduction
In bioinformatics, various approaches are leveraged to understand the molecular perturbations observed on cells or tissues caused by a disease, such as cancer, autism, diabetes, Alzheimer's. One of these approaches is inferring the gene networks, which can illustrate the gene-gene and protein-protein interactions from an expression dataset to understand the cell physiology and disease pathogenesis and to estimate the genome-wide working mechanism of proteins and genes [1,2]. In this study, effects of the association estimators on the network inference methods, which are frequently utilized in bioinformatics studies to detect molecular structures related to a given disease, were evaluated via the proteomic data of five different cancer types generated by TCPA [3]. The five cancer types used in this study are among the most common types according to the recently published results by the American Cancer Society [4].
In previous studies, both the synthetic and real gene expression data sets obtained from the microarray assays are mainly used to analyze the association estimators' effect on the gene network inference techniques. Olsen et al. [5] evaluated the performance of Pearson, Spearman, Empirical, Miller-Madow (MM), and Shrink association estimators on three network inference algorithms (Algorithm for the Reconstruction of Accurate Cellular Networks-ARACNE [6], Context-Likelihood of Relatedness-CLR [7], Network inference with maximum relevance/minimum redundancy feature selection-MRNET [8]) by using synthetic and real microarray data sets. In addition, the Empirical, MM and Shrink estimators were used to examine the equal width (EW) and equal frequency (EF) discretization methods' effects on the performance of the association estimators [5]. As a result, MRNET with the Spearman and CLR with the Pearson yielded more successful results on the synthetic dataset, and also significant results were obtained on the real dataset, as well [5]. Simoes and Streib [9] used the MM, ML, Shrink and SG association estimators with three different discretization methods (EW, EF, global equal width-GEW) together with The Conservative Causal Core NETwork (C3NET) [10] network inference algorithm by using the synthetic gene microarray data sets to examine the effects of both the estimators and the discretization methods on the network inference algorithms. They found that the MM estimator outperformed the other estimators when used with the EW discretization method. Daub et al. [11], investigated the performances of the B-spline (BS) and Kernel Density Estimator (KDE) methods on large-scale gene expression data sets. As a result, they found that the performance of the BS estimator varied for different spline values. The order of success is as follows BS (spline order = 3) > KDE > BS (spline order = 1). Kurt et al. [12], used two synthetic and two biological data sets to evaluate 14 association estimators with 3 different network inference algorithms. Also they observed that the influences of the Copula Transform (CT) pre-processing operation on the performance of the association estimators. B-spline, Pearson-based Gaussian and Spearman-based Gaussian estimators are observed as the best performing ones among all. Also CT operation increased inference performances of the estimators for synthetic datasets.
Network inference methods have been used in multiple studies on different cancer types. Şenbabaoğlu et al. [13] analyzed the protein expression data, which was obtained from TCPA, from 3467 patients with 11 different types of cancer by using 13 different network inference methods. The most successful network inference method was varying based on the cancer type. Madhamshettiwar et al. [14] searched the biological mechanisms related to the ovarian cancer using 7 different network inference methods.
WGCNA [15] is one of the most commonly used gene network inference and clustering approaches and is used to find highly correlated gene/protein modules (cluster, sub-network) related to a given disease and hub genes/proteins with high connectivity in these modules. WGCNA has been used to analyze gene/protein expression data from brain cancer [16], yeast cell cycle [17], mouse genetics [18,19], diabetes [20] and many other diseases and complex traits.
It has been reported that the correlation-based methods are not sufficient to estimate the non-linear relationships between the cell molecules such as the complex molecular interactions related to a cancer type [21]. Many algorithms have been developed to solve this problem by using mutual information based methods [6][7][8]10]. We proposed a system integrating the MI-based association estimators with the co-expression network clustering technique to identify the sub networks associated with the diseases. Leveraging the MI-based estimators in WGCNA instead of the correlation-based ones, which are the default and only options presented in the WGCNA, can capture both linear and nonlinear interactions between the protein-protein pairs and improves the estimation of diseaseassociated molecular mechanisms and interactions, while the correlation-based ones can only capture the disease mechanisms based on the linear interactions. To estimate the disease associated sub networks and mechanisms, we leveraged the MI-based estimators within the coexpression network methodology different than the previous studies [13,14] that were interested in global networks and not focusing on the individual sub networks or co-expression modules. Also different discretization methods were used to estimate the association estimators that are widely used in gene network inference, and an evaluation framework was developed to analyze the as-obtained results unlike [13,14]. Besides, we identified the hub genes of the sub networks to reveal the potential key regulators of these sub networks, which were not searched in [6][7][8]. We observed that according to whole cancer datasets used in our study Shrink, SG, BS, MM and Shrink methods identified the most disease-relevant sub networks and genes with average precision scores of 0.85, 0.84, 0.59, 0.76 and 0.76 (for BRCA, GBM, LUSC, KIRC, and SKCM datasets), respectively. Moreover, unlike previous studies [5], [9], [11,12], in our study, five real cancer proteomic data sets that are obtained from TCPA were examined by biological network inference method along with nine different association estimators to compare the association estimators' performance.

Materials
Studies in bioinformatics field have gained a great momentum along with the developments in high-throughput techniques. One of these techniques is the reverse phase protein array (RPPA) [22] technology, which is a high-throughput antibody-based technique and it supplies protein expression data for proteomics research. Sheehan et al. [23] performed an analysis on ovarian cancer data obtained by RPPA. TCPA has obtained protein expression data from numerous cell line and tumor samples using RPPA technique. The detailed information about the process of preparing the data set before the analysis is given in [3].
The RPPA data used in this study was downloaded from Download section (level 4) in TCPA's web site [24]. It is comprised of proteomic expression data of 2230 cancer patients in 5 cancer types. There are more than 218 antibodies for each patient. To obtain the individual gene-protein matching (S1 Table) of the data, a process was performed after taking the relevant information from My Protein section of TCPA's [24] web site. The file containing the expression data of the selected antibodies (proteins) for each cancer type is given (S2-S6 Tables). The type of the cancers, their sample sizes, number of proteins (antibodies), number of selected antibodies and their abbreviations are listed in Table 1. Furthermore, to evaluate the performances of the association estimators, DisGeNET and MSigDB were used as gold standards. DisGeNET curates the gene-disease and genetic variantdisease associations that were reported in previous studies and stored in multiple publically available data sources such as UNIPROT database, the Comparative Toxicogenomics Database, GWAS catalog, and also it contains 429,036 associations between 17,381 genes and 15,093 diseases [25,26]. In MSigDB, there are 17,779 gene sets in total, from 8 different collections, namely hallmark gene sets, positional gene sets, curated gene sets, motif gene sets, computational gene sets, Gene Ontology (GO) gene sets, oncogenic gene sets, and immunologic gene sets [27].
Based on each cancer type, the search terms used to obtain the data, the Unified Medical Language System-Concept Unique Identifiers (UMLS-CUIs), number of the genes obtained from DisGeNET with a number of supporting publications (PMIDs) ! 2 and the overlapped number of the genes between the selected cancer dataset and DisGeNET are given in Table 2. The gene level analysis was performed by using the relevant cancer-related DisGeNET data. By using the selected disease genes from DisGeNET for each cancer type, the top associated 100 biological pathways were identified from the BioCarta, KEGG, and Reactome databases with a false discovery rate (FDR)<0.05 by using MSigDB online tool. FDR score is the adjusted version of the raw p-value for the multiple hypothesis testing. Here, Bonferroni correction was used to correct the raw p-values [28], in which the p-values are multiplied by the number of hypotheses that were tested. The pathway level analysis was performed by using the top relevant biological pathways of the disease genes obtained from the DisGeNET.

Methods
In this study, the association scores between the proteins were calculated by using Spearman [5], Pearson [5], Kendall Tau (Kendall) [29] correlation methods. BS [11], Empirical [30], KDE [31], MM [32], Shrink [33], and SG [34] MI-based association estimators were also used in our analyses. Moreover, the influence of the discretization methods (EW, EF, GEW) on the Empirical, MM, Shrink, and SG estimators' inference performances was also examined. Details of the association estimator and the discretization methods used in this study can be found in [5,12,33,35]. All these methods were used as an alternative to the correlation-based adjacency function of WGCNA package that is used to calculate the Pearson and Spearman correlation scores between the protein pairs. We compared the MI-based and correlation-based methods within the co-expression network concept and evaluated their disease relevance in both pathway and gene-level.
The WGCNA steps were performed as follows; • Firstly, the association matrix (adjacency) of the proteins in the data set is calculated using the protein expression data. • A network-related similarity matrix is obtained by using a topological overlap measure (TOM) [36], which has been successfully applied in different studies [37] in the literature, to identify the heavily interrelated protein clusters (TOMsimilarity measure).
• The obtained similarity matrix is subtracted from one (1-TOM) to identify a dissimilarity measure. From the dissimilarity matrix, a clustering tree is formed to identify the cluster of related proteins via hierarchical clustering (hclust).
• To find highly correlated gene/protein co-expression modules in the generated clustering tree of the gene networks, we used the dynamic branch cut methods [38] (cutreeDynamic) with different parameters (see S7 Table). Finally, gene/protein with the highest connectivity in each module (hub gene) is determined. (chooseOneHubInEachModule ).
In this study, we used one-tailed version of the Fisher's exact test (FET) [39], which is identical to the hypergeometric test to calculate the P-values representing the association of the coexpression modules with the given cancer type. A hypergeometric test is adopted from [40], whose distribution (used to calculate P-value) is given in (1) where n and k as integer, ð n k Þ is the binomial coefficient, I is the number of inferred genes/proteins of network inference method, V is the number of genes/proteins in the DisGeNET database to use for verification, O is the number of overlap between I and V. AGP (all genes/proteins) is the number of all known genes/proteins for human genome.
If the calculated P-value is less than 0.05, the overlapped proteins between the inferred modules and disease-related DisGeNET genes and their relevant pathways, are considered less likely to be random and these modules are more likely to be disease-associated and biologically interesting. Fig 1 and Table 3 are illustrated to provide a better understanding for the use of FET for the overall assessment of the association estimators' performances. As given in Fig 1, based on the given two clusters (e.g., inferred and validated genes/proteins), it was determined whether the overlap is statistically significant according to all genes/proteins (AGP) in the literature. Recent studies have revealed that the number of genes in the human genome is around 19,000 [41] and we used this value as AGP in our study. Table 3 summarizes the regions shown in Fig 1 and is used as an input in the FET. Ultimate assessment metric used in the evaluation process is the precision score: where p is precision, TP is number of the true positives, FP is number of the false positives. TPs denote the modules which satisfying the required conditions in the pathway-level or genelevel analysis (P-value<0.05). FPs denote the number of modules which are not satisfying the required conditions in the pathway-level or gene-level analyses. The required conditions are described in Proposed Framework section. Association estimators used in the reconstruction process of the co-expression networks. For the analyses, the build.mim function from the minet [35], the obtain.mim function from the DepEst [42] and the chooseOneHubInEachModule, adjacency, TOMsimilarity, cutreeDynamic and hclust functions from the WGCNA packages were used. The selected association estimators are indicated by the specific parameter values provided to the build.mim and obtain.mim functions.
Proposed framework. In this section, the proposed framework to analyze the performances of the association estimators is illustrated in Fig 2. 1. In the first step the proteomic data is downloaded from TCPA and prepared for the analysis (see Materials section).
2. In the second step, co-expression networks and modules were created using WGCNA as described in Methods section. The interaction score matrix to be used in the subsequent steps was calculated by using the adjacency function and other association estimators provided by DepEst [42] and minet [35] packages. In addition, parameters that generate at least 7 modules (sub-networks) for each association estimator were found by using certain parameters (see S7 Table). 7, 8, and 9 modules are generated from the co-expression networks for each association estimator from each dataset. The reason why the module numbers are selected as 7-8-9 is shown in (S1 and S2 Figs) and details are explained in S1 Text. The modules created with the parameters listed in S8 Table were used in the comparison of the association estimators.
3. In the 3rd step, the genes that were confirmed to be related with the given cancer type in at least two different studies (PMIDs ! 2) were obtained for each cancer type by using DisGe-NET web page [43].
4. In this step, by using the MSigDB web page [44], we identified the top 100 biological pathways from the BioCarta, KEGG and Reactome databases, that are significantly associated with the disease-related gene list with a FDR q-value <0.05. 5. Pathway-level analysis by FET: In this step, overlapping ratios and corresponding P-values of the association between the pathways obtained at the 4th step and the modules identified in step 2 were found by using FET. To correct the raw P-values for the multiple hypothesis testing, the FDR q-value was calculated by multiplying the P-value by 100 since we selected the top 100 pathways associated with the DisGeNET genes in the previous step. To evaluate our findings in the pathway level, WGCNA modules containing at least 5 shared genes with at least two of the disease-associated pathways found in Step 4 with an FDR q-value <0.05 were considered as a successful hit (TP) in terms of disease relevance. The reason why at least 2 pathways and 5 genes are selected as cut-off is shown in S3 Fig and details are explained in S2 Text. Number of the modules satisfying these conditions are divided by the total number of the modules (TP+FP) to obtain the pathway level performance scores (precision) of the association estimators. The precision ratios at the pathway level are given in Table 4 and Fig 3 by module numbers for each association estimator.
6. Gene-level analysis by FET: To evaluate our findings in gene level, the overlapping ratios between the genes in modules passing the overlap analysis test at the pathway level overlap analysis and the genes obtained from DisGeNET were calculated via FET, and the resulting modules which have p-values lower than 0.05 were considered as significantly associated (TP) with the corresponding cancer type. By dividing the number of modules that satisfy these conditions by the total number of modules (TP+FP), we obtained the gene level performance scores (precision) of the association estimators estimator (see S9 Table for details of the gene-level analysis scores by FET). The precision scores at the gene level are given in Table 5 by module numbers for each association estimator, and the graphical representation of the hub genes of the significantly disease-relevant modules is given in Fig  4. In this step, the hub genes within the disease-associated sub-networks, which are identified by the most successful estimator of step 5, and the neighboring genes that are located at the center of these hub genes are illustrated in Figs 5-9 for BRCA, GBM, KIRC, LUSC, SKCM cancer type, respectively. The hub genes and the genes that have been experimentally confirmed to be associated with the disease and reported in DisGeNET within the hub gene neighborhood are illustrated with colored and larger nodes in Figs 5-9. The genes that are not colored but have a red frame in Figs 5-9 have a PMID value of one. There is no entry in DisGeNET for the grey colored nodes. Also, the top three pathways, to which each module is related, are given above or below the relevant module to annotate each module.

Results
As shown in Table 4 and Fig 3, the rankings of the association estimators identified by the average performance score (Avg) are varying according to the constructed module sizes and the cancer type. In the pathway-level analysis, best performing methods based on the cancer type are as follows: Shrink for BRCA, SG for GBM, Shrink for KIRC, BS for LUSC, and MM for SKCM. In addition, although the most successful predictor varies for each data set, the MI based predictors outperformed other methods in BRCA, GBM, KIRC, LUSC datasets, however only in SKCM data set, MM estimator and Kendall correlation method were found to be successful with very close average precision scores according to the pathway level analysis. According to the average precision scores for BRCA dataset that are given in Table 4, the ranking of the estimators is as follows: Shrink, Empirical, SG, Adjacency, BS, Pearson, Table 5. Gene level precision ratios by module numbers for each cancer type (P-value <0.05).   The hub genes and neighbors in the disease-related sub-networks obtained by the most successful SG method (in terms of precision score) on GBM dataset. IGF1R is validated in one study according to DisGeNET. In recent study, RAD50 gene (protein) was also shown to be associated with the GBM, though it was not reported in the DisGeNET. The genes registered in DisGeNET and experimentally confirmed for the diseases, are shown with colored and larger nodes. Among those, genes that are not colored but have a red frame have a PMID value of one. There is no entry in DisGeNET for the grey colored nodes. According to the average precision scores for KIRC dataset that are given in Table 4, the ranking of the estimators is as follows: Shrink, MM, Adjacency, BS, Spearman, SG, Kendall, Pearson, Empirical and KDE. Here, Shrink method is identified as the most accurate one in terms of identifying disease associated sub networks with a precision score of 0.59. MM,  Based on the average precision scores for LUSC dataset that can be seen in Table 4, the ranking of the estimators is as follows: BS, Shrink, Empirical, Adjacency, MM, SG, KDE, Pearson, Kendall  According to the average precision scores for SKCM dataset as given in Table 4, the accuracy ranking of the estimators is as follows: MM, Kendall, Shrink, Pearson, Spearman, Adjacency, BS, Empirical, KDE and SG. MM and Kendall methods could identify the most diseaserelevant sub networks and genes with slightly different values of 0.758 and 0.757, respectively. Shrink, Pearson, Spearman, Adjacency, BS, Empirical, KDE and SG methods follow them with lower scores of 0.73, 0.71, 0.71, 0.68, 0.68, 0.68, 0.68 and 0.67, respectively. We designed a 2-level evaluation process, i.e. pathway-level and gene-level, with stringent cut-offs to have a complementary system as explained in the methods section. In pathwaylevel evaluation we searched the disease-associated modules. Then, in gene-level evaluation we focused on the disease-associated modules and looked for the overlapping ratios between the member genes of these disease-associated modules and the previously reported disease, i.e. given cancer type, genes.

BRCA
As given in Table 5 and Fig 4, the best performing association estimators according to the average precision values in the gene-level analysis are as follows: Shrink for BRCA, SG for GBM, Shrink for KIRC, BS for LUSC, and MM for SKCM. In addition, as in the pathway-level analysis, although the most successful predictor changes for each data set, the MI-based predictors were found to be more successful in BRCA, GBM, KIRC, LUSC dataset. Only in SKCM data set, MM estimator and Kendall correlation method were found to be successful with very close precision values according to both pathway-level and gene-level analyses for P-value < 0.05. The hub genes and neighbors in the disease-related sub-networks obtained by the most successful MM method (in terms of precision score) on SKCM dataset. In recent studies, ETS1 gene (protein) was also shown to be associated with the SKCM, though it was not reported in the DisGeNET. The genes registered in DisGeNET and experimentally confirmed for the diseases, are shown with colored and larger nodes. Among those, genes that are not colored but have a red frame have a PMID value of one. There is no entry in DisGeNET for the grey colored nodes. https://doi.org/10.1371/journal.pone.0188016.g009 Influences of the association estimators on the coexpression network inference According to the average precision scores for BRCA dataset, which are given in Table 5, the estimators are listed as follows: Shrink, Empirical, SG, Adjacency, Pearson, MM, Kendall, BS, Spearman, and KDE. Shrink method is the most accurate one in terms of identifying diseaseassociated genes with a precision score of 0.85. Empirical, SG, Adjacency, Pearson methods follow this one with lower scores of 0.67, 0.63, 0.61, 0.59, respectively. Additionally, MM, Kendall, BS methods obtain comparatively lower scores of 0.52, 0.51, 051, while Spearman and KDE methods get the lowest scores of 0.45, 0.42, respectively.
Based on the average precision scores for GBM dataset that are given in Table 5, the estimators are listed as follows: SG, MM, Kendall, Shrink, BS, KDE, Empirical, Adjacency, Pearson and Spearman. SG is the most accurate method in terms of identifying disease-associated genes with a precision score of 0.80. MM, Kendall, Shrink, BS, KDE methods follow this one with lower scores of 0.72, 0.67, 0.63, 0.62, and 0.60, respectively. Additionally, Empirical, Adjacency and Pearson methods obtain comparatively lower scores of 0.59, 0.58, 0.54, while Spearman method gets the lowest score of 0.46.
According to the average precision scores for KIRC dataset (Table 5) According to the average precision scores for SKCM dataset (Table 5), the estimators are listed from high to low scores as follows: MM, Kendall, Shrink, Pearson, Adjacency, BS, Empirical, KDE, SG, and Spearman. MM and Kendall are the most accurate methods in terms of identifying disease-associated genes with precision scores of 0.758 and 0.757, respectively. Shrink, Pearson, Adjacency, BS, Empirical, KDE, and SG methods follow them with lower scores of 0.73, 0.71, 0.68, 0.68, 0.68, 0.68, and 0.67 respectively. Additionally, Spearman method has the lowest score of 0.62.
Finally, the sub networks of the module hub genes identified by the association estimators, which provide the highest precision scores based on a statistical test (with a P-value < 0.05), are generated by Cytoscape [45] and shown in Figs 5-9 for all cancer types for the consideration of the other researchers studying in this field. The genes registered in DisGeNET and experimentally confirmed for the diseases, are shown in Figs 5-9 with colored and larger nodes. Among those, genes that are not colored but have a red frame have a PMID value of one. There is no entry in DisGeNET for the grey colored nodes. Also, the most associated top three biological pathways, to which each module is related, are given above or below the relevant module to annotate each module.

Conclusion and discussion
Performances of nine association estimators used in the network inference algorithms were examined on the proteomic data of five different cancer types in both pathway and gene-levels.
To make this assessment, selected association estimators were used instead of the adjacency matrix construction procedure in the WGCNA, which is based on either Pearson or Spearman correlation and has been used in many different studies in the literature. In conclusion, from Tables 4 and 5 and Figs 3 and 4, it can be clearly observed that in terms of the precision scores, the MI based methods provide better results than the correlation-based methods and the adjacency function which is provided as a default choice in WGCNA. In parallel with the studies in the literature [46], our findings confirmed that the correlation-based methods are not sufficient to estimate the non-linear relationships between the cell molecules such as cancer-related complex molecular interactions.
As a conclusion, despite not being included in DisGeNET, the genes that are found to be related with the disease via recent studies in the literature (PDK1, SMAD4, RAD50, FOXM1, CDKN1A, ETS1 [47][48][49][50][51][52][53][54]) were detected with our proposed framework. Studies verify that these genes (proteins) are associated with cancer-related processes. Du et al. identified the PDK1 as a potential therapeutic target for BRCA [47]. Dupuy et al. also determined the PDK1 as a key regulator of metabolism and metastatic potential in BRCA [48]. Liu at el. indicated that the assessment of SMAD4 protein level may provide additional prognostic information about BRCA [49]. Mishima et al. found out that MRE11-RAD50-NBS1 complex inhibitor can effectively increase radiosensitivity in GBM [50]. Sun et al. showed the prognostic significance of FOXM1 expression in LUSC [51]. Zhang et al. remarked the FOXM1 as a novel biomarker of LUSC [52]. Fukazawa et al. found out that the tumorigenic effect of SOX2 on LUSC is mediated in part by suppression of CDKN1A [53]. Keehn et al. stated that ETS1 may be important in the pathogenesis of invasive SKCM [54]. Thus, since our proposed method could capture this long list of previously studied genes, it is suggested that it might capture a more comprehensive list of the disease associated gene-gene interactions that were missed in previous studies.
The most significant contribution of our study is the use of different association estimators in biological network inferring methodologies, which can make a significant improvement in identifying the disease-associated co-expression modules when they are integrated with the WGCNA method. In addition, similar performance scores of each estimator in pathway-level and gene-level analysis also indicate the consistency of our study.