Identification of modules and functional analysis in CRC subtypes by integrated bioinformatics analysis

Colorectal cancer is one of the top three causes of cancer-related mortality globally, but no predictive molecular biomarkers are currently available for identifying the disease stage of colorectal cancer patients. Common molecular patterns in the disease, beyond superficial manifestations, can be significant in determining treatment choices. In this study, we used microarray data from colorectal cancer and adjacent normal tissue from the GEO database. These data were categorized into four consensus molecular subtypes based on distinct gene expression signatures. Weighted gene-based protein–protein interaction network analysis was performed for each subtype. NUSAP1, CD44, and COL4A1 modules were found to be statistically significant and present among all the subtypes and displayed though similar but not identical functional enrichment results. Reference of the characteristics of the subtypes to functional modules is necessary since the latter can stay resistant to platform changes and technique noise when compared with other analyses. The CMS4-mesenchymal group, which currently has a poor prognosis, was examined in the study. It is composed mainly of genes involved in immune and stromal expression, with modules focused on ECM dysregulation and chemokine biological processes. Hub genes detection and its’ mapping into the protein–protein interaction network can be indicative of possible targets against specific modules. This approach identified subtypes using enrichment-oriented analysis in functional modules. Proper annotation of functional analysis of modules from different subtypes of CRC might be directive for finding extra options for treatment targets and guiding clinical routines.


Introduction
Colorectal cancer (CRC) is a complex and heterogeneous disease and has a significant contribution to cancer mortality [1,2,3]. One of the major drawbacks of the treatment of CRC is its heterogeneity, as evidenced by multiple clinical manifestations, mutational profiles, and survival rates. This heterogeneity leads to variability in the efficiency of standard treatment approaches [4]. Traditionally, CRC has been characterized using pathology and clinical phenotype [5,6], which sometimes works well; however, there is usually a delay in the detection prior to the onset of symptoms. There are also problems in distinguishing diseases with complex and overlapping clinical signs.
With the advancement of technology for genomic sequencing, heterogeneity can be identified at a molecular level. Previously, microsatellite instability (MSI) and chromosomal instability were major criteria for molecular classifications. A third molecular subtype was later added to the two well-characterized subtypes, taking into account microsatellite stability and molecularly with a higher level of CpG island methylation. These classifications provided valuable additional information [7]. It became apparent that the newly added subtypes exhibited chemoresistance against epidermal growth factor receptor-targeted therapy [8]. Much more precise classifications of CRC were established using four biologically distinct consensus molecular subtypes: MSI-immune, epithelial and canonical, epithelial and metabolic, and mesenchymal subtypes [9]. These four subtypes cover more than 85% of CRCs.
Despite the promising results of intrinsic subtype analyses, it's transformation into personalized treatment strategies is still limited. Many patients underwent the same standard care based on their pathological stage or clinical manifestations; however, discovering that these approaches end up with variable survival rates is not uncommon [10]. The application of promising results from research into actual clinical treatment are often delayed [11], leaving many positive discoveries untrialed.
Because response feedback systems for targeted malignancy treatments are rather substandard, the clear division of tumors into subtypes has gained practical importance as a way for researchers to investigate the molecular features of malignancy and identify precise molecular phenotypes [12], which can be used to inform decisions, such as treatment regimens and issues pertaining to medical care. However, the interactions between the genes involved in each subtype of CRC are not fully understood. In particular, enriched functional analysis is a valuable approach. Further, the extent to which processes which are enriched in these networks are related to clinical outcomes is unclear.
PPI network is an original element we adopted in this study, and thesis that it can be informative in the target perspectives was established far early [13], although there is debate on the aspects of whether a drug target protein can be represented as the hub of the PPI network. It was also suggested that multiple specific motifs from the PPI networks can be meaningful in detection of functional dependency of most drugs [14]. Meanwhile, the necessity of mapping drug targets into the integrated biological networks to identify the optimal points of PPI for drug discovery was reported [15]. There are other methodologies applied with PPI network that can be helpful in predicting drug targets and finding hub genes [16], approaching a better profile of interactions among molecular function in the whole system, among which, module screening in the PPI was a must.
Module and function analysis can be important [17]. A module is a stable functional unit in a gene expression set. For example, in breast cancer, therapy choices based on subtypes determined by clinical markers have proven to be effective compared with treatment based simply on the pathological stage [18]. Each subtype can have its own unique functional subnetwork of enriched genes. However, it is not known how the recently determined consensus CRC molecular subtypes relate to clinically relevant pathological subtypes and treatment choices [19]. The objective of this study is to discover enriched functional modules in each subtype of CRC, with the aim of better identification of the variances at the molecular level. A crucial element of the study was the use of unsupervised modules, which are robust to noise and tend to identify at least a few member genes represented across multiple platforms. To the best of our knowledge, this is the first attempt to investigate modules for separate subtypes of CRC since the establishment of consensus molecular subtype of CRC.
In this study, we analyzed biological functional modules and provided evidence that targeted treatment selection based on modules at the molecular level can be realized as subtypes of CRC and may be valuable for developing integrated models that can predict clinical outcomes.

Study design
This study focused on discovering functional modules in a CRC molecular network. The raw data from CRC samples was classified into four categories based on consensus molecular subtypes, then compared with normal samples. We used gene expression data from CRC samples. Differentially expressed genes (DEGs) were categorized into upregulated and downregulated groups, and network inference algorithms were used to construct protein-protein interaction (PPI) networks, which were visualized using Cytoscape [20]. Significant modules in each part of the CRC subnetwork were studied, with the aim of identifying promising targetable points inside CRC modules. A complete workflow for this study is depicted in Fig 1.

Microarray data
The CRC gene expression profiles used in this study were downloaded from the NCBI database with the accession number GSE39582, which was processed using the Affymetrix HGU133A platform, and contained 462 samples, including 443 CRC samples and 19 normal colon samples. Statistical analysis of the GEO dataset was performed using R (Version 1.1.453). All raw data were normalized and converted to log 2 ratio format using the robust multichip average algorithm [21]. RMA algorithm was often applied in generating matrix from gene chip and microarray data, which consisted of three sections in dealing with large amount of data: background subtraction, quantile normalization and summarization, and exceeded in preparing data for numerous downward R packages analysis.

Identification of subtypes
The analysis of the raw CRC data was carried out using the R package CMScaller [22], which implements an algorithm optimized for the comparison of consensus molecular subtypes. CRC data were divided into four consensus molecular subtypes, and gene expression data that were excluded in the four subtypes were removed from this study.

DEGs
The DEGseq R package [23], in which SAM algorithm is kernelled, was used to detect DEGs with an absolute log fold change (FC) > 2, and a P-value < 0.05 was considered to be statistically significant. DEGs were used to sort genes with upregulated and downregulated expression in each CRC subtype data vs. the normal colon epitheliums.

Integration of PPI network and module detection
The PPI network was created using the online database STRING (Search Tool for the Retrieval of Interacting Genes) [24], which holds data about known and predicted protein-protein interactions. DEGs in each subtype were mapped using this tool. The criterion for retaining interactions was a combined score > 0.4. The Cytoscape plug-in Molecular Complex  Detection (MCODE) was used to identify modules in the PPI networks. The criteria for the identification of significant modules were MCODE score > 3 and number of included nodes > 4.

Functional analysis of modules
All modules identified for each consensus molecular subtype were examined for overrepresented GO categories [25,26,27]. The analysis of Gene Ontology term enrichment (GO) is widely used for interpreting the biological significance of sets of genes and the processes in which they are involved. The DAVID database [28] was used to map genes from modules to detect the relevant biological annotations of GO terms. A P-value < 0.05 was considered to be statistically significant. The PPI networks of CMS4 can also be visualized using REVIGO [29], which helps to unveil the inner connections of all CMS4-enriched biological processes.

Survival analysis of hub genes
The HCAR3 module of CMS4 was selected for further analysis. This module included 32 genes. The top ten ranked genes determined by the number of interactions it formed in this module were selected as the signature gene set representative of the module. The relationship between recurrence-free survival and possession of the HCAR3 module signature gene set was assessed using Cox regression survival analysis in SurvExpress [30], and Kaplan-Meier survival plots stratified by these ten genes were constructed.

Validation of the workflow using bioinformatic approach
The applicability of this workflow in our study was partially examined using a bioinformatic approach with another dataset from the GEO database, with the accession number of GSE20916. This dataset included both malignant samples and paired normal tissue. DEGs were screened by comparison with normal tissue gene expression levels, following the identification of CMSs. Modules with a high proportion of enriched processes were investigated in depth in CMS4.

Identification of DEGs
Using a CMS classification of GSE39582, a total of 2930, 2846, 2286, and 2627 DEGs were identified in CMS1, CMS2, CMS3, and CMS4 for each subtype, respectively. Heat maps of the 50 top-ranked genes with respect to DEGs expression in each subtype are shown in Fig 2(

PPI network construction and module analysis
DEGs in each subtype were mapped using the STRING database to construct a PPI network with a total of 2163 nodes and 23939 edges in CMS1, 2163 nodes and 23939 edges in CMS2, 1699 nodes and 16578 edges in CMS3, and 1974 nodes and 12939 edges in CMS4. The modules identified for each subtype are shown in Table 1, which briefly summarizes some of the key differences in modules between subtypes. The number of significant modules for each subtype varied, and the structure of the networks for each subtype differed in node composition and number of links in a component. The number of links in a module is the number of connections between nodes and represents the interactivity of the component. In Fig 3, we show a visualization of the PPI network for each subtype, with nodes colored based on modules. It is easy to observe the contrasts in the architectures of the PPI network between each network subtype. Both the number of colors and the structure of each module were unique for each CRC subtype.

GO terms are associated with modules for each CRC CMS
All of the components of each subtype were analyzed using GO term enrichment (Table 2). Each module was named after the top-ranked gene. Consistent with the results shown in Fig 3, most of the enriched modules varied across subtypes, with respect to the number of enriched modules and the gene composition in the modules, as well as the number of processes observed to be enriched in each module. Gene composition was not directly correlated with the number of enriched processes in a module. For instance, the IFITM2 module from CMS3  contained only four genes, yet ten processes were enriched in this module; in contrast, the RSL1D1 module from the same subtype, which was composed of 38 genes, had only three enriched processes.   The NUSAP1 module had the same 47 genes among subtypes, with 21 enriched processes shared across subtypes, most of which included protein phosphorylation, ubiquitin-protein ligase activity, and microtubule-based movement processes. In the CD44 module, gene composition shared little resemblance between subtypes and was dependent on each subtype. Two common processes are enriched in the COL4A1 module among all subtypes: ECM structural constituent and metal ion binding.

NUSAP1, CD44, and COL4A1 are present in all four consensus molecular subtypes
One of the significantly enriched processes in the CD44 module of CMS3 was negative regulation of canonical Wnt signaling pathway. CMS3 is often referred to as the subtype with the least variance; thus, the Wnt pathway tended to be the canonical transduction signal pathway overstimulated in the progression of CMS3.

Analysis of modules in CMS4
There were 16 modules detected in CMS4, of which 12 were enriched. All of the enriched biological processes were evaluated using REVIGO, as shown in Fig 5. It is clear that the modules with the most interacting enriched processes consist of inflammatory and immune response and apoptosis processes. This may indicate that although numerous sets of nodes model modules, there can be stronger interactions formed between certain modules, as in the case of the  HCAR3 module, which was the only module in which the chemokine-associated process was observed.

HCAR3 module and mesenchymal invariance
As seen in Fig 6, there were two subgroups in the HCAR3 module of CMS4, centered on ECM dysregulation and chemokine-associated processes. Subgroup centralities in the HCAR3 module indicated that some genes related to functional variance and coordination overview were involved. The tightly concentrated module of chemokine-associated processes and the ECM dysregulation process are often linked together as common mesenchymal traits in malignancy, with a synchronized mutual regulation.

The HCAR3 module is associated with survival in CRC
Using TCGA COAD clinical data as references, survival analysis of every gene contained in the HCAR3 module was performed. No significant differences were observed using single- gene analysis; however, survival analysis using multiple genes from the HCAR3 module indicated that colorectal cancer patients with higher expression of multiple genes from HCAR3 module had worse survival outcomes (Fig 7). As a potential new signature in the context of CRC survival analysis, the hazard ratio was 2.09, indicating a worse hazard of death from the patients possessing these genes. A Kaplan-Meier curve was generated, the log rank P-value of which was 0.002143 and the concordance index 0.64, suggesting a relatively good interpretation of survival prediction.

Bioinformatic validation of possible biomarkers
DEGs generated from comparisons of cancer samples in each subtype and 24 paired normal tissues were further mapped to the PPI network; the details are shown in S1 Table. In CMS4, the IL-6 module had a relatively higher number of enriched processes compared with the number of genes included in the module. Thirteen of the 22 genes in this module were part of the HCAR3 module of CMS4 in the analysis of GSE39582. Among them, genes with high weight included several from the CXCL family and the HCAR3 gene. Survival analysis of multiple genes in the IL-6 module was performed (S1 Fig) confirming the possible value of modules as biomarkers.

Discussion
Efforts have long been made to understand the heterogeneity of CRC, starting with the initial polymerase chain reaction-based method, in order to establish effective treatment strategies. Previous subtypes have, to some extent, achieved this purpose [31]. For example, MSI+ tumors, which have genomic mutation and exhibit immune cell infiltration, respond well to chemotherapy, including immune checkpoint inhibitors. CMS classification is the method applied in this study and is formed through the merging of information from weighted analyses of major subtypes as classified at that time. Retrospective analysis of clinical trials, for instance, those involvingCMS4 with a mesenchymal-like phenotype which benefits less from chemo-regimes containing anti-EGFR ingredients, has demonstrated the potential predictive value of the CMS classification. Three genes, NUSAP1, CD44, and COL4A1, were shared among the four subtypes identified, despite their molecular differences. NUSAP1 plays an important role in regulating BRCA1 protein levels [32] and is reported to be a predictor of poor prognosis in CRC [33]. In our findings, the expression level of the NUSAP1 module was upregulated in all CRC subtypes. Even though gene composition varied among different subtypes, they were involved in similar processes, such as mitosis and microtubule movement dysfunction. CD44, which is a wellknown immune membrane factor, coordinating alteration of signal expression during the progression of malignancy, acts as a "marker molecule" in the process of EMT: a common process for majority of malignant transformations [34]. COL4A1, a tumor angiogenesis indicator, is regulated by the p53 gene and functions in association with endothelial cells with destabilized matrix [35]. All three genes have vital roles to play in CRC progression.
CMS4 is the only subtype in which adjacent mesenchymal expression is involved. Often, patients diagnosed with CMS4 have the worst prognosis. In this study, CMS4 had 16 modules after PPI network screening, of which 12 were statistically significant in Gene Ontology enrichment analysis. Processes concerning inflammation, the immune system, and apoptosis are the most highly enriched categories. Among all the enriched modules, the HCAR3 module consisted of 32 genes but had relatively higher number of enriched processes per gene compared with other modules, although the HCAR3 gene alone was not a significant factor in the survival of CRC patients. The ten top-ranked genes in this module were shown to be essential in the prediction of survival. Patients with upregulated expression of these genes have lower rates of survival. Most of the ten genes share a gene signature common to metastatic traits: ECM1 is expressed throughout the intestine, and overexpression of ECM1 suggests malignant epithelial cancer as in CRC [36]; CXCL2 [37] and CXCL1 [38] mediate metastatic processes; cells with CXCR4 upregulation shows less sensitivity toward radiotherapy [39]; ACTN1 [40] and S1PR3 [41] aid in invasion enhancement [36]; and C5AR1 can increase cell permeability [42]. In contrast, CXCL9 has been shown to predict good outcomes for cancer patients [43]; HCAR3 and GNG4 exhibit suppressor effects on the process of tumorigenesis in mammary cancer [44,45] and in GBM [46]. The function of HCAR3 in CRC remains to be elucidated; however, a member of the same HCA receptor family with similar structure, named, HCAR2, which functions as a tumor suppressor gene, has been reported to have reduced expression levels in colorectal cancer cell lines [47]. In this study, HCAR3 was one of the most upregulated genes, forming the largest amount of interactions inside a module, so it is worth investigating as a potential novel target for a drug acting on oncogenic or suppressor processes.
Many bioinformatic pipelines exist, based on different assumptions and algorithms, and produce varying results. Results generated via the methodology applied in this study can be combined with those of other bioinformatic workflows and experimental conclusions to join information from a range of origins to build a solid framework for exploring these issues. The modules in the CMSs of CRC have not been explored in detail, but attempts have been made in studies using weighted correlation network analysis (WGCNA) on CRC microarray data [48], and recurrence-associated modules have been identified.
In the validation section, in which a different expression dataset was used, similar results were found during the detection of CMS4 modules. The major gene composition had a similar trend as the HCAR3 module at the level of enriched functional processes. The CXCL family has long been reported as prognostic biomarkers in colon cancer [49], being involved with inflammatory processes, including genes such as IL-6, C5AR1, and C3 in the IL-6 module. We applied strict rules on the use of datasets in this study; they must have cancer samples that are paired with normal tissue. In this way, we minimized the effects of timing and different data sources.
There is an urgent need for the elucidation of key oncogenic modules to provide an unbiased molecular classification of CRC in order to help tailor treatment choices in the future. In our research, not only modules but also the composition of modules varied among CRC subtypes. Despite these variances, NUSAP1, CD44, and COL4A1 were present in all subtypes. They share processes related to protein phosphorylation, ubiquitin-protein ligase activity, microtubule-based movement, ECM structural constituent, and metal ion binding, all of which are key pathways involved in the progression of malignancy. In CMS4, ten genes from the HCAR3 module are associated with the prediction of patient survival, and the HCAR3 gene is especially important, as it is engaged in multiple interactions and thus might be worthy of further attention with respect to the development of drugs against colorectal cancer.