Figures
Abstract
Hepatocellular carcinoma (HCC) is the most prevalent and deadly form of liver cancer, and its mortality rate is gradually increasing worldwide. Existing studies used genetic datasets, taken from various platforms, but focused only on common differentially expressed genes (DEGs) across platforms. Consequently, these studies may missed some important genes in the investigation of HCC. To solve these problems, we have taken datasets from multiple platforms and designed a statistical and machine learning-based system to determine platform-independent key genes (KGs) for HCC patients. DEGs were determined from each dataset using limma. Individual combined DEGs (icDEGs) were identified from each platform and then determined grand combined DEGs (gcDEGs) from icDEGs of all platforms. Differentially expressed discriminative genes (DEDGs) was determined based on the classification accuracy using Support vector machine. We constructed PPI network on DEDGs and identified hub genes using MCC. This study determined the optimal modules using the MCODE scores of the PPI network and selected their gene combinations. We combined all genes, obtained from previous studies to form metadata, known as meta-hub genes. Finally, six KGs (CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1) were selected by intersecting the overlapping hub genes, meta-hub genes, and hub module genes. The discriminative power of six KGs and their prognostic potentiality were evaluated using AUC and survival analysis.
Citation: Hasan MAM, Maniruzzaman M, Huang J, Shin J (2025) Statistical and machine learning based platform-independent key genes identification for hepatocellular carcinoma. PLoS ONE 20(2): e0318215. https://doi.org/10.1371/journal.pone.0318215
Editor: Alexis G. Murillo Carrasco, Instituto do Cancer do Estado de Sao Paulo / University of Sao Paulo, BRAZIL
Received: August 21, 2024; Accepted: January 10, 2025; Published: February 5, 2025
Copyright: © 2025 Hasan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: This study utilized datasets that are available in the Gene Expression Omnibus (GEO) repository with accession numbers: GSE121248, GSE69715, GSE14520, GSE36376, GSE39791, GSE87630, GSE54236, GSE115018, and GSE47197. These datasets can be easily downloaded from the following link: www.ncbi.nlm.nih.gov/geo/. Moreover, TCGA-LIHC dataset can also be easily downloaded from the TCGA database (https://portal.gdc.cancer.gov/).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Hepatocellular carcinoma (HCC) is the leading cause of liver cancer-related deaths, with its mortality rate steadily increasing throughout the world. [1]. More than 80% of liver cancers are attributed account for HCC [2], with its higher percentages were found in men than women [3]. Typically diagnosed between 30 and 50 years [3]. HCC is strongly linked to various factors, including hepatitis (B/C) virus infections, obesity, abuse of alcohol consumption, smoking, and type 2 diabetes [4]. Typically diagnosed Hepatitis B, in particular, is noted as a significant risk factor [5]. Despite therapeutic advancements including radiation, chemotherapy, and targeted therapy, patients with HCC still face low survival rates [6]. Late diagnosis of HCC significantly contributes to rising incidence and mortality rates, underscoring the critical need for early detection methods and improved access to treatment to mitigate these risks [7, 8].
Today, statistical model-based bioinformatic analysis is effective in identifying key genes (KG) and their correlated molecular pathways in various cancers, including HCC [6, 9–59]. It helps in understanding the molecular mechanisms underlying cancer development and progression, which play a vital role in identifying targets for targeted therapy and improving patient outcomes. For instantaneous, Zhao et al. [12] determined seven hub genes and also their molecular pathway (gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway). Liu et al. [13] also introduced ten prognostic biomarkers that played an important role of developing HCC. Li et al. [16] proposed five genes, also strongly correlated with patients with HCC.
In existing studies, researchers have attempted to propose promising biomarkers for detecting HCC. The majority of the existing studies conducted their research on genetic datasets that were taken from a particular platform [9, 12, 13, 19, 41, 59] or different platforms but they considered only common differentially expressed genes (DEGs) among those platforms [14, 18, 23, 24]. As a result, both types of studies may be missed important genes in their analysis. So, there is a scope to identify platform-independent KGs by including all important genes from all microarray platforms. These existing Most existing studies derived hub genes or key genes from PPI network. Recently, machine learning (ML)-based approaches have widely used and gained more popularity for identifying potential biomarkers from genetic data [60, 61]. So, the integration of machine learning, statistical models, and bioinformatics can indeed the evidence and confidence in identifying a biomarker to be a KG.
In this study, six different datasets from three microarray platforms were used as training datasets to identify KGs using statistical and ML-based approaches. To determine KGs, we performed several experiments. Firstly, DEGs were identified individually from each dataset. Secondly, individual combined DEGs (icDEGs) were identified from two datasets of each platform. Thirdly, grand combined DEGs (gcDEGs) were combined and identified across all platforms. Fourth, differentially expressed discriminative genes (DEDGs) were determined from gcDEGs by support vector machine (SVM). Fifthly, the Cluster profile was performed for the enrichment analysis. The PPI network was constructed using STRING and visualized through Cytoscape as well as determined hub genes using maximal clique centrality (MCC). Module analysis was performed using Molecular Complex Detection (MCODE) and determined the significant modules and their associated significant genes. To incorporate the inference of other existing studies, metadata were formed by combing all genes from previous studies and identified meta-hub genes. Finally, the KGs were identified by intersecting common genes from hub genes, significant module genes, and meta-hub genes. These identified KGs demonstrate greater discriminatory power than other genes due to their statistical significance, robust validation, association with critical pathways, enhanced performance in predictive models, and unique identification through necessity analysis. Another three independent test datasets, taken from three microarray platforms were used to validate these KGs and confirmed that these identified KGs have more discriminative power. So, the flowchart of our proposed methodology for determining KGs in HCC patients is more clearly explained in Fig 1.
Materials and methods
Microarray data acquisitions
Ten publicly available datasets were used to conduct the study. Nine datasets with GEO accession: GSE121248 [62], GSE69715 [63], GSE14520 [64], GSE36376 [65], GSE39791 [66], GSE87630 [67], GSE54236 [68], GSE115018 [69], and GSE47197 was extracted from the GEO database (www.ncbi.nlm.nih.gov/geo/). Another TCGA-LIHC dataset was extracted from the TCGA database (https://portal.gdc.cancer.gov/). Three GEO datasets (GSE121248, GSE69715, and GSE14520) were derived from the Affymetrix platform (GPL570 and GPL571). The GSE36376, GSE87630, and GSE39791 datasets were from Illumina platform (GPL10558 and GPL6947) and another three datasets (GSE54236, GSE115018, and GSE47197) were from Agilent platform (GPL6480, GPL20115, and GPL16699). The GSE115018 dataset with GPL20115 Agilent platform had two types of marker-IncRNA + mRNA microarray. This study analyzed only the mRNA data of GSE115018. The Probe ID mapped to gene symbols from each platform and incorporated them into the dataset matrix for each dataset. Six datasets (GSE121248, GSE69715, GSE36376, GSE39791, GSE54236, and GSE115018) were employed to identify the KGs. Another four independent datasets (GSE14520, GSE87630, GSE47197, and TCGA-LIHC) were used to validate the KGs. The list of utilizing ten datasets was explained more clearly in Table 1.
Effects of platform independent vs platform dependent data analysis
Usually, microarray gene expression datasets are provided in a rich public repository, like GEO [70] to study various types of disease, including cancer. Generally, microarray datasets have been formed using different platforms, namely Affymetrix, Illumina, and Agilent. The number of genes is not equal of these platforms. For example, there are 54675 probe ids presented in the Affymetrix platform, 47323 probe ids in the Illumina platforms, and 41000 probe ids in the Agilent platform. There were 22835, 20760, and 19570 unique gene symbols found in Affymetrix, Illumina, and Agilent platforms, respectively as shown in Table 2 and their Venn diagram is presented in Fig 2. The majority of the existing studies conducted their research on genetic datasets that were taken from a particular platform [9, 12, 13, 19, 41, 59] or different platforms but they considered only common DEGs among those platforms [14, 18, 23, 24]. As a result, both types of studies missed some important genes in their analysis. For instance, 17484 common genes were determined from Affymetrix, Illumina, and Agilent platforms. If we considered only these 17484 common genes for the analysis, the remaining important genes may be eliminated from consideration to be KGs. On the other hand, if we considered only genes (2984+1092+17484+1275 = 22835) (see in Fig 2) from the Affymetrix platform, rest of the important genes may be also missed from the analysis. To solve these problems, we have taken datasets from three different microarray platforms and designed a statistical and ML-based computational approach for identifying platform-independent KGs for HCC by considering all important genes.
Identification of DEGs from individual GEO dataset
This study performed log2 transformation as well as quintile normalization of the selected GEO datasets. DEGs were obtained using the “limma” based R package (version 4.1.2) [71]. Moreover, an empirical Bayes procedure was implemented for analyzing the gene expression difference between HCC and normal tissues [72]. At the same time, the |log2FC| and adj. p-values were computed for correspondence each gene. This study converted probe ids into gene symbols using the “Bioconductor annotation” package [73] and found a unique gene symbol matched with multiple probe IDs. We selected the gene symbol with its correspondence expression value, which gave the minimum adjusted p-value. The genes which have |log2FC|≥1.2 and adj. p-value < 0.01 were chosen as DEGs [74, 75]. Moreover, we used ggplot2 package (version 3.3.6) [76] for generating the volcano plot and NMF package (version 0.24.0) [77] for creating the heatmap of DEGs in R.
Identification of icDEGs from each platform
Generally, GEO microarray datasets were taken from the Affymetrix, Illumina, and Agilent platforms. Most of the existing studies identified common DEGs after identifying DEGs from individual datasets with particular [9, 12, 13, 19, 41, 59] or different microarray platforms and took common DEGs for analysis [14, 18, 23, 24]. As a result, some potential DEGs were missed from the analysis due to platform-independent [78]. To solve these problems, we considered all DEGs, obtained from GEO datasets on the same platforms using the following formula:
(1)
where, rs is the no. of GEO datasets (rs = 2) and j is the number of platforms (j = 3).
Identification of gcDEGs from all platforms
Most of the existing studies were conducted their studies by taking GEO datasets with different platforms. These existing studies determined common DEGs after identifying DEGs, obtained from the GEO dataset with different platforms, and used common DEGs for analysis. As a result, more significant DEGs were also missed or ignored from the analysis [78]. In this case, we identified grand combined DEGs (gcDEGs), obtained from three microarray platforms (Affymetrix, Illumina, and Agilent) using the following formula:
(2)
Where, rd is the number of platforms (here, rd = 3). Here, “Combined DEGs” refers to DEGs identified by integrating data from multiple datasets across platforms through bioinformatics analyses.
SVM based DEDGs identification
SVM is widely used for classification, regression, gene or feature selection, and outlier detection methods. SVM aims to find a hyperplane in a high-dimensional space [79] that effectively separates different classes (HCC versus control) by maximizing the following formulae:
(3)
This study used radial basis function (RBF) for SVM and tuned their correspondence hyperparameters (cost (C) and gamma (γ)) using a grid search approach. In our study, SVM model was employed to determine DEDGs and its computational formula is explained more clearly as follows:
- Step 1: Select 80% of the dataset as a training set and remaining datasets for test sets.
- Step 2: Select one gene from a list of identifying gcDEGs from all platforms.
- Step 3: Trained SVM with RBF kernel with 5-fold CV.
- (a) Calculate accuracy for a DEG if it is found in one dataset.
- (b) Calculate the average accuracy for a DEG If it is found in multiple datasets.
- Step 4: Repeat Step 1 to Step 3 for all gcDEGs (2,264) obtained from all platforms.
- Step 5: Sort the DEGs according to their accuracy or average accuracy in descending order of magnitude.
- Step 6: Take DEGs as DEDGs whose provided the accuracy or average is of more than 95.0%.
Enrichment analysis
Enrichment analysis was performed on DEDGs by the cluster profile based package (version 3.10.0) [80] and “org.Hs.e.g.db” annotation based package (version 3.10.0) [81] in R to understand the molecular the progression and mechanisms of patients with HCC. The cluster profile was used to determine the GO-based terms including biological process (BP), cellular component (CC), and molecular function (MF), and also their associated pathways which were involved in HCC. The KEGG database was chosen to match the datasets for biological interpretations. The optimal GO and KEGG pathways was selected using p-value (<0.05).
Analysis of PPI network and hub gene identification
In our study, we constructed the protein-protein interaction (PPI) networks, data from STRING-based biological database (version 11.5) (www.string-db.org) which was typically used to gather known protein interactions [82]. The proteins are represented as nodes, and the interactions between them as edges. The PPI network can be constructed based on experimental data or computational predictions, which results in a graph where the nodes (proteins) are linked by edges that signify physical or functional interactions. In previous studies, thresholds like 0.30, 0.50, and 0.70 have been applied when analyzing PPI networks. Interactions with lower confidence scores are often unreliable and may represent false positives. By setting the threshold to 0.70, we can filter out these false positives, ensuring that the remaining interactions are more likely to be biologically meaningful. Therefore, for constructing the PPI network, we have used a confidence score threshold of 0.70 and set the maximum number of interactors to ‘0’ to further refine the network. Cytoscape (version 3.9.1) [83] was employed to generate PPI network within DEDGS. Additionally, Cytoscape-plugin cytoHubba was employed to rank the DEDGs (nodes) in the PPI network using the eleven topological measures such as degree, edge percolated component, MCC, and six centralities (Bottleneck, EcCentricity, Closeness, Radiality, Betweenness, and Stress) based on shortest paths [84]. Among the eleven methods, MCC has a better performance on the precision of predicting essential proteins from PPI network [84]. In our study, We chose MCC as it has been shown to outperform other measures in predicting essential proteins based on the existing studies such as Chin et al. [84], where MCC demonstrated superior precision in identifying critical nodes in biological networks. Consequently, we ranked the MCC values from the largest to lowest and selected the top 20 DEDGs as hub genes, which was used to determine KGs for patients with HCC.
Hub modules and identifying their associated genes
MCODE (Molecular Complex Detection) method is used for clustering PPI networks into highly interconnected subgraphs. MCODE was chosen due to its ability to detect functional protein complexes efficiently. It is easy to implement, works well for large networks, and is sensitive to local density variations, making it highly suitable for detecting biologically relevant sub-networks in PPI data. We used degree (2=), k-score (=2), nodes score (=0.2), and max depth (=100) based criteria for performing module analysis [85, 86]. The optimal modules (om) were chosen with MCODE scores (≥6) and the no. of nodes (≥6) and selected hub module genes using the following formula:
(4)
Meta hub genes formation from existing studies
Some previous studies on the identification of DEGs for HCC were summarized from three different platforms (Affymetrix, Illumina, and Agilent) [6, 9–59] and listed their hub genes [6, 9–59], known as meta-hub genes. These meta-hub genes were calculated using the following formula:
(5)
where, “MHG” is the no. of existing studies from identifying hub genes (here, M = 48). These meta-hub genes were also used to identify KGs.
Identification of KGs
The following formula was used to determine KGs for HCC:
(6)
where, K is the no. of optimal gene determination approaches (Here, K = 3). This study considered three identification-based approaches (hub genes, significant module genes, and meta-hub genes) to determine the important genes for HCC.
Validation of KGs
Discriminative capability analysis using AUC.
Four independent datasets (GSE14520, GSE87630, GSE47197, and TCGA-LIHC) were utilized to validate the KGs for HCC. Among them, three independent datasets were taken from three platforms (Affymetrix, Illumina, and Agilent platforms). The detailed descriptions of these data sets are presented in Table 1. Logistic regression model was used and their respective AUC values [87] to determine the discriminative power of the KGs.
Survival analysis.
We extracted 374 HCC patients from the TCGA database for survival analysis. To perform survival analysis, patients with HCC were divided into low-risks and high-risks groups using median of their respective gene expression value. The relationship between the KGs and survival status (alive/dead) was assessed using cox proportional hazard model. We performed the survival analysis of KGs using the “Survfit” based R package [88] (<0.05).
Necessity analysis of platform-independent KG identification
Like existing studies, this study determined the common DEGs from the GO based datasets of the same platform and common DEGs from datasets of all platforms in order to show whether it is possible to identify six KGs by following the procedure of existing studies. In this work, we performed the independent analysis in order to present or absence of KGs into five different viewpoints: (i) KGs vs. identified common DEGs from two datasets of Affymetrix platform, (ii) KGs vs. identified common DEGs from two datasets of Illumina platform, (iii) KGs vs. identified common DEGs from two datasets of Agilent platform, (iv) KGs vs. identified common DEGs from six datasets of all platforms, and (v) KGs vs. identified grand combined DEGs from six datasets of all platforms.
Experimental results
Identification of DEGs from individual dataset
The “limma” based R-package was utilized to obtain individual DEGs from six GEO datasets (GSE121248, GSE69715, GSE36376, GSE39791, GSE54236, and GSE115018). Using filtering criteria |log2FC|≥1.2 and adj. p-value<0.01, a total of 645 DEGs, 1084 DEGs, 377 DEGs, 279 DEGs, 563 DEGs, and 734 DEGs were determined from GSE121248, GSE69715, GSE36376, GSE39791, GSE54236, and GSE115018 datasets, respectively. The volcano plots and their correspondence heatmaps of identifying DEGs for individual datasets were depicted in Fig 3.
Dodger blue, Gray, and fire brick color represent downregulated, no significant, and upregulated DEGs, respectively.
Identification of icDEGs from each platform
We identified icDEGs from two datasets of each platform. The DEGs were identified from these three platforms, and their Venn diagrams are illustrated in Fig 4. As shown in Fig 4a, a total of 1415 (331+314+770) icDEGs were obtained from Affymetrix, 458 (179+198+81) icDEGs were from Illumina, and 1,113 (380+183+550) icDEGs were from Agilent platform, respectively. Moreover, we also performed the intersection among DEGs, obtained from six datasets across three microarray platforms (see in Fig 4d). As shown in Fig 4d, we observed that only 31 DEGs were common among six GEO datasets. If we consider these 31 common DEGs, some potential DEGs may be missed from the analysis. For this reason, we considered icDEGs from two datasets of each platform.
Identification of gcDEGs from all platforms
We also performed the intersection among DEGs, obtained from Affymetrix, Illumina, and Agilent platforms, and their Venn diagrams are depicted in Fig 5. As shown in Fig 5, we obtained 135 common DEGs from icDEGs of Affymetrix, Illumina, and Agilent platforms. If we also considered these 135 common DEGs among all platforms, some potential DEGs may also be missed from the analysis. In this case, we considered all 2264 (856+92+203+135+332+28+618) DEGs, called as gcDEGs. These 2265 gcDEGs were used for further analysis.
SVM based DEDGs identification
SVM was implemented on gcDEGs to determine DEDGs for HCC and computed classification accuracy for each gene of gcDEGs from all platforms. If the DEGs were found in multiple datasets, then compute the average classification accuracy. The identification procedure of DEDGs using SVM is more clearly explained in the methodology section. After that, gcDEGs were sorted in descending (largest to lowest) based on classification accuracy as shown in Fig 6. Therefore, 518 DEDGs were identified because their accuracy or average accuracy was more than 95.0%.
Enrichment analysis on DEDGs
Enrichment analysis was performed on 518 DEDGs to understand the molecular mechanism of developing HCC. The DEDGs were significantly enriched in a total of 309 BP-based GO terms, 17 CC-based GO terms, 18 MF-based GO terms, and 9 KEGG-based pathways (p<0.05). The top ten BP, CC, and MF-based GO terms and nine KEGG pathways were illustrated in Fig 7.
PPI network construction and hub gene selection
A PPI network was constructed on 518 DEDGs using STRING and is displayed in Fig 8. A total of 233 nodes and 748 edges were connected in the PPI network with an average clustering coefficient of 0.283, network density of 0.02, and connected components of 19. The scores of MCC were computed from the PPI network using the Cytoscape plug-in cytoHubba and then, sorted the genes based on MCC. At the same time, the top 20 DEDGs (CDC20, TOP2A, CENPF, DLGAP5, UBE2C, ARHGAP11A, RACGAP1, HIST1H2AJ, HIST1H2AH, HIST1H2AM, HIST1H2AK, HIST1H2BO, HIST1H4H, HIST2H2AB, HIST1H2BJ, HIST1H2BB, HIST1H3E, HIST1H2AD, HIST1H2BI, and HIST1H2BL) were selected, which were treated as hub genes in this work.
Hub modules and its associated genes identification
MCODE analysis was employed to module analysis and identified 18 modules with MCODE scores from 2.8 to 7.0. We observed that module 1 and module 2, provided the MCODE scores and nodes of more than 6. These two modules were considered as significant models and their PPI networks were shown in Fig 9. There were 7 nodes and 42 edges in module 1, whereas module 2 contained 6 nodes and 30 edges. To make hub module, this study combined module 1 and module 2 and identified their respective 13 genes, known as significant module genes.
Meta hub genes determined from existing studies
Forty eight previous studies were summarized related to gene identification for patients with HCC [6, 9–59]. To make align with other previous studies, we compiled metadata by identifying all hub genes from earlier research and determining meta hub genes. The identified hub genes and corresponding metadata are listed in Table 3. Finally, this study identified 138 hub genes from previous studies, which were used to identify KGs of HCC.
Identification of KGs
We determined 20 hub genes from the PPI network using MCC, 13 significant module genes from significant hub modules, and 138 meta hub genes from existing meta hub genes. We identified six common genes among selected hub genes, significant module genes, and meta hub genes as illustrated in Fig 10. The six common genes were CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1, which were assumed to be KGs that can be used to classify patients as having HCC or healthy controls.
Validation of KGs
Discriminative capability analysis using AUC.
Four independent datasets (GSE14520, GSE87630, GSE47197, and TCGA-LIHC) were used to validate six KGs. Since these six KGs were identified from datasets with three independent platforms: Affymetrix, Illumina, and Agilent, respectively. We also took these datasets from Affymetrix, Illumina, and Agilent platforms to show the discriminative power of the selected KGs. Moreover, we also checked the discriminative power of these KGs using the TCGA-LIHC dataset. The discriminative power of these six KGs was evaluated using AUC.
The ROC curves of six KGs for three independent datasets are shown in Fig 11. Five KGs out of six KGs were found in the GSE14520 dataset with the Affymetrix platform (see in Fig 11a), six KGs were found in the GSE87630 dataset with Illumina platform (see in Fig 11b), and four KGs were found in the GSE47197 dataset with the Agilent platform (see in Fig 11c). As shown in Fig 11a, the AUC value of each KG was more than 0.900, which was assumed to be a high discriminative power. So, the correspondence AUC values were 0.954 for CDC20, 0.976 for TOP2A, 0.966 for CENPF, 0.946 for DLGAP5, and 0.972 for RACGAP1. Similarly, as shown in Fig 11b, the AUC values of six KGs were: 0.999 for CDC20, 1.000 for TOP2A, 0.986 for CENPF, 0.886 for DLGAP5, 0.995 for UBE2C, and 0.977 for RACGAP1. Similarly, as shown in Fig 11c, the AUC values of four KGs were: 0.759 for CDC20, 0.927 for CENPF, 0.893 for UBE2C, and 0.878 for RACGAP1 Moreover, the ROC curve of these six KGs for the TCGA-LIHC dataset is shown in Fig 11d. As shown in Fig 11d, we observed that the AUC values of CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1 based KGs were as follows: 0.967, 0.960, AUC: 0.967, 0.961, 0.963, and 0.961, respectively.
Survival analysis.
The survival analysis of six KGs was performed using univariate Cox regression and their correspondence findings are shown in Fig 12. The patients with HCC were classified into low-risk and high-risk groups and marked as different colors. For example. red line indicates the patients in high risk groups, and the green line indicates the patients in low-risk groups. Specifically, the high expression levels of CDC20 (p<0.0001), TOP2A (p = 0.0053), CENPF (p = 0.0019), DLGAP5 (p = 0.00024), UBE2C (p = 0.0013), and RACGAP1 (p = 0.0028) were determined as being strongly associated with survival status (alive/death). Moreover, the hazard ratio (HR) of these six KGs was greater than one, which indicates there was a strong relationship between these six KGs and the HR of the death. Moreover, HCC patients with over-expression levels of CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1 KGs had significantly poor survival periods than the lower expression level of these genes among patients as shown in Fig 12.
Necessity analysis of platform-independent KGs identification
Six KGs were identified from Affymetrix, Illumina, and Agilent platforms. Now, we performed a necessity analysis to check whether these six identified KGs were present or absent on microarray platforms. Like existing studies, we identified the common DEGs from two datasets of each platform and common DEGs from six datasets of all platforms in order to show whether it is possible to identify these six KGs by following the procedure of existing studies. The result of the necessity analysis of identified six KGs on three microarray platforms is presented in Fig 13. As shown in Fig 13, we observed that only four KGs (TOP2A, CENPF, DLGAP5, and RACGAP1) were identified among common DEGs, obtained from two datasets of Affymetrix platform (see in Fig 13a), three KGs (CDC20, TOP2A, and UBE2C) were found in common DEGs, obtained from two datasets of Illumina platform (see in Fig 13b), three KGs (TOP2A, CENPF, and DLGAP5) were also found in common DEGs, obtained from two datasets of Agilent platform (see in Fig 13c), only one KG (TOP2A) were found in common DEGs, obtained from six datasets of all platforms (see in Fig 13d). On the other hand, six KGs were completely presented in grand combined DEGs, obtained from six datasets of all platforms by following our procedure. From these results, it has been shown that analysis depending on a specific platform (Affymetrix, Illumina, and Agilent) and analysis depending on common DEGs from all platforms can not produce these six KGs.
Discussion
HCC is one kind of tumor, significantly correlated with poor survival rate and high death rate among most cancer types [89]. The poor survival rate has occurred due to the lack of early detection and the higher incidence rate. In existing studies, researchers have attempted to propose promising biomarkers for detecting HCC. Some of the existing studies conducted their research on genetic datasets that were taken from a particular platform [15, 27], and some researchers used datasets from different platforms [23, 24, 50, 52], but they considered only the common DEGs among those platforms. As a result, both types of studies missed some important genes in their analysis. In order to solve these problems, we have taken datasets from multiple platforms and designed a computational approach with the combination of statistics and ML-based approaches for identifying platform-independent KGs for HCC by considering all important genes.
To identify KGs for HCC, we used six datasets (GSE121248, GSE69715, GSE36376, GSE39791, GSE54236, and GSE115018) from Affymetrix, Illumina, and Agilent platforms. Two datasets were taken from each platform. First, we identified a total of 645 DEGs, 1084 DEGs, 377 DEGs, 279 DEGs, 563 DEGs, and 734 DEGs from GSE121248, GSE69715, GSE36376, GSE39791, GSE54236, and GSE115018 datasets, respectively and their volcano plots and heatmaps of DEGs for individual dataset were presented in Fig 3. After that, we selected 1415 icDEGs from the Affymetrix platform (GSE121248 and GSE69715) (see in Fig 4a), 458 icDEGs from the Illumina platform (GSE36376 and GSE39791) (see in Fig 4b), and 1113 icDEGs from the Agilent platform (GSE54236 and GSE115018) (see in Fig 4c). At the same time, a total of 135 common DEGs were identified from the Affymetrix, Illumina, and Agilent platforms (See in Fig 5). If we analyzed these 135 common DEGs for the next experiments, some important DEGs may be missed from the analysis. To solve this problem, we considered all 2264 (856+92+203+135+332+28+618) gcDEGs, obtained from all platforms. Moreover, we adopted SVM on gcDEGs and then calculated the accuracy of each gene from all 22264 gcDEGs, obtained from all platforms. If the DEGs were presented in multiple datasets, then compute the average classification accuracy. At the same time, DEGs were sorted based on classification accuracy, and we selected 518 DEGs out of 2264 DEGs because their classification accuracy or average classification accuracy was greater than 95.0% as shown in Fig 6. Enrichment analysis was employed on 518 DEDGs to understand their molecular pathways (see Fig 7). It was noticed that BP-based GO terms were significantly correlated of developing HCC patients, which were also enriched DEDGs, coincided with previous studies.
For example, detoxification of copper ion [36, 46], the stress response to copper ion [36], cellular response to copper ion [36, 90], detoxification of inorganic compound [36], the stress response to metal ion [36], cellular response to zinc ion [16], response to zinc ion [16, 54, 57], response to cadmium ion [16, 50, 54, 57], regulation of cell-cell adhesion [6]. The CC-based GO terms were also statistically enriched with DEDGS and coincided with existing studies, collagen-containing extracellular matrix [40]. For MFs based GO terms, DEDGs were also enriched with top GO terms such as chemokine activity [36, 59].
The KEGG pathways were closely associated with DEDG. Our findings also coincided with previous studies, such as Mineral absorption [42, 46, 50, 51, 54, 91], Arginine biosynthesis [40], Viral protein interaction with cytokine and cytokine receptor [59], Cytokine-cytokine receptor interaction [25, 31, 45], Biosynthesis of amino acids, Proximal tubule bicarbonate reclamation, Alanine, aspartate and glutamate metabolism [40, 55], Complement and coagulation cascades [25, 35, 40], PI3K-Akt signaling pathway [25].
A PPI network was constructed using STRING and Cytoscape and determined the 20 hub genes using MCC as shown in Fig 8. Simultaneously, significant modules and their correspondence PPI netwroks as shown in Fig 9. At the same time, 13 combined genes were determined from module 1 and module 2, treated as significant hub module genes. Moreover, meta data were formed from 48 previous studies [6, 9–59] and identified 138 meta-hub genes by combining all hub genes. Finally, this study proposed six KGs (CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1) by intersecting identified hub genes, significant hub module genes and meta hub genes. After that, three independent test datasets from three independent platforms were used to validate the identified six KGs using AUC. Six KGs provided high discriminative power for differentiating HCC from healthy.
CDC20 plays an essential role in the cell division of humans [92]. CDC20 overexpression was strongly associated with various cancer diseases, like colorectal [93] and breast [94]. Existing research found that the overexpression of CDC20 was highly correlated with the development [95] and prognosis of HCC [96]. This current study also illustrated that CDC20 was a KG that played an important role of developing HCC, as coincided by existing studies [10, 13, 14, 17, 18, 20, 24, 30, 32, 37, 40, 41, 44, 48, 50, 55–57].
TOP2A is a cell cycle-related gene that encodes a DNA topoisomerase, which regulates and modifies the topologic states of DNA during transcription. The overexpression of TOP2A was considered as a key biomarker of various cancers, including ovarian [97] and lung [98]. TOP2A overexpression was significantly related to the progression and poor prognosis of HCC [99]. Our study also considered TOP2A as a key biomarker for the development and progression of HCC, as supported by previous studies [12, 14, 15, 18, 20–25, 27, 32–36, 39, 41–43, 45, 46, 48, 50, 51, 54–59].
CENPF is a protein that played a significant role in the segregation of chromosomes during the cell cycle. CENPF was significantly linked with numerous cancers, including HCC [100]. Huang et al. [100] proposed the CENPF gene as a novel significant biomarker for HCC [100]. This current study illustrated that CENPF might play a significant role in developing HCC, supported by previous studies [101, 102]
DLGAP5 also plays a vital role in the development and progression of HCC [109, 103]. Liao et al. [104] illustrated that higher expression levels of DLGAP5 were found in HCC patients compared to healthy subjects [104]. Moreover, the higher expression of DLGAP5 was significantly associated with venous permeation and cellular invasion, as coincided by previous studies [9].
Similarly, this study also considered UBE2C as one of the key biomarkers for the development and prediction of HCC as supported by previous studies [10, 18, 33, 36, 41, 44, 58]. A study showed that UBE2C was one of the key promising biomarkers of HCC [105]. RACGAP1 is a key regulator in various cancers, like colorectal cancer [106], ovarian cancer [107], and breast cancer [108]. A study showed that RACGAP1 was one of the diagnostic prognostic biomarkers for early developing HCC [109]. Our study also showed that RACGAP1 might play a critical role in the development of HCC. This fining was also coincided with existing studies [35, 38, 39, 43, 57, 59].
To show the discriminative power of these proposed six KGs, We used four independent test datasets. The validation experimental results illustrated that CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1 genes had more discriminative capability to be KGs. Moreover, we also performed the prognosis analysis of these six KGs. Interestingly, the six KGs contained CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1 were significantly correlated to the poor prognosis of HCC. This research will be necessary to conduct more research on the roles played by these six discovered genes of developing HCC. In the future, we will implement the same pipeline or proposed system to analysis of other cancer types. Despite more research ongoing to develop novel biomarkers, our future work will implement the clinical utilization of determined key biomarkers in order to fulfill the real-world demand.
Conclusion and future work direction
This current work identified CDC20, TOP2A, CENPF, DLGAP5, UBE2C, and RACGAP1 as KGs for HCC using statistical and ML-based approaches. These KGs were more closely linked to HCC and have been identified as biomarkers and targeted therapy in the diagnosis of cancer. This current research will be helpful for the readers as well as physicians to identify the associated molecular pathway of HCC. It is necessary to conduct more experimental studies to validate these KGs for HCC.
We have also plan to use alternative models, such as linear SVM, decision trees, and other machine learning algorithms in our future studies to assess the discriminative ability of the selected key genes. We also will try to validate the verify the reliability of these key genes in our future studies. Moreover, incorporating RNA-Seq data into our study will significantly broaden the scope of the research, offering deeper insights into gene expression regulation. We have plan a subsequent study focused exclusively on RNA-Seq datasets for HCC. This future analysis will allow for a direct comparison of RNA-Seq with the platforms used in this study.
References
- 1. Parkin DM, Bray F, Ferlay J, Pisani P. Global cancer statistics, 2002. CA Cancer J Clin. 2005;55(2):74–108. pmid:15761078
- 2. Yang JD, Hainaut P, Gores GJ, Amadou A, Plymoth A, Roberts LR. A global view of hepatocellular carcinoma: trends, risk, prevention and management. Nat Rev Gastroenterol Hepatol. 2019;16(10):589–604. pmid:31439937
- 3.
Kumar V, Abbas AK, Fausto N, Aster JC. Robbins and Cotran pathologic basis of disease, 9th ed. Elsevier health sciences; 2015.
- 4. Llovet JM, Kelley RK, Villanueva A, Singal AG, Pikarsky E, Roayaie S, et al. Hepatocellular carcinoma. Nat Rev Dis Primers. 2021;7(1):6–34. pmid:33479224
- 5. Akinyemiju T, Abera S, Ahmed M, Alam N, Alemayohu MA, Allen C, et al. The burden of primary liver cancer and underlying etiologies from 1990 to 2015 at the global, regional, and national level: results from the global burden of disease study 2015. JAMA Oncol. 2017;3(12):1683–1691. pmid:28983565
- 6. Zhang C, Peng L, Zhang Y, Liu Z, Li W, Chen S, et al. The identification of key genes and pathways in hepatocellular carcinoma by bioinformatics analysis of high-throughput data. Med Oncol. 2017;34(6):1–13. pmid:28432618
- 7. Choi DT, Davila JA, Sansgiry S, David E, Singh H, El-Serag HB, et al. Factors associated with delay of diagnosis of hepatocellular carcinoma in patients with cirrhosis. Clinical Gastroenterology and Hepatology. 2021;19(8):1679–1687. pmid:32693047
- 8. Cheo FY, Lim CHF, Chan KS, Shelat VG. The impact of waiting time and delayed treatment on the outcomes of patients with hepatocellular carcinoma: A systematic review and meta-analysis. Annals of hepato-biliary-pancreatic surgery. 2024;28(1):1–13. pmid:38092430
- 9. Maddah R, Shariati P, Arabpour J, Bazireh H, Shadpirouz M, Kafraj AS. Identification of critical genes and pathways associated with hepatocellular carcinoma and type diabetes mellitus using integrated bioinformatics analysis. Inform Med Unlocked. 2022;30:100956–100963.
- 10. Yan G, Liu Z. Identification of differentially expressed genes in hepatocellular carcinoma by integrated bioinformatic analysis. bioRxiv. 2019; p. 570846–570874.
- 11.
Qian Z, Yan Z, Zhengkui L. Mining of Gene Modules and Identification of Key Genes in Hepatocellular Carcinoma based on Gene Co-expression Network Analysis. In: Proce. 2020 12th Int. Conf. Bioinformatics Biomed. Technol.; 2020. p. 18–24.
- 12.
Zhao Y, Xie Y. Study on Differential Expression Genes in HCC Based on GEO Database. In: Proce. 2021 Int. Conf. Bioinformatics Intell. Comput.; 2021. p. 63–69.
- 13. Liu J, Han F, Ding J, Liang X, Liu J, Huang D, et al. Identification of Multiple Hub Genes and Pathways in Hepatocellular Carcinoma: A Bioinformatics Analysis. Biomed Res Int. 2021;2021:1–11. pmid:34337056
- 14. Meng Z, Wu J, Liu X, Zhou W, Ni M, Liu S, et al. Identification of potential hub genes associated with the pathogenesis and prognosis of hepatocellular carcinoma via integrated bioinformatics analysis. J Int Med Res. 2020;48(7):1–23. pmid:32722976
- 15. Rosli AFC, Razak SRA, Zulkifle N. Bioinformatics analysis of differentially expressed genes in liver cancer for identification of key genes and pathways. Malays J Med Sci. 2019;15:18–24.
- 16. Li Y, Chen R, Yang J, Mo S, Quek K, Kok CH, et al. Integrated bioinformatics analysis reveals key candidate genes and pathways associated with clinical outcome in hepatocellular carcinoma. Front Genet. 2020;11:814–819. pmid:32849813
- 17. Li Z, Lin Y, Cheng B, Zhang Q, Cai Y. Identification and analysis of potential key genes associated with hepatocellular carcinoma based on integrated bioinformatics methods. Front Genet. 2021;12:571231–571245. pmid:33767726
- 18. Tian D, Yu Y, Zhang L, Sun J, Jiang W. A five-gene-based prognostic signature for hepatocellular carcinoma. Front Med. 2021;8:1–24. pmid:34568357
- 19. Wan Z, Zhang X, Luo Y, Zhao B. Identification of hepatocellular carcinoma-related potential genes and pathways through bioinformatic-based analyses. Genet Test Mol Biomarkers. 2019;23(11):766–777. pmid:31633428
- 20. Zhu Q, Sun Y, Zhou Q, He Q, Qian H. Identification of key genes and pathways by bioinformatics analysis with TCGA RNA sequencing data in hepatocellular carcinoma. Mol Clin Oncol. 2018;9(6):597–606. pmid:30546887
- 21. Wang J, Tian Y, Chen H, Li H, Zheng S. Key signaling pathways, genes and transcription factors associated with hepatocellular carcinoma. Mol Med Rep. 2018;17(6):8153–8160. pmid:29658607
- 22. Zhou L, Du Y, Kong L, Zhang X, Chen Q. Identification of molecular target genes and key pathways in hepatocellular carcinoma by bioinformatics analysis. Onco Targets Ther. 2018;11:1861–1869. pmid:29670361
- 23. Zhang P, Feng J, Wu X, Chu W, Zhang Y, Li P. Bioinformatics analysis of candidate genes and pathways related to hepatocellular carcinoma in China: A study based on public databases. Pathol Oncol Res. 2021;27:588532–588546. pmid:34257537
- 24. Mou T, Zhu D, Wei X, Li T, Zheng D, Pu J, et al. Identification and interaction analysis of key genes and microRNAs in hepatocellular carcinoma by bioinformatics analysis. World J Surg Oncol. 2017;15(1):1–9. pmid:28302149
- 25. Wu M, Liu Z, Li X, Zhang A, Lin D, Li N. Analysis of potential key genes in very early hepatocellular carcinoma. World J Surg Oncol. 2019;17(1):1–8.
- 26. Gui T, Dong X, Li R, Li Y, Wang Z. Identification of hepatocellular carcinoma-related genes with a machine learning and network analysis. J Comput Biol. 2015;22(1):63–71. pmid:25247452
- 27. Wang J, Peng R, Zhang Z, Zhang Y, Dai Y, Sun Y. Identification and validation of key genes in hepatocellular carcinoma by bioinformatics analysis. Biomed Res Int. 2021;2021:6662114–6662127. pmid:33688500
- 28. Lu H, Zhu Q. Identification of key biological processes, pathways, networks, and genes with potential prognostic values in hepatocellular carcinoma using a bioinformatics approach. Cancer Biother Radiopharm. 2021;36(10):837–849. pmid:32598174
- 29. Bhatt S, Singh P, Sharma A, Rai A, Dohare R, Sankhwar S, et al. Deciphering key genes and miRNAs associated with Hepatocellular carcinoma via network-based approach. IEEE/ACM Trans Comput Biol Bioinform. 2020;36(10):837–849.
- 30. Zhang Y, Lin Z, Lin X, Zhang X, Zhao Q, Sun Y. A gene module identification algorithm and its applications to identify gene modules and key genes of hepatocellular carcinoma. Sci Rep. 2021;11(1):1–14. pmid:33750838
- 31. Jiang X, Hao Y. Analysis of expression profile data identifies key genes and pathways in hepatocellular carcinoma. Oncol Lett. 2018;15(2):2625–2630. pmid:29434983
- 32. Zhang X, Luo X, Liu W, Shen A, et al. Identification of Hub Genes Associated With Hepatocellular Carcinoma Prognosis by Bioinformatics Analysis. J Int Med Res. 2021;12(04):186–207.
- 33. Wu M, Liu Z, Zhang A, Li N. Identification of key genes and pathways in hepatocellular carcinoma: A preliminary bioinformatics analysis. Medicine. 2019;98(5):1–7. pmid:30702595
- 34. Nguyen TB, Do DN, Nguyen-Thanh T, Tatipamula VB, Nguyen HT. Identification of Five Hub Genes as Key Prognostic Biomarkers in Liver Cancer via Integrated Bioinformatics Analysis. Biology. 2021;10(10):957–970. pmid:34681056
- 35. Zhou Z, Li Y, Hao H, Wang Y, Zhou Z, Wang Z, et al. Screening hub genes as prognostic biomarkers of hepatocellular carcinoma by bioinformatics analysis. Cell Transplant. 2019;28(1_suppl):76S–86S. pmid:31822116
- 36. Yu C, Chen F, Jiang J, Zhang H, Zhou M. Screening key genes and signaling pathways in colorectal cancer by integrated bioinformatics analysis. Mol Med Rep. 2019;20(2):1259–1269. pmid:31173250
- 37. Kakar M, Mehboob M, Akram M, Iqbal I, Ijaz H, Aziz U, et al. Identification of novel potential biomarkers in hepatocarcinoma cancer; a transcriptome analysis. Preprint. 2021; p. 1–21.
- 38. Ji Y, Yin Y, Zhang W. Integrated bioinformatic analysis identifies networks and promising biomarkers for hepatitis B virus-related hepatocellular carcinoma. Int J Genomics. 2020;2020:1–18. pmid:32775402
- 39. Chen D, Feng Z, Zhou M, Ren Z, Zhang F, Li Y. Bioinformatic Evidence Reveals that Cell Cycle Correlated Genes Drive the Communication between Tumor Cells and the Tumor Microenvironment and Impact the Outcomes of Hepatocellular Carcinoma. Biomed Res Int. 2021;2021:4092635–4092660. pmid:34746301
- 40. Qiang R, Zhao Z, Tang L, Wang Q, Wang Y, Huang Q. Identification of 5 hub genes related to the early diagnosis, tumour stage, and poor outcomes of hepatitis B virus-related hepatocellular carcinoma by bioinformatics analysis. Comput Math Methods Med. 2021;2021:1–20.
- 41. Wang J, Wang Y, Xu J, Song Q, Shangguan J, Xue M, et al. Global analysis of gene expression signature and diagnostic/prognostic biomarker identification of hepatocellular carcinoma. Sci Prog. 2021;104(3):1–7. pmid:34315286
- 42. Zhang Y, Tang Y, Guo C, Li G. Integrative analysis identifies key mRNA biomarkers for diagnosis, prognosis, and therapeutic targets of HCV-associated hepatocellular carcinoma. Aging (Albany NY). 2021;13(9):12865–12895. pmid:33946043
- 43. Kim SH, Hwang S, Song GW, Jung DH, Moon DB, Do Yang J, et al. Identification of key genes and carcinogenic pathways in hepatitis B virus-associated hepatocellular carcinoma through bioinformatics analysis. Ann Hepatobiliary Pancreat Surg. 2022;26(1):58–68. pmid:34907098
- 44. Zhang G, Kang Z, Mei H, Huang Z, Li H. Promising diagnostic and prognostic value of six genes in human hepatocellular carcinoma. Am J Transl Res. 2020;12(4):1239–1254. pmid:32355538
- 45. Sha M, Cao J, Zong ZP, Xu N, Zhang JJ, Tong Y, et al. Identification of genes predicting unfavorable prognosis in hepatitis B virus-associated hepatocellular carcinoma. Ann Transl Med. 2021;9(12):975–985. pmid:34277775
- 46. Chen H, Wu J, Lu L, Hu Z, Li X, Huang L, et al. Identification of hub genes associated with immune infiltration and predict prognosis in hepatocellular carcinoma via bioinformatics approaches. Front Genet. 2021;11:575762–575779. pmid:33505422
- 47. He B, Yin J, Gong S, Gu J, Xiao J, Shi W, et al. Bioinformatics analysis of key genes and pathways for hepatocellular carcinoma transformed from cirrhosis. Medicine. 2017;96(25):6938–6946. pmid:28640074
- 48. Zhang S, Peng R, Xin R, Shen X, Zheng J. Conjoint analysis for hepatic carcinoma with hub genes and multi-slice spiral CT. Medicine. 2020;99(45):e23099–e23110. pmid:33157984
- 49. Hu WQ, Wang W, Yin XF, et al. Identification of biological targets of therapeutic intervention for hepatocellular carcinoma by integrated bioinformatical analysis. Med Sci Monit. 2018;24:3450–3461. pmid:29795057
- 50. Zhang Q, Sun S, Zhu C, Zheng Y, Cai Q, Liang X, et al. Prediction and analysis of weighted genes in hepatocellular carcinoma using bioinformatics analysis. Mol Med Rep. 2019;19(4):2479–2488. pmid:30720105
- 51. Li N, Li L, Chen Y. The identification of core gene expression signature in hepatocellular carcinoma. Oxid Med Cell Longev. 2018;2018:1–15. pmid:29977454
- 52. Cao J, Zhang R, Zhang Y, Wang Y. Combined screening analysis of aberrantly methylated-differentially expressed genes and pathways in hepatocellular carcinoma. J Gastrointest Oncol. 2022;13(1):311–325. pmid:35284134
- 53. Yang L, Zeng Lf, Hong Gq, Luo Q, Lai X. Construction of a Novel Clinical Stage-Related Gene Signature for Predicting Outcome and Immune Response in Hepatocellular Carcinoma. J Immunol Res. 2022;2022:1–10. pmid:35865652
- 54. Wang M, Wang L, Wu S, Zhou D, Wang X. Identification of key genes and prognostic value analysis in hepatocellular carcinoma by integrated bioinformatics analysis. Int J Genomics. 2019;2019:1–22. pmid:31886163
- 55. Jiang N, Zhang X, Qin D, Yang J, Wu A, Wang L, et al. Identification of Core Genes Related to Progression and Prognosis of Hepatocellular Carcinoma and Small-Molecule Drug Predication. Front Genet. 2021;12:608017–608036. pmid:33708237
- 56. Li L, Lei Q, Zhang S, Kong L, Qin B. Screening and identification of key biomarkers in hepatocellular carcinoma: evidence from bioinformatic analysis. Oncol Rep. 2017;38(5):2607–2618. pmid:28901457
- 57. Xing T, Yan T, Zhou Q. Identification of key candidate genes and pathways in hepatocellular carcinoma by integrated bioinformatical analysis. Exp Ther Med. 2018;15(6):4932–4942. pmid:29805517
- 58. Zhu W, Xu J, Chen Z, Jiang J. Analyzing Roles of NUSAP1 From Clinical, Molecular Mechanism and Immune Perspectives in Hepatocellular Carcinoma. Front Genet. 2021;12:689159–689181. pmid:34354737
- 59. Dai Q, Liu T, Gao Y, Zhou H, Li X, Zhang W. Six genes involved in prognosis of hepatocellular carcinoma identified by Cox hazard regression. BMC Bioinformatics. 2021;22(1):1–12. pmid:33784984
- 60. Qing JB, Song WZ, Li CQ, Li YF. The Diagnostic and Predictive Significance of Immune-Related Genes and Immune Characteristics in the Occurrence and Progression of IgA Nephropathy. J Immunol Res. 2022;2022:1–20. pmid:35528619
- 61. Chen DL, Cai JH, Wang CC. Identification of Key Prognostic Genes of Triple Negative Breast Cancer by LASSO-Based Machine Learning and Bioinformatics Analysis. Genes. 2022;13(5):902–918. pmid:35627287
- 62. Wang SM, Ooi LLP, Hui KM. Identification and validation of a novel gene signature associated with the recurrence of human hepatocellular carcinoma. Clin Cancer Res. 2007;13(21):6275–6283. pmid:17975138
- 63. Sekhar V, Pollicino T, Diaz G, Engle RE, Alayli F, Melis M, et al. Infection with hepatitis C virus depends on TACSTD2, a regulator of claudin-1 and occludin highly downregulated in hepatocellular carcinoma. PLoS Pathog. 2018;14(3):e1006916–e1006946. pmid:29538454
- 64. Roessler S, Jia HL, Budhu A, Forgues M, Ye QH, Lee JS, et al. A unique metastasis gene signature enables prediction of tumor relapse in early-stage hepatocellular carcinoma patients. Cancer Res. 2010;70(24):10202–10212. pmid:21159642
- 65. Lim HY, Sohn I, Deng S, Lee J, Jung SH, Mao M, et al. Prediction of disease-free survival in hepatocellular carcinoma by gene expression profiling. Ann Surg Oncol. 2013;20(12):3747–3753. pmid:23800896
- 66. Kim JH, Sohn BH, Lee HS, Kim SB, Yoo JE, Park YY, et al. Genomic predictors for recurrence patterns of hepatocellular carcinoma: model derivation and validation. PLoS Med. 2014;11(12):e1001770–e1001786. pmid:25536056
- 67. Woo HG, Choi JH, Yoon S, Jee BA, Cho EJ, Lee JH, et al. Integrative analysis of genomic and epigenomic regulation of the transcriptome in liver cancer. Nat Commun. 2017;8(1):1–11. pmid:29018224
- 68. Villa E, Critelli R, Lei B, Marzocchi G, Camma C, Giannelli G, et al. Neoangiogenesis-related genes are hallmarks of fast-growing hepatocellular carcinomas and worst survival. Results from a prospective study. Gut. 2016;65(5):861–869. pmid:25666192
- 69. Shi J, Ye G, Zhao G, Wang X, Ye C, Thammavong K, et al. Coordinative control of G2/M phase of the cell cycle by non-coding RNAs in hepatocellular carcinoma. PeerJ. 2018;6:e5787–e5798. pmid:30364632
- 70. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, et al. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375(12):1109–1112. pmid:27653561
- 71. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):1–13. pmid:25605792
- 72. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3(1):1–28. pmid:16646809
- 73.
Carlson MR, Arora S, Obenchain V, Morgan M, et al. Genomic annotation resources in R/Bioconductor. In: Statistical Genomics. Springer; 2016. p. 67–90.
- 74. Cui Y, Wang L, Liang W, Huang L, Zhuang S, Shi H, et al. Identification and Validation of the Pyroptosis-Related Hub Gene Signature and the Associated Regulation Axis in Diabetic Keratopathy. Journal of Diabetes Research. 2024;2024(1):2920694. pmid:38529047
- 75. Sassi C, Nalls MA, Ridge PG, Gibbs JR, Lupton MK, Troakes C, et al. Mendelian adult-onset leukodystrophy genes in Alzheimer’s disease: critical influence of CSF1R and NOTCH3. Neurobiology of Aging. 2018;66:179–e17. pmid:29544907
- 76.
Wickham H, Chang W, Henry L, Pedersen T, Takahashi K, Wilke C, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics (3.3. 6)[Computer software]; 2022.
- 77. Gaujoux R, Seoighe C. Nmf: Algorithms and framework for nonnegative matrix factorization (nmf). R Package Version 020. 2015;6.
- 78. Kaur H, Dhall A, Kumar R, Raghava GP. Identification of platform-independent diagnostic biomarker panel for hepatocellular carcinoma using large-scale transcriptomics data. Frontiers in genetics. 2020;10:1306. pmid:31998366
- 79. Hasan MAM, Nasser M, Pal B, Ahmad S. Support vector machine and random forest modeling for intrusion detection system (IDS). J Intell Learn SystAppl. 2014;2014:45–52.
- 80. Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–287. pmid:22455463
- 81. Carlson M, Falcon S, Pages H, Li N. org. Hs. eg. db: Genome wide annotation for Human. R package version. 2019;3(2):3.
- 82. Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2016;45(D1):D362–D368. pmid:27924014
- 83. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–2504. pmid:14597658
- 84. Chin CH, Chen SH, Wu HH, Ho CW, Ko MT, Lin CY. cytoHubba: identifying hub objects and sub-networks from complex interactome. BMC Syst Biol. 2014;8(4):1–7. pmid:25521941
- 85. Li M, Zhao J, Yang R, Cai R, Liu X, Xie J, et al. CENPF as an independent prognostic and metastasis biomarker corresponding to CD4+ memory T cells in cutaneous melanoma. Cancer science. 2022;113(4):1220–1234. pmid:35189004
- 86. Hasan MAM, Maniruzzaman M, Shin J. Differentially expressed discriminative genes and significant meta-hub genes based key genes identification for hepatocellular carcinoma using statistical machine learning. Scientific Reports. 2023;13(1):3771. pmid:36882493
- 87. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12(1):1–8.
- 88.
Therneau T, Lumley T. R survival package. R Core Team. 2013;.
- 89. Siegel Rebecca L, Miller Kimberly D. Jemal Ahmedin. Cancer statistics, 2019. CA Cancer J Clin. 2019;69(1):7–34. pmid:30620402
- 90. Lai X, Wu Yk, Hong Gq, Li Jk, Luo Q, Yuan J, et al. A Novel Gene Signature Based on CDC20 and FCN3 for Prediction of Prognosis and Immune Features in Patients with Hepatocellular Carcinoma. J Immunol Res. 2022;2022:1–22. pmid:35402624
- 91. Guan L, Luo Q, Liang N, Liu H. A prognostic prediction system for hepatocellular carcinoma based on gene co-expression network. Exp Ther Med. 2019;17(6):4506–4516. pmid:31086582
- 92. Weinstein J. Cell cycle-regulated expression, phosphorylation, and degradation of p55Cdc: a mammalian homolog of CDC20/Fizzy/slp1. J Biol Chem. 1997;272(45):28501–28511. pmid:9353311
- 93. Wu Wj, Hu Ks, Wang Ds, Zeng Zl, Zhang Ds, Chen Dl, et al. CDC20 overexpression predicts a poor prognosis for patients with colorectal cancer. J Transl Med. 2013;11(1):1–8.
- 94. Tang J, Lu M, Cui Q, Zhang D, Kong D, Liao X, et al. Overexpression of ASPM, CDC20, and TTK confer a poorer prognosis in breast cancer identified by gene co-expression network analysis. Front Oncol. 2019;9:310–324. pmid:31106147
- 95. Li J, Gao JZ, Du JL, Huang ZX, Wei LX. Increased CDC20 expression is associated with development and progression of hepatocellular carcinoma. Int J Oncol. 2014;45(4):1547–1555. pmid:25069850
- 96. Zhang X, Zhang X, Li X, Bao H, Li G, Li N, et al. Connection between CDC20 expression and hepatocellular carcinoma prognosis. Med Sci Monit. 2021;27:e926760–e926765. pmid:33788826
- 97. Gao Y, Zhao H, Ren M, Chen Q, Li J, Li Z, et al. TOP2A promotes tumorigenesis of high-grade serous ovarian cancer by regulating the TGF-^/Smad pathway. J Cancer. 2020;11(14):4181–4192. pmid:32368301
- 98. Ma W, Wang B, Zhang Y, Wang Z, Niu D, Chen S, et al. Prognostic significance of TOP2A in non-small cell lung cancer revealed by bioinformatic analysis. Cancer Cell Int. 2019;19(1):1–17. pmid:31528121
- 99. Meng J, Wei Y, Deng Q, Li L, Li X. Study on the expression of TOP2A in hepatocellular carcinoma and its relationship with patient prognosis. Cancer Cell Int. 2022;22(1):1–18.
- 100. Huang Y, Chen X, Wang L, Wang T, Tang X, Su X. Centromere protein F (CENPF) serves as a potential prognostic biomarker and target for human hepatocellular carcinoma. J Cancer. 2021;12(10):2933–2951. pmid:33854594
- 101. Chen H, Wu F, Xu H, Wei G, Ding M, Xu F, et al. Centromere protein F promotes progression of hepatocellular carcinoma through ERK and cell cycle-associated pathways. Cancer Gene Ther. 2021; p. 1–10. pmid:34857915
- 102. Ho DWH, Lam WLM, Chan LK, Ng IOL. Investigation of Functional Synergism of CENPF and FOXM1 Identifies POLD1 as Downstream Target in Hepatocellular Carcinoma. Front Med. 2022;9:860395–860406. pmid:35865168
- 103. Zhang H, Liu Y, Tang S, Qin X, Li L, Zhou J, et al. Knockdown of DLGAP5 suppresses cell proliferation, induces G 2/M phase arrest and apoptosis in ovarian cancer. Exp Ther Med. 2021;22(5):1–8. pmid:34539841
- 104. Liao W, Liu W, Yuan Q, Liu X, Ou Y, He S, et al. Silencing of DLGAP5 by siRNA significantly inhibits the proliferation and invasion of hepatocellular carcinoma cells. PLoS One. 2013;8(12):e80789–e80798. pmid:24324629
- 105. Xiong Y, Lu J, Fang Q, Lu Y, Xie C, Wu H, et al. UBE2C functions as a potential oncogene by enhancing cell proliferation, migration, invasion, and drug resistance in hepatocellular carcinoma cells. Biosci Rep. 2019;39(4):1–8. pmid:30914455
- 106. Zhou T, Wang Y, Qian D, Liang Q, Wang B. Over-expression of TOP2A as a prognostic biomarker in patients with glioma. Int J Clin Exp Pathol. 2018;11(3):1228–1237. pmid:31938217
- 107. Wang C, Wang W, Liu Y, Yong M, Yang Y, Zhou H. Rac GTPase activating protein 1 promotes oncogenic progression of epithelial ovarian cancer. Cancer Sci. 2018;109(1):84–93. pmid:29095547
- 108. Ren K, Zhou D, Wang M, Li E, Hou C, Su Y, et al. RACGAP1 modulates ECT2-Dependent mitochondrial quality control to drive breast cancer metastasis. Exp Cell Res. 2021;400(1):112493–112506. pmid:33485843
- 109. Liao S, Wang K, Zhang L, Shi G, Wang Z, Chen Z, et al. PRC1 and RACGAP1 are Diagnostic Biomarkers of Early HCC and PRC1 Drives Self-Renewal of Liver Cancer Stem Cells. Front Cell Dev Biol. 2022;10:864051–864066. pmid:35445033