Table 1.
Microarray samples (biological replicates), platforms and GEO accession numbers.
Table 2.
The 26 feature selection methods.
Fig 1.
Data partition and aggregation procedures.
A random partition of the data into mutually exclusive sets P1, P2, P3, P4 and P5 is done. Feature selection is performed in each partition. It results in a feature subset for each partition. We perform frequency based aggregation by individually adding the most frequent features from the subsets and stop adding features when the performance of a mining algorithm starts to decrease. It results in a unique ensemble subset.
Fig 2.
Tree structure where each of the stages of the disease has been clustered in a single cluster using the RFE_clust_Dunn algorithm to select the variables used as input in pvclust [43] used to perform hierarchical clustering.
Fig 3.
Mouse and human HCC clustering.
the gene expression data of the human HCC of mixed etiologies has been integrated with HCC samples from GNMT and MAT1A mouse KO models of HCC derived from NAFLD by selecting the orthologous genes using the homologene database. The integrated data holds 1691 genes obtained from matching the orthologous genes between the genes having at least 9 samples of two fold regulation in the human HCC series, the 15 month MAT1A KO and 8 month GNMT mouse KO models. Using complete hierarchical clustering and Pearson correlation it is possible to distinguish cluster A and B with significant differences of survival length and the mouse models laying together cluster A.
Fig 4.
Survival signature common for human and mouse in an independent HCC dataset using complete hierarchical clustering and Pearson correlation as a similarity measure over the expression values of the genes composing renders 3 main clusters (A, C and B) representing HCC subtypes of differential survival.
Table 3.
5 fold cross-validation classification performance, stability calculated as the Average Normalized Hamming Distance (ANHD) and number of selected genes in the signatures of NAFLD progression from smoothed and raw data.
Fig 5.
Enriched KEGG pathway signatures selected by the two supervised clustering based feature selection methods which produced the optimal clustering result on smoothed data and the two ensemble signatures derived from 14 feature selection algorithm from raw and smoothed data used to build the signatures of NAFLD progression.
KEGG enrichment analysis was performed on the genes selected in the 5 feature selection runs of the external 5 fold crossvalidation procedure and those pathways having a significant p-value (p<0.05) were selected.
Table 4.
Ensemble unique gene survival signature common for human and mouse resulting from the frequency based aggregation of the signatures produced by the 5 feature selection methods.
Table 5.
Survival signature of pathways common for human and mouse resulting from the signatures produced by the 5 runs of the 5 feature selection methods.
Fig 6.
Kaplan-Meier plots showing the survival probability over time (days) of the 3 main clusters representing HCC subtypes of differential survival found in the independent HCC dataset when performing clustering analysis over the expression values of the genes composing the survival signature common for human and mouse.