Fig 1.
The 3Mont pipeline contains four main components: ProG (pro-grouping), FIS (feature importance scoring), S (pro-group scoring), and M (model creation). The features are grouped and groups are arranged based on the shared gene lists among them. Following the normalization of the pro-group sizes using the FIS component, the pro-groups are ordered according to their S component scores. The highest scoring pro-groups are selected to build ML models for classifying BRCA molecular subtypes.
Fig 2.
Internal cross-validation step of S component in 3Mont.
The expression profiles of each feature and class labels in the ProGroups (represented as a two-class dataset, aggregated from 3 omics datasets) are given as input to the S component. Each dataset is further split into internal training (90%) and internal testing (10%) datasets (encoded by shades of gray). Random splits are repeated 10 times, and the mean accuracy is assigned as the score for each ProGroup.
Fig 3.
Distribution of pro-group sizes before and after selecting the top 10 features.
The size indicates the total number of features within a pro-group. The Gini importance function is used to rank features and the best features in each pro-group are chosen based on their Feature Importance Scores (FIS). In the right panel, a predetermined cut-off value of 10 is used.
Table 1.
Performance evaluation metrics of 3Mont across different cancer types. The total sample size column lists the numbers of the tumor and control samples, respectively. A 1:1 sampling ratio is applied in each iteration of the cross-validation step to prevent class imbalance.
Table 2.
The 3Mont summary statistics for the top 10 most significant pro-groups that are identified for BRCA molecular subtypes (HR+ and HR-) dataset. These statistics include Frequency of group (the number of times the pro-group appears), Average Score (the score assigned in the S component), Robust Rank Aggregation (RRA) score over 10 iterations. The genes regulated by the methylated CpG site and the pro-group-associated features across -omics datasets are listed in the last two columns, respectively.
Fig 4.
Generated network illustrating the pro-groups and their associated features that are identified by 3Mont for differentiating among the BRCA molecular subtypes (HR
+ /HR- cases). The network visualizes associations between mRNAs, CpG IDs, miRNAs within the pro-groups. The node size represents the scaled average score of each associated feature within the top 10 pro-groups. The scores are obtained from the Average scores column of Table 2. Different colors represent distinct communities (clusters) detected by the community detection algorithm.
Fig 5.
Comparative performance evaluation of 3Mont with other feature selection algorithms over 10 iterations.
The average performance metrics of the best scoring groups, along with standard deviations over iterations, are shown in panels which are labeled as accuracy, Area Under ROC Curve, sensitivity and specificity. All algorithms use mRNA, miRNA, and methylation data, but 3Mint only uses mRNA expression in training and testing the classifier. Abbreviations: SKB: SelectKBest, FCBF: Fast Correlation Based Filter, IG: Information Gain, CMIM: Conditional Mutual Information Maximization, MRMR: Minimum Redundancy Maximum Relevance.