Fig 1.
Overview of the mixOmics multivariate methods for single and integrative ‘omics supervised analyses.
X denote a predictor ‘omics data set, and y a categorical outcome response (e.g. healthy vs. sick). Integrative analyses include N-integration with DIABLO (the same N samples are measured on different ‘omics platforms), and P-integration with MINT (the same P ‘omics predictors are measured in several independent studies). Sample plots depicted here use the mixOmics functions (from left to right) plotIndiv, plotArrow and plotIndiv in 3D; variable plots use the mixOmics functions network, cim, plotLoadings, plotVar and circosPlot. The graphical output functions are detailed in Supporting Information S1 Text.
Table 1.
Summary of the eighteen multivariate projection-based methods available in mixOmics version 6.0.0 or above for different types of analysis frameworks.
Note that our block.pls/plsda and sparse variants differ from the approaches from [28–31]. The wrappers for rgcca and sgcca are originally from the RGCCA package [32] but the argument inputs were further improved for mixOmics.
Fig 2.
Prediction area visualisation on the Small Round Blue Cell Tumors data (SRBCT [35]) data, described in the Results Section, with respect to the prediction distance.
From left to right: ‘maximum distance’, ‘Centroid distance’ and ‘Mahalanobis distance’. Sample prediction area plots from a PLS-DA model applied on a microarray data set with the expression levels of 2,308 genes on 63 samples. Samples are classified into four classes: Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), Neuroblastoma (NB), and Rhabdomyosarcoma (RMS).
Table 2.
Example of computational time for the data sets presented in the Results section with a macbook pro 2013, 2.6GHz, 16Go Ram.
Fig 3.
Illustration of a single ‘omics analysis with mixOmics.
A) Unsupervised preliminary analysis with PCA, A1: PCA sample plot, A2: percentage of explained variance per component. B) Supervised analysis with PLS-DA, B1: PLS-DA sample plot with confidence ellipse plots, B2: classification performance per component (overall and BER) for three prediction distances using repeated stratified cross-validation (10×5-fold CV). C) Supervised analysis and feature selection with sparse PLS-DA, C1: sPLS-DA sample plot with confidence ellipse plots, C2: arrow plot representing each sample pointing towards its outcome category, see more details in Supporting Information S1 Text. C3: Clustered Image Map (Euclidean Distance, Complete linkage) where samples are represented in rows and selected features in columns (10, 300 and 30 genes selected on each component respectively), C4: ROC curve and AUC averaged using one-vs-all comparisons.
Fig 4.
Illustration of N-integrative supervised analysis with DIABLO.
A: sample plot per data set, B: sample scatterplot from plotDiablo displaying the first component in each data set (upper diagonal plot) and Pearson correlation between each component (lower diagonal plot). C: Clustered Image Map (Euclidean distance, Complete linkage) of the multi-omics signature. Samples are represented in rows, selected features on the first component in columns. D: Circos plot shows the positive (negative) correlation (r > 0.7) between selected features as indicated by the brown (black) links, feature names appear in the quadrants, E: Correlation Circle plot representing each type of selected features, F: relevance network visualisation of the selected features.
Fig 5.
Illustration of MINT analysis in mixOmics.
A: Parameter tuning of a MINT sPLS-DA model with two components using Leave-One-Group-Out cross-validation and maximum distance, BER (y-axis) with respect to number of selected features (x-axis). Full diamond represents the optimal number of features to select on each component, B: Performance of the final MINT sPLS-DA model including selected features based on BER and classification error rate per class, C: Global sample plot with confidence ellipse plots, D: Study specific sample plot, E: Clustered Image Map (Euclidean Distance, Complete linkage). Samples are represented in rows, selected features on the first component in columns. F: Loading plot of each feature selected on the first component in each study, with color indicating the class with a maximal mean expression value for each gene.