mixOmics: An R package for ‘omics feature selection and multiple data integration

doi:10.1371/journal.pcbi.1005752

Fig 1.

Overview of the mixOmics multivariate methods for single and integrative ‘omics supervised analyses.

X denote a predictor ‘omics data set, and y a categorical outcome response (e.g. healthy vs. sick). Integrative analyses include N-integration with DIABLO (the same N samples are measured on different ‘omics platforms), and P-integration with MINT (the same P ‘omics predictors are measured in several independent studies). Sample plots depicted here use the mixOmics functions (from left to right) plotIndiv, plotArrow and plotIndiv in 3D; variable plots use the mixOmics functions network, cim, plotLoadings, plotVar and circosPlot. The graphical output functions are detailed in Supporting Information S1 Text.

More »

Expand

Table 1.

Summary of the eighteen multivariate projection-based methods available in mixOmics version 6.0.0 or above for different types of analysis frameworks.

Note that our block.pls/plsda and sparse variants differ from the approaches from [28–31]. The wrappers for rgcca and sgcca are originally from the RGCCA package [32] but the argument inputs were further improved for mixOmics.

More »

Expand

Fig 2.

Prediction area visualisation on the Small Round Blue Cell Tumors data (SRBCT [35]) data, described in the Results Section, with respect to the prediction distance.

From left to right: ‘maximum distance’, ‘Centroid distance’ and ‘Mahalanobis distance’. Sample prediction area plots from a PLS-DA model applied on a microarray data set with the expression levels of 2,308 genes on 63 samples. Samples are classified into four classes: Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), Neuroblastoma (NB), and Rhabdomyosarcoma (RMS).

More »

Expand

Table 2.

Example of computational time for the data sets presented in the Results section with a macbook pro 2013, 2.6GHz, 16Go Ram.

More »

Expand

Fig 3.

Illustration of a single ‘omics analysis with mixOmics.

A) Unsupervised preliminary analysis with PCA, A1: PCA sample plot, A2: percentage of explained variance per component. B) Supervised analysis with PLS-DA, B1: PLS-DA sample plot with confidence ellipse plots, B2: classification performance per component (overall and BER) for three prediction distances using repeated stratified cross-validation (10×5-fold CV). C) Supervised analysis and feature selection with sparse PLS-DA, C1: sPLS-DA sample plot with confidence ellipse plots, C2: arrow plot representing each sample pointing towards its outcome category, see more details in Supporting Information S1 Text. C3: Clustered Image Map (Euclidean Distance, Complete linkage) where samples are represented in rows and selected features in columns (10, 300 and 30 genes selected on each component respectively), C4: ROC curve and AUC averaged using one-vs-all comparisons.

More »

Expand

Fig 4.

Illustration of N-integrative supervised analysis with DIABLO.

A: sample plot per data set, B: sample scatterplot from plotDiablo displaying the first component in each data set (upper diagonal plot) and Pearson correlation between each component (lower diagonal plot). C: Clustered Image Map (Euclidean distance, Complete linkage) of the multi-omics signature. Samples are represented in rows, selected features on the first component in columns. D: Circos plot shows the positive (negative) correlation (r > 0.7) between selected features as indicated by the brown (black) links, feature names appear in the quadrants, E: Correlation Circle plot representing each type of selected features, F: relevance network visualisation of the selected features.

More »

Expand

Fig 5.

Illustration of MINT analysis in mixOmics.

A: Parameter tuning of a MINT sPLS-DA model with two components using Leave-One-Group-Out cross-validation and maximum distance, BER (y-axis) with respect to number of selected features (x-axis). Full diamond represents the optimal number of features to select on each component, B: Performance of the final MINT sPLS-DA model including selected features based on BER and classification error rate per class, C: Global sample plot with confidence ellipse plots, D: Study specific sample plot, E: Clustered Image Map (Euclidean Distance, Complete linkage). Samples are represented in rows, selected features on the first component in columns. F: Loading plot of each feature selected on the first component in each study, with color indicating the class with a maximal mean expression value for each gene.

More »

Expand