^{1}

^{2}

^{3}

^{2}

^{4}

^{5}

^{2}

^{4}

^{5}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JR JST DFR. Performed the experiments: JR DFR. Analyzed the data: JR DDA OS JST DFR. Contributed reagents/materials/analysis tools: DDA OS DFR. Wrote the paper: JR JST DFR.

Biomarker discovery aims to find small subsets of relevant variables in ‘omics data that correlate with the clinical syndromes of interest. Despite the fact that clinical phenotypes are usually characterized by a complex set of clinical parameters, current computational approaches assume univariate targets, e.g. diagnostic classes, against which associations are sought for. We propose an approach based on asymmetrical sparse canonical correlation analysis (SCCA) that finds multivariate correlations between the ‘omics measurements and the complex clinical phenotypes. We correlated plasma proteomics data to multivariate overlapping complex clinical phenotypes from tuberculosis and malaria datasets. We discovered relevant ‘omic biomarkers that have a high correlation to profiles of clinical measurements and are remarkably sparse, containing 1.5–3% of all ‘omic variables. We show that using clinical view projections we obtain remarkable improvements in diagnostic class prediction, up to 11% in tuberculosis and up to 5% in malaria. Our approach finds proteomic-biomarkers that correlate with complex combinations of clinical-biomarkers. Using the clinical-biomarkers improves the accuracy of diagnostic class prediction while not requiring the measurement plasma proteomic profiles of each subject. Our approach makes it feasible to use omics' data to build accurate diagnostic algorithms that can be deployed to community health centres lacking the expensive ‘omics measurement capabilities.

Many infectious diseases such as tuberculosis and malaria are challenging both for scientists trying to understand the biochemical basis of the diseases and for medical doctors making diagnosis. The challenges arise both from the dependence of the diseases on sets of proteins and from the complexity of the symptoms. Biomarkers denote small sets of measurements that correlate with the phenotype of interest. They have potential use both in advancing the basic biomedical research of infectious diseases and in facilitating predictive diagnostic tools. We propose a new method for biomarker discovery that works by finding canonical correlations between two sets of data, the plasma proteomic profiles and clinical profiles of the subjects. We show that the method is able to find candidate proteomic biomarkers that correlate with combinations of clinical variables, called the clinical biomarkers. Using the clinical biomarkers improves the accuracy of diagnostic class prediction while not requiring the expensive plasma proteomic profiles to be measured for each subject.

The aim of biomarker discovery is to find small subsets of measurements in ‘omics data that correlate with the clinical syndromes or phenotypes of interest. Despite the fact that most clinical phenotypes (e.g. diseases) are characterized by a complex set of clinical parameters, with a variable degree of overlap, current computational approaches do not take into consideration the multivariate nature of the phenotypes. The challenges arise both from the dependence of the diseases on several proteins and from the complexity of the symptoms. To overcome this limitation, in our framework, the data to be analysed is represented by two views, namely a plasma proteomics profile, and a set of clinical data composed of patient history, signs, symptoms and clinical laboratory measurements of the individuals with syndromes of interest. This type of problem can be described as multivariate in both the views, and the aim is to discover a sparse set of ‘omic variables (proteomic-biomarkers) that correlates with a combination of clinical variables (clinical-biomarkers).

Given the typically high number of variables and small number of patient samples in clinical ‘omic studies, dimensionality reduction techniques such as Principal component analysis (PCA) and Canonical Correlation Analysis (CCA) have become popular. PCA allows one to discover a set of latent variables in the data that explain most of the variance but they may not correlate with the clinical syndrome of interest. In contrast, CCA performs dimensionality reduction for two co-dependent datasets simultaneously so that the latent variables extracted from the two datasets are maximally correlated. Thus, the latent variables computed from one of the datasets can be used to predict the ones computed from the other, which is the basic goal in biomarker discovery. However, in both PCA and CCA, the latent variables depend on all variables and therefore hinder clinical interpretation and biomarker discovery and validation. To address these computational limitations sparse variants of PCA (SPCA) and CCA (SCCA) have been independently developed

Here we use our asymmetrical SCCA algorithm for the unsupervised discovery of candidate biomarkers by correlating plasma proteomics data to multivariate clinical parameters from two human infectious diseases namely, tuberculosis and malaria, which present with overlapping complex clinical phenotypes in the affected host.

Tuberculosis is the leading bacterial cause of death worldwide, with an estimated 8.8 million new cases of active disease and 1.6 million deaths per year

Cerebral malaria (CM) and severe malarial anemia (SMA) are the major severe disease syndromes in African children with a high level of mortality in the under-five age group. The current WHO case definitions for severe malaria combine

To validate the biomarkers discovered by the SCCA approach, we study the prediction of diagnostic classes using the biomarkers. In particular we study a scenario were the expensive proteomics data is only available during the training time of the models, whilst in prediction time, clinical data and a previously learned biomarker model is available. In our belief this is a realistic setup considering possible real-world deployment of decision support systems into resource-poor health care centres.

The active TB dataset

The childhood severe malaria dataset consist of 944 patient data with three datasets: plasma proteome profiles measured by mass-spectrometry (774 variables), clinical data (57 variables) and diagnostic classes (Cerebral Malaria (CM), Severe Malaria Anaemia (SMA), Uncomplicated Malaria (UM), Disease Control (DC) and Community Control (CC)).

Plasma was subjected to high-throughput proteomic profiling by mass spectrometry as previously described

_{a} x m_{a} = (x_{a}^{(1)},…,x_{a}^{(m)})_{b} x m_{b} = (x_{b}^{(1)},…,x_{b}^{(m)})_{a} and _{b}_{a}^{(i)},x_{b}^{(i)})

Canonical correlation analysis (CCA) is a family of statistical algorithms designed to situations where there are two available views or measurements of the same phenomenon and the goal is to find latent variables that explain both views (‘the generating model’) _{a}_{b}_{a}^{(i)} = w_{a}^{T}x_{a}^{(i)}_{b}^{(i)}_{b}^{T}x_{b}^{(i)}_{ab} = X_{a}X_{b}^{T}_{a} = X_{a}^{T} X_{a}_{b} = _{b}^{T} X_{b}, as well as dual variables α = (α_{1},…, α_{m}) and β = (β_{1},…,β_{m}) giving weights to the examples in the two views_{a} = X_{a}α and w_{b} = X_{b}β

CCA can be applied to the data iteratively via deflation, that is, the projection of the data to the orthogonal complement of _{a}_{b}_{a}(1)_{b}(1))_{a}(2)_{b}(2)),…_{a}(r)_{b}(r))_{a} and X_{b}_{c}(i)^{T}w_{c}(j) = 0_{a}(1)_{b}(1))_{a} = (w_{a}^{(1)},…, w_{a}^{(r)})_{b} = (w_{b}^{(1)},…, w_{b}^{(r)})

Sparse canonical correlation analysis differs from KCCA in that it aims to find sparse projection directions, those that only depend on a small number of variables. We use an asymmetric formulation of SCCA _{k}, k = 1…m_{b}^{(i)}

The first term of the objective aims to make the projection scores s_{a} = w_{a}^{T}X_{a}_{b} = K_{b}e_{k} = 1_{l}

SCCA can be applied iteratively to extract a series of projection directions by using one of two alternative approaches. In the deflation approach, like in CCA, deflation is used to arrive at a sequence ^{(1)},e^{(1)}), (w^{(2)},e^{(2)}),…,(w^{(r)},e^{(r)})_{1},…,s_{k}^{(1)},e^{(1)}), (w^{(2)},e^{(2)}),…,(w^{(k)},e^{(k)})

Without loss of generality, we assume that the sequence is sorted so that the pair ^{(1)},e^{(1)})

_{b}_{b}^{(i)} = W_{b}^{T}_{b}^{(i)} for each example. The Radial Basis Function (RBF) kernel

We compared the CCA-based methodology to using raw clinical data as the input to SVM without making use of the biomarker model and using proteomics data processed with principal component analysis. We used 30 first principal components as the input to SVM.

Statistical significance of the results was estimated using randomization tests. In randomization, a background data distribution consistent with the null hypothesis is generated by simulation, where the statistical connection to be tested has been broken, but the data distribution is otherwise kept close to the original data. In the case of canonical correlation analysis, we want to test if the correlation of the two views is significant, we use the null hypothesis H0: The data _{a}_{b}_{b}

Firstly, we assessed the capabilities of SCCA for extracting biomarkers from data that is organized in two views, ‘omics data and clinical profile data. For this purpose, we used the leading pair of SCCA projection directions as it encapsulates the highest correlation between the ‘omics and clinical views. Secondly, we assessed the utility of the extracted biomarkers in predicting diagnostic classes. Specifically, we studied a scenario where we assume that ‘omics data is not available in prediction phase, but only in training phase.

The leading pair of projection directions extracted from the proteomics (view

The x-axis gives the score in the proteomics profile while the y-axis gives the score in the clinical profile. Data points labeled based on the three diagnostic classes Active TB (red), Symptomatic Control (green) and Asymptomatic Control (blue). Ellipses denote the mean and covariance of the class clusters.

The proteomic profile, given by the weights in the leading projection direction _{a}^{(1)}_{b}^{(1)}

In the plasma proteomic profile the proteins at mass peaks

_{a}^{(1)}, whilst all variables were present in the clinical profile w_{b}^{(1)}. The highest weights in the clinical profile, were

The x-axis gives the score in the proteomics profile while the y-axis gives the score in the clinical profile. Data points labeled based on the four diagnostic classes Community Controls (black), Uncomplicated Malaria (purple), Severe Malaria Anaemia (blue) and Cerebral Malaria (red). Ellipses denote the mean and covariance of the class clusters.

PCA-Proteomics | Raw-Clinical | K+SCCA-Clinical | |

ACC ± s.d. | ACC ± s.d. | ACC ± s.d. | |

Active TB vs. Symptomatic Latent TB | 0.77±0.07 | 0.87±0.07 | 0.86±0.05 |

Active TB vs. Symptomatic No-Latent TB | 0.76±0.05 | 0.90±0.04 | 0.91±0.04 |

Active TB vs. No-Symptomatic Latent TB | 0.98±0.03 | 0.92±0.08 | 0.92±0.08 |

Active TB vs. No-Symptomatic No-Latent TB | 0.99±0.02 | 0.94±0.04 | 0.94±0.04 |

Symptomatic Latent-TB vs. Symptomatic No-Latent TB | 0.52±0.18 | 0.68±0.16 | 0.70±0.15 |

Symptomatic Latent-TB vs. No-Symptomatic Latent TB | 1.00±0 | 0.74±0.14 | 0.75±0.15 |

Symptomatic Latent-TB vs. No-Symptomatic No-Latent TB | 1.00±0 | 0.79±0.11 | 0.90±0.11 |

Symptomatic No-Latent TB vs. No-Symptomatic Latent TB | 1.00±0 | 0.90±0.09 | 0.87±0.06 |

Symptomatic No-Latent TB vs. No-Symptomatic No-Latent TB | 0.99±0.02 | 0.77±0.13 | 0.85±0.05 |

No-Symptomatic Latent TB vs. No-Symptomatic No-Latent TB | 0.59±0.10 | 0.78±0.11 | 0.85±0.10 |

PCA-Proteomics | Raw-Clinical | K+SCCA-Clinical | |

ACC ± s.d. | ACC ± s.d. | ACC ± s.d. | |

CC vs. CM | 0.94±0.06 | 0.98±0.02 | 0.96±0.03 |

CC vs. DC | 0.91±0.04 | 0.97±0.03 | 0.94±0.04 |

CC vs. SMA | 0.98±0.04 | 0.99±0.02 | 0.99±0.02 |

CC vs. UM | 0.92±0.03 | 0.99±0.01 | 0.99±0.02 |

CM vs DC | 0.75±0.08 | 0.99±0.02 | 0.88±0.06 |

CM vs. SMA | 0.61±0.12 | 0.88±0.07 | 0.76±0.08 |

CM vs UM | 0.73±0.06 | 0.93±0.04 | 0.89±0.04 |

DC vs SMA | 0.79±0.09 | 0.92±0.05 | 0.97±0.03 |

DC vs UM | 0.79±0.06 | 1.00±0 | 0.99±0.02 |

SMA vs UM | 0.71±0.07 | 0.85±0.05 | 0.90±0.05 |

In this paper, we have put forward an approach for discovering biomarkers from plasma proteomic data by canonical correlation to clinical data collected from the same subjects. We have also shown an approach to predict diagnostic classes based on the selected biomarkers.

We analysed a set of data consisting of plasma proteome and clinical profiles. Sparse canonical correlation analysis was shown to be effective in extracting small sets of proteomic variables, each representing a plasma protein, that correlate with clusters of similar clinical phenotypes in statistically significant manner (p-value 0.01). Sparsity of the extracted biomarker models is shown by the fact that 1.5% and 3% of the proteomics variables had non-negligible coefficients in the malaria and TB models, respectively. The sparsity of the ‘omic view of the SCCA model is deemed to be crucial for interpretation by human experts, as the set of proteomic variables to be studied remains tractable. Unlike SCCA, the KCCA method does not impose sparsity, so the KCCA is less amenable to human analysis.

In diagnostic class prediction, we show that via canonical correlation analysis, it is possible to make use of proteomic data in order to improve on the diagnostic classification, even if no proteomics data is available at the time of prediction, only at the time of training the model. This is a close match to a real-world scenario of deploying a diagnostic tool to health care centres without expensive ‘omics measurement capabilities.

In our experiments, the proposed approach appears to be advantageous (a) when proteomics data contains a strong signal predictive of the classification (i.e. PCA-Proteomics accuracy higher than Raw-Clinical) that can be mediated by the K+SCCA model, or (b) when proteomics data alone does not predict well (i.e. PCA-Proteomics accuracy lower than Raw-Clinical) but there is a synergistic latent signal between the proteomic and clinical profiles that K+SCCA can pick up. In the case of TB, a strong proteomics signal (case a) is found in six of the ten comparisons. In four of those six cases, K+SCCA matches the performance of Raw-Clinical, improves on Raw-Clinical on two of the cases and marginally loses in one of the cases. Evidence of a synergistic signal (case b) is found in four of the ten comparisons, in three of which K+SCCA matches Raw-Clinical and in one it exceeds the accuracy of Raw-Clinical: determining the presence of latent TB when there are no symptoms (85% accuracy versus 77% with clinical data alone). In malaria, first we note that the clinical data is very strong, there are no comparisons where PCA-proteomics exceeds the accuracy of Raw-Clinical (i.e. case a). Thus, in this data set, the K+SCCA is required to pick out a synergistic latent signal (case b) between the proteomic and clinical variables, in order to improve on the predictions from clinical data alone. This appears to take place in two experiments: in the separation of severe malaria anemia from both uncomplicated malaria and from disease controls. In the comparisons involving cerebral malaria (CM), we note that proteomics data seems to be weak in three of the four cases, and its seems that K+SCCA is hampered by this: Although it significantly improves over PCA-proteomics, it loses out to Raw-Clinical.

Another observation of the experiments is that K+SCCA benefits the prediction of the difficult class pairs more than the easier ones: in all of the comparisons where Raw-Clinical accuracy is below 85%, K+SCCA improves on the Raw-Clinical model.

Finally, we note that the canonical correlation analysis can also be used in the opposite way, namely to use the clinical data to extract more predictive biomarkers from proteomics data, and thus enhance the understanding of the systems biology underlying the complex phenotypes. Although this particular application was not the main focus in the present work, in our experiments the K+SCCA method often improved over the accuracy obtained with proteomics data alone.