Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics

Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis.

P value 40S ribosome 4e-12 60S ribosome 3e-07 Cytosol 3e-10 Endoplasmic reticulum 4e-05 Lysosome 3e-19 Mitochondrion 7e-10 Nucleus -Chromatin 1e-01 Nucleus -Non-chromatin 3e-01 Plasma membrane 1e-04 Proteasome 6e-08 S5 File. Table A. T-test results for the mouse dataset P values from an unpaired two-sample t-test (with unequal variance) used to determine if the populations means between the k-NN TL and SVM TL methods are significantly different from one another for each sub-cellular class in the mouse stem cell dataset.  Table B. T-test results for the human dataset P values from an unpaired two-sample t-test (with unequal variance) used to determine if the populations means between the k-NN TL and SVM TL methods are significantly different from one another for each sub-cellular class in the human dataset.   Table C. P values from an unpaired two-sample t-test (with unequal variance) used to determine if the populations means between the k-NN TL and SVM TL methods are significantly different from one another for each sub-cellular class in the plant callus dataset.   Table E. T-test results for the fly dataset P values from an unpaired two-sample t-test (with unequal variance) used to determine if the populations means between the k-NN TL and SVM TL methods are significantly different from one another for each sub-cellular class in the fly dataset.
k-NN transfer learning: Wu's original method In Wu and Dietterich's original application of transfer learning (TL) [1] the k-NN TL classifier only allowed weighting by data source and not on a data source and class-by-class basis. We have extended the usability of the method by incorporating a multi-class multi-data weighting schema to allow the integration of heterogeneous data types.
We compared Wu's k-NN TL method with own multi-class multi-data k-NN TL method (S5 File Figure F) (from herein we refer to these two methods as Wu and Breckels, for each TL method respectively). As described in the methods section of the manuscript to assess classifier performance we partitioned our labelled data into training and testing sets, and used the testing sets to assess the strength of our classifiers. Parameter optimisation was conducted on the labelled training data using 100 rounds of stratified 80/20 partitioning, in conjunction with 5-fold crossvalidation in order to estimate the k-NN TL weights via a grid search. Comparing the macro-F1 scores at the 0.01 significance level we found that as per the Breckels method, Wu's method was better than using primary data alone for all datasets except the callus dataset (mouse p = 2e −10 , human p = 9e −5 , callus p = 0.02, roots p = 9e −10 , fly p = 6e −6 ). We found that the Breckels k-NN TL classifier outperformed Wu's method for the mouse (p = 4e −4 ) and roots dataset (p = 4e −3 ). Both classifiers are implemented in the pRoloc package [2] in Bioconductor [3].   Is is important to note that in two of the above cases, namely callus and fly, learning from auxiliary data has either limited (fly), or no effect at all (callus) because the resolution in the primary data is already excellent (the primary F1 scores are close to 1). Neither Wu nor Breckels algorithms can bring much using TL for these cases, and hence the comparison of Wu's and Breckel's k-NN are not particularly telling here. For the case where improvement is possible (fly), both algorithms result in an increase in performance, but scope for improvement is so limited that it is impossible to separate them. If we consider the other datasets (mouse, human and roots), where integration of primary and auxiliary data is most useful, our k-NN TL algorithm outperforms Wu's original algorithm in 2 out of 3 cases (mouse and roots).

Negative transfer
Negative transfer is a paradigm in machine learning used to describe the situation when (often irrelevant) information is transferred from an auxiliary source which results in a decrease in the performance of the learner. A major hurdle that one faces in developing successful transfer learning methods is how to minimise the negative transfer paradigm [4]. In a review of transfer learning [5] Pan and Yang provide an introduction to transfer learning in which they address three key issues (1) what to transfer, (2) how to transfer, (3) when to transfer, and how these relate to negative transfer and similarity between source/target domains and tasks. Olivas et al [4] also provide insight on to how to avoid negative transfer and choose source tasks wisely. One such way is to manually select what to transfer, which is possible with the two TL methods presented here by manually setting the class-weights in the k-NN TL classifier and data-specific SVM parameters in the SVM TL classifier. We observe some negative transfer events on a class-specific basis. For example, from examining the class-F1 scores for the mouse dataset we see from Fig. 2 (bottom) in the main body of the manuscript, k-NN TL does not perform as well for the lysosome to using primary alone. We find however that a t-test shows that this difference is not significant at 0.01 (p = 0.07). We observe the converse for the proteasome, in terms of auxiliary performance, wherein adding primary information decreases the performance of the auxiliary data alone (p = 6e −3 , for combined versus auxiliary). As mentioned above, one of the advantages of the k-NN TL algorithm is the ability to set the weights for these organelles manually, so we can limit the cases where negative transfer may happen.
We have found in previous tests that straightforward concatenation of the primary and auxiliary data i.e. where no data is weighted, for many cases fails and indeed we see strong negativetransfer effects. Following our usual protocol for testing classifier performance (as detailed in the methods), the resultant 100 macro-F1 scores from straightforward concatenation was compared to those obtained from training on primary alone and to training on auxiliary alone (S5 File Figure  H). We find that simple concatenation of the primary and auxiliary data results in a significant decrease in classifier performance for some datasets compared to using just primary data alone, as seen in, for example, the human (p = 2e −20 ) and callus (p = 5e −50 ) datasets.