A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection

The cell is compartmentalised into complex micro-environments allowing an array of specialised biological processes to be carried out in synchrony. Determining a protein’s sub-cellular localisation to one or more of these compartments can therefore be a first step in determining its function. High-throughput and high-accuracy mass spectrometry-based sub-cellular proteomic methods can now shed light on the localisation of thousands of proteins at once. Machine learning algorithms are then typically employed to make protein-organelle assignments. However, these algorithms are limited by insufficient and incomplete annotation. We propose a semi-supervised Bayesian approach to novelty detection, allowing the discovery of additional, previously unannotated sub-cellular niches. Inference in our model is performed in a Bayesian framework, allowing us to quantify uncertainty in the allocation of proteins to new sub-cellular niches, as well as in the number of newly discovered compartments. We apply our approach across 10 mass spectrometry based spatial proteomic datasets, representing a diverse range of experimental protocols. Application of our approach to hyperLOPIT datasets validates its utility by recovering enrichment with chromatin-associated proteins without annotation and uncovers sub-nuclear compartmentalisation which was not identified in the original analysis. Moreover, using sub-cellular proteomics data from Saccharomyces cerevisiae, we uncover a novel group of proteins trafficking from the ER to the early Golgi apparatus. Overall, we demonstrate the potential for novelty detection to yield biologically relevant niches that are missed by current approaches.


mESC chromatin enrichment validation
For the mESC dataset, Novelty TAGM reveals 8 new putative phenotypes. Novelty TAGM recovers the masked annotations with phenotype 2 having the enriched terms associated with chromatin, such as chromatin and chromosome (p < 10 −80 ). Phenotype 3 corresponds to a separate nuclear substructure with enrichment for the terms nucleolus (p < 10 −60 ) and nuclear body (p < 10 −30 ). Thus, in the mESC dataset Novelty TAGM conrms the chromatin enrichment preparation designed to separate chromatin and non-chromatin associated nuclear proteins [52]. In addition, phenotype 4 demonstrates enrichment for the ribosome annotation (p < 10 −35 ). Phenotype 1 is enriched for centrosome and microtubule annotations (p < 10 −15 ), though observing the PSM in Fig1 we can see there is much uncertainty in this phenotype. This uncertainty quantication can then be used as a basis for justifying additional expert annotation.

Uncovering additional annotations in broblast cells HCMV-infected broblast cells
We apply Novelty TAGM to the dataset corresponding to the HCMV-infected broblast cells 24 hours post infection (hpi) [7], and discover 9 putative additional phenotypes (demonstrated in Fig2). Phenotype 2 contains a singleton protein and phenotypes 4, 6, 7, 8 and 9 are not signicantly enriched for any annotations. However, phenotype 3 is enriched for the mitochondrial membrane and mitochondrial envelope annotations (p < 10 −4 ); this is an addition to the already annotated mitochondrial class, indicating sub-mitochondrial resolution. Phenotype 1 is a mixed ribosomal/nuclear cluster with enrichment for nucleoplasm (p < 10 −5 ) and the small ribosomal subunit (p < 10 −4 ), which is distinct from phenotype 5 which is enriched for the large ribosomal subunit (p < 10 −10 ). This demonstrates unbiased separation of the two ribosomal subunits, which was overlooked in the original analysis [7].
However, we observe that phenotype 3 is enriched with the large ribosomal subunit with signicance at level p < 10 −7 . Phenotype 1 represents a mixed peroxisome (p < 10 −2 ) and mitochondrion cluster (p < 10 −2 ), an unsurprising result since these organelles possess similar biochemical properties and therefore similar proles during density gradient centrifugationbased fractionation [18,29]. The diering number of condently identied and biologically relevant phenotypes discovered between the two broblast datasets could be down to the diering levels of structure between the two datasets. Indeed, it is evident from Fig2 that we see diering levels of clustering structure in these datasets.

Additional organellar map datasets
Mouse primary neurons The mouse primary neuron dataset reveals 10 phenotypes after we apply Novelty TAGM. However, 8 of these phenotypes have no enriched GO annotations. This is likely a manifestation of the dispersed nature of this dataset, where the variability is generated by technical artefacts rather than biological signal. Despite this, Novelty TAGM is able to detect two relevant phenotypes: the rst phenotype is enriched for nucleolus (p < 0.01); the second for chromosome (p < 0.01). This suggests additional annotations for this dataset.
HeLa cells (Hirst et. al 2018) The HeLa dataset of [34], which we refer to as HeLa Hirst, reveals 7 phenotypes with at least 1 protein with discovery probability greater than 0.95. However, three of these phenotypes represent singleton proteins. Phenotype 1 reveals mixed cytosol/ribosomal annotations with the terms cytosolic ribosome (p < 10 −30 ) and cytosolic part (p < 10 −25 ) signicantly over- represented. There are no further phenotypes with enriched annotations (threshold p = 0.01), except phenotype 2 which represents a mixed extracellular structure/cytosol cluster.

Handling label switching
Bayesian inference in mixture models suers from an identiability issue known as label switching -a phenomenon where the allocation labels can ip between runs of the algorithm [58,66]. This occurs because of the symmetry of the likelihood function under permutations of these labels. We note that this only occurs in unsupervised or semi-supervised mixture models. This makes inference of the parameters in mixture models challenging. In our setting the labels for the known components do not switch, but for the new phenotypes label switching must occur. One standard approach to circumvent this issue is to form the so-called posterior similarity matrix (PSM) [22]. The PSM is an N ×N matrix where the (i, j) th entry is the posterior probability that protein i and protein j reside in the same component. More precisely, if we let S denote the PSM and T denote the number of Monte-Carlo iterations then where I denotes the indicator function. The PSM is clearly invariant to label switching and so avoids the issues arising from the label switching problem.   Figure 2: (a, c) PCA plots of the HCMV-infected broblast data 24 hpi and the mock broblast data 24 hpi. The points are coloured according to the organelle or proposed new phenotype and are scaled according to the discovery probability. (b, d) Heatmaps of the posterior similarity matrix derived from the infected broblast data and mock broblast data demonstrating the uncertainty in the clustering structure of the data. We have only plotted the proteins which have greater than 0.99 probability of belonging to a new phenotype and probability of being an outlier less than 0.95.  Plasma membrane (c) (d) Figure 3: (a),(c) PCA plots of the mouse primary neuron data and HeLa Hirst data. The pointers are scaled according to their discovery probability. (b),(d) Heatmaps of the mouse neuron data and HeLa Hirst data. Only the proteins whose discovery probability is greater than 0.99 and outlier probability less than 0.95 (10 −2 for the mouse primary neuron dataset to reduce the number of visualised proteins) are shown. The heatmaps demonstrate the uncertainty in the clustering structure present in the data.

Summarising posterior similarity matrices
To summarise the PSMs, we take the approach proposed by [22]. They proposed the adjusted Rand index (AR) [38,56], a measure of cluster similarity, as a utility function and then we wish to nd the allocation vectorẑ that maximises the expected adjusted Rand index with respect to the true clustering z. Formally, we writê which is known as the Posterior Expected Adjusted Rand index (PEAR). One obvious pitfall is that this quantity depends on the unknown true clustering z. However, this can be approximated from the MCMC samples: The space of all possible clustering over which to maximise is infeasibly large to explore. Thus we take an approach taken in [22] to propose candidate clusterings over which to maximise. Using hierarchical clustering with distance 1 − S ij , the PEAR criterion is computed for clusterings at every level of the hierarchy. The optimal clusteringẑ is the allocation vector which maximises the PEAR.

Details of MCMC
The MCMC algorithm used in [14] is insucient to handle inference of unknown phenotypes. As in [14], a collapsed Gibbs sampler approach is used, but a number of modications are made. Firstly, to accelerate convergence of the algorithm half the proteins are initial allocated randomly amongst the new phenotypes. Secondly, the parameters for the new phenotypes are proposed from the prior. Throughout the same default prior choices are used as in [14].

Further details of endosomal proteins
For completeness, this appendix provides additional details and important literature on the proteins discussed in the main text. First, P20339 (Rab5a) and P61020 (Rab5b) are two of the three isoforms of Rab5, a small GTPase which belongs to the Ras protein superfamily and is considered a master organiser of the endocytic system. Rab5a and Rab5b share a high level of amino acid sequence identity (approximately 85%) and are ubiquitously expressed in the mouse and human. Independently, these isoforms act as key regulators of clathrin-mediated endocytosis and early endosome dynamics by controlling the following processes in vivo and in vitro: (a) clathrincoated vesicle formation at the cell surface; (b) endocytosed vesicle transport from the plasma membrane towards, and fusion with, early endosomes; (c) early endosome biogenesis and maintenance; (d) molecular motor-driven, microtubule-dependent early endosome motility along the endocytic route; (e) early endosome docking/tethering and homotypic fusion, and (f) Rab conversion and early-to-late endosome maturation [13,28,43,59,64,82].
Rab5a and Rab5b play crucial roles in the internalisation and recycling/degradation of cell surface receptors such as EGFR (epidermal growth factor receptor), TfR (transferrin receptor) and several GPCRs (G-protein-coupled receptors) and integrins as well as peripheral plasma membrane-associated signalling molecules, thereby regulating important intracellular signal transduction pathways [5,12,45,73]. We observe a mixed steady-state potential localisation between the endosome and PM for both Rab5a and Rab5b (Fig ??D). According to previously published information, both Rab5a and Rab5b are mainly localised to (and considered well-established constituents of) the early endosome compartment but have also been detected on the PM and clathrin-coated vesicles, in support of our results [51,64,77]. Moreover, according to the HPA Cell Atlas, Rab5b resides in the vesicles (which, in this context, include the endosomes, lysosomes, peroxisomes and lipid droplets). There is no information regarding the sub-cellular location of Rab5a in this database.
Second, Q92738 (RN-tre) is a GTPase-activating protein (GAP) which controls the activity of several Rab GTPases. RN-tre is a major Rab5 (see above) regulator and therefore a key player in the organisation and dynamics of the endocytic pathway [28,42]. This protein modulates the internalisation of and signal transduction mediated by cell surface receptors such as EGFR, TfR and β1 integrins [17,42,48,55]. It also controls early endosome-to-Golgi retrograde transport and Golgi membrane organisation [32]. We observe a steady-state snapshot of the sub-cellular distribution of RN-tre with potential localisation to the endosome and PM (Fig ??D). In line with these results, RN-tre has been shown to reside in Rab5-positive early endosomes at steady state, but has also been detected at the PM and focal adhesions [17,28,42,48,55]. There is no information concerning the sub-cellular localisation of RN-tre in the HPA Cell Atlas database.
Third, Q96L93 (KIF16B) is a plus end-directed molecular motor which belongs to the kinesin-3 protein family. This kinesin regulates early endosome motility along microtubules and is required for the establishment of the steady-state sub-cellular distribution of early endosomes as well as the balance between PM recycling and lysosome degradation of signal transducing cell surface receptors including EGFR and TfR [9,35]. In neuronal cells, KIF16B plays an important role in the establishment of somatodendritic early endosome localisation and in the tracking of AMPA and NGF receptors [21]. In epithelial cells, this protein controls the transcytosis of TfR from juxtanuclear recycling endosomes to apical recycling endosomes [6]. KIF16B is also involved in tubular endosome biogenesis and ssion by regulating early endosome fusion [65]. Lastly, this kinesin has been shown to mediate biosynthetic Golgi-to-endosome transport of FGFR (broblast growth factor receptor)-carrying vesicles and thereby control FGFR cell surface presentation and signalling during in vivo mouse embryogenesis [74]. Our results indicate a mixed localisation to the endosome and PM for KIF16B (Fig ??D). In line with our observations, it has been reported that this protein is associated with early endosome membranes at steady state in mouse and human cells [21,35]. Additionally, it has been demonstrated that KIF16B co-localises with, and its spatial distribution and activity is regulated by, the small GTPase Rab5, whose isoforms Rab5a and Rab5b we also identied as potentially localised to the endosome and PM in the U-2 OS hy-perLOPIT dataset (see above), on early endosomes [35,65]. Taking the above into account, a mixed distribution between the endosome and PM is reective of the molecular function of KIF16B. However, the HPA Cell Atlas database classies KIF16B as a component of the mitochondria (Fig ??B), contradicting our ndings as well as previously published information regarding the sub-cellular localisation and biological role of this protein. We speculate that this disagreement arises from the uncertainty associated with the specicity of the chosen antibody [71]. Indeed, the reliability of the mitochondrial annotation for KIF16B is classied as "uncertain" in this database.
Fourth, Q8NHG8 (ZNRF2) is an E3 ubiquitin ligase which has been shown to regulate mTOR signalling as well as lysosomal acidity and homeostasis in mouse and human cells [37]. This protein has been found to control the sub-cellular localisation and biological function of mTORC1, the V-ATPase and the Na+/K+-ATPase α1 [36,37]. ZNRF2 is membraneassociated but can be released into the cytosol upon phosphorylation by various kinases [37]. We observe a mixed steady-state distribution between the endosome and PM for this protein (Fig ??D). In support of this result, we nd that ZNRF2 has been detected on the endosomes, lysosomes, Golgi apparatus and PM according to the literature [1,37]. There is no information in regard to the sub-cellular location of ZNRF2 in the HPA Cell Atlas database.
Fifth, O15498 (Ykt6) is a SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) protein which is conserved from yeast to humans. This protein regulates a wide variety of intracellular tracking and membrane tethering and fusion processes including ER-to-Golgi vesicular transport, intra-Golgi trac, retrograde Golgi-to-ER transport, retrograde endosome-to-TGN (trans-Golgi network) tracking, homotypic fusion of ER membranes, Golgi-to-PM transport and exosome/secretory vesicle-PM fusion, Golgi-tovacuole trac (in yeast), homotypic vacuole fusion (in yeast), autophagosome formation and autophagosome-lysosome fusion [20,44,49,68,69,80]. Ykt6 lacks a transmembrane domain and is able to cycle between intracellular membranes and the cytosol in a palmitoylationand farnesylation-dependent manner [25,50]. The membrane-associated form of Ykt6 has been detected on the PM, ER, Golgi apparatus, endosomes, lysosomes, vacuoles (in yeast), and autophagosomes as part of various SNARE complexes [20,25,44,49,50,68,69,80]. In line with this information, our results show a mixed sub-cellular distribution for Ykt6 with potential localisation to the endosome and cytosol (Fig ??D). The cytosolic localisation for Ykt6 is also supported by the HPA Cell Atlas annotation corresponding to this protein ( Fig   ??B), further reinforcing our ndings.
Sixth, Q9NZN3 (EHD3) is another important regulator of endocytic tracking and recycling. This protein promotes the biogenesis and stabilisation of tubular recycling endosomes by inducing early endosome membrane bending and tubulation [3,33]. Additionally, EHD3 is essential for early endosome-to-recycling endosome transport, retrograde early endosometo-Golgi trac, Golgi apparatus morphology maintenance, and recycling endosome-to-PM transport [8,31,33,53,54]. It plays an important role in the recycling of cell surface receptors and the biosynthetic transport of lysosome proteins [8,31,53,54]. We observe a mixed steady-state potential localisation to the endosome and PM for EHD3 (Fig ??D). Our results are in agreement with previously published studies which have reported that EHD3 is resident in the early endosomes and recycling endosomes at steady state [8,31,53,54], and our PM localisation-related observation is supported by the HPA Cell Atlas-derived annotation for this protein (Fig ??B).
Our ndings provide insights on the dynamic sub-cellular distribution of proteins which play important roles in development, physiology and disease. For example, Rab5/Rab5a has been identied as a master regulator of cancer cell migration, tumour invasion and dissemination programs in vitro and in vivo. It has been demonstrated that Rab5/Rab5a expression is dysregulated in many invasive human cancers, Rab5/Rab5a is overexpressed in metastatic foci compared to the matched primary tumours, and Rab5/Rab5a activity critically promotes the acquisition of invasive properties by poorly invasive tumour cell types [19,23,45,46,51,62,72]. Several publications have reported that elevated Rab5/Rab5a expression correlates with, and is predictive of, increased local invasiveness and metastatic potential, as well as poor patient prognosis in a variety of human cancer types [19,23,26,39,51,79,81,84]. Due to its established role in cancer progression and metastasis, Rab5/Rab5a is considered a fundamental cancer-associated protein and a potential diagnostic marker or therapeutic target [23,39]. Recently, Rab5 was identied as a promising therapeutic target for colorectal cancer, as inhibition of Rab5 (and Rab7) activity led to elimination of colorectal cancer stem cells and disruption of colorectal cancer foci [70]. Moreover, individual ablation of Rab5a but also Rab5b was shown to impair the invasion and dissemination ability of dierent cancer cell types [23]. In addition to its important role in cancer, there is some evidence suggesting that Rab5a might also be involved in the early pathogenesis of Alzheimer's disease [10,11,60]. Lastly, the Rab5 machinery has also been identied as an important factor in several bacterial, parasitic and viral infections. Bacterial pathogens such as Mycobacterium tuberculosis, Listeria monocytogenes, Tropheryma whipplei and Salmonella typhimurium [47], as well as parasites such as Leishmania donovani have evolved specic subversion mechanisms with which they are able to control the intracellular distribution and/or activity of Rab5 and its eectors as a way to avoid neutralisation by the immune system or facilitate invasion [76]. L. donovani specically controls the expression and function of the Rab5a isoform in this context [76]. Additionally, Rab5 was shown to participate in adenovirus endocytosis [57], both Rab5a and Rab5b were found to play functional roles in web formation and viral genome replication during HCV (hepatitis C virus) infection [67], and Rab5a was identied as a crucial target of HBV (hepatitis B virus) during HBV-related hepatocellular carcinoma pathogenesis [63].
Apart from Rab5a and Rab5b, the other proteins also possess demonstrated roles in development and disease. RN-tre is overexpressed in a subset of aggressive basal-like breast cancers, where high levels of this protein prevent the endocytosis and recycling of EGFR, leading to Akt overstimulation. In turn, Akt activity stabilises the glucose transporter GLUT1 at the cell membrane, resulting in an increase in glycolysis and cancer cell proliferation. RN-tre has been proposed as a potential therapeutic target for these types of breast cancer [2]. This protein also plays a functional role in infection, as it was shown to regulate the uptake and intracellular tracking of Shiga toxins [24]. Furthermore, it has been reported that KIF16B is essential for early post-implantation mouse embryo development, as Kif16b-knockout animals display peri-implantation embryonic lethality [74]. In addition, recent studies have shown that ZNRF2 is overexpressed in human non-small cell lung cancer, osteosarcoma and papillary thyroid cancer, and that high levels of this protein are correlated with disease progression and poor patient prognosis in these cases [15,78,83]. Moreover, Ykt6 was found to be necessary for glycosome biogenesis and function in the kinetoplastid parasite Trypanosoma brucei, which causes African sleeping sickness, with Ykt6 ablation signicantly reducing the viability of the parasite in both its pro-cyclic and bloodstream forms [4]. Finally, EHD3 has been identied as an essential factor for heart physiology [16].

Summary of convergence diagnostics
We provide a summary of convergence diagnostics using parallel chains analysis [30]. We compute the number of proteins allocated to the outlier component at each iteration of the Markov-chain and monitor this quantity for convergence. TheR statistic between parallel chains in then computed and reported in the

Prior specication and sensitivity
To complete the Bayesian specication, here we provide details of the priors on the model parameters. In the multivariate Gaussian components of the Novelty TAGM model, as with TAGM, a common and practical choice is the use of a normal-inverse-Wishart prior. That is, for each mixture component and where d is the dimension of the data. To complete this discussion, we need to specify the hyperparameters, µ 0 , λ 0 , ν 0 and S 0 . We use diusive priors that make minimal assumptions about the data, but they are set semi-empirically as to obtain the correct scale of the data. The hyperparameters are selected as follows The hyperparameters are interpreted in the following ways. The prior mean, µ 0 , is the mean of the data. Then λ 0 is viewed as the number of observations with data µ 0 which are added to each component specic mean. This value is small to avoid strong prior inuence. The marginal prior distribution (or prior predictive) for a cluster specic mean µ is given by a student's t-distribution. This can be observed by recalling that the student's t-distribution arises by marginalisation of the covariance from a normal distribution. Now, to ensure this t-distribution has nite covariance we require that ν 0 > d + 1. Thus, the choice presented here is the smallest integer value of ν 0 that ensures a nite covariance matrix. Hence, we have a well dened t-distribution with heavy tails. The empirically chosen scale matrix S 0 is chosen to roughly partition the range of the data into K balls of equal size. Previous work has shown that these priors lead to good predictive performance [14]. For π, we take a conjugate symmetric Dirichlet prior with parameter β, so that π 1 , . . . , π Kmax ∼ Dirichlet(β). Note that to apply the principle of overtted mixtures, we have to choose max j β j < d/2 [61], which is satised in all examples by setting β j = 1 for every j. Empirically Van Havre et al. [75] have recommended smaller values of β j ≈ n −1 to encourage stronger shrinkage.

Sensitivity to the choice of β j
To explore the sensitivity of our inferences to the specication of β j , we considered setting β j = 0.1, 0.01, as well as β j ≈ n −1 for the mESC example, which in this case n −1 ≈ 0.0002. As before, we hid nucleus, chromatin and ribosome annotations and sought to use our model to rediscover them. As we now summarily describe, we found that our results can be sensitive to the choice of β j and hence it should be set carefully. For example when β j = 0.1, we were unable to detect a ribosomal phenotype. Furthermore, there was a joint nucleus and chromatin phenotype, phenotype 1, rather than two distinct phenotypes. Chromosome was enriched for this phenotype (p < 10 −100 ), as well as nucleolus (p < 10 −60 ). When β j = 0.01 the results were somewhat improved with a phenotype 1 enriched for chromosome (p < 10 −100 ) but phenotype 3 was enriched for cytosolic ribosome (p < 10 −48 ) and nucleolus (p < 10 −50 ). Setting β j = 0.0002 provided the expected results with 3 distinct phenotypes for chromatin (phenotype 1) (p < 10 −100 ), nucleolus (phenotype 4) (p < 10 −50 ), and cytosolic ribsome (phenotype 3) (p < 10 −59 ), successfully matching our test components. Hence, based on these results, we would recommend either β j = 1 or β j ≈ n −1 depending on the desired amount of shrinkage.

Impact of reducing the proportion of labelled proteins
In all the examples we considered previously, the proportion of labelled proteins is roughly 20% of the total number of proteins. To assess the impact of the relative proportion of labelled and unlabelled proteins, we reconsidered our mESC example, where the goal was to detect ribosomal, nuclear and chromatin niches without annotation. In addition to masking these annotations as test components, we also masked, uniformly at random, an additional 10%, 20% and 50% of labelled proteins and assessed our ability to rediscover the ribosomal, nuclear and chromatin testing classes. Briey, we were able to rediscover two distinct phenotypes according to two nuclear clusters in all cases. When we masked 10% of the labels, the enrichments for the two nuclear phenotypes were chromosome (p < 10 −99 ) and nucleolus (p < 10 −59 ), the results were the same when we removed 20% and 50% of labels. However, only in the scenario were 20% of the labels were hidden did we nd a ribosome enriched phenotype (p < 10 −30 ). In the other cases, the ribosome clustered with the other large protein complex: the proteasome. This reects the similar biochemical properties of these subcellular niches. Furthermore, removing annotations renders the proteasome prole less well dened, resulting in a more diuse cluster. In practice, careful quality control would mitigate these scenarios [27]. In applications where there are very few annotated niches and the analysis is close to the unsupervised setting, it may be valuable to increase K novelty above 10 -others have found n/2 to work well [41].