The proteome of the malaria plastid organelle, a key anti-parasitic target

The apicoplast is an essential plastid organelle in malaria parasites (Plasmodium spp.) and a validated anti-parasitic target. A major hurdle to uncovering cryptic apicoplast pathways required for malaria pathogenesis is the lack of an organellar proteome. Here we combine proximity biotinylation-based proteomics (BioID) and a new machine learning algorithm to generate the first high-confidence apicoplast proteome consisting of 346 proteins. Critically, the accuracy of this proteome significantly outperforms previous prediction-based methods. Half of identified proteins have unknown function, and 77% are predicted to be important for normal blood-stage growth. We validate the apicoplast localization of a subset of novel proteins and show that an ATP-binding cassette protein ABCF1 is essential for blood-stage survival and plays a previously unknown role in apicoplast biogenesis. These findings indicate critical organellar functions for newly-discovered apicoplast proteins. The apicoplast proteome will be an important resource for elucidating unique pathways and prioritizing antimalarial drug targets.


Introduction 33
Identification of new antimalarial drug targets is urgently needed to address emerging 34 resistance to all currently available therapies. However, nearly half of the Plasmodium 35 falciparum genome encodes conserved proteins of unknown function (Aurrecoechea et al.,36 2017), obscuring critical pathways required for malaria pathogenesis. The apicoplast is an 37 essential, non-photosynthetic plastid found in Plasmodium spp. and related apicomplexan 38 pathogens (McFadden et al., 1996;Kohler et al., 1997). This unusual organelle is an enriched 39 source of both novel cellular pathways and parasite-specific drug targets (van Dooren and 40 Striepen, 2013). It was acquired by secondary (i.e., eukaryote-eukaryote) endosymbiosis and has 41 To target the promiscuous biotin ligase BirA* to the apicoplast, the N-terminus of a GFP-88 BirA* fusion protein was modified with the apicoplast-targeting leader sequence from acyl 89 carrier protein (ACP) ( Figure 1A). Since apicoplast proteins transit the parasite endoplasmic 90 reticulum (ER) en route to the apicoplast (Waller et al., 2000), we also generated a negative 91 control in which GFP-BirA* was targeted to the ER via an N-terminal signal peptide and a C-92 terminal ER-retention motif ( Figure 1A). Each of these constructs was integrated into an ectopic 93 locus in Dd2 attB parasites (Nkrumah et al., 2006) to generate BioID-Ap and BioID-ER parasites 94 ( Figure S1). Live imaging of these parasites confirmed GFP-BirA* localization to a branched 95 structure characteristic of the apicoplast or a perinuclear region characteristic of the ER, 96 respectively ( Figure 1B). 97 To test the functionality of the GFP-BirA* fusions in the apicoplast and ER, we labeled 98 either untransfected Dd2 attB , BioID-Ap, or BioID-ER parasites with DMSO or 50 µM biotin and 99 assessed biotinylation by western blotting and fixed-cell fluorescent imaging. As has been 100 reported (Khosh-Naucke et al., 2018), significant labeling of GFP-BirA*-expressing parasites 101 above background was achieved even in the absence of biotin supplementation, suggesting that 102 the 0.8 µM biotin in RPMI growth medium is sufficient for labeling ( Figure 1C). Addition of 50 103 µM biotin further increased protein biotinylation. Fluorescence imaging of biotinylated proteins 104 revealed staining consistent with apicoplast morphology in BioID-Ap parasites and the ER and 105 other endomembrane structures in BioID-ER parasites ( Figure 1D). These results confirm that 106 GFP-BirA* fusions are active in the P. falciparum apicoplast and ER and can be used for 107 compartment-specific biotinylation of proteins. 108 For large-scale identification of apicoplast proteins, biotinylated proteins from late-stage 111 BioID-Ap and BioID-ER parasites were purified using streptavidin-conjugated beads and 112 identified by mass spectrometry. A total of 728 unique P. falciparum proteins were detected in 113 the apicoplast and/or ER based on presence in at least 2 of 4 biological replicates and at least 2 114 unique spectral matches in any single mass spectrometry run (Figure 2A and Table S1). The 115 abundance of each protein in apicoplast and ER samples was calculated by summing the total 116 MS1 area of all matched peptides and normalizing to the total MS1 area of all detected P. 117 falciparum peptides within each mass spectrometry run. 118 To assess the ability of our dataset to distinguish between true positives and negatives, 119 we then generated control lists of 96 known apicoplast and 451 signal peptide-containing non-120 apicoplast proteins based on published localizations and pathways (Table S2). Consistent with an 121 enrichment of apicoplast proteins in BioID-Ap samples, we observed a clear separation of known 122 apicoplast and non-apicoplast proteins based on apicoplast:ER abundance ratio (Figure 2A). 123 Based on the apicoplast:ER abundance ratio, we considered the 187 proteins that were ≥5-fold 124 enriched in apicoplast samples (Figure 2A, dotted line) to be the BioID apicoplast proteome 125 (Table S1). This dataset included 50 of the 96 positive control proteins for a sensitivity of 52%. 126 Further, manual inspection of the proteins on the ≥5-fold enriched apicoplast list identified 54 127 true positives and 5 likely false positives (Table S1) for a positive predictive value (PPV; the 128 estimated fraction of proteins on the list that are true positives) of 92%. 129 To benchmark our dataset against the current standard for large-scale identification of 130 apicoplast proteins, we compared the sensitivity and PPV of our apicoplast BioID proteome to 131 BioID identified fewer known apicoplast proteins than PATS or PlasmoAP, which had 134 sensitivities of 89% and 84%, respectively, but outperformed the 40% sensitivity of ApicoAP 135 ( Figure 2B). However, we expected that the advantages of apicoplast BioID would be the ability 136 to detect proteins without classical targeting presequences and its improved discrimination 137 between true and false positives ( Figure 2A). Indeed, bioinformatic algorithms had poor PPVs 138 ranging from 19%-36% compared to the 92% PPV of BioID ( Figure 2C). Even a dataset 139 consisting only of proteins predicted by all three algorithms achieved a PPV of just 25%. 140 Consistent with these low PPVs, many proteins predicted by the bioinformatic algorithms are not 141 enriched in BioID-Ap samples, suggesting that many of these proteins are likely to be false 142 positives ( Figure S2). Altogether, identification of apicoplast proteins using BioID provided a 143 dramatic improvement in prediction performance over bioinformatic algorithms. 144 145

Apicoplast BioID identifies proteins of diverse functions in multiple subcompartments 146
To determine whether lumenally targeted GFP-BirA* exhibited any labeling preferences, 147 we assessed proteins identified based on the presence of transmembrane domains, their sub-148 organellar localization, and their functions. First, we determined the proportion of the 187 149 proteins identified by apicoplast BioID that are membrane proteins. To ensure that proteins were 150 not classified as membrane proteins solely due to misclassification of a signal peptide as a 151 transmembrane domain, we considered a protein to be in a membrane only if it contained at least 152 one predicted transmembrane domain more than 80 amino acids from the protein's N-terminus 153 (as determined by annotation in PlasmoDB). These criteria suggested that 11% of identified 154 proteins (20/187) were likely membrane proteins ( Figure 3A), indicating that lumenal GFP-155 BirA* can label apicoplast membrane proteins. 156 Second, apicoplast proteins may localize to one or multiple sub-compartments defined by 157 the four apicoplast membranes. It was unclear whether BirA* targeted to the lumen would label 158 proteins in non-lumenal compartments. Based on literature descriptions, we classified the 96 159 known apicoplast proteins on our positive control list as either lumenal (present in lumenal space 160 or on the innermost apicoplast membrane) or non-lumenal (all other sub-compartments) and 161 determined the proportion that were identified in our dataset. Apicoplast BioID identified 56% 162 (45/81) of the classified lumenal proteins and 33% (5/15) of the non-lumenal proteins ( Figure  163 3B), suggesting that the GFP-BirA* bait used can label both lumenal and non-lumenal proteins 164 but may have a preference for lumenal proteins (though this difference did not reach statistical 165 significance). 166 Finally, we characterized the functions of proteins identified by apicoplast BioID. We 167 grouped positive control apicoplast proteins into functional categories and assessed the 168 proportion of proteins identified from each functional group ( Figure 3C). BioID identified a 169 substantial proportion (67-100%) of proteins in four apicoplast pathways that are essential in 170 blood stage and localize to the apicoplast lumen, specifically DNA replication, protein 171 translation, isoprenoid biosynthesis, and iron-sulfur cluster biosynthesis. Conversely, BioID 172 identified few proteins involved in heme or fatty acid biosynthesis (0% and 17%, respectively), 173 which are lumenal pathways that are non-essential in the blood-stage and which are likely to be 174 more abundant in other life cycle stages (Yu et al., 2008;Vaughan et al., 2009;Pei et al., 2010;175 Nagaraj et al., 2013;Ke et al., 2014). We achieved moderate coverage of proteins involved in 176 protein quality control (44%) and redox regulation (38%). Consistent with the reduced labeling 177 of non-lumenal apicoplast proteins, only a small subset (29%) of proteins involved in import of 178 nuclear-encoded apicoplast proteins were identified. Overall, apicoplast BioID identified soluble 179 and membrane proteins of diverse functions in multiple apicoplast compartments with higher 180 coverage for lumenal proteins required during blood-stage infection. 181

182
The PlastNN algorithm expands the predicted apicoplast proteome with high accuracy 183 Apicoplast BioID provided the first experimental profile of the blood-stage apicoplast 184 proteome but is potentially limited in sensitivity due to 1) difficulty in detecting low abundance 185 peptides in complex mixtures; 2) inability of the promiscuous biotin ligase to access target 186 proteins that are buried in membranes or protein complexes; or 3) stage-specific protein 187 expression. Currently-available bioinformatic predictions of apicoplast proteins circumvent these 188 limitations, albeit at the expense of a low PPV ( Figure 2C). We reasoned that increasing the 189 number of high-confidence apicoplast proteins used to train algorithms could improve the 190 accuracy of a prediction algorithm while maintaining high sensitivity. In addition, inclusion of 191 exported proteins that traffic through the ER, which are common false positives in previous 192 prediction algorithms, would also improve our negative training set. 193 We used our list of previously known apicoplast proteins (Table S2) as well as newly-194 identified apicoplast proteins from BioID (Table S1) to construct a positive training set of 205 195 apicoplast proteins (Table S4). As a negative training set, we used our previous list of 451 signal 196 peptide-containing non-apicoplast proteins (Table S2). For each of the 656 proteins in the 197 training set, we calculated the frequencies of all 20 canonical amino acids in a 50 amino acid 198 region immediately following the predicted signal peptide cleavage site. In addition, given that 199 apicoplast proteins have a characteristic transcriptional profile in blood-stage parasites (Bozdech 200 et al., 2003) and that analysis of transcriptional profile has previously enabled identification of 201 apicoplast proteins in the related apicomplexan Toxoplasma gondii (Sheiner et (Bartfai et al., 2010). Altogether, each protein was represented by a vector of 204 dimension 28 (20 amino acid frequencies plus 8 transcript levels). These 28-dimensional vectors 205 were used as inputs to train a neural network with 3 hidden layers ( Figure 4A and Table S5). Six-206 fold cross-validation was used for training, wherein the training set was divided into 6 equal 207 parts (folds) to train 6 separate models. Each time, 5 folds were used to train the model and 1 208 fold to measure the performance of the trained model. 209 We named this model PlastNN (ApicoPLAST Neural Network). PlastNN recognized 210 apicoplast proteins with a cross-validation accuracy of 96 ± 3% (mean ± s.d. across 6 models), 211 along with sensitivity of 95 ± 5%, and PPV of 94 ± 4% ( Figure 4B). This performance was 212 higher than logistic regression on the same dataset (average accuracy = 91%). Combining the 213 transcriptome features and the amino acid frequencies improves performance: the same neural 214 network architecture with amino acid frequencies alone as input resulted in a lower average 215 accuracy of 91%, while using transcriptome data alone resulted in an average accuracy of 90% 216 (Table S6). Comparison of the performance of PlastNN to existing prediction algorithms 217 indicates that PlastNN distinguishes apicoplast and non-apicoplast proteins with higher accuracy 218 than any previous prediction method ( Figure 4C). To identify new apicoplast proteins, PlastNN 219 was used to predict the apicoplast status of 450 predicted signal peptide-containing proteins that 220 were not in our positive or negative training sets. Since PlastNN is composed of 6 models, we 221 designated proteins as "apicoplast" if plastid localization was predicted by ≥4 of the 6 models. 222 PlastNN predicts 118 out of the 450 proteins to be targeted to the apicoplast (Table S7). 223 Combining these results with those from apicoplast BioID (Table S1) and with experimental 224 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint localization of proteins from the literature (Table S2) yielded a compiled proteome of 346 225 putative nuclear-encoded apicoplast proteins (Table S8). 226

227
The apicoplast proteome contains a multitude of novel and essential proteins 228 To determine whether candidate apicoplast proteins from this study have the potential to 229 reveal unexplored parasite biology or are candidate antimalarial drug targets, we assessed the 230 novelty and essentiality of the identified proteins. We found that substantial fractions of the 231 BioID and PlastNN proteomes (49% and 71%, respectively) and 50% of the compiled apicoplast 232 proteome represented proteins that could not be assigned to an established apicoplast pathway 233 and therefore might be involved in novel organellar processes ( Figure 5A). Furthermore, we 234 identified orthologs of identified genes in the 150 genomes present in the OrthoMCL database 235 (Chen et al., 2006): 39% of the compiled apicoplast proteome were unique to apicomplexan 236 parasites, with 58% of these proteins found only in Plasmodium spp. ( Figure 5B). This analysis 237 indicates that many of the proteins identified are significantly divergent from proteins in their 238 metazoan hosts. 239 Consistent with the critical role of the apicoplast in parasite biology, a recent genome-240 scale functional analysis of genes in the rodent malaria parasite P. berghei showed that numerous 241 apicoplast proteins are essential for blood-stage survival (Bushell et al., 2017). Using this 242 dataset, we found that 77% of those proteins in the compiled apicoplast proteome that had P. 243 berghei homologs analyzed by PlasmoGEM were important for normal blood-stage parasite 244 growth ( Figure 5C). Notably, of 49 proteins that were annotated explicitly with "unknown 245 function" in their gene description and for which essentiality data are available, 38 are important 246 for normal parasite growth, indicating that the high rate of essentiality for apicoplast proteins is 247 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint true of both previously known and newly discovered proteins. Overall, these data suggest that we 248 have identified dozens of novel proteins that are likely critical for apicoplast biology. 249 250

Localization of candidate proteins confirms accuracy of protein identification 251
To confirm the utility of our approaches, we experimentally determined the localization 252 of several candidate apicoplast proteins. A rhomboid protease homolog ROM7 and 3 conserved 253 Plasmodium proteins of unknown function (PF3D7_0521400, PF3D7_1472800, and 254 PF3D7_0721100) were each overexpressed as a C-terminal GFP fusions and tested in apicoplast 255 localization assays. First, we detected the apicoplast-dependent cleavage of each candidate as a 256 marker of its import. Most nuclear-encoded apicoplast proteins are proteolytically processed to 257 remove N-terminal targeting sequences following successful import into the apicoplast (Waller et 258 al., 1998;van Dooren et al., 2002). This processing is abolished in parasites rendered 259 "apicoplast-minus" by treatment with an inhibitor (actinonin) to cause apicoplast loss (Yeh and  apicoplast-intact and -minus parasites showed that ROM7, PF3D7_0521400, and 262 PF3D7_1472800 (but not PF3D7_0721100) were cleaved in an apicoplast-dependent manner 263 ( Figure 6A). 264 Second, we localized the candidate-GFP fusions by live fluorescence microscopy and 265 assessed their mislocalization in apicoplast-minus parasites. Consistent with apicoplast 266 localization, ROM7-GFP, PF3D7_0521400-GFP, and PF3D7_1472800-GFP localized to 267 branched structures characteristic of the apicoplast ( Figure 6B). In apicoplast-minus parasites, 268 these proteins mislocalized to diffuse puncta ( Figure 6B), as previously observed for apicoplast 269 proteins (Yeh and DeRisi, 2011). Interestingly, while in untreated parasites PF3D7_0721100-270 GFP localized to a few large bright puncta not previously described for any apicoplast protein, 271 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint this protein also relocalized to the typical numerous diffuse puncta seen for genuine apicoplast 272 proteins in apicoplast-minus parasites ( Figure 6B). Taken together, these data validate the 273 apicoplast localization of ROM7, PF3D7_0521400, and PF3D7_1472800. Though targeting 274 peptide cleavage and the characteristic branched structure were not detected for 275 PF3D7_0721100, the mislocalization of PF3D7_0721100-GFP to puncta characteristic of 276 apicoplast-minus parasites indicates that this protein may also be a true apicoplast protein. We performed immunofluorescence analysis (IFA) to determine whether ABCB7-HA and 287 ABCF1-HA colocalized with the apicoplast marker ACP. ABCF1-HA exhibited clear co-288 localization with ACP, confirming its apicoplast localization ( Figure 6C). ABCB7-HA localized 289 to elongated structures that may be indicative of an intracellular organelle but rarely co-localized 290 with ACP, indicating a primarily non-apicoplast localization ( Figure 6C). Overall, of 8 291 candidates of unknown localization at the start of this study, we identified 6 confirmed apicoplast 292 proteins, 1 likely apicoplast protein, and 1 potential false positive. 293 294 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint

A novel apicoplast protein ABCF1 is essential and required for organelle biogenesis 295
We determined the essentiality and knockdown phenotype of a newly identified 296 apicoplast protein, ABCF1, taking advantage of the TetR-binding aptamers inserted into its 3ʹ 297 UTR as described above. In the presence of anhydrotetracycline (ATc), binding of the aptamer 298 by a TetR-DOZI repressor is inhibited and ABCF1 is expressed. Upon removal of ATc, repressor 299 2) loss of transit peptide processing of nuclear-encoded apicoplast proteins, and 3) relocalization 310 of apicoplast proteins to puncta. Indeed, the apicoplast:nuclear genome ratio drastically 311 decreased in ABCF1 knockdown parasites beginning 1 cycle after knockdown ( Figure 7C), and 312 western blot showed that the apicoplast protein ClpP was not processed in ABCF1 knockdown 313 parasites ( Figure 7D). Furthermore, IFA of the apicoplast marker ACP confirmed redistribution 314 from an intact plastid to diffuse cytosolic puncta ( Figure 7E). In contrast to ABCF1, a similar 315 knockdown of ABCB7 caused no observable growth defect after four growth cycles despite 316 significant reduction in protein levels ( Figure S4). Together, these results show that ABCF1 is a 317 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint novel and essential apicoplast protein with a previously unknown function in organelle 318 biogenesis. 319 320 Discussion 321 Since the discovery of the apicoplast, identification of its proteome has been a pressing 322 priority. We report the first large-scale proteomic analysis of the apicoplast in blood-stage 323 malaria parasites, which identified 187 candidate proteins with 52% sensitivity and 92% PPV. A 324 number of groups have also profiled parasite-specific membrane compartments using proximity 325 biotinylation but observed contamination with proteins in or trafficking through the ER, 326 preventing accurate identification of these proteomes without substantial manual curation and GFP-BirA* to detect enrichment of apicoplast proteins from background ER labeling and 2) 333 strong positive and negative controls to set an accurate threshold. We suspect a similar strategy 334 to detect nonspecific ER background may also improve the specificity of proteomic datasets for 335 other parasite-specific, endomembrane-derived compartments. 336 Leveraging our successful proteomic analysis, we used these empirical data as an updated 337 training set to also improve computational predictions of apicoplast proteins. PlastNN identified 338 an additional 118 proteins with 95% sensitivity and 94% PPV. Although two previous prediction 339 algorithms, PATS and ApicoAP, also applied machine learning to the problem of transit peptide 340 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint prediction, we reasoned that their low accuracy arose from the small training sets used 341 (ApicoAP) and the use of cytosolic as well as endomembrane proteins in the negative training set 342 (PATS). By using an expanded positive training set based on proteomic data and limiting our 343 training sets to only signal peptide-containing proteins, we developed an algorithm with higher 344 sensitivity than BioID and higher accuracy than previous apicoplast protein prediction models. 345 Moreover, PlastNN suggests testable hypotheses regarding the contribution of sequence-based 346 and temporal regulation to protein trafficking in the ER. 347 Overall, we have compiled a high-confidence apicoplast proteome of 346 proteins that 348 are rich in novel and essential functions ( Figure 5A and 5C). This proteome likely represents a 349 majority of soluble apicoplast proteins, since 1) our bait for proximity biotinylation targeted to 350 the lumen and 2) most soluble proteins use canonical targeting sequences that can be predicted. 351 Further improvements to the apicoplast proteome will focus on expanding the coverage of 352 membrane proteins, which more often traffic via distinctive routes (Mullin et al., 2006;Parsons 353 et al., 2007). Performing proximity biotinylation with additional bait proteins may identify such 354 atypical apicoplast proteins. In the current study, our bait was an inert fluorescent protein 355 targeted to the apicoplast lumen to minimize potential toxicity of the construct. The success of 356 this apicoplast GFP bait gives us confidence to attempt more challenging baits, including 357 proteins localized to sub-organellar membrane compartments or components of the protein 358 import machinery. Performing apicoplast BioID in liver and mosquito stages may also define 359 apicoplast functions in these stages. 360 The apicoplast proteome will be a valuable resource for uncovering cryptic pathways 361 Following >50 transfections, 3 essential and 4 non-essential apicoplast membrane proteins were 381 identified. One newly-identified essential apicoplast membrane protein was then validated to be 382 required for apicoplast biogenesis in P. falciparum. In contrast, even though our study was not 383 optimized to identify membrane proteins, the combination of BioID and PlastNN identified 2 384 known apicoplast transporters, 4 of the new apicoplast membrane protein homologs, and 56 385 additional proteins predicted to contain at least one transmembrane domain. A focused screen of 386 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint higher quality candidates in P. falciparum is likely to be more rapid and yield the most relevant 387 biology. Our high-confidence apicoplast proteome will streamline these labor-intensive screens, 388 focusing on strong candidates for downstream biological function elucidation. As methods for 389 analyzing gene function in P. falciparum parasites continue to improve, this resource will 390 become increasingly valuable for characterizing unknown organellar pathways. AlexaFluor 546-conjugated streptavidin (ThermoFisher S11225) for one hour followed by three 503 washes in PBS. No labeling of GFP was necessary, as these fixation conditions preserve intrinsic 504 GFP fluorescence . Coverslips were mounted onto slides with ProLong Gold 505 antifade reagent with DAPI (ThermoFisher) and were sealed with nail polish prior to imaging. 506 For immunofluorescence analysis, parasites were processed as above except that fixation 507 was performed with 4% paraformaldehyde and 0.0075% glutaraldehyde in PBS for 20 minutes 508 and blocking was performed with 5% BSA in PBS. Following blocking, primary antibodies were 509 used in 5% BSA in PBS at the following concentrations: 1:500 rabbit-α-ACP (Gallagher and 510 Prigge, 2010); 1:100 rat-α-HA 3F10 (Sigma 11867423001). Coverslips were washed three times 511 in PBS, incubated with goat-α-rat 488 (ThermoFisher A-11006) and donkey-α-rabbit 568 512 (ThermoFisher A10042) secondary antibodies at 1:3000, and washed three times in PBS prior to 513 mounting as above. The resulting spectra were searched against a "target-decoy" sequence database (Elias 563 and Gygi, 2007) consisting of the PlasmoDB protein database (release 32, released April 19, 564 2017), the Uniprot human database (released February 2, 2015), and the corresponding reversed 565 sequences using the SEQUEST algorithm (version 28, revision 12). The parent mass tolerance 566 was set to 50 ppm and the fragment mass tolerance to 0.6 Da for CID scans, 0.02 Da for HCD 567 scans. Enzyme specificity was set to trypsin. Oxidation of methionines was set as variable 568 modification and carbamidomethylation of cysteines was set as static modification. Peptide 569 . CC-BY-NC-ND 4.0 International license not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was this version posted February 14, 2018. . https://doi.org/10.1101/265967 doi: bioRxiv preprint identifications were filtered to a 1% peptide false discovery rate using a linear discriminator 570 analysis (Huttlin et al., 2010). Precursor peak areas were calculated for protein quantification. C-score was used to predict the signal peptide cleavage position, and the remaining portion of 582 the protein was inspected for presence of a putative apicoplast transit peptide using the rules 583 described for PlasmoAP (Foth et al., 2003), implemented in a Perl script. 584 P. falciparum proteins predicted to localize to the apicoplast by ApicoAP were accessed 585 from the original paper (Cilingir et al., 2012). Genes predicted to encode pseudogenes were 586 excluded. 587 A positive control list of 96 high-confidence apicoplast proteins (Table S2)  To generate the positive training set for PlastNN, we took the combined list of previously known 597 apicoplast proteins (Table S2) and apicoplast proteins identified by BioID (Table S1) and 598 removed proteins that (1) were likely false positives based on manual inspection; (2) were likely 599 targeted to the apicoplast without the canonical bipartite N-terminal leader sequence; or (3) did 600 not contain a predicted signal peptide based on the SignalP 3.0 D-score. This yielded a final 601 positive training set of 205 proteins (Table S4). The negative training set was the previously 602 generated list of known non-apicoplast proteins (Table S2) examples. We trained models using 6-fold cross-validation; that is, we trained 6 separate models 618 with the same architecture, each using 5 of the 6 folds for training and then using the one 619 remaining fold as a cross-validation set to evaluate performance. Accuracy, sensitivity, and PPV Gene products with annotations that could clearly assign a given protein to an established 644 cellular pathway were labeled as "Known Pathway;" gene products with a descriptive annotation 645 that did not clearly suggest a cellular pathway were labeled as "Annotated Gene Product, 646 Unknown Function;" and gene products that explicitly contained the words "unknown function" 647 were labeled as "Unknown Function."  Figure S2, 727 Table S1, Table S2, and Table S3. See also Table S1 and Table S2.  Table S2, Table S3,  752   Table S4, Table S5, Table S6, Table S7. identified that are essential, cause slow growth when deleted, or are dispensable based on 761 PlasmoGEM essentiality data of P. berghei orthologs. Absolute number of proteins identified as 762 indicated. See also Table S1, Table S7, and Table S8.   Table S1, Related to Figures 2, 3, and 5. Abundances of 728 P. falciparum proteins identified 807 by mass spectrometry in ≥2 biological replicates and with ≥2 unique peptides in at least one mass 808 spectrometry run. 809