PhotoModPlus: A webserver for photosynthetic protein prediction from a genome neighborhood feature

Identification of photosynthetic proteins and their functions is essential for understanding and improving photosynthetic efficiency. We present here a new webserver called PhotoModPlus as a platform to predict photosynthetic proteins via genome neighborhood networks (GNN) and a machine learning method. GNN facilitates users to visualize the overview of the conserved neighboring genes from multiple photosynthetic prokaryotic genomes and provides functional guidance to the query input. We also integrated a newly developed machine learning model for predicting photosynthesis-specific functions based on 24 prokaryotic photosynthesis-related GO terms, namely PhotoModGO, into the webserver. The new model was developed using a multi-label classification approach and genome neighborhood features. The performance of the new model was up to 0.872 of F1 measure, which was better than the sequence-based approaches evaluated by nested five-fold cross-validation. Finally, we demonstrated the applications of the webserver and the new model in the identification of novel photosynthetic proteins. The server was user-friendly designed and compatible with all devices and available at http://bicep.kmutt.ac.th/photomod or http://bicep2.kmutt.ac.th/photomod.


31
Photosynthesis is a complex and diverse process developed in green organisms for 32 converting sunlight into useable chemical form. Discovering new components in the 33 photosynthetic process is substantially important, as it may point the way to improve 34 photosynthesis efficiency [1-4] and its applications [5][6][7][8]. 35 Considering the efficient pipeline used in the identification of novel photosynthetic 36 proteins, it is clear that computational approaches are crucial as a first screening step before a 37 more in-depth study with experimental approaches [9][10][11]. An earlier attempt of the 38 computational approach has been made based on sequence similarity search. It is trustable 39 when high sequence similarity is found in the existing database, but its accuracy drops quickly 40 when the lower similarity is found. It was shown that the classification of photosynthetic 41 proteins suffers from the diversity of photosystem, which can cause a high false-positive rate The next steps are the process for building a machine learning model, which consists 156 of i) selecting the optimal feature set, and ii) training the model. Due to a large number of 157 features in the genome neighborhood profile, we removed redundant and irrelevant features in 158 the first step to increase learning speed and model performance. We employed the multi-label 159 feature selection method, RF-ML [22], which is the extended version of the single-label feature 160 selection method, Relief. The RF-ML method was developed by using the dissimilarity 161 function between multi-labels to determine the optimal set of features. That is, it considers the 162 relationship between labels making the prediction performance of the model better. We 163 modified the original RF-ML script to obtain the top-N features ranked by dissimilarity score, 164 where N is a number of features in the feature subset. The N number was fine-tuned during the 165 model development process. 166 The next step is the building of a machine learning model, in which we employed a 167 problem transformation method to transform a multi-label dataset into the multi-class dataset 168 allowing usual learning algorithms to learn. We initially selected three different multi-label 169 transformation methods for model development, and the method with the best performance 170 was selected as the final model. The first method is Binary Relevance (BR) [23], a classical 171 approach in multi-label classification. It constructs a binary classifier for each label, and the 172 final prediction is determined by aggregating the classification results from all classifiers. The 173 problem of this method is that it ignores label relationships resulting in poor performance when 174 applied on real-world datasets. The second method is Label Power-set (LP) [24], which it is 175 developed to take label dependency into account by constructing binary classifiers from all 176 possible sets of those labels. Therefore, it usually suffers from worst-case time complexity 177 when the size of the label set is large. The last method is RAndom k labEL sets (RAkEL) [25], 178 an ensemble of LP classifiers, each of which is built from a different random label subset of a 179 size k. Therefore, the classification performance of this method depends on the random strategy 180 of the label subset. We applied these transformation methods with the random forest algorithm 181 as a base binary classifier, which performs the best in our previous study [15]. We developed 182 classification models by using the MEKA framework [26]. Three parameters are tuned to 183 optimize the model, i.e., number of the tree, the number of random features used in the tree in 184 the random forest, and the number of selected features from the RF-ML method. 185 Nested cross-validation 186 We used five repetitions of nested 5x3-fold cross-validation (CV) to evaluate the model 187 performance with the optimal parameter set. The schema of the nested fold CV is demonstrated 188 in S2 Fig. In [28]. F1 score is the harmonic mean between precision and recall, and F1max is the 213 maximum F1 score overall classification thresholds of each model. 214 (1) 215 (2) 216 (3) 217 (4) 218 First, precision (pri) and recall (rci) are calculated as equations (1) and (2). For every 220 protein i in the dataset, f is a GO term label, Ti is a set of true GO term labels and Pi(t) is a set 221 of predicted GO term labels of threshold t. I is an identity function which returns 1 for true and 222 0 for the false condition. Second, average precision (AvgPr) and average recall (AvgRc) are 223 calculated as equations (3) and (4), where m(t) is a number of proteins that contain at least one 224 predicted label and n is a total number of proteins. The F1max is calculated as equation (5) with 225 the thresholds t in range of 0 to 1 and a step size of 0.1. 226

227
The PhotoModPlus web server was developed using Flask framework version 1.0.2 and 228 python script and implemented using Apache HTTP web server version 2.4.18. PhotoModPlus 229 web server is available at bicep.kmutt.ac.th/photomod and bicep2.kmutt.ac.th/photomod. 230

232
In this study, we developed a new multi-label classifier, namely PhotoModGO, to 233 predict functional subclasses in the photosynthetic system. Photosynthetic subclasses were 234 selected from the list of 61 photosynthetic specific GO terms [12], in which only 24 GO terms 235 (S3 Table) belong to prokaryotes. Data preprocessing was carried out as described in the 236 method section. 237 Among the multi-label classifiers, RAkEL performed the best with the F1max value of 238 0.872 following by the BR method with a value of 0.862 (Table 1). The performance between 239 these two algorithms was not significantly different (p-value = 0.166) observed by the 240 Wilcoxon signed-rank test (S4 Table). LP performed the worst with an F1max value of 0.693, 241 which was significantly different from the two methods (p-value < 0.00001). Although the 242 recall value of the LP model was highest, the precision value was lowest compared to the 243 others. It is consistent with the previous benchmark [29] showing that the ensemble 244 transformation methods generally perform better than simple transformation methods. 245 The same dataset was used to train two sequence-based models, DeepGoPlus and 246 BLAST, for performance comparison. We found that DeepGoPlus was able to achieve 0.702 247 of F1max, 0.771 of precision, and 0.693 of recall. The classical BLAST method was the worst-248 performing method in this comparison with F1max, Precision, and Recall of 0.603, 0.601, and 249 0.624, respectively. Although the result showed that RAkEL and BR methods performed better 250 than DeepGoPlus significantly (p < 0.00001), the performance of the LP method was not 251 statistically different from the performance of DeepGoPlus (p = 0.135). It is clear that the 252 genome neighborhood-based models outperformed the sequence-based models on average. 253 we examine the role of sequence identity in our model performance. We created a new dataset 259 containing sequence identity  70%, called easy dataset. PhotoModGO-RAkEL, DeepGOPlus, 260 and BLAST models were trained with this dataset and evaluated by using five replications of 261 nested 5x3 fold CV. We observed that DeepGoPlus and BLAST methods performed better on 262 the easy dataset (identity  70%) comparing to identity  50% dataset, but we cannot observe 263 the difference in PhotoModGO performance (Fig 2) 283 We demonstrated the use of our new model by predicting the function of novel 284 photosynthetic genes/proteins. We retrieved 17 recently identified photosynthetic proteins, 285 which have never been collected in the Uniprot database (Sep 2016). As shown in Table 2, the 286 functional annotations of many proteins from literature are crudely described, which makes it 287 difficult to do a further experiment. We can apply the new model to these proteins to specify 288 functional classes related to photosynthesis. We found that many proteins were predicted to be 289 related to PSII, such as RfpA, IflA and DpxA, while the only IsiX was predicted to be related 290 to PSI. Interestingly, sll0272 product has been documented as a homolog of NdhV in 291

Functional prediction of novel photosynthetic proteins
Arabidopsis thaliana. Deletion of this gene reduced the NADPH dehydrogenase (NDH-1)-292 dependent cyclic electron transport around photosystem I (NDH-CET) activity, which has 293 previously shown as a protective role against several stress conditions e.g. high light, high salt 294 and high temperature [31-33]. It has been proposed that sll0272 product is a ferredoxin-binding 295 domain of cyanobacterial NDH-1 localized in the thylakoid membrane [31]. Based on our 296 prediction, we propose that the product of this gene functions in the photosynthetic system as 297 a regulator or stabilizer for PSII components. A few studies supporting this prediction showed 298 that the absence of NDH-CET activity impairs PSII repairing process under heat stress 299 conditions [34,35]. Besides, three proteins i.e. ApcD4, MpeZ and all4940 were predicted to 300 be related to phycobilisome consistent with the function from the literature. 301 Surprisingly, we observed that the predicted GO terms from the DeepGoPlus model 302 were sparse and not informative (S5 Table). For example, all of the query genes were assigned 303 by GO:0034357 (photosynthetic membrane) and GO:0009579 (thylakoid), while only ssl3829 304 and asr1131 were assigned by a more specific term, GO:0009523 (photosystem II). Moreover, 305 only six from 17 query sequences could be assigned with photosynthetic GO terms by SVMprot 306 webserver (S5 Table). These results indicated the advantage of PhotoModGO in the 307 classification of photosynthetic subclasses over the sequence-based approaches. 308 Functional prediction of unknown proteins in Synechocystis sp.

PCC 6803
312 Identification of novel photosynthetic proteins might provide us a new route to improve 313 photosynthesis efficiency. It has been found that more than 50% (1,885 from 3672) of protein-314 coding genes in the model organism of photosynthetic prokaryote, Synechocystis sp. PCC 315 6803, are unknown (data from http://genome.microbedb.jp/cyanobase, Sep 2018). Therefore, 316 we applied two machine learning models, PhotoMod and PhotoModGO, implemented in 317 PhotoModPlus webserver to screen the unknown genes in Synechocystis sp. PCC 6803 genome 318 to find the potential novel photosynthetic genes. As a result, we could predict ~100 high 319 confident photosynthetic gene candidates and their function (as shown in S6 Table), some of 320 which are shown in Table 3 as interesting candidates with informative annotations. 321 sll0249 localizes next to isiAB operon, thus it was designated to isiC. While IsiA protein 322 was shown to associate with PSI to form a ring-like complex around a PSI reaction center under 323 iron deficiency condition [49], isiC was observed to be transcriptionally enhanced during iron 324 deficiency with no functional designation [50]. Also, the isiA mutant was previously reported 325 to be sensitive to high light [51], while the isiC mutant showed high sensitivity to oxidative 326 stress [52]. It is known that the oxygenic photosynthetic bacteria consume a large amount of 327 iron for maintaining photosynthesis (e.g. a part of several iron-sulfur cluster-containing 328 proteins). Thus, to minimize harmful radicals from oxidative interaction, iron accumulation in 329 cells should be tightly regulated [53]. According to our prediction result and such pieces of 330 evidence, we proposed that sll0249 or IsiC appears to play an important role in balancing PSI 331 activity and oxidative stress response. 332 We also predicted that sll0611 functions in PSII. DNA microarrays of Synechocystis 333 PCC6803 showed that sll0611 gene responses to the depletion of lexA gene, which is the key 334 regulator of the SOS response induced by DNA damage [54]. A more comprehensive study 335 achieved by RNA-seq analysis showed that the depletion of lexA gene resulted in altered 336 expression of many genes, including those in photosynthesis, although no significant change of sll0611 was observed in this study [55]. A further study focusing on sll0611 is required to 338 clarify photosynthesis-related function. slr0022 product was previously identified as core 339 cyanobacterial clusters of orthologous groups of proteins (CyOGs) [11] and was predicted by 340 the sequence similarity-based approach as Fe-S cluster protein. illustrates an example of the prediction of an sll0272 protein sequence, which was classified as 364 photosynthetic protein shown with its predicted photosynthesis-specific GO terms. 365 parameters to submit query sequences, and the link to the output page will be sent to the user 380 by email. The summited sequences are blasted to our protein sequence database collected from 381 only photosynthetic organisms. The output is visualized as GNN via Cytoscape.js [60] plugin. 382 The network was designed to be portable, informative, and easy to use. As demonstrated in Fig  383   4A, the hexagonal node represents the query, while the circular node represents its genome 384 neighbors. The labeled text in each node represents a protein cluster ID. The size of the genome 385 neighbor node varies according to its conservation score (Phylo score). There are three build-386 in protein clustering cutoffs (E-value: 1E-10, 1E-50, and 1E-100) available on the top of the 387 network to define the level of homologous relationship of proteins. These pre-calculations 388 allow the user to immediately adjust the network according to the stringency. Users can click 389 on the node to observe the most enriched functions in each protein cluster in the network. We 390 also provide a link from the node to the hub of protein databases (Interpro), allowing users to 391 gain more insight into protein functions. The lower part of the output page ( Fig 4B) shows the 392 list of GO terms that are statistically enriched among the group of genome neighborhoods. 393 Users can use this tool for, such as functional guidance, operon prediction, and protein 394 evolution by observing the genome neighborhood patterns. The standalone version of this tool 395 is freely available at https://github.com/asangphukieo/PhotoModGNN. 396 We demonstrated the application of the GNN in Fig 4 by using the sll0272 protein  397 sequence as an input. The GNN (Fig 4A) showed that sll0272 was conserved with 11 genome 398 neighborhoods with moderate Phylo scores in range 2.37-3.96. The distribution of the Phylo 399 scores collected from the entire dataset was shown in S7 Fig. The list of GO terms (Fig 4B) that were enriched among the group of the genome neighborhoods indicated that the query