PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features

A new web server called PhotoModPlus is presented as a platform for predicting photosynthetic proteins via genome neighborhood networks (GNN) and genome neighborhood-based machine learning. GNN enables users to visualize the overview of the conserved neighboring genes from multiple photosynthetic prokaryotic genomes and provides functional guidance on the query input. In the platform, we also present a new machine learning model utilizing genome neighborhood features for predicting photosynthesis-specific functions based on 24 prokaryotic photosynthesis-related GO terms, namely PhotoModGO. The new model performed better than the sequence-based approaches with an F1 measure of 0.872, based on nested five-fold cross-validation. Finally, we demonstrated the applications of the webserver and the new model in the identification of novel photosynthetic proteins. The server is user-friendly, compatible with all devices, and available at bicep.kmutt.ac.th/photomod.


Introduction
Photosynthesis is an important, complex biological process by which solar energy is converted into biochemical energy by some diverse organisms, including plants, algae, cyanobacteria, phytoplankton, and photosynthetic bacteria, for the production of biological compounds. However, there is still limited information about the complexity of the photosynthetic system, and many genes and protein functions are yet to be explored. Discovering new genes/proteins in the photosynthetic process is substantially important, as it may point the way to improve photosynthesis efficiency [1][2][3][4] and its applications [5][6][7][8].
Considering the efficient pipeline used in the identification of novel photosynthetic proteins, it is clear that computational approaches are a crucial screening step preceding in-depth studies by experimental approaches [9][10][11]. Earlier attempts at computational discovery had based methods, BLAST [21] and DeepGOPlus [18], investigated by nested cross-validation evaluation. The new model serves to explore functional subclasses of photosynthetic proteins, which might be retrieved from the candidates of the first application or other sources. The last, PhotoModGNN, is the application for genome neighborhood network (GNN) generation and visualization, which allows users to observe genome neighborhood patterns and explore photosynthetic functions by network analysis. Our PhotoModPlus provides an easy-to-use web interface for input submission and a convenient interpretable output for photosynthetic protein discovery.

Dataset collection
Photosynthetic protein datasets used in PhotoModPlus comprised two types: i) known photosynthetic protein datasets for training and testing the model, and ii) novel photosynthetic protein datasets.
Training and test datasets. The photosynthetic protein dataset was retrieved from the UniprotKB database. 15,191 protein sequences consisting of at least one of 61 photosynthesisspecific GO terms identified by Ashkenazi et al. [12] were included in the dataset. To avoid incomplete gene neighborhood identification, only photosynthetic proteins from 154 photosynthetic prokaryotes with complete genomes were included in the dataset, as previously reported [15]. To reduce sequence redundancy, we used the USEARCH analytical tool [22] to cluster similar sequences (sequence identity � 50% as a diverse dataset and � 70% as an easy dataset), using the command: usearchÀ cluster fast½input�À id½identity�À centroids½output�; where the input file is in FASTA format, identity is the percent sequence identity cutoff for the cluster, and output is the selected representative sequence in FASTA format. The number of sequences in the dataset was reduced to 369 sequences with identity � 50% and 1,021 sequences with identity � 70%, and these were applied for model development and model comparison.
Novel photosynthetic protein dataset. A set of novel proteins collected from the literature was used to evaluate the model performance. Proteins that had never been deposited in the Uniprot database by September 2016 were considered novel as our model was trained using data retrieved then. BLAST was used to check their availability in the database with the criteria: sequence identity � 50% and query and subject coverage � 70%. The lack of sequence homolog makes them ideal for testing the feasibility of our model to facilitate functional annotation of novel proteins.

PhotoModGO development
The PhotoModGO was newly developed using a multi-label classification approach, which allows each data point to be assigned to more than one functional class at the same time, to classify photosynthesis subclasses of photosynthetic proteins based on the genome neighborhood feature. The data preparation and preprocessing of PhotoModGO are similar to Photo-Mod procedures, allowing a connection between the two approaches, whereby the output from PhotoMod can be used as input data in PhotoModGO. In the pipeline, the PhotoModGO shares data preprocessing steps, including genome neighborhood calling, Phylo score calculation, and Phylo score normalization with PhotoMod. The photosynthetic subclasses were selected from the list of 61 photosynthesis-specific GO terms [12], which are child terms of photosynthesis (GO:0015979) at every level and are not linked to other functions in the BP category. Therefore, they can be considered as a part of the same tree of the photosynthesis term. Among these, only 23 GO terms belong to photosynthetic prokaryotes. In addition, we manually added one term of phycobilisome (GO:0030089), which is a light-harvesting protein complex in cyanobacteria and red alga. Although this GO term is not the child term of photosynthesis, it can be considered as marginally connected to photosynthesis through a photosynthetic membrane (GO:0034357). The description of all 24 selected GO terms is shown in S1 Table. The process started with the retrieval of the photosynthetic protein dataset with photosynthesis-specific GO terms, followed by the removal of redundant data as described in the dataset collection (Fig 2). Then, the loci of those photosynthetic protein-coding genes were determined and used for calling the gene neighborhood. The genes were considered neighbors if they were within 250 bp on the same strand. Two gene clusters were merged into the same neighborhood gene cluster if they were in a range of 200-1000 bp in the divergent direction, following the operon interaction concept as demonstrated in S1 Fig. The homologous relationship between protein sequences was determined by the protein clustering method with three stringent criteria: 1E-10, 1E-50, and 1E-100, according to a previous study [15]. The genome neighborhood conservation scores (Phylo scores) were calculated based on the phylogenetic tree of organisms that conserve those gene neighborhoods, as described previously [15]. The quantile cutoff points were determined and used for converting the raw Phylo scores to simple numeric forms (i.e., 0, 1, 2, and 3). Hereafter, we call the list of gene neighbors and the Phylo score of the query gene as a genome neighborhood profile. The matrix of the query genes and their genome neighborhood profiles was built and concatenated with the binary matrix of GO term annotations corresponding to the query genes. This processed matrix was converted to ARFF format, which is a standard format for machine learning software. Note that we keep using a matrix form to display the data in Fig 2 as it is easier to understand.
The next steps include the process for building the machine learning model: i) selection of the optimal feature set and ii) training of the model. Due to a large number of features in the genome neighborhood profile, we removed redundant and irrelevant features in the first step to increase learning speed and model performance. We employed the multi-label feature selection method, RF-ML [23], which is the extended version of the single-label feature selection method, Relief. The RF-ML method was developed using the dissimilarity function between multi-labels to determine the optimal set of features. That is, it considers the relationship between labels, making the prediction performance of the model better. We modified the original RF-ML script to obtain the top-N features ranked by the dissimilarity score, where N is the number of features in the feature subset. The N number was fine-tuned during the model development process.
The next step was the building of the machine learning model, during which we employed a problem transformation method to transform the multi-label dataset into a multi-class dataset, allowing usual learning algorithms to learn. We initially selected three different multi-label transformation methods for model development, and the method with the best performance was selected as the final model. First, we explored the Binary Relevance (BR) [24] algorithm, a classical approach in multi-label classification, which constructs a binary classifier for each label with the final prediction determined by aggregating the classification results from all classifiers. The

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features problem with this method is that it ignores label relationships, resulting in poor performance when applied to real-world datasets. Next, we explored the Label Powerset (LP) method [25], which takes label dependency into account by constructing binary classifiers from all possible sets of labels. Therefore, it usually suffers from worst-case time complexity when the size of the label set is large. Finally, we used RAndom k-labELsets (RAkEL) [26], an ensemble of LP classifiers, each of which is built from a different random label subset of size k. Therefore, its classification performance depends on the random strategy of the label subset. We applied these transformation methods with the random forest algorithm as a base binary classifier, as it performed the best in our previous study [15]. We developed the classification model using the MEKA framework [27]. Three parameters: the numbers of trees, random features used in the tree in the random forest, and selected features from the RF-ML method, were tuned to optimize the model.

PhotoModGNN development
The PhotoModGNN is portable, informative, and easy to interpret. As demonstrated in S1C Fig, the hexagonal node represents the query, while the circular node represents its genome neighbors. The labeled text in each node represents a protein cluster ID. The size of the genome neighbor node varies according to its conservation score (Phylo score). There are three built-in protein clustering cutoffs (E-value: 1E-10, 1E-50, and 1E-100) available at the top of the network to define the level of homologous relationship between proteins. These precalculations allow users to immediately and dynamically adjust the network according to the stringency. The network is visualized via the Cytoscape.js [28] plugin. Users can click on the node to observe the most enriched functions in each protein cluster in the network. We also provided a link from the node to the hub of protein databases (InterPro), allowing users to gain more insight into protein functions. The output page shows the list of GO terms that are statistically enriched among the group of genome neighborhoods. Users can use this tool for functional guidance, operon prediction, and protein evolution analysis by observing the genome neighborhood patterns. The standalone version of this tool is freely available at https://github.com/asangphukieo/PhotoModGNN.

PhotoModPlus web server implementation and demonstrations
The web server was developed using Flask framework version 1.0.2 and a Python script and implemented using Apache HTTP web server version 2.4.18. In addition to the previous binary classification model 'PhotoMod' [15], which combines protein clustering, genome neighborhood extraction and scoring, and random forest algorithms to classify photosynthetic proteins, PhotoModPlus includes the development of 'PhotoModGO' and 'PhotoModGNN' in the pipeline for function categorization of photosynthetic proteins.
The workflow of PhotoModPlus is shown in Fig 1. Users can submit single or multiple protein sequences in FASTA format directly as input. Then, the homologs of the input are identified, and the genome neighborhood profiles are automatically generated. The input sequences with genome neighborhood profiles are delivered to PhotoMod [15] for classifying the photosynthetic proteins. Then, users can choose the candidate photosynthetic proteins for the next steps. PhotoModGO can be used to predict photosynthesis-related functions of photosynthetic proteins, PhotoModGNN can be used to create and visualize genome neighborhood network of photosynthetic proteins for inferring their functions, and InterProScan [29] can be used to identify other functions via a protein motif search. Users can use default parameters to submit query sequences, and the link to the output page is sent to the user by email. The web server is available at bicep.kmutt.ac.th/photomod.

Photosynthetic function prediction by PhotoModGO.
We demonstrated the application of PhotoModGO by predicting the function of a photosynthetic protein, sll0272. After the homologs of the input are identified with the E-value cutoff of 1E-25 and the maximum number of BLAST-matches limited to 1 (Fig 3A), the genome neighborhood profile is automatically generated. The prediction output from the binary classification model (PhotoMod) indicates that sll0272 has a high chance (83%) of being a photosynthetic protein (Fig 3B), whereas the output from the multi-label classification model (PhotoModGO) indicates the association of sll0272 with the photosystem II and regulation of photosynthesis ( Fig 3C).
Functional guidance by PhotoModGNN. We demonstrated the application of Photo-ModGNN using the sll0272 protein sequence as an input. The submitted sequences were searched by BLAST against our protein sequence database, which contains sequences collected from photosynthetic organisms only. The GNN ( Fig 4A) shows that sll0272 is conserved with 11 genome neighborhoods and moderate Phylo scores in the range of 2.37 to 3.96. The distribution of the Phylo scores collected from the entire dataset is shown in S2 Fig. The list of enriched GO terms ( Fig 4B) among the group of genome neighborhoods indicated that the query might be involved in a photosynthesis-related function and nucleic acid-related activity.

Baseline comparison methods and evaluation
To evaluate the performance of our newly developed model, PhotoModGO, we selected the recently developed sequence-based machine learning model, DeepGOPlus, and standard sequence similarity method, BLAST, for comparison. Nested fold cross-validation and F1 max score were used to evaluate the prediction performance.

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features dataset. Due to the high computational requirement of DeepGOPlus, we selected two optimal parameter sets, fine-tuned from the original paper, for use in this study.
BLAST. We also used a sequence similarity method based on Diamond BLAST as a baseline method for comparison. The training set was used as a database, and the test set as a query. Annotation from the training dataset was transferred to a similar sequence in the test set if it passed the E-value criteria, which ranged from 20 to 1E-30. The percent sequence identity was used as a probability score. The diamond BLAST command is shown below.
diamondblastpÀ d½training data�À q½test data�À e½EÀ value� À À outfmt 6 qseqidsseqidpident > ½output� Evaluation metric. To evaluate the prediction performance, we used simple and easily interpretable metrics, F1 max [30], as recommended by the Critical Assessment of Functional Annotation (CAFA) [30]. The F1 score is the harmonic mean of precision and recall, and F1 max is the maximum overall F1 score classification threshold of each model. AvgPr

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features First, precision (pr i ) and recall (rc i ) are calculated according to Eqs (1) and (2). For every protein i in the dataset, f is a GO term label, T i is a set of true GO term labels, and P i (t) is a set of predicted GO term labels of threshold t. I is an identity function which returns 1 for true and 0 for the false condition. Second, average precision (AvgPr) and average recall (AvgRc) are calculated as Eqs (3) and (4), where m(t) is the number of proteins that contain at least one predicted label, and n is the total number of proteins. The F1 max is calculated as Eq (5) with the threshold t in range of 0 to 1 and a step size of 0.1.
Nested cross-validation. We used five repetitions of nested 5x3-fold cross-validation (CV) to evaluate the model performance with the optimal parameter set. The schema of the nested CV is presented in S3 Fig. For every fold, a training set was separated and used to perform nested CV to find the best parameter set before validating with the test set. That meant we could test the model in each fold with the best parameter set without the leaking of information during the step of model training.

Performance of PhotoModGO in comparison to the baseline methods
We applied PhotoModGO to predict proteins of functional subclasses in the photosynthetic system. Among the multi-label classifiers, RAkEL performed the best with an F1 max value of 0.872, followed by the BR method with a value of 0.862 (Table 1). The performance between these two algorithms was not significantly different (p-value = 0.166) based on the Wilcoxon signed-rank test (S2 Table). LP performed the worst with an F1 max value of 0.693, which was significantly different from the other two methods (p-value < 0.00001). Although the LP model showed the highest recall value, its precision was low compared to the others. This is consistent with a previous benchmark [31] showing that ensemble transformation methods generally perform better than simple transformation methods.
The same dataset was used to train two sequence-based models, DeepGoPlus and BLAST, for performance comparison. DeepGoPlus achieved an F1 max value of 0.702, a precision value of 0.771, and a recall value of 0.693. The classical BLAST method had the worst performance, with F1 max , precision, and recall values of 0.603, 0.601, and 0.624, respectively. RAkEL and BR performed significantly (p< 0.00001) better than DeepGoPlus, but LP was not statistically different from DeepGoPlus (p = 0.135) in this regard. The genome neighborhood-based models outperformed the sequence-based models on average.

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features Although PhotoModGORAkEL showed high predictive performance overall, we found that four GO terms could not be predicted by this model, as shown by the performance for each GO term in S1 Table. One of the reasons is that the number of instances is not enough for training the model. Moreover, the predictive performance of the phycobilisome term (GO:0030089; F1 max = 0.548), which is marginally connected to the photosynthesis term, was not significantly different (one sample two-tailed t-test: p< 0.01) from the average performance per GO term (F1 max = 0.563). This indicates that the model development protocol is not seriously biased toward the original 23 photosynthesis-specific GO terms. In addition, it suggests that this protocol can be applied to other photosynthesis-related functions and other systems.

Sequence identity-independent performance of PhotoModGO
The more similar protein sequences are, the more they tend to have similar functions [32]. Therefore, we examined the role of sequence identity in our model performance. We created a new dataset, called the easy dataset, containing a sequence identity of �70%. PhotoModGO-R-AkEL, DeepGOPlus, and BLAST models were trained with this dataset and evaluated using five replications of nested 5x3-fold CV. We observed that DeepGoPlus and BLAST methods performed better on the easy dataset (identity �70%) compared to the dataset with identity�50%, but we observed no difference in PhotoModGO performance (Fig 5), indicating that PhotmoModGO does not suffer from the situation where proteins with similar functions share low sequence identity scores.
DeepGoPlus achieved an F1 max measure of 0.887 with the easy dataset, while BLAST method could achieve a value of 0.885. This better performance of sequence-based models indicates that they take advantage of the more similar sequence to predict the function. On the other hand, PhotoModGO could not take advantage of such sequence property, resulting in no significant difference between high and low sequence identity datasets in the prediction of performance. The reason is that PhotoModGO considers both sequence relationships and genome neighborhood profiles for the classification, i.e., weakly homologous sequences with the same genome neighborhood profile were generally classified under the same functional class.

Functional prediction of novel photosynthetic proteins
Moreover, our new model revealed the potential to predict the function of novel photosynthetic genes / proteins. For example, we retrieved 17 recently identified photosynthetic proteins, which had never been deposited in the Uniprot database as of September 2016. As shown in Table 2, the functional annotations of many proteins from literature are poorly described, which makes it difficult to work with them. We could apply the new model to these proteins to specify functional classes related to photosynthesis. We found that many proteins, including RfpA, IflA and DpxA, were predicted to be related to PSII, while only IsiX was predicted to be related to PSI. Interestingly, gene sll0272 has been documented as a homolog of NdhV in Arabidopsis thaliana. Its deletion reduces the activity of NADPH dehydrogenase (NDH-1)and NDH-1-dependent cyclic electron transport around photosystem I (NDH-CET), which plays a protective role against several stress conditions, e.g. high light intensity, salt and temperature [33][34][35]. It has been proposed that the sll0272 protein is a ferredoxin-binding domain of cyanobacterial NDH-1 localized in the thylakoid membrane [33]. Based on our model prediction, it was proposed that the product of this gene functions in the photosynthetic system as a regulator or stabilizer of PSII components. A few studies supporting this prediction showed that the absence of NDH-CET activity impairs PSII repairing process under heat stress conditions [36,37]. Besides, three proteins, ApcD4, MpeZ and all4940, were predicted to be related to phycobilisomes as reported in the literature.
Surprisingly, we observed that the predicted GO terms from the DeepGoPlus model were sparse and not informative (S3 Table). For example, all of the query sequences were assigned by GO:0034357 (photosynthetic membrane) and GO:0009579 (thylakoid), while only ssl3829 and asr1131 were assigned by a more specific term, GO:0009523 (photosystem II). Moreover, only six of 17 query sequences could be assigned with photosynthetic GO terms by the SVMprot web server (S3 Table). These results indicated the advantage of PhotoModGO over the sequence-based approaches in the classification of photosynthetic subclasses.

Functional prediction of unknown proteins in Synechocystis sp. PCC 6803
Identification of novel photosynthetic proteins might provide us a new route to improve photosynthetic efficiency. More than 50% (1,885 from 3672) of protein-coding genes in Synechocystis sp. PCC 6803, a model photosynthetic prokaryote, are unknown (data from http:// genome.microbedb.jp/cyanobase, Sep 2018). Therefore, we applied two machine learning models, PhotoMod and PhotoModGO implemented in the PhotoModPlus web server, to The unique sequence dataset with � 50% identity (diverse dataset) is represented by a black bar, while the dataset with� 70% identity (easy dataset) is represented by a white bar. Asterisks indicate statistically significant differences, based on the Wilcoxon signed-rank test ( � ,0.01<p<0.05; ��� , p<0.00001). https://doi.org/10.1371/journal.pone.0248682.g005

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features screen for unknown genes in the Synechocystis sp. PCC 6803 genome and identify potential novel photosynthetic genes. As a result, we could predict~100 high confident photosynthetic gene candidates and their functions (as shown in S4 Table). Some of these interesting candidates with informative annotations are shown in Table 3.

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features As sll0249 is localized next to the isiAB operon, it was designated isiC. IsiA protein has been associated with PSI and forms a ring-like complex around the PSI reaction center under iron deficiency condition [51]. isiC was observed to be transcriptionally enhanced during iron deficiency with no functional designation [52]. Also, the isiA mutant was previously reported to be sensitive to high light [53], while the isiC mutant showed high sensitivity to oxidative stress [54]. It is known that oxygenic photosynthetic bacteria consume a large amount of iron for maintaining photosynthesis (e.g. a part of several iron-sulfur cluster-containing proteins). Thus, to minimize harmful radicals from oxidative interaction, iron accumulation in cells should be tightly regulated [55]. According to our prediction result and the pieces of evidence, we propose that sll0249 and IsiC play an important role in balancing PSI activity in response to oxidative stress.
Besides, the predicted result also demonstrated that sll0611 is associated with PSII. DNA microarrays of Synechocystis PCC6803 showed altered expression of the sll0611 gene responding to the deletion of lexA, which is a key regulator of the SOS response induced by DNA damage [56]. A more comprehensive study achieved by RNA-seq analysis showed that deletion of the lexA gene resulted in altered expression of many genes, including those involved in photosynthesis, although no significant change in sll0611 expression was observed in this study [57]. A further study focusing on sll0611 is required to clarify its link with photosynthesis. The slr0022 protein was previously identified in the core cyanobacterial clusters of orthologous groups of proteins (CyOGs) [11] and was predicted by the sequence similarity-based approach as Fe-S cluster protein. Our model was able to predict more specific functions of slr0022 related to photosynthesis. Additionally, our prediction result for slr2049 is consistent with a

PLOS ONE
PhotoModPlus: A web server for photosynthetic protein prediction from genome neighborhood features recent publication [58] showing that slr2049 functions as lyase to catalyze phycobilin chromophores, which are a part of phycobiliproteins in phycobilisomes.

Discussion
In our PhotoModPlus web server, we developed PhotoModGO, which integrates protein clustering, genome neighborhood profile generation, and multi-label classification algorithm to classify the photosynthesis subclasses. The model can assign photosynthetic functions to proteins with high accuracy, even though their sequences are highly diverse from the training sequences. This is an advantage over sequence-based approaches, including DeepGO and BLAST, which perform very well with the easy dataset but handles diverse datasets poorly. We demonstrated that the BLAST method performs efficiently when the sequences in the dataset are similar (�70% sequence identity). However, if the sequences in the dataset are more diverse (�50% sequence identity), the BLAST method suffers from a lack of information, resulting in low performance. Therefore, the machine-learning method plays an important role in this situation. We showed that the DeepGOPlus method, which combines sequence similarity-based prediction and the motif sequence-based neuron network model, performed better than the classical BLAST method with the diverse dataset. Moreover, the potential of PhotoModGO to uncover the photosynthetic function of novel photosynthetic proteins was demonstrated to provide more informative GO terms than that of the sequence-based approaches.
The sequence-based approaches are suited for use as a first tool for predicting protein functions are considering their speed, ease of use and performance efficiency. When higher precision is required or in the absence of similar sequences in the database, it is necessary to consider other suitable approaches carefully. We can use the information from other sources to predict functions, for example, protein-protein interaction, protein structure, and genomic contexts. The genomic context seems the cheaper and easier data option given the substantial reduction in the cost of genome sequencing. Two well-established genomic context-based methods are the phylogenetic profile [59,60] and gene neighborhood [30]. The concept of both methods is similar, i.e., functionally linked proteins tend to co-occur in genomes [61]. However, gene neighborhood-based methods, particularly PhotoMod, focus on gene clusters and operons by observing only the surrounding region of the query gene. This generally can reduce noise or unrelated genes and provide high-quality functional relationships [15].
Additionally, by integrating the sequence similarity network and GNN [62], we developed PhotoModGNN to create and visualize GNN for inferring photosynthesis-related functions. Compared to the original GNN [62], PhotoModGNN was improved to circumvent complicated interpretations. One of the improvements is the use of Phylo scores as the indicative signal for detecting related functions from genome neighborhoods, instead of using the cooccurrence frequency, which might cause a bias toward dominant species in the dataset. In addition, enriched functions among genome neighborhoods are calculated and provided to users with the p-value to support users' decisions. One more feature of GNN is the ability to adjust the E-value threshold to separate protein families [62]. The range of the E-value threshold varies (ranging from 1E-8 to 1E-84) depending on the size and diversity of the dataset [62][63][64]. Therefore, we selected three E-value thresholds of 1E-10, 1E-50 and 1E-100 based on the minimum and maximum sequence similarity found in photosynthetic reaction center homologs. GNNs based on these three selected E-value thresholds are provided in PhotoModGNN to allow users to dynamically explore the genome neighborhood patterns from the different cutoffs. Therefore, PhotoModGNN can be used to explore functions of novel photosynthetic proteins other than the 24 subclasses that are predictable by PhotoModGO, as demonstrated.
Although PhotoModGO performs efficiently in predicting the photosynthetic subclasses, it is important to mention its limitations. Firstly, most of the photosynthetic-specific GO terms used in this study are in the biological process and cellular component categories (12 in the biological process, 10 in the cellular component, and 2 in molecular function). We showed that PhotoModGO generally outperformed the sequence-based approaches except for the molecular function category, where the sequence-based methods generally work well [65]. Thus, in the PhotoModPlus web server, we also enabled the prediction of functions based on the protein domain, via InterProScan [29], to improve the prediction quality. Secondly, gene neighborhood scoring is a time-consuming process. PhotoModGO calculates gene neighborhood conservation based on the phylogenetic tree, modified from Zheng et al. [66]. It requires a genome distance matrix based on shared gene contents between genomes to calculate the score. Our dataset contains 154 genomes, implying a huge amount of calculation (11,781 pairwise genome distance). Instead of using shared gene contents, one might estimate the distance between genomes using 16s rRNA, which can be obtained from public databases such as SILVA (https://www.arb-silva.de), to increase calculation speed and make the application easily expandable. Finally, our methods have been developed based on the concept of gene clusters and operons, which make it particularly applicable in only prokaryotic genomes, where gene clusters and operons commonly exist. Although we showed that genome neighborhood features could be used to classify photosynthetic proteins and their sub-functional classes, the framework can be applied to other systems. One of the most important systems on earth is the nitrogen assimilation and fixation system, which is crucial for carbon and nitrogen cycles and the global climate [67]. This complicated system is exclusively conducted by prokaryotes, and many related genes are organized as operons, which make it compatible with our framework.

Conclusions
Among the diversity of available data types and computational algorithms for protein function prediction, the PhotoModPlus represents a new and unique method based on genome neighborhood analysis. The current version of the web server consists of i) PhotoMod: a genome neighborhood-based model to classify photosynthetic proteins [15], ii) PhotoModGO: a genome neighborhood-based multi-label model to classify sub-functional classes of photosynthesis, and iii) the PhotoModGNN platform for visualizing the genome neighborhood network to infer protein functions. Using genome neighborhood features with machine learning model PhotoModGO, we were able to increase the functional prediction performance by~17-27% compared to other sequence-based approaches. Notably, the model can handle diverse sequence datasets well. The application of PhotoModPlus in the prediction of novel photosynthetic protein functions was demonstrated. We are assured that the user-friendly web interface of PhotoModPlus can increase its accessibility and usability for those interested in the discovery of photosynthetic genes / proteins.

Implementation and availability
PhotoModGO is available as free software in Github at https://github.com/asangphukieo/ PhotoModGO and in Dockerhub at https://hub.docker.com/r/asangphukieo/photomod2. The PhotoModPlus web server is available at bicep.kmutt.ac.th/photomod.  [68]. (B) Gene neighborhoods are called from all of the genomes that contain a homolog of the query sequence. (C) After applying protein clustering and calculating the Phylo score, a genome neighborhood network (GNN) can be constructed. The hexagonal node represents the query, while the circular node represents its genome neighbors. The label in each node represents a protein cluster ID. The size of the genome neighbor node varies according to its Phylo score. There are three built-in protein clustering cutoffs (E-value: 1E-10, 1E-50, and 1E-100) available on top of the network to define the level of homologous relationship between proteins. (PDF) For every outer fold, the training set (white color) is used to perform 3-fold cross-validation to find the best parameter set. The best parameter set is used to build a model again from the whole training set in the outer fold and tested with an independent test set (black color). (PDF) S1 Table. List of 24 GO terms specific to the photosynthetic system and PhotoModGO prediction performance of each GO term. (PDF) S2 Table. P-value table for Table. List of potential photosynthetic gene candidates predicted from the PhotoMod-Plus web server. The unknown protein-coding genes from Synechocystissp. PCC 6803 genome (1,885 genes) were used as input in PhotoModPlus prediction. BLAST best match and E-value of 1E-25 were used as BLAST parameters. The potential candidates were selected using the criteria: i) predicted as a positive class with more than 80% probability from the PhotoMod binary model and ii) containing at least one predicted GO term with more than 50% probability from the PhotoModGO multi-label model.