Propensity Scores for Prediction and Characterization of Bioluminescent Proteins from Sequences

Bioluminescent proteins (BLPs) are a class of proteins with various mechanisms of light emission such as bioluminescence and fluorescence from luminous organisms. While valuable for commercial and medical applications, identification of BLPs, including luciferases and fluorescent proteins (FPs), is rather challenging, owing to their high variety of protein sequences. Moreover, characterization of BLPs facilitates mutagenesis analysis to enhance bioluminescence and fluorescence. Therefore, this study proposes a novel methodological approach to estimating the propensity scores of 400 dipeptides and 20 amino acids in order to design two prediction methods and characterize BLPs based on a scoring card method (SCM). The SCMBLP method for predicting BLPs achieves an accuracy of 90.83% for 10-fold cross-validation higher than existing support vector machine based methods and a test accuracy of 82.85%. A dataset consisting of 269 luciferases and 216 FPs is also established to design the SCMLFP prediction method, which achieves training and test accuracies of 97.10% and 96.28%, respectively. Additionally, four informative physicochemical properties of 20 amino acids are identified using the estimated propensity scores to characterize BLPs as follows: 1) high transfer free energy from inside to the protein surface, 2) high occurrence frequency of residues in the transmembrane regions of the protein, 3) large hydrophobicity scale from the native protein structure, and 4) high correlation coefficient (R = 0.921) between the amino acid compositions of BLPs and integral membrane proteins. Further analyzing BLPs reveals that luciferases have a larger value of R (0.937) than FPs (0.635), suggesting that luciferases tend to locate near the cell membrane location rather than FPs for convenient receipt of extracellular ions. Importantly, the propensity scores of dipeptides and amino acids and the identified properties facilitate efforts to predict, characterize, and apply BLPs, including luciferases, photoproteins, and FPs. The web server is available at http://iclab.life.nctu.edu.tw/SCMBLP/index.html.


Introduction
Bioluminescent proteins (BLPs) are a class of proteins with various mechanisms of light emission such as bioluminescence and fluorescence from luminous organisms. Bioluminescence of organisms is the emission of visible light by a natural chemical reaction in diverse forms of morphology within a living organism [1]. In the bioluminescence process, the luciferin-luciferase reaction involves two chemicals, luciferin and luciferase. Luciferins are a class of small light-emitting heterocyclic compounds that are oxidized in the presence of the enzyme luciferase. Luciferase generally refers to the class of oxidative enzymes that catalyzes the oxidation of luciferin to emit light with an intermediate called oxyluciferin. Luciferins typically undergo enzyme-catalyzed oxidation, and the unstable oxyluciferin emits light upon decaying to its ground state. The amount of light emitted is proportional to the concentration of the substrate luciferins. Firefly luciferin is a wellstudied luciferin found in many firefly species, in which oxygen, ATP, and magnesium ions are required for light emission [2].
As a stable protein complex, photoprotein consists of a catalyzing protein (i.e. a luciferase variant) and factors required for light emission including the luciferin and oxygen. Photoproteins do not exhibit a luciferin-luciferase reaction and emits light without the intervention of an enzyme. Photoproteins are triggered to produce light upon binding with another ion or cofactor such as Ca 2+ , causing a conformational change in the protein. Photoproteins referring to luciferase-like proteins display luminescence proportional to the amount of the catalyzing protein.
The number of photoproteins with known structures is relatively small, compared with that of BLPs. As a photoprotein isolated from luminescent jellyfish Aequorea victoria, aequorin has been used to measure calcium ion concentration. Aequorin consists of two distinct units, the apoprotein apoaequorin and a latently luminescent molecule coelenterazine, i.e. a luciferin. When a calcium ion Ca 2+ binds to aequorin, the protein complex decomposes into apoaequorin, coelenteramide and CO 2 , accompanied by the emission of light [3].
Rather than producing their own light, fluorescence molecules absorb photons, which temporarily excite electrons to a higher energy state. When relaxing rapidly to their ground state, the electrons rerelease their energy, usually at a longer wavelength. Fluorescent proteins (FPs) belong to a structurally homologous class of proteins that share the unique property of being-sufficient to form a visible chromophore from a sequence of three amino acids within their own polypeptide sequence [4]. The tightly packed nature of a barrel excludes solvent molecules, thereby protecting the chromophore fluorescence from quenching by water. Fluorescent proteins are well established markers for gene expression and protein targeting in intact cells and organisms. The mechanism of light production through a chemical reaction distinguishes bioluminescence from fluorescence [4]. The green fluorescent protein (GFP) from jellyfish Aequorea victoria can be colocalized with its bioluminescence counterpart (i.e. aequorin).
BLPs serve in a variety of functions of cellular processes. With the increasing number of innovative commercial and medical applications using BLPs, understanding BLPs provides fascinating challenges for fundamental sciences and numerous opportunities for practical applications. Prediction and characterization of BLPs are of priority concern in a variety of research fields. Despite the availability of experimental methods to investigate BLPs [5], such methods are often time-consuming and severely limited in scope, leading to a lack of large datasets of reviewed BLPs. Therefore, developing computational methods to predict and characterize BLPs, including luciferases and FPs, from a sequence is more challenging, owing to their high variety in the protein sequence and lack of a large BLP dataset. The sequence-based prediction method can fast identify putative BLPs from sequences with/ without known structure information for further validation.
By establishing a dataset consisting of 441 BLPs and 18,502 non-BLPs, Kandaswamy et al. [6] proposed a computational method (known as BLProt) to predict BLPs. BLProt is trained using a dataset consisting of 300 BLPs and 300 non-BLPs and, then, evaluated by an extremely imbalanced test dataset (141 BLPs and 18,202 non-BLPs). BLProt uses a support vector machine (SVM) and 100 physicochemical properties selected by adopting three feature selection methods. BLProt achieves an accuracy of 80.0% for 5-fold cross-validation (5-CV) from training and an accuracy of 80.06% from testing. Huang et al. [7] subsequently proposed an improved SVM-based method (known as PBLP), in which 15 physicochemical properties are used to predict BLPs; in addition, the training (5-CV) and test accuracies are 84.50% and 81.79%, respectively. Zhao et al. [8] developed a BLPre method with a training accuracy of 85.17% for 10-CV to predict BLPs by using a model based on position specific scoring matrix (PSSM) and auto covariance. Fan and Li [9] recently proposed an accurate SVM-based method (known as SVM-Hybrid) by integrating multiple features, including dipeptide composition, reduced amino acid composition, PSSM, and auto covariance of averaged chemical shift. SVM-Hybrid achieves a training accuracy of 90.50% for 10-CV. Although yielding an acceptable accuracy, existing SVM-based prediction methods suffered from obtaining human-interpretable knowledge for further understanding BLPs.
Predicting and characterizing BLPs as well as luciferases and FPs in general conditions from a sequence are worthwhile yet challenging tasks. This study proposes a novel methodological approach to estimating the propensity scores of 400 dipeptides and 20 amino acids in order to design two prediction methods, i.e. SCMBLP and SCMLFP, based on a scoring card method (SCM) [10,11]. In addition to using the existing datasets of BLPs for comparison, this study also establishes a balanced dataset consisting of 274 non-BLPs and 274 BLPs, including 141 BLPs from seed proteins of the Pfam database, 94 BLPs using the GO term GO: 0008218 ''bioluminescence'', and 39 BLPs using the keyword ''photoprotein'' from the Protein Data Bank (PDB database) to evaluate SCMBLP. According to those results, SCMBLP achieves an accuracy of 90.83% for 10-CV, better than the existing SVM-based methods, and the test accuracy of 82.85%. Additionally, a dataset of 269 luciferases and 216 FPs is established to design a novel SCMLFP method for distinguishing luciferases from FPs. Evaluation results indicate that the SCMLFP method yields training and test accuracies of 97.10% and 96.28%, respectively. This study also identifies informative physicochemical properties from the AAindex database [12,13] by using propensity scores of 20 amino acids to gain further insight into BLPs, luciferase, photoproteins, and FPs. Results of this study demonstrate that the propensity scores of dipeptides and amino acids and the identified properties greatly facilitate mutagenesis analysis to enhance the bioluminescence and fluorescence of BLPs.

Materials and Methods
This study proposes a novel methodological approach to estimating the propensity scores of 400 dipeptides and 20 amino acids to design two prediction methods, i.e. SCMBLP and SCMLFP, based on a scoring card method (SCM). Figure 1 shows the flowchart of system designs involving datasets, methods, and analysis of propensity scores to predict and characterize BLPs.

Datasets
While obtaining 300 BLPs from seed proteins of the Pfam database [14], Kandaswamy et al. [6] enriched this dataset consisting of 441 BLPs with sequence identity , = 40%. The used training set, known as BLP-TRN, consists of 300 BLPs selected from the 441 BLPs and 300 non-BLPs from seed proteins of Pfam protein families that can be obtained from this work [6]. The test dataset used in the studies [6][7][8][9] consists of 141 BLPs and 18202 non-BLPs, which is an extremely imbalanced dataset. After manual examination by inquiring the domain classification in the Pfam database for each BLP, the dataset of 441 BLPs consist of three groups: 269 luciferases, 84 FPs, and 88 others (without obviously unique categorization into luciferase or FP). Owing to the extreme imbalance of the test dataset, the overall accuracy only is an unsatisfactory performance index used in related studies [6][7][8][9]. This study also established a balanced dataset, known as BLP-TEST, consisting of 274 non-BLPs (randomly selected from the 18,202 non-BLPs) and 274 BLPs including the 141 BLPs from seed proteins of the Pfam database, 94 BLPs from 44 species using the GO term GO:0008218 ''bioluminescence'' annotated on   SwissProt, and 39 BLPs (without duplicated sequences) using the keyword ''photoprotein'' from the PDB database.
To investigate the properties of discriminating between luciferases and FPs from a sequence, a dataset consisting of 269 luciferases and 216 FPs was created (given in Files S1 and S2, respectively). The 269 luciferases were obtained from the 441 BLPs in the Pfam database. There are 512 sequences obtained by using the keyword ''fluorescent protein'' from the PDB database. Only the 189 sequences with the classification category ''fluorescent protein'' were adopted. Deleting the duplicated sequences among the 189 and 84 FPs (from the Pfam database) led to a final dataset of 216 FPs. The luciferases and FPs in the dataset were randomly divided into ten groups. Among which, nine groups were used for training (i.e. LFP-TRN) and the other one for testing (i.e. LFP-TEST) in turn to design the SCMLFP method. Therefore, 10 independent training and test datasets are available to evaluate SCMLFP.
This study also calculates the Pearson's correlation coefficients (the R values) between the propensity scores of amino acids to be a BLP and the physicochemical properties of amino acids in the AAindex database. The range of R is [-0.733, 0.592]. Table 4 lists four physicochemical properties of amino acids with a large value of R estimated in a general condition. Figure 5 shows the correlations of propensity scores and the four identified properties of 20 amino acids. The four properties of interest with high absolute values of R are discussed below.

Scoring Card Method
The scoring card method (SCM) [10,11] is a general-purpose method for predicting and analyzing protein functions from primary sequences by estimating propensity scores of 400 dipeptides and 20 amino acids to be the protein with the investigated function. To apply the SCM method for designing a SCM-based predictor, the procedure mainly comprises the following steps: 1) both positive and negative datasets are prepared as input, 2) an initial scoring card with 400 propensity scores of dipeptides is generated using a statistical method, 3) propensity scores of 20 amino acids are derived from those of 400 dipeptides, 4) the scoring card is refined using a global optimization method, and 5) a binary SCM classifier with a threshold value is established. The algorithm of the SCMBLP method, based on SCM for predicting BLPs, is described below. Design of the SCMLFP method resembles that of SCMBLP by simply replacing the training dataset BLP-TRN with LFP-TRN. Further details of the SCM method and its applications can be found in these studies [10,11].
Step 1: Adopt a training dataset BLP-TRN, which consists of two subsets for positive (BLP) and negative (non-BLP) datasets.
Step 2: Generate an initial scoring card, which consists of propensity scores of 400 dipeptides by using a statistical method as follows: a. Calculate the dipeptide composition of BLPs and non-BLPs from the positive and negative datasets; b. Obtain the propensity score of each individual dipeptide by subtracting the composition value of the dipeptide in non-BLPs from that of the dipeptide in BLPs; and. c. Normalize the scores of all dipeptides into the range [0, 1000].
Step 3: Calculate the propensity score of each amino acid O by averaging the 40 scores of dipeptides OX and XO where X can be any amino acid.
Step 4: Optimize the scoring card (referred to herein as Scard) of dipeptides, which consists of 400 scores by using an intelligent genetic algorithm (IGA) [15]. In the chromosome representation, the 400 real-valued variables within the range [0, 1000] are encoded into a chromosome of IGA. The fitness function of IGA is Table 2. The comparisons of SCMBLP with some comparable methods using BLP-TRN in terms of 5-or 10-fold cross-validation accuracy (%).  to maximize the prediction accuracy in terms of the area under the ROC curve (AUC) [16] and also the Pearson's correlation coefficient (the R value) between the initial and optimized propensity scores of 20 amino acids. To avoid overfitting, the fitness function is calculated using a 10-CV assessment [10]. The fitness function is as follows (W 1 = 0.9 and W 2 = 0.1 in this study).

Max Fit Scard
Step 5: Predict a sequence P bases on the scoring function S(P), i.e. a weighted-sum score, and a threshold value determined by maximizing the prediction accuracy of using the training dataset.
where w i and S i denotes the composition and propensity score of the i-th dipeptide, respectively. Additionally, P is classified as the positive class (i.e. BLP) when S(P) is greater than the threshold value; otherwise, P is the negative class (i.e. non-BLP).

Identifying Informative Physicochemical Properties
Physicochemical properties of amino acids are well recognized to be effective features for predicting and analyzing protein functions from primary sequences [7,12,17]. Kandaswamy et al.
[6] and Huang et al. [7] developed prediction methods for BLPs using SVM and a set of physicochemical properties selected from the AAindex database [13]. Since the mathematical vectors representing the 544 physicochemical properties in AAindex are distributed diversely, a set of properties can achieve a high prediction accuracy for binary classifier problems if the feature selection method is satisfactory, especially when using optimal feature selection approaches [7,12,17].
To characterize BLPs from the aspect of physicochemical properties of amino acids, this study proposes a novel method in which the properties of particular interest are identified, based on the scoring card of SCM while considering two factors. First, many physicochemical properties of amino acids in AAindex are investigated in a specific condition. Second, bioluminescence of organisms occurs in diverse forms of morphology with various mechanisms of light emission, which also heavily depends on environmental conditions. Therefore, this study elucidates informative physicochemical properties in a generalized condition to discriminate between BLPs and non-BLPs, as well as luciferases and FPs.
The propensity scores of 400 dipeptides and 20 amino acids to be a BLP obtained from the above-mentioned SCMBLP classifier greatly facilitate efforts to understand BLPs. The procedure to identify informative physicochemical properties is described below. First, whether or not the bioluminescence is a global property of sequence for general BLPs is examined. The examination method analyzes the distribution of locations of highscore dipeptides on the BLP and non-BLP sequences [8,9]. A situation in which the high-score dipeptides are evenly distributed on the sequences implies that the bioluminescence is a global property of a sequence; otherwise, bioluminescence may occur in specific segments. Second, the procedure identifies the candidate physicochemical properties in AAindex with a large Pearson's correlation coefficient between the properties and the propensity scores or composition of 20 amino acids. The candidate physicochemical properties investigated in a generalized condition are preferred for further analysis.

Results and Discussion
Propensity Scores of BLPs Figure 2 shows the propensity scores of 400 dipeptides to be a BLP obtained by using the SCMBLP method with the BLP-TRN dataset consisting of 300 BLPs and 300 non-BLPs. The propensity scores of 20 amino acids can be derived from the propensity scores of 400 dipeptides. Table 1 lists the propensity scores of amino acids to be a BLP and the amino acid compositions of BLPs and non-BLPs. The total number of amino acids in the used datasets of BLPs and non-BLPs is 130,904 and 120,607, respectively. The propensity scores and amino acid composition can be analyzed to further gain insight into how to characterize BLPs.
Closely examining the dipeptide composition (not shown) reveals that the 20 top-ranked dipeptides according to propensity score (DI, LP, HP, YD, CN, GM, KN, CA, VM, QH, FH, CG, DV, GY, RG, IY, LS, MD, MQ, VR in order) make up BLPs (5.23%) and non-BLPs (4.37%). There are 19 individual dipeptides (except for IY) in which their compositions in BLPs are larger than those in non-BLPs. Compositions of the dipeptide IY for BLPs (0.1738%) and non-BLPs (0.1779%) are very close. The results indicate that BLPs have a larger number of dipeptides with high propensity scores than non-BLPs. Similarly, BLPs have a small number of dipeptides with low propensity scores. This result reveals that the propensity scores of dipeptides can discriminate between BLPs and non-BLPs.
The set of hydrophobic residues is {Ala, Ile, Leu, Met, Phe, Val, Cys, Gly}, and the other residues are hydrophilic residues according to the work [18]. The three top-ranked amino acids according to propensity scores are Gly = 573.025, Cys = 572.850, and Phe = 548.075, which belong to the class of hydrophobicity. The three amino acids with the smallest scores are Lys = 336.275, Ala = 359.750, and Ser = 368.500. The three hydrophobic residues Gly, Cys, and Phe make up BLPs (14.20%) and non-BLPs (11.67%). Based on the composition differences between BLPs and non-BLPs, the three individual amino acids Gly, Cys, and Phe are ranked at 1, 2, and 3, respectively. The top-10 high-score residues of BLPs have nine residues (except for Ile) with positive composition differences. Table 1 lists the compositions of BLPs and non-BLPs for residue Ile, which are 5.60% and 5.66%, respectively. The correlation coefficient R equals to 0.97 between the propensity scores of amino acids and composition difference of amino acids between BLPs and non-BLPs. Analysis results of amino acid composition indicate that BLPs have more hydrophobic residues and less hydrophilic residues than non-BLPs. Moreover, the high correlation coefficient suggests that the propensity scores of amino acids can also discriminate between BLPs and non-BLPs. From the propensity scores of dipeptides and amino acids, we can infer that two individual amino acids with high propensity scores do not necessarily form a dipeptide with a high propensity score. For example, although the two top-ranked amino acids are Gly and Cys, the dipeptides Gly-Cys and Cys-Gly have propensity scores of 442 and 923, respectively. The propensity scores of two dipeptides with the same two residues are not the same, owing to the different dipeptide compositions of BLPs. It is hypothesized that the charged polar residues play an important role in placing the luciferin and chromophore in the right orientation facing the hydrophobic surface of substrate-binding cavity [19], resulting in the asymmetry scores of dipeptides. Notably, the dipeptide with the smallest score is Ala-Ser with a score 0. The score of Ser-Ala is also small, which equals to 164. The two residues Ala and Ser have small propensity scores, ranking 19 and 18, respectively.

Prediction Accuracy of SCMBLP
The propensity score of a sequence to be a BLP is useful for discrimination between PLPs and non-BLPs. The histogram of sequence scores for BLPs and non-BLPs in the training dataset is given in Fig. 3. The distribution range of sequence scores for BLPs is reduced and shifted to the end of high score after the optimization on propensity scores. Furthermore, the distributions for BLPs and non-BLPs are more separable after optimization. A higher score of a sequence implies a larger probability that the sequence is a BLP.
This study also compares SCMBLP with the existing SVMbased methods [6][7][8][9] using the promising features: dipeptide composition (DPC), physicochemical properties (PCPs), PSSM and hybrid feature set. The SVM-DPC method using SVM with the DPC features is implemented for comparison. Owing to the extreme imbalance of the test dataset (i.e. 141 BLPs and 18,202 non-BLPs; 99.23% for non-BLP), the test accuracy heavily depends on the specificity performance of using the training dataset. If the design of the prediction methods tends to have a higher specificity performance than sensitivity performance (i.e. preference towards non-BLP), the test accuracy is easily higher than 90%. Therefore, it is not fair to evaluate the predictors using the test accuracy only. The training accuracy of 5-CV or 10-CV was commonly used for performance comparisons in these studies [6][7][8][9]. The comparisons of SCMBLP with existing methods using the training dataset BLP-TRN are given in Table 2.
Above comparisons reveal that the best SVM-based method is SVM-Hybrid with an accuracy of 90.50% for 10-CV using a hybrid feature set by integrating DPC, RAAC, PSSM, and auto covariance of average chemical shift (ACS) [9]. That SVM-DPC has an accuracy of 83.50% for 10-CV further reveals that DPC is a more effective feature type than PSSM and PCPs in terms of predicting BLPs. The proposed SCMBLP method of using SCM and DPC (90.83%) is superior to all the SVM-based methods. The high performance of SCMBLP arises mainly from the used optimization approach to adjusting 400 propensity scores of dipeptides, i.e. 400 features for classification. Owing to the difference in the feature number and optimization usage used, evaluating the effectiveness of individual types of features is relatively difficult. Based on the above results, we can conclude that the feature of dipeptides is effective in predicting BLPs, and the SCM classifier is comparable to the high-performance SVM classifier. Moreover, SVM is satisfactory in designing an accurate predictor and SCM is appropriate for analyzing the investigated functions with an acceptable accuracy from a sequence. However, the dipeptide features with propensity scores are more easily interpreted and helpful for subsequent analysis [10,11].
To evaluate SCMBLP using test performance for observing the generalization ability, a balanced dataset consisting of different resources for comparison and analysis is created. The test dataset BLP-TEST consists of 274 non-BLPs and 274 BLPs. Table 3 summarizes the performance of SCMBLP on BLP-TEST. From Table 3, the following observations are made. First, the overall test accuracy is 82.85% with a sensitivity of 85.40% and a specificity of 80.29%. The threshold value of sequence scores used is 439.63, and can be adjusted according to the preference of a decision marker on the sensitivity and specificity. Second, while considering the training accuracy of 90.83% and the test accuracy of 82.85%, not-severe overtraining seems to occur. This overtraining can be alleviated by way of increasing the size of the training dataset and including all domains of luciferases and FPs with a sufficient number of samples. Third, the accuracies of BLPs from PDB, SwissProt, and the Pfam database are 94.87%, 86.17%, and 82.27%, respectively. The prediction accuracies on the datasets consisting of reviewed BLPs from PDB and SwissProt are larger than that of using seed proteins from the Pfam database. The 39 BLPs from the PDB database and 94 BLPs from SwissProt with their sequence scores are given in Tables S1 and S2, respectively.

Physicochemical Properties for BLPs
The propensity scores come from the statistics of whole sequences to discriminate between BLPs and non-BLPs. Figure 4 shows the distribution of dipeptide scores on two typical sequences of BLP (Pfam ID A8QZJ6 of length 96 with domain Oxidored) and non-BLP (NCBI sequence entry GI: 20137223 of length 101) with high and low sequence scores of 551.65 and 256.86, respectively. This finding suggests that both high-and low-score dipeptides are evenly distributed on the sequences. Moreover, there are more high-score dipeptides in the BLP than in the non-BLP. Distribution of the dipeptide scores indicates no obviously high-score segment in the BLP. The above finding that top-ranked dipeptides do not tend to cluster in a certain region suggests that bioluminescence and fluorescence are a global property of BLP sequences.
1. The property of free energy of transfer (DG t~R T ln f ) from the inside to the protein surface in globular proteins [20] has R = 0.532. Variable f of each amino acid is a ratio of the molar fractions of the buried and accessible residues. For example, amino acid Cys has the largest free energy of transfer DG t 0.9 with f = 4.6, in which the molar fractions of the buried and accessible residues are 4.1 and 0.9, respectively. The propensity score of Cys is 572.85 at rank 2, which is quite close to 573.025 of Gly at rank 1. The residue with the smallest transfer free energy21.8 is Lys at rank 20. This finding is consistent with a situation in which Lys has the smallest propensity score of 336.275 at rank 20. The residues with high propensity scores to be a BLP tend to be buried with a high free energy of transfer. 2. The property of average scale of membrane preference (AMP07) [21] has R = 0.415. The membrane-preference value for the amino acid residue j is defined as follows: where f j, MEM denotes the frequency of occurrence of residue j in the transmembrane regions of the proteins and f j, TOT represents the frequency of occurrence of residue j along the entire sequences of the proteins, i.e. the molar fraction of residue j. Average scale of the membrane preference AMP07 has been computed as the arithmetical mean of seven different scales of the membrane preference. Residue Cys has the largest scale of membrane preference 1.60 at rank 1. Residue Lys having the smallest propensity score has the scale of membrane preference 0.33 at rank 18. Residues with high propensity scores to be a BLP tend to have residues with a high frequency of occurrence in the transmembrane regions. Consider the BLP with Pfam ID A8QZJ6 as an example. Three segments are in the transmembrane regions, which are in the positions 6,22, 29,50, and 56,78, annotated in the Pfam database [14].
3. The property of hydrophobicity scale from the native protein structures of globular proteins [22] has R = 0.478. The structure-derived hydrophobic interaction along can distinguish a substantial number of native conformation from a large pool of misfolded structures. The hydrophobic interactions largely contribute to the stability of native folds, which agrees with experimental findings. Residue Cys has the largest hydrophobicity scale 1.9, which ranks 1 st . The cysteine pair appears to be strongly hydrophobic since the formation of disulphide bonds increases the hydrophobicity of the reactants [23]. Residue Lys with the smallest propensity score has the smallest hydrophobicity scale 21.6 at rank 20. The hydrophobic potential is an important stabilizing interaction that is closely related to the native conformation. Moreover, the large propensity scores assigned to the hydrophobic residues agree with the hydrophobic characterization of luciferins, chromophore, and substratebinding cavity [19]. 4. The property of composition of amino acids in nuclear proteins [24] has R = 20.660. Correlation analysis of the amino acid composition and the cellular location of a protein is performed, which discriminates among the following five protein classes: integral membrane proteins, anchored membrane proteins, extracellular proteins, intracellular proteins and nuclear proteins [24]. BLPs tend to have a dislike of the amino acid composition of nuclear proteins, compared to that of integral membrane proteins.
Both integral membrane proteins and nuclear proteins were more closely examined by analyzing the compositions of BLPs, integral membrane proteins, and nuclear proteins, and their corresponding R values were analyzed, as shown in Table 5. The correlation coefficient between the compositions of BLPs and integral membrane proteins was R = 0.921, which is larger than R = 0.766 for nuclear proteins. The mean differences of compo-sition of every amino acid were 0.77% and 1.31% for integral membrane proteins and nuclear proteins, respectively. Membrane proteins are rich in hydrophobic residues, which correspond to proteins having several transmembrane stretches of a secondary structure and poor in charged residues [24]. As is well recognized, the function of a protein is correlated with its subcellular localization, because the environment of a protein provides a portion of the relevant circumstance necessary for function. From the above analysis of the four properties, we hypothesize that BLPs tend to locate themselves near the cell membrane in order to execute their functions such as the emission of visible light (i.e. appropriate for the photic environment to surface regions) and receipt of extracellular ions for luciferases and photoproteins to trigger the bioluminescence reaction. Residue Cys in BLPs has the largest hydrophobicity scale and highest transfer free energy, which tends to be a buried residue. Consequently, the Cys composition was small, i.e. 2.09%, which ranked 19 th (i.e. larger  than that of residue Trp). The above results support the analysis derived from the propensity scores of the SCMBLP method.

Propensity Scores of Using SCMLFP
The propensity score of a sequence obtained using the prediction method SCMLFP is useful for discrimination between luciferases and FPs. The SCMLFP method uses the whole dataset consisting of 269 luciferases and 216 FPs to estimate the propensity scores of 400 dipeptides to be a luciferase against FP, as shown in Fig. 6. The training accuracy is 98.56%, and the sensitivity and specificity are 98.51% and 98.61%, respectively. Table 6 lists the propensity scores of 20 amino acids to be a luciferase against FP. For closely examining the different properties between luciferases and FPs, this table also shows the compositions of luciferases and FPs, composition difference between luciferases and FPs, and composition of integral membrane proteins. The total number of amino acids in the used luciferases and FPs is 107,742 and 133,473, respectively.
The correlation coefficient between the propensity scores to be a luciferase and the compositions of luciferases and FPs are R = 0.503 and 20.237, respectively. A higher correlation of luciferases than that of FPs in terms of the absolute value of R reveals that the propensity scores characterize luciferases more appropriately. The two residues Leu and Ala have the highest scores, i.e. 729.900 and 726.425, which also have the largest percentages of composition in the luciferases, i.e. 10.41% and 8.25%, respectively. Notably, the top-two residues with the largest percentages (11.0% and 8.1%) of composition of residues in the integral membrane proteins are Leu and Ala, too. Leu and Ala are abundant in the transmembrane regions of the membrane proteins [21]. The hydrophobic residue Leu is frequently observed in transmembrane proteins [25].
The correlation coefficient between the propensity scores and the composition difference between luciferases and FPs is R = 0.9996. According to Eq. (1), the high correlation indicates that the estimated dipeptide scores can distinguish luciferases from FPs by simultaneously maximizing the prediction accuracy in terms of AUC and maximizing the correlation between the optimized and initial propensity scores of 20 amino acids (i.e. composition difference between luciferases and FPs).
The correlation coefficient between the propensity scores to be a luciferase and the composition of integral membrane proteins was R = 0.522. This study also examines the difference between luciferases and FPs from the aspect of composition. The correlation coefficients between the compositions of integral membrane proteins and those of luciferases and FPs were Clytin is a Ca 2+ -regulated photoprotein comprising coelenterazine (CZ) as a substrate for its bioluminescence reaction. When Ca 2+ binds to clytin, the bioluminescence reaction is triggered to produce the excited state product coelenteramide (CA) and CO 2 with emission of a broad blue bioluminescence (maximum 470 nm). Due to energy transfer, Clytia GFP fluorescence is in green (maximum 500 nm). (B) The structure of clytin (22.4 kDa, PDB code 3KPX) and its representation using hydrophilic (outer part in blue) and hydrophobic (inner part) residues. (C) The structure of Clytia GFP (PDB code 2HPW) and its representation using hydrophilic and hydrophobic residues using Discovery Studio 3.5. The visible fluorophore of Clytia GFP (CSY) is a sequence of three amino acids (68S, 69Y, and 70G). To protect the chromophore fluorescence from quenching by water, the tightly packed nature of the barrel excludes solvent molecules resulting in that the environment comprising this fluorophore is also hydrophobic. doi:10.1371/journal.pone.0097158.g007 R = 0.937 and 0.635, respectively. The amino acid composition is commonly used as an informative feature when predicting subcellular localization [26]. Despite the inability of the amino acid composition to define protein location, a correlation is made between composition and location. We hypothesize that luciferases prefer a location near the cell membrane location rather than FPs for convenient receipt of extracellular ions.

Prediction Accuracy of SCMLFP
The correlation coefficient between the compositions of luciferases and FPs was R = 0.72. The high value of R reveals that the prediction method using the feature of amino acid composition only cannot distinguish luciferases from FPs accurately. The SCMBLP method optimizing the propensity scores of 400 dipeptides by maximizing prediction accuracy is more effective in predicting BLPs than when using the SVM-based methods. Table 7 summarizes the classification accuracy of SCMLFP using 400 dipeptide scores from 10 independent runs on the training (LFP-TRN) and test (LFP-TEST) datasets. The training accuracy was 97.1060.38% using LFP-TRN. The independent test performance was 96.2863.35% with a sensitivity of 98.8761.81% and a specificity of 93.0167.37% on the LFP-TEST consisting of 26 luciferases and 22 FPs. This study also implemented the SVM-based methods using amino acid composition (SVM-AAC) and dipeptide composition (SVM-DPC) for comparisons; their test accuracies were 93.1864.80% and 96.0864.41%, respectively. Above results demonstrate that SCMLFP is better than SVM-AAC and comparable to SVM-DPC.
Based on the above analysis of propensity scores of amino acids for luciferases and FPs, as is expected, the propensity scores of dipeptides can distinguish luciferases from FPs satisfactorily from the following two aspects. From the aspect of the classification mechanism, the residues and dipeptides with high propensity scores make up luciferases with large percentages of composition. Therefore, as is widely recognized, the weighted-sum propensity score of a sequence based on the SCM method can effectively distinguish luciferases from FPs. From the aspect of subcellular localization, the correlation between the compositions of luciferases and membrane proteins is as high as R = 0.937. The high correlation between the propensity scores and the composition of luciferases can characterize luciferases located near the membrane position.

Illustrative Examples and Discussion
In this study, propensity scores are proposed to predict and characterize BLPs from primary sequences rather than informative crystal structures. Although the meaning of this apparently simple relationship (the R value) between the propensity scores and the identified physicochemical properties of amino acids is not very clear, additional comments are necessary to gain further insight into BLPs, including luciferases, photoproteins, and FPs. To characterize BLPs more accurately, this study discusses the propensity scores and the identified properties by using illustrative models with known structures of BLPs and illustrative applications of photoproteins.
Titushin et al. presented a review towards understanding the role of protein-protein interactions in the function of various bioluminescence systems [19]. That review provides valuable insight into the detailed mechanism of bioluminescence using the most reliable structural model which is available for the protein-protein complex of the Ca 2+ -regulated photoprotein clytin and GFP from the jellyfish Clytia gocking [19]. Figure 7 schematically depicts the bioluminescence system of the jellyfish Clytia with the GFP-clytin complex. Clytin shares a high structural and sequence similarity with the other Ca 2+ -regulated photoproteins such as aequorin Figure 8. Schematic presentation of the G protein-coupled receptor (GPCR) reporter cell line for drug discovery. The drug discovery of high-throughput screening is monitored by activation of the calcium-sensitive photoprotein light production. Ligand-mediated activation of Gscoupled receptors stimulates cAMP synthesis by adenylyl cyclase (AC) and opening of the cAMP-gated heteromultimeric cyclic nucleotide-gated (CNG) channel. Calcium ions from the extracellular solution enter the cell through the CNG channel and are detected by photoprotein luminescence measurements. Gq-coupled receptor activation stimulates the phospholipase C (PLC)/IP3 pathway detected via photoprotein luminescence stimulated by calcium released from the endoplasmic reticulum (ER). doi:10.1371/journal.pone.0097158.g008 Prediction of Bioluminescent Proteins from Aequorea. Clytin comprises the hydrophobic coelenterazine as a substrate for its bioluminescence reaction, as shown in Fig. 7(A). When Ca 2+ binds to the hydrophobic substrate-binding cavity of clytin, the bioluminescence reaction is triggered to produce the excited state product coelenteramide and CO 2 with emission of a broad blue bioluminescence (maximum 470 nm). Clytia GFP has a low sequence identity and a structure that is highly similar to those of GFPs from Aequorea [27]. Owing to the energy transfer, Clytia GFP extends the bioluminescence to slightly longer wavelengths; the resulting fluorescence is in green (maximum 500 nm). Closely examining the structure of the GFP-clytin complex reveals that the complexation is governed by several hydrophobic contacts and a hydrogen bond network [19]. The SCMBLP and SCMLFP methods predicted the photoprotein clytin and obtained the propensity scores of 463.63 and 572.16, respectively. The threshold values to classify BLPs and luciferases are 439.63 and 577.12, respectively. According to the prediction results, clytin is a BLP with high confidence and a FP with low confidence (the sequence score close to the threshold value). Additionally, the SCMBLP and SCMLFP methods predicted the Clytia GFP and obtained the propensity scores of 464.98 and 560.94, respectively. The prediction result was accurate that Clytia GFP was categorized into BLP and FP with high confidence. Figure 7(B) shows the structure of clytin (PDB code 3KPX) and its representation using hydrophilic and hydrophobic residues. This figure reveals the hydrophobic coelenterazine and the hydrophobic substrate-binding cavity of clytin. Similarly, Fig. 7(C) shows a visible fluorophore of Clytia GFP (CSY) from a sequence of three amino acids (68S, 69Y, and 70G). To protect the chromophore fluorescence from quenching by water, the tightly packed nature of the barrel excludes solvent molecules resulting in that the environment comprising this fluorophore is also hydrophobic.
Based on Fig. 7 and the prediction results, we can infer the following. First, the propensity scores of sequences can classify the clytin and Clytia GFP into BLPs accurately. The SCMLFP method was designed using the training datasets consisting of luciferases and FPs. Therefore, the propensity score of the photoprotein clytin was close to the decision threshold value. Luciferases (oxidative enzymes) have higher average propensity scores of using SCMLFP than photoproteins. Second, the hydrophobic coelenterazine and substrate-binding cavity of clytin, as well as the hydrophobic contacts in the protein-protein interaction in the GFP-clytin complex support the identified physicochemical properties. Notably, the residues of BLPs with high propensity scores tend to be buried with high transfer free energy and hydrophobic.
Owing to the easy detection of the emitted light with an illuminometer, photoproteins are widely used in molecular biology to measure intracellular Ca 2+ levels. The G protein-coupled receptor (GPCR) family is the largest known class of molecular targets with a proven therapeutic value. Cell surface GPCRs drive numerous signaling pathways involved in the regulation of a broad range of physiological processes [28]. GPCRs can couple with Ca 2+ signaling in the drug discovery of high-throughput screening. Photoproteins are also increasingly used to detect activation of other molecular target classes such as ligand-gated ion channels and transporters [29]. Figure 8 schematically depicts the GPCR reporter cell line for drug discovery monitored by activation of the calcium-sensitive photoprotein light production [30]. Ligandmediated activation of Gs-coupled receptors stimulates cAMP synthesis by adenylyl cyclase (AC) and opening of the cAMP-gated heteromultimeric cyclic nucleotide-gated (CNG) channel. While entering the cell through the CNG channel, Calcium ions from the extracellular solution are detected by photoprotein luminescence measurements. Gq-coupled receptor activation stimulates the phospholipase C (PLC)/IP3 pathway detected via photoprotein luminescence stimulated by calcium released from the endoplasmic reticulum (ER).
Conventionally, changes in intracellular Ca 2+ levels have been readily detected using fluorescent dyes that emit light in proportion to changes in intracellular Ca 2+ concentration. An alternative approach to measuring indirectly the changes in Ca 2+ concentrations involves using recombinantly expressed biosensor photoproteins, of which aequorin is a prototypic example [31]. These biosensors provide an intense luminescent signal in response to elevations in intracellular Ca 2+ [29]. Bovolenta et al. engineered a new photoprotein, Photina, to overcome some of the limitations of aequorin by exchanging a portion of the sequence of the photoprotein obelin [32] with a corresponding region in clytin to combine the best properties of each photoprotein [33]. The signal of Photina cells is approximately 3-folds higher than the signal of cells expressing aequorin in the mitochondria. Cainarca et al. stably expressed c-Photina, a Ca 2+ -sensitive photoprotein, driven by a ubiquitous promoter in a mouse embryonic stem cell line [34].
Photoproteins can detect both extracellular and intracellular Ca 2+ influxes. For the receipt of extracellular magnesium or calcium ions, luciferases and photoproteins have a similar composition of integral membrane proteins. The residues with high propensity scores to be a luciferase tend to have residues with a high frequency of occurrence in the transmembrane regions. Photoprotein mutants and modified versions of coelenterazine can increase the range of the recorded calcium concentrations. The presented propensity scores and characterization of BLPs greatly facilitates mutagenesis analysis to enhance the bioluminescence and fluorescence of BLPs.

Conclusions
Predicting and characterizing bioluminescent proteins (BLPs), including luciferases, photoproteins, and fluorescent proteins (FPs), are valuable for commercial and medical applications yet more challenging, owing to their high variety of protein sequences and small number of crystal structures. Given the increasing number of innovative tools in drug discovery based on engineered luciferases, photoproteins, and FPs, accurate and easily interpreted computational methods must be developed to predict and analyze BLPs. This study has proposed a scoring card method (SCM) based approach to estimate the propensity scores of 400 dipeptides and 20 amino acids from sequences in order to design two prediction methods (i.e. SCMBLP and SCMLFP) and characterize BLPs using some identified physicochemical properties. Analysis results indicate that the SCMBLP method performs better than the support vector machine (SVM) based prediction methods. By establishing the datasets of luciferases and FPs, to our knowledge, the proposed SCMLFP method is the first prediction method capable of distinguishing luciferases from FPs. The two sets of propensity scores to be BLPs and luciferases are highly promising for use in discovering new properties to more thoroughly elucidate BLPs. For example, the residues of BLPs with high propensity scores tend to be buried with a high transfer free energy and hydrophobic in nature. As for the high correlation between the compositions of luciferases and integral membrane proteins, we hypothesize that luciferases prefer a location near the cell membrane location for convenient receipt of extracellular ions. Furthermore, the propensity scores and identified properties for BLPs greatly facilitate mutagenesis analysis to enhance the bioluminescence and fluorescence of BLPs.