Scoring Protein Relationships in Functional Interaction Networks Predicted from Sequence Data

The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins. Availability Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.


Introduction
In recent years we have experienced an exponential growth of biological data, including primary data such as genomic sequences resulting from worldwide DNA sequencing efforts and as well as functional data from high-throughput experiments, respectively. This abundance of primary sequence data and the large availability of public gene and protein sequence databases have the capability to provide many new insights into the biology of organisms. Several studies have shown that very often functional properties of a protein are not necessarily determined by the whole sequence but only by some of its sub-sequences [1]. Sequences sharing similar or conserved features are referred to as homologous sequences, and these features can be used for inferring and scoring protein pair-wise functional connections. One of these features is a protein domain, defined as a part of a protein sequence and structure that can evolve, function and exist independently of the rest of the protein chain [2].
Discovering sequence homology and modelling functional interactions between homologues from sequence and experimental data constitutes an important problem in molecular biology, as these can help to describe their behaviour in cellular processes and reveal the interplay between particular genes and proteins. In order to determine functional similarity between proteins, many approaches try to identify the sub-sequences of the proteins that may contribute to their function. Several Bioinformatics tools have been designed for deriving and storing these functional features. These include standard sequence comparison tools such as BLAST [3,4], protein sequence databases such as UniProt [5], and protein signature databases such as InterPro [6], which integrates together predictive models or protein signatures representing protein domains, families and functional sites, from multiple source databases, namely, PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY, Gene3D, PANTHER [7].
Using homologous datasets obtained from pair-wise sequence similarities, and protein domains and families in public databases, the inference of functional connections can be carried out based on the fact that two proteins sharing common domains or belonging to the same family are more likely to be functionally linked [8], i:e:, have similar functions with respect to molecular function and biological process. Note, the interactions discussed here are potential functional interactions, not direct physical interactions. These functional associations may be set in Boolean or binary form, i:e:, either two genes or proteins are functionally linked in which case the score is 1 or they are not and the score is 0. Such a scoring scheme is not consistent since it does not take into account the nature of parameters used to derive these functional associations. Understanding the properties of these functional relationships is key to successful mathema-tical modelling of such a system and developing efficient scoring techniques.
There are several problems with generating functional interaction networks using diverse data types such as sequence and functional genomics data. Considering that we are dealing with inaccurate data obtained from different experiments [9,10], the uncertainty of data and noise inherent in each experiment must be efficiently managed by systematically weighing or scoring these functional associations [11]. This is referred to as a reliability or confidence score of functional associations for the particular computational approach used for prediction. This produces a graph with confidence-weighted relationships between each protein pair, which weighs each evidence type on the basis of its accuracy. Data-driven prediction methods should be able to extract essential features from particular datasets and to discount unwanted information. So, these scoring schemes must be data source and technology dependent, meaning that a given scoring scheme should normally vary according to the data sources and be designed on the basis of the technology used. Furthermore, the effectiveness of a scoring scheme for functional associations is critical for the quality of the analyses performed on the resulting network, including functional and structural analysis. An inability to accurately infer and score these protein pair functional associations leads to the propagation of annotation errors [12] and may negatively impact on the prediction analyses performed on the basis of these networks.
Several scoring schemes have been proposed for sequence data and are, so far, limited to only finding the similarity scores of proteins that are referred to as scoring functions. In the case of protein domain and family data, the scoring function is deduced from the number of common signatures shared by two proteins [10,13]. These schemes miss other features related to the data under consideration including their nature and sources. On the other hand, for sequence similarity data this scoring function is just the E{value obtained from sequence comparison tools, and pair-wise functional interactions between proteins are obtained by simply applying an E{value cut-off [10,[14][15][16][17]. However, there is no single fixed E{value describing where homology ends and non homology begins. This shows that these schemes are not equipped to meet the requirements for scoring functional relationships, i:e:, they do not capture all information shared between sequences.
In order to overcome these shortcomings, we propose an information-theoretic based measure to score protein-protein relationships in functional interaction networks predicted from homology data. This approach is shown to be effective for scoring functional pair-wise relationships from homology data, and translating the amount of biological content shared between proteins into the score of their functional relationships. We apply our method to score functional relationships between proteins in Mycobacterium tuberculosis (MTB) strain CDC1551 to produce a functional network from sequence data for this organism. This approach is compared to the STRING (Search Tool for the Retrieval of Interacting Genes/ Proteins) [11,18] homology scoring system for sequence similarity, and to existing scoring schemes for protein family and domain sharing [10,13] in terms of functional classification coherence. Results show that the new scoring approach is as effective as that of the STRING approach, but produces a reliable functional network with higher coverage. The MTB functional network produced is then used to predict the functional class of proteins of unknown function, evaluated using leave-one-out cross validation.

Materials and Methods
This section describes novel scoring schemes for protein family and domain data extracted from protein family databases, as well as for protein sequence similarity obtained by running sequence comparison tools such as Basic Local Alignment Search Tool (BLAST). Sequences in Fasta format and InterPro data for the organism were downloaded from the Integr8 project of the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk/ integr8. Scoring functional relationships for data from protein families and domains has been widely addressed by the Bionformatics community. However, the approaches described so far in the literature are limited to finding the similarity scores between proteins by the number of common signatures shared by proteins. Two examples of such a scheme are given below.
Scheme 1: Scoring Function of Pfam Domain Sharing [10]. The scoring function S pfam of Pfam domain sharing is simply the number of common domains of the two proteins defined as follows: where D p k is the set of Pfam domains found in protein p k . Scheme 2: Scoring Function based on Protein Signature Profiling [13].
The similarity score between a pair of proteins p i ,p j À Á is computed using a binary similarity function between a pair of their signature profiles and is given by where n is the number of signatures contained in proteins of a genome of interest and P '~S'1 ,S '2 , . . . ,S 'n ½ the signature profile of protein p ' , with S 'k~1 , if the signature S k exists in protein p ' and S 'k~0 otherwise.
Note that the scheme 1 expressed by the equation (1) can be rewritten using Boolean operator 'and (^)' as follows: and similarly, the scheme 2 in the equation (2) can also be written using set operators 'intersection (\)' and 'union (|)' as with P k and D p k as defined above. These two schemes just count the number of shared signatures without taking into account the nature of the data and experiments used to derive them. In addition, the limitation of the second scheme can be seen in this small illustration: Let's consider three proteins p 1 , p 2 , and p 3 , with 3, 4, and 9 detected signatures, respectively. If we assume that p 1 and p 2 share 2 signatures and 3 signatures are shared by p 2 and p 3 , we have: m p 1 ,p 2 ð Þ~0:400 and m p 2 ,p 3 ð Þ~0:273. So, m p 2 ,p 3 ð Þvm p 1 ,p 2 ð Þ, whereas one should expect to have m p 1 ,p 2 ð Þvm p 2 ,p 3 ð Þwhen looking at the number of the common signatures shared by these proteins. In fact, the scoring function as a function of the number of common signatures shared by a pair of proteins, is expected to be increasing. This property does not hold for scoring functions based on protein signature profiling, making this unattractive.
In the case of sequence similarity, the existing scoring schemes rely on the use of the negative logarithm of E{values obtained from a sequence similarity tool. As pointed out previously, the problem with these scoring schemes is that initially there is no single fixed E{value describing where homology ends and non homology begins. This constitutes an impediment to these scoring schemes beyond the fact that they may obviously lead to the singularities caused by the log of zeros.
Thus, these schemes are not equipped to capture all the parameters related to the data under consideration and technology used to derive them. In order to overcome these shortcomings, we introduce novel scoring schemes based on the information-theoretic approach, taking into account the nature of the data and technology used and where the user can tune parameters based on their confidence in the data source.

Scoring Scheme For Protein Family and Domain
Consider two proteins denoted p i and p j , sharing signatures or entries S 1 , . . . ,S M : We define the similarity score g ij of proteins p i and p j as the minimum number of occurrences of these signatures in proteins p i and p j , i:e:, where n k' is the number of occurrences of signatures S k in the protein p ' : Broadly speaking, the reliability or confidence score increases with the confidence-level of data, which depends on the data source and is torn down by the uncertainty-level of data linked to the dispersion measure s. As we are dealing with data from experiments containing a certain level of uncertainty, which propagates into the data, it is natural to use the normal distribution, as these data can be summarized in terms of mean and standard deviation. In fact, in this case this distribution constitutes an attractive approximation as it maximizes information entropy in the data. Thus, we set the confidence-level d of the similarity score g as with the function w the cumulative probability of the standard Gaussian distribution defined by and a the calibration control parameter, with a §0:5, strengthening the impact of the confidence-level for the data under consideration, in which case, a~0:5 is associated with low confidence data. The training dataset D consists of all pairs S k ,x k ð Þ, where x k is the number of times the signature S k was observed. In order to get rid of observations that lie at abnormal distances from the data, referred to as outliers, it is recommended to use the rectified dataset D S , the subset of the training dataset D consisting of a data point which falls inside 1:5 IQR ð Þ, i:e:, with Q 1 and Q 3 , respectively, the 1st (lower) and 3rd (upper) quartile, and IQR~Q 3 {Q 1 the interquartile range. s is thus the standard deviation of the rectified dataset, estimated from maximum likelihood and given by where N is the number of signatures found in the rectified dataset, and x x~P N k~1 x k =N, the mean or average of the set. Given the confidence-level d of the similarity score g defined in equation (4), the uncertainty measure related to the outcome g resulting from the data is obtained from the binary entropy function, given by In fact, the uncertainty measure function : Finally, we set up the capacity of inferring the functional relationship score between two proteins belonging to the same family or sharing common signatures as and the reliability or confidence score of the functional relationship between two proteins by Note that for g significantly large, d converges to 1: Therefore, the uncertainty measure H 2 d ð Þ converges to 0, leading to the maximum capacity of inferring the functional relationship of 1: This means that the reliability of a functional relationship between two proteins is given by To illustrate the dependency of this new measure on the data under consideration and the technology used to produce them, we plot the variation of confidence level d, uncertainty H 2 and capacity C in terms of common domains g between proteins, for different values of a, which keeps track of the technology used to produce data and s controlling the impact of data under consideration, respectively. These are user-tunable parameters and results are shown in figures 1-4.
These results show that the confidence level d increases as the number of common signatures between the two proteins increases, and that for a higher value of a, indicating the efficiency level of the technology used to derive data, the confidence level d is higher, and so is the reliability or confidence score, due to the fact that in this case the uncertainty component is smaller. Similarly, the impact of data obtained from each technology is taken into account through s: Interestingly, this confidence score formula accommodates the case where no common pattern is found between two proteins in the training dataset, in which case, the confidence score or reliability of a functional relationship is 0: In addition, this scoring scheme takes into account a false positive assignment of any of the common patterns by narrowing down the confidence score of proteins containing only one common signature, depending on the measure of dispersion s which can provide a hint on the nature of the data under consideration. Indeed, the measure of dispersion s impacts on the confidence score in the sense that if data is far away from the average, in which case s is high, the uncertainty component might be large and significant while calculating the confidence score, thus yielding a lower confidence score. Thus, with knowledge of the data source, the measure of dispersion s can be penalized by a factor e between 0 and 1, in order to reduce the impact of the uncertainty component.

Scoring Scheme For Protein Sequence Similarity
For a given set of pair-wise homologous sequences, Bastian [19,20] showed that their biological evolution can be formalized by the evolution of their shared amount of information. This is  measured by the mutual information in the sense of Hartley [21,22], estimating the information they share due to their common origin and parallel evolution under similar selective pressure. Moreover, this mutual information is proportional to the bit score computed with standard methods in sequence comparisons.
Let S s 1 ,s 2 ð Þ be the bit score alignment of homologous sequences s 1 and s 2 , set with its standard units, and I s 1 ,s 2 ð Þ mutual information between these two sequences. We have where l is a constant defining the unity, which depends on the statistical parameter scale K for the search size (http://www.ncbi. nlm.nih.gov/BLAST/tutorial/Altschul-1.html) derived from the  scoring matrix and amino acid composition of the sequence [23]. Therefore, generally S s 1 ,s 2 ð Þ=S s 2 ,s 1 ð Þand they are equal only if they have the same scale for the search size. However, the mutual information I s 1 ,s 2 ð Þ between two sequences s 1 and s 2 satisfies I s 1 ,s 2 ð Þ~I s 2 ,s 1 ð Þ and I s 1 ,s 2 ð Þ §0 [24]. Equation (11) shows that the mutual information I s 1 ,s 2 ð Þ increases with the bit score S s 1 ,s 2 ð Þ, which measures the average information available per position to distinguish an alignment from chance, calculated using relative entropy of target and background distributions [25] as where q ij is the ''target'' residue substitution frequency, the probability of finding a residue i aligned with a residue j after a certain amount of evolution given that they have both evolved from a common ancestor who had a residue k at that position. q i is the probability of occurrence of a residue i in a collection of sequences, i:e:, the probability that a residue i would align by chance based solely on its frequency in a sequence. Thus, we define the reliability or confidence score R s 1 ,s 2 ð Þof a functional relationship between two protein sequences s 1 and s 2 as normalized mutual information calculated [26] as measuring how the protein sequence s 1 is able to predict the protein sequence s 2 , and where H s ð Þ is the relative entropy obtained by aligning a protein sequence s by itself. Indeed, the increase of mutual information with relative entropy yields bias, and this bias is corrected by dividing the mutual information by the maximum entropy of the sequence pair.
Using equation (11), the mutual information I s 1 ,s 2 ð Þ can be computed as follows: where l and l 0 are constants defining unity for S s 1 ,s 2 ð Þ and S s 2 ,s 1 ð Þ, respectively. For a protein sequence s, H s ð Þ~I s,s ð Þ, obtained using equation (14) and given by It is obvious that this scoring scheme relies only on the two protein sequences for which the confidence score is being computed. Two protein sequences whose mutual information of their evolutionary history embedded in their similarity score is 0, indicates that the two sequences are not similar and so, their confidence score is also 0. Thus, this scoring scheme accommodates the case where no similarity is found between two protein sequences and the error due to the arbitrary growth of the mutual information between two protein pairs is corrected by the maximum entropy induced.

MTB Functional Network Derived from Sequence Data
The computation of relationship scores (as described in the methods section) was performed on the whole Mycobacterium tuberculosis strain CDC1551 proteome to produce functional links between proteins from homology data, including pair-wise links from sequence similarity and protein family data derived from the InterPro database. Sequence similarity searches were carried out using BLASTP under a BLOSUM62 matrix based on the premise that if the E{value is less than 0:01, the hit is similar to the query sequence and is likely to be evolutionarily related [27]. Resulting functional link scores are provided in Table S1.
We investigated the general behaviour of the link confidence scores induced from homology datasets. Results are depicted in Table 1 in terms of number and frequency of functional links in a given bin S : x, where S : x corresponds to link score values ranging between (x{1)=10 and x=10 (x{1)=10vscoreƒ ½ x=10.These results indicate that the link confidence scores from protein family data are either low (ƒ0:4) or high (w0:7). This is due to the calibration control parameter applied to data from the InterPro database, which is a~1 with penalty parameter e~0:45, producing either low or high confidence according to the fact that two proteins share only one domain or more than one domain, respectively. Moreover, in most cases, prediction of functional links from sequence similarity matches that of protein family data but at different confidence levels. The link score s ij between proteins p i and p j obtained for the combined data is given by under the assumption of independency, where r S ij and r F ij are link confidence scores obtained from sequence similarity and protein family datasets, respectively.

Evaluating the Scoring Scheme
We compared our approach for scoring functional interactions inferred from sequence similarity to the STRING homology scoring scheme. STRING is a database of known and predicted protein-protein associations for a large number of organisms derived from high-throughput experimental data, the mining of databases and literature, and from predictions based on genomic analysis. For this assessment we used only their links derived from homology data, which uses a scoring scheme based on E-values obtained from the Smith-Waterman algorithm with a reasonably strict cut-off score to ensure high quality matches [28]. We also compared our approach for scoring functional interactions from protein family and domain to the scoring scheme for protein signature profiling (SFSP).
The STRING scheme classifies its functional link confidence scores into three different categories, low, medium and high confidence, with corresponding scores less than 0.4, between 0.4 and 0.7, and greater than 0.7, respectively [11]. These scores measure our confidence in pair-wise functional interactions in the networks produced. Even though sequence data are initially accurate, computational tools used to produce sequence similarity data may introduce noise due to certain unpredictable factors, such as arbitrary increases of bit score or over-estimation of similarity patterns between sequences. In order to take into account these uncertainties in sequence similarity data while ensuring the accuracy of functional interactions produced, one can set a cut-off score above which a given interaction is more likely to occur. Therefore, the comparison was performed in terms of functional classification accuracy for links with a medium confidence level and upwards (link score greater than 0:4). The number of associations predicted in different MTB functional networks produced using different approaches are shown separately in Table 1 for each approach and confidence ranging from low to high.
The SFSP as defined by equation (2) may produce several link scores for the same number of shared domains, we have considered the maximum score when over-estimating, their minimum when underestimating and their average score, referred to as SFSP-Max, SFSP-Under and SFSP-Mean, respectively. We plot the scores obtained using our approach and these from SFSP, and results are shown in figure 5. As pointed out previously, the scoring function should be increasing since our confidence level increases with the number of common  signatures shared between pair-wise proteins. These results show that only SFSP-Under estimation provides the increasing scoring function but unfortunately it yields a poor coverage and for this reason it is not considered for further performance evaluation. The scoring scheme developed here produces an increasing scoring function and provides a better trade-off between SFSP-Max and SFSP-Mean. Considering the confidence score cut-off applied, the configuration of the network produced from SFSP-Max estimation is the same as that derived using the scheme based on the scoring function of domain sharing described by equation (1).

Statistical significance of Functional Interactions Derived
We evaluated the statistical significance and biological relevance of the functional interactions inferred using our scoring approach in terms of functional classification coherence. To measure this, an interaction between two proteins is said to be significant or correct if these proteins belong to the same functional class.
The functional classes were extracted from Tuberculist (http:// genolist.pasteur.fr/Tuberculist), and the repartition of interacting proteins in the functional network per functional class or category for different configurations is shown in Table 2. The evaluation was done using a sub-network generated by each protein in the functional network, consisting of functional interactions between a protein under consideration and its direct neighbours, referred to as a P-subgraph. The proteins in the unknown functional class were excluded from the evaluation.
To assess functional category coherence of functional interactions derived from a random model, we compute the P-value for each P-subgraph defined as the probability that the P-subgraph under consideration occurs by chance or is comprised of randomly drawn interactions. The hypergeometric distribution, which yields the probability of observing at least ' interactions between proteins from a given P-subgraph of size S by chance among I interactions of the same type in the entire functional network considered to be a background distribution, is used to model the P-value [14] given by where L is the size of the functional network, i:e:, the number of functional links in the network, with all the proteins in the unknown class removed.
We assessed functional category coherence of functional interactions derived using our approach and STRING homology data for sequence similarity, as well as those inferred using our scheme for protein family and domain, and those obtained using SFSP-Mean and SFSP-Max estimation. Results displayed in figures 6 and 7 show that the functional interactions induced have a very low probability of occurring by chance. Note that this statistical test against a random distribution aims at checking if a given P-subgraph in the functional network consists of randomly grouped proteins. These figures show that using a significance level of 0:05 as the optimal threshold, more P-subgraphs derived using our approach are statistically significant than those obtained from the STRING homology scoring and provides roughly equal statistically significant percentage of P-subgraphs with SFSP-Mean and SFSP-Max schemes. A total of 205 out of 378, representing 54:2% of P-subgraphs in our network are significant compared to 213 out of 485 representing 43:9% of P-subgraphs for the STRING scoring system for sequence similarity. For SFSP scheme for protein family and domain, A total of 1078 out of 1515 representing 71:2% of P-subgraphs in our network are significant compared to 901 out of 1261 representing 71:5% of P-subgraphs for SFSP-Mean and to 1517 out of 2024 representing 75% for SFSP-Max.

Effectiveness of The Novel Scoring Scheme
To evaluate the classification power of the new scoring scheme, we used the modified Receiver Operator Characteristic (ROC) curve analysis that measures the number of true positive (TP) predictions (number of functional interactions correctly identified) against the number of false positive (FP) (number of functional interactions incorrectly identified) [29], in which case the area under the ROC curve (AUC) is used as a measure of discriminative power. The larger the upper AUC value (the portion between the curve and the line TP = FP), the more powerful the scheme is. For a given number of P-subgraphs ranging from 5 to 485, we randomly generated 1000 independent samples and compute the average number of correct and incorrect predicted interactions expected to be normally distributed from the central limit theorem. Thus, we perform modified ROC analyses for the two scoring approaches, and results are shown in figure 8 for sequence similarity. These results indicate that our approach outperforms the STRING scheme, respectively, with an average of 95:9% and 4:1% of functional interactions correctly and incorrectly identified out of 378 P-subgraphs, compared to the STRING scheme, which provides an average of 89:3% and 10:7% of functional interactions correctly and incorrectly identified, respectively, out of 485 P-subgraphs. This shows not only that it is not sufficient to ensure high quality matches [28] Figure 6. Significance of functional interactions derived using our approach and the STRING scheme. At each significance level a in these graphs, we counted all relevant predicted associations for the two approaches and computed the percentage. Each a corresponds to the number of associations with p-value b and a ƒbva, where a is the significance level just before a in the plot. doi:10.1371/journal.pone.0018607.g006 Figure 7. Significance of functional interactions derived using our approach and SFSP approach. At each significance level a in these graphs, we counted all relevant predicted associations for the two approaches and computed the percentage. Each a corresponds to the number of associations with p-value b and a ƒbva, where a is the significance level just before a in the plot. doi:10.1371/journal.pone.0018607.g007 by just applying a reasonably strict cut-off score when using the Smith-Waterman algorithm, but also this practice may lead to a poor coverage. Results in figure 9 indicate that our method performs comparably to the SFSP-Max and SFSP-Mean schemes, and provides a better trade-off between over-estimating and averaging scores for SFSP schemes in terms of precision and coverage. Our approach provides an average of 79% and 21% of functional interactions correctly and incorrectly, respectively, identified out of 1515 P-subgraphs. SFSP-Mean yields an average of 80:5% and 19:5% of functional interactions correctly and incorrectly identified, respectively, out of 1261 Psubgraphs while SFSP-Max produces an average of 73:3% and 26:7% of functional interactions correctly and incorrectly identified, respectively, out of 2024 P-subgraphs. Apart from the general limitation common to scoring schemes inferred from signature profiling based approaches, SFSP-Max produces a poor precision. This poor performance is due to the fact that when over-estimating it includes all false positives and our approach corrects this, providing an improved precision and coverage.

General Analysis of the Structure of the Functional Network Produced
We performed a general analysis of the homology-based functional network produced by integrating into a single network all functional interactions inferred from sequence similarity and protein family and domain data using our scheme. The number of functional links in the combined network, which contains a total of 2206 proteins (nodes), is given in Table 3. The results in figure 10 show that this network exhibits scale-free topology, i:e:, the degree distribution of proteins approximates a power law P k ð Þ~k {c , with the degree exponent c*1:55. We analyzed the general behavior of this network by finding the number of cliques and the distribution of hubs. Here protein hubs are described as ''single points of failure'' able to disconnect the network. This functional network contains 262 clusters, or cliques, with 174 hubs and with the biggest cluster containing 1957 gene products.

Predicting Protein Functional Class
Several approaches have been proposed for predicting protein functions from functional networks and are mainly classified into two categories, namely global network topology and local neighborhood based approaches. Global network topology based approaches use global optimization [30][31][32] or probabilistic methods [33][34][35][36] or machine learning [37][38][39] to improve the prediction accuracy using the global structure of the network under consideration. Unfortunately, these approaches raise a scalability issue which might not be proportional to the improvement in predictions compared to most straight forward approaches, which rely only on local neighborhood [40] of uncharacterized proteins.
In the case of local neighborhood based approaches, known as 'Guilt-by-Association' or 'Majority Voting' or 'Neighbor Counting' [41], direct interacting neighbors of proteins are used to predict protein functions. However, the biggest limitation of approaches relying on the direct neighbors of the protein under consideration is that they are unable to characterize proteins whose direct interacting neighbors are all uncharacterized, thus impacting negatively on annotation coverage. Investigating the  relation between interacting neighbors of a given protein using network topology, Chua et al. [8,42] show that in many cases, a protein shares functional similarity with level-2 neighbors (2 branch-lengths away) and proposed a functional similarity weight (FS-Weight) method for predicting protein functions from protein interaction data. Here, we analyze the performance of using direct interacting neighbors and second level interacting neighbors. The second level interacting neighbors were used when we were unable to use direct interacting neighbors, in order to improve coverage.
The functional network produced from sequence data was used to predict, where possible, the functional class of proteins in the Tuberculist unknown functional class using a local neighborhood based approach. Through this, a new functional class is assigned to an unknown protein based on the functional class frequently occurring among its direct interacting neighbors. In this case, the score of a given functional class c for a protein p is given by the frequency f c p ð Þ of occurrence of functional class c among direct neighbors of p, and calculated as follows: where N p refers to the set of direct interacting partners of protein p, and d q is the q{function indicator given by if the protein q performs the function t 0 otherwise:

&
Since the objective is to assign to an unknown protein only one functional class, we make use of global network information, and the prediction of a given protein functional class is based on an over represented functional class found amongst its direct neighbors. The functional class with the largest chi-squared score is assigned to the protein. The chi-square score of functional class c for protein p [43] is given by where f c p ð Þ is defined in equation (19) and p p ð Þ is the global expected number of proteins belonging to the functional class c, given by p p ð Þ~n|p c , with p c that of proteins belonging to the class c among all the proteins in the functional network under consideration and n the order of the functional network, i:e:, number of proteins in the network.
As an illustration, protein 'fadA6' (MT3660 or Rv3557c), named Acetyltransferase FADA6 (UniProt accession P96834), which is involved in lipid metabolism (figure 11), is functionally linked to proteins annotated to the lipid metabolism class. This means that if we assumed that the protein 'fadA6' was not classified then it is likely that 'fadA6' would have been annotated to the lipid metabolism class. Similarly, protein 'lprJ' (MT1729 or Rv1690), named lipoprotein LPRJ (O33192), is also known to be involved in lipid metabolism ( figure 12). All its direct interacting partners are of the unknown class, in which case if the class of 'lprJ' was not known, the use of level-1 neighbors would fail to classify this protein. However, using the level-2 neighbors would successfully classify this protein. Finally, figure 13 shows protein MT1417 (Rv1372, Q7D8I1), which is of unknown class in Tuberculist, but suggested by UniProt to belong to the chalcone/stilbene synthase family known to be involved in lipid metabolism. The prediction method annotates this protein to lipid metabolism, thus confirming the suspicion.
Once again, the classification performance of these approaches can be evaluated with modified ROC curve analyses. We used leave-one-out cross-validation to evaluate the efficiency of these prediction approaches at computing the number of proteins correctly classified and those incorrectly classified. Note that when using the level-2 interacting neighbors to classify a protein, the instance of each protein is counted, i:e:, if a given level-2 neighbor interacts with different direct interacting neighbors, it will be counted twice. In order to compare the effectiveness of these approaches, we combined their related modified ROC curves and results are shown in figure 14. These results indicate that while the level 2 interacting partners may be used to improve the coverage, they contain many false positives impacting negatively on the precision. Combining level 1 and level 2 interacting partners slightly improves precision and coverage. These two measures of protein classification quality are computed as follows: Precision~T P TPzFP and Coverage~T P N where TP (true positive) is the number of proteins correctly classified, i:e:, number of proteins for which the actual classification is the same as the one predicted, FP (false positive) is the number of proteins for which the classification is different to the one predicted, and N is the total number of classified proteins in the functional network. Thus, the precision measures the  Combining level-1 and 2 neighbors yields a precision of 0:8349459 with a coverage of 0:8172702. This is only a slight improvement over using level-1 neighbors only, but the illustration for LPRJ above shows the value in using both.

Conclusions
We have developed novel information-theoretic based schemes for calculating the link confidence scores or link reliability for homology data, i:e, data from protein family and sequence similarity. These convert the amount of biological content shared between proteins into confidence scores of their functional relationships. The methods could be used for a clustering analysis but here they are used for functional network generation.
We applied these schemes to the genome of Mycobacterium tuberculosis strain CDC1551 to produce a protein-protein functional network. Results showed that the novel scheme is efficient and effective compared to the existing schemes and can be used to improve functional networks inferred from sequence data in terms of precision and coverage.
We analyzed the global behaviour of the network obtained from the new scoring schemes. Furthermore, the functional network produced was used to classify proteins in the unknown class using a local neighborhood based approach extended to level-2 protein neighbors in order to improve genomic coverage.
Currently, we are integrating into a single protein-protein functional network, all pair-wise functional interactions obtained from different data sources, including genetic interactions, and functional genomics data, in order to predict functions, where possible, of uncharacterized proteins in the genome and to study the biology of the organism.

Supporting Information
Table S1 # scores of functional interactions derived from sequence data. (XLS) Figure 14. Performance evaluation of classification prediction approaches. Number of proteins incorrectly classified (false positives) versus number of proteins correctly classified (true positives) using level-1, level-2, and combined level-1 and level-2 interacting partners to improve coverage. doi:10.1371/journal.pone.0018607.g014