On the Relevance of Sophisticated Structural Annotations for Disulfide Connectivity Pattern Prediction

Disulfide bridges strongly constrain the native structure of many proteins and predicting their formation is therefore a key sub-problem of protein structure and function inference. Most recently proposed approaches for this prediction problem adopt the following pipeline: first they enrich the primary sequence with structural annotations, second they apply a binary classifier to each candidate pair of cysteines to predict disulfide bonding probabilities and finally, they use a maximum weight graph matching algorithm to derive the predicted disulfide connectivity pattern of a protein. In this paper, we adopt this three step pipeline and propose an extensive study of the relevance of various structural annotations and feature encodings. In particular, we consider five kinds of structural annotations, among which three are novel in the context of disulfide bridge prediction. So as to be usable by machine learning algorithms, these annotations must be encoded into features. For this purpose, we propose four different feature encodings based on local windows and on different kinds of histograms. The combination of structural annotations with these possible encodings leads to a large number of possible feature functions. In order to identify a minimal subset of relevant feature functions among those, we propose an efficient and interpretable feature function selection scheme, designed so as to avoid any form of overfitting. We apply this scheme on top of three supervised learning algorithms: k-nearest neighbors, support vector machines and extremely randomized trees. Our results indicate that the use of only the PSSM (position-specific scoring matrix) together with the CSP (cysteine separation profile) are sufficient to construct a high performance disulfide pattern predictor and that extremely randomized trees reach a disulfide pattern prediction accuracy of on the benchmark dataset SPX, which corresponds to improvement over the state of the art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3CysBridges.


Introduction
A disulfide bridge is a covalent link resulting from an oxidationreduction process of the thiol group of two cysteine residues. Both experimental studies in protein engineering [1][2][3] and theoretical studies [4,5] showed that disulfide bridges play a key role in protein folding and in tertiary structure stabilization. The knowledge of the location of these bridges adds strong structural constraints to the protein, which enable to drastically reduce the conformational search space in the context of protein structure prediction. Due to the technical difficulties and the expensive cost of experimental procedures for determining protein structures (by x-ray crystallography, NMR or mass spectrometry), machine learning approaches have been developed to predict the formation of disulfide bridges in an automatic way.
Given an input primary structure, the disulfide pattern prediction problem consists in predicting the set of disulfide bridges appearing in the tertiary structure of the corresponding protein. This problem can be formalized as an edge prediction problem in a graph whose nodes are cysteine residues, under the constraint that a given cysteine is linked to at most to a single other one. Most recent successful methods to solve this problem are pipelines composed of three steps which are illustrated in Figure 1.
First, they enrich the primary structure using evolutionary information and sometimes structural-related predictions. Second, they apply a binary classifier to each pair of cysteines to estimate disulfide bonding probabilities. Finally, they use a maximum weight graph matching algorithm to extract a valid disulfide pattern maximizing the sum of these probabilities.
The central component of this three step pipeline is the binary classifier that predicts bonding probabilities for all cysteine pairs. The wide majority of available binary classification algorithms cannot process complex objects such as cysteine pairs natively, hence they require the user to encode such objects into vectors of (categorical or numerical) features. Since the way to perform this encoding typically has a major impact on the classification accuracy, a large body of work has been devoted to studying different feature representations for cysteines and cysteine-pairs. However, it is often the case that these different studies rely on different kinds of binary classifiers and slightly differ in their experimental protocol. Therefore, the comparison of the conclusions of these works is difficult. In consequence, the relevance of some features is still a subject under heavy debate. It is for example not clear whether the use of (predicted) secondary structure or (predicted) solvent accessibility can significantly improve disulfide pattern predictors [6][7][8].
The main contribution of this paper is an extensive study which aims at establishing the relevance of various structural-related annotations and of various feature encodings in the context of a disulfide pattern predictor such as the one presented in Figure 1. We consider various structural annotations, some which were already studied in the context of disulfide pattern predictionposition-specific scoring matrix, secondary structure and solvent accessibility -and some others which are more original in this context: 8-class secondary structure, disordered regions and structural alphabet. For each such annotation, we consider four different procedures in order to encode it as a feature vector. The combination of annotations with feature encodings leads to a large set of possible feature functions. In order to identify a minimal subset of feature functions that are relevant to disulfide pattern prediction, we introduce a tractable and interpretable feature selection methodology, based on forward selection of feature functions. We adopt a computational protocol that avoids any risk of overfitting and apply our approach in combination with two usual classifiers: k-nearest neighbors (kNN) and support vector machines (SVM), as well as with one classifier, which was not yet considered for disulfide pattern prediction: extremely randomized trees (ET) [9].
As a result of this study, we show that only a very limited number of feature functions are sufficient to construct a high performance disulfide pattern predictor and that, when using these features, extremely randomized trees reach a disulfide pattern accuracy of 58:2% on the benchmark dataset SPXz, which corresponds to z3:2% improvement over the state of the art. However, since SPXz only contains proteins with at least one intrachain disulfide bridge, we further consider the more heterogeneous and less redundant benchmark dataset SPX{ which also contains a significant number of proteins without any intrachain bridge. We then investigate the behavior of our disulfide pattern predictor on both datasets by coupling it with filters predicting the presence of intrachain bridges and the bonding states of individual cysteines. We consider both the case where bonding states are known a priori and the case where bonding states are estimated thanks to another predictor. We show that predicting the bonding states significantly improves our disulfide pattern predictor on SPX{, but slightly degrades it on SPXz. When the bonding states are known a priori, we reach very high accuracies: 89:9% on SPX{ and 75:8% on SPXz.
The following two sections give an overall view of related work by first discussing multiple sub-problems of disulfide pattern prediction and then presenting the kinds of features that have been proposed to describe cysteines and cysteine pairs in supervised learning approaches. We refer the reader to [10] for an extensive recent overview of the field.

Disulfide bridge related prediction problems
While the ultimate goal of disulfide bridge prediction is to infer correctly the whole connectivity pattern of any protein from its primary sequence, several researchers have focused on intermediate simpler sub-problems, which are detailed below.
Chain classification. This sub-problem aims at predicting for a given protein, whether (a) none of its cysteines participate to a disulfide bridge, (b) some of its cysteines are involved in disulfide bridges or (c) all of its cysteines are involved in disulfide bridges. Frasconi et al. [11] proposed a support vector machine classifier to solve this task. Fiser et al. [12] have exploited the key fact that free cysteines (not involved in any bond) and oxidized cysteines (involved in a bond but not necessarily an intra-chain disulfide bridge) rarely co-occur and shown that theirs sequential environments are different. From those observations, subsequent studies have reduced this sub-problem to a binary classification task: (a) or (c).
Cysteine bonding state prediction. This second commonly studied sub-problem consists in classifying cysteines into those that are involved in a disulfide bridge and those that are not. To solve this binary classification problem, several machine-learning algorithms were proposed such as multi-layer neural networks [13], two-stage support vector machines that exploit chain classification predictions [11] and hidden neural networks [14].
Disulfide bonding prediction. While chain classification works at the protein level and cysteine bonding state prediction works at the cysteine level, disulfide bonding prediction works at the level of cysteine pairs and aims at predicting the probability that a specific pair of cysteines will form a disulfide bridge during protein folding. Depending on the studies, some authors assume to have an a priori knowledge on the bonding state of isolated cysteines. This prior knowledge can be the actual state [15][16][17] or a prediction made by a cysteine bonding state predictor [18].
Disulfide pattern prediction. Once one or several of the previous tasks have been solved, the most challenging step is to predict the disulfide connectivity pattern. Fariselli et al. [19] were the first to relate the problem of predicting the disulfide pattern to a maximal weight graph matching problem. Several authors have since adopted this approach and proposed disulfide pattern predictors that fit into the three step pipeline of Figure 1. Baldi et al. [6,20] have used two-dimensional recursive neural networks to predict bonding probabilities, which are exploited by a weighted graph matching algorithm. Lin et al. [7,21] used the same graph Figure 1. Three-step approach for disulfide pattern prediction. (A) an input primary structure, which contains four cysteine residues. (B) The sequence is first enriched using evolutionary information and sometimes structural-related predictions such as the secondary structure. (C) A bridge classifier, then, predicts disulfide bonding probabilities for each cysteine pair and finally (D) a graph matching algorithm extracts the disulfide pattern with maximal weight. doi:10.1371/journal.pone.0056621.g001 matching approach while predicting bonding probabilities with support vector machines.

Features for cysteines and cysteine pairs
Machine learning algorithms are rarely able to process complex objects such as cysteine pairs directly, hence it is necessary to define a mapping from these objects to vectors of features. A large body of research on disulfide bridge prediction has been devoted to the analysis of such encodings into feature vectors.
In 2004, Vullo et al. [15] suggested to incorporate evolutionary information into features describing cysteines. For each primary sequence, they generate a position-specific scoring matrix (PSSM) from a multiple alignment against a huge non-redundant database of amino-acid sequences. This evolutionary information was shown to significantly improve the quality of the predicted disulfide bridges, which led the large majority of authors to use it in their subsequent studies. Generally, the PSI-BLAST program [22] is used to perform multiple alignments against the SWISS-PROT non-redundant database [23].
Zhao et al. [24] introduced cysteine separation profiles (CSPs) of proteins. Based on the assumption that similar disulfide bonding patterns lead to similar protein structures regardless of sequence identity, CSPs encode sequence separation distances among bonded cysteine residues. The CSP of a test protein is then compared with all CSPs of a reference dataset and the prediction is performed by returning the pattern of the protein with highest CSP similarity. This approach assumes to have an a priori knowledge on the bonding state of cysteines. In this paper, we introduce a slightly different definition of CSPs based on separation distances among all cysteine residues (see Candidate feature functions).
From the earlier observation that there is a bias in the secondary structure preference of bonded cysteines and non-bonded cysteines, Ferrè et al. [8] have developed a neural network using predicted secondary structure in addition to evolutionary information. Cheng et al. [6] proposed to also include predictions about the solvent accessibility of residues. The predictions of secondary structure and/or solvent accessibility used in their experiments were however not accurate enough to obtain significant performance improvements. Nevertheless, they observed that using the true values of secondary structure and solvent accessibility can lead to a small improvement of 1%. More recently, Lin et al. [7] proposed an approach based on support vector machines with radial basis kernels combined with an advanced feature selection strategy. They observed a weak positive influence by using predicted secondary structure descriptors, but their experimental methodology could suffer from overfitting so that this result should be taken with a grain of salt. Indeed, in this study, the same data is used both for selecting features and for evaluating the prediction pipeline. As detailed in [25], proceeding in this way often lead to an overfitting effect and hence to over-optimistic scores. Notice that the three studies [6][7][8] were all based on the secondary structure predicted by the PSIPRED predictor [26].
More recently, Savojardo et al. [27] reported an improvement of their predictive performance by taking into consideration the relevance of protein subcellular localization since the formation of disulfide bonds depends on the ambient redox potential.

Notations and problem statement
This section introduces notations and formalizes the disulfide pattern prediction problem. Let P be the space of all proteins described by their primary structure and P[P one particular protein. We denote C(P)~(C 1 (P), . . . ,C nC (P)) the sequence of n C~D C(P)D cysteine residues belonging to protein P, arranged in the same order as they appear in the primary sequence. A disulfide bonding connectivity pattern (or disulfide pattern) is an undirected graph G~(C(P),B) whose nodes C(P) are cysteines and whose edges B are the pairs of cysteines (f(C i ,C j )g that form a disulfide bridge.
Since a given cysteine can physically be bonded to at most one other cysteine, valid disulfide patterns are those that respect the constraint degree(C i )ƒ1, Vi[½1,n C . This constraint enables to trivially derive an upper bound on the number b of disulfide bridges given the number of cysteines: bƒt n c 2 s, where t : s is the floor function. If we know in advance the number b §1 of disulfide bridges, we can derive the number of valid disulfide patterns using the following closed form formula [28]: where C 2b denotes the number of possible subsets of size 2b of the set of n C cysteines. As an example, a protein with n C~6 cysteines and b~3 bridges has 15 possible disulfide patterns and a protein with n C~1 1 cysteines and b~5 bridges has 11|945~10 395 possible patterns. Figure 2 illustrates the three possible disulfide connectivity patterns of a protein with four cysteines and two disulfide bridges. When the number of bridges is unknown, the number of possible disulfide connectivity patterns for a protein P with n C cysteines becomes Note that the term z1 represents the case where no cysteine residue is bonded. As an example, a protein with n C~1 0 cysteines has 45|1z210|3z210|15z45|105z1|945z1~9 496 possible valid disulfide patterns. We adopt a supervised-learning formulation of the problem, where we assume to have access to a dataset of proteins (represented by their primary structure) with associated disulfide patterns. We denote this dataset D~f(P (i) , B (i) )g i[½1,N , where P (i) [P is the i-th protein and B (i) is the set of disulfide bridges associated to that protein. We also denote n (i) C~D C(P (i) )D the number of cysteines belonging to the protein P (i) . Given the dataset D, the aim is to learn a disulfide pattern predictor f ( : ): a function that maps proteins P[P to sets of predicted bridgeŝ B B~f (P). Given such a predicted set, we can define the predicted connectivity pattern as following:Ĝ G~(C(P),B B).
We consider two performance measures to evaluate the quality of predicted disulfide patterns: Q p and Q 2 . Q p is a protein-level performance measure that corresponds to the proportion of entirely correctly predicted patterns: where Pr f g is the indicator function whose value is 1 if Pr is true or 0 otherwise. Q 2 is a cysteine-pair level performance measure that corresponds to the proportion of cysteine pairs that were correctly labeled as bonded or non-bonded: Note that both Q p and Q 2 belong to the interval ½0,1 and are equal to 1 in case of perfectly predicted disulfide patterns. While the ultimate goal of disulfide pattern prediction is to maximize Q p , we will also often refer to Q 2 since, in the pipeline depicted in Figure 1, Q 2 is directly related to the quality of the cysteine pair classifier.

Disulfide pattern prediction pipeline
This section first presents the datasets and the five kinds of structural-related predictions we consider. It then details the different steps of our prediction pipeline: the dataset annotation, the pre-processing step that enriches the primary structure with evolutionary information and structural-related annotations, the classification step of cysteine pairs that predicts bridge bonding probabilities and the post-processing step that constructs a disulfide pattern from these probabilities using maximum weight graph matching.
Dataset and annotations. In order to assess our methods, we use two datasets that have been built by Cheng et al. [6] and extracted from the Protein Data Bank [29]. The first one, SPXz, is a collection of 1 018 proteins that contain at least 12 amino acids and at least one intrachain disulfide bridge. We use this dataset for the problem of pattern prediction. However, since it does not contain any protein without disulfide bridges it is not adapted to address chain classification and cysteine bonding state prediction. For these tasks, we use the other dataset, SPX{, which is made of 1 650 proteins that contain no disulfide bridge and 897 proteins that contain at least one bridge. In order to reduce the overrepresentation of particular protein families, both datasets were filtered by UniqueProt [30], a protein redundancy reduction tool based on the HSSP distance [31]. In SPX{, Cheng et al. used a HSSP cut-off distance of 0 for proteins without disulfide bridge and a cut-off distance of 5 for proteins with disulfide bridges. In SPXz, the cut-off distance was set to 10. To properly compare our experiments with those of Cheng et al., we use the same train/ test splits as they used in their paper. Statistics of the two datasets are given in Table 1.
We enrich the primary structure (denoted as AA) by using two kinds of annotations: evolutionary information in the form of a position-specific scoring matrix (PSSM) and structural-related predictions, such as predicted secondary structure or predicted solvent accessibility. We computed the PSSMs by running three iterations of the PSI-BLAST program [22] on the non-redundant NCBI database. To produce structural-related predictions, we use the iterative multi-task sequence labeling method developed by Maes et al. [32]. This method enables to predict any number of structural-related properties in a unified and joint way, which was shown to raise state of the art results. We consider here five kinds of predicted annotations: secondary structure (SS3, 3 labels), DSSP secondary structure (SS8, 8 labels), solvent accessibility (SA, 2 labels), disordered regions (DR, 2 labels) and a structural alphabet (StAl, 27 labels, see [33]). The two versions of secondary structure give two different levels of granularity. The structural alphabet is a discretization of the protein backbone conformation as a series of overlapping fragments of four residues length. This representation, as a prediction problem, is not common in the literature. Here, it is used as a third level of granularity for local 3D structures. To our best knowledge, predicted DSSP secondary structure, predicted disordered regions and structural alphabet annotations have never been investigated in the context of disulfide pattern prediction.
In order to train the system of Maes et al., we rely on supervision information computed as follows: secondary structures and solvent accessibility are computed using the DSSP program [34], disordered regions and structural alphabet are computed by directly processing the protein tertiary structure. Since the disorder classes are not uniquely defined, we use the definition of the CASP competition [35]: segments longer than three residues but lacking atomic coordinates in the crystal structure were labelled as disordered whereas all other residues were labelled as ordered.
Note that it is often the case that supervised learning algorithms behave differently on training data than on testing data. For example, the 1-nearest neighbor algorithm always has a training accuracy of 100%, while its testing accuracy may be arbitrarily low. In order to assess the relevance of predicted annotations, we expect our input enrichment step to provide ''true'' predictions, i.e., representative of predictions corresponding to examples that were not part of training data.
We therefore use the cross-validation methodology proposed in [36] that works as follows. First, we randomly split the dataset into ten folds. Then, in order to generate ''true'' predictions for one fold, we train the system of Maes et al. on all data except this fold. This procedure is repeated for all ten folds and all predictions are concatenated so as to cover to whole dataset. Table 2 reports the cross-validation accuracies that we obtained with this procedure. The default scoring measure is label accuracy, i.e., the percentage of correctly predicted labels on the test set. Since disordered regions labeling is a strongly unbalanced problem, label accuracy is not appropriate for this task. Instead, we used a classical evaluation measure for disordered regions prediction: the Matthews correlation coefficient [37].
Candidate feature functions. The feature generation step aims at describing cysteine pairs in an appropriate form for classification algorithms. This encoding is performed through cysteine-pair feature functions w : P|C|C?R d that, given a protein P and two of its cysteines (C i ,C j ), computes a vector of d realvalued features. In our experiments, we extracted cysteine-pairs (C i ,C j ) in such a way that 1ƒivjƒn C , where n C is the number of cysteine residues of P. Consequently, we extract n C |(n C {1) 2 Feature Functions for Disulfide Pattern Prediction PLOS ONE | www.plosone.org cysteine-pairs from P. The purpose of the feature selection methodology described in the next section is to identify a subset of relevant w functions among a large panel of candidate ones that we describe now. Our set of candidate feature functions is composed of primarystructure related functions and annotation related functions. The former are directly computed from the primary structure alone and are the following ones: N Number of residues: computes one feature which is the number of residues in the primary structure. N Normalized position difference: returns one feature which corresponds to the number of residues separating C i from C j in the primary structure, i.e., pos(C j ){pos(C i ), divided by the protein length. Note that as jwi and therefore pos(C j )wpos(C i ), this difference is always greater than zero. Note that in the general case, a A p,l probabilities may take any value in range ½0,1 to reflect uncertainty about predictions. Since the primary structure (AA) is always known perfectly, we have: equal to a pos(C k )zd,l . When the position is out-of-bounds, i.e., pos C k ð Þzd = [ 1,DPD ½ , the feature is set to 0.
Our candidate feature functions are summarized in Table 3. Note that three of them are parameterized by window size parameters. Figure 3 shows an illustration of the three kinds of histograms. We will see how to tune window sizes and how to select a minimal subset of feature functions in the next section.
Cysteine pair classifiers. Let fw 1 , . . . ,w m g be a subset of the candidate feature functions described above and let d i denote the dimensionality of the i-th function of this set. A cysteine pair classifier processes feature vectors of dimension P m i~1 d i , in order to predict disulfide bonding probabilities. In this study, we consider three such binary classifiers: N K-nearest neighbors (kNN) is a simple and well-known method for classification. In order to determine the disulfide bonding probability of a new example, the algorithm first search for the k nearest training samples and then returns the frequency of bonded cysteines among these neighbors. The distance between two feature vectors A and B is computed using a normalized version of the l2-norm, which is defined as follows: where A j i and B j i denote the j-th components of the i-th feature generator w i , and where s j i denotes the empirical standard deviation of this component, computed on the training data.  Note that previous studies on disulfide pattern prediction [7,38] also relied on the Gaussian radial basis function. In our experiments, we used the well-known LibSVM implementation [39]. In order to convert SVM predictions into probabilities, we use the default probability estimation method of LibSVM, which was proposed by Platt [40] and Wu et al. [41].
N Extremely randomized trees (ETs). This tree-based ensemble method, proposed by Geurts et al. [9], is similar to the popular Random Forests approach [42]. The main differences with the latter are that ETs does not rely on bootstrap replicates (unlike the Random Forests method, each tree is built using all learning samples), and that cut-points are selected in a random fashion, which was shown to lead to better generalization performances. The method has three hyper-parameters: K, the number of random splits tested per node creation, T, the  number of trees composing the ensemble, and N min , the minimum number of samples required to allow for splitting a node. We use the probabilistic version of ETs, in which each leaf is associated to a bonding probability, which is the empirical proportion of bonded cysteine pairs among the training samples associated to that leaf. In order to make one prediction, we traverse each of the T trees and return the average of the bonding probabilities associated to the corresponding T leaves. To our best knowledge, tree-based ensemble methods, and in particular ETs, were not yet applied to disulfide connectivity pattern prediction, despite the fact that several studies have shown that these methods very often outperform other methods such as support vector machines or neural network [43].
Maximum weight graph matching. Given bonding probabilities for every cysteine pair of a protein, the aim of this last step of the disulfide pattern prediction pipeline is to select a subset of disulfide bridges so as to respect the constraint degree(C i (P))ƒ1,Vi[½1,n C . As proposed previously, this problem is formalized as a maximum weight graph matching problem: the weight of a disulfide pattern is defined as the sum of probabilities attached to its edges and the aim is to find the valid pattern with maximal weight.
A naive solution to solve the maximum weight graph matching problem is to perform an exhaustive search over all valid disulfide patterns. The complexity of this procedure is however exponential in the number of cysteines, which is problematic for large proteins. This issue is often solved using the maximum weight matching algorithm of Gabow [44] whose time complexity is cubic w.r.t. the number of cysteines n C and whose space complexity is linear w.r.t. n C . In our experiments, we used Blossom V, which is a more recent and optimized implementation proposed by Kolmogorov [45].
Notice that, because this algorithm searches for a full matching, i.e., where each cysteine is associated to another one, it cannot be directly applied on proteins that have an odd number n C of cysteines. To deal with such proteins, we run the matching algorithm on each one of the n C subsets of n C {1 cysteines and select the solution with maximal weight.

Forward feature function selection
This section describes our forward feature function selection algorithm, which aims at determining a subset of relevant feature functions among those described above. Feature selection is an old topic in machine learning and a common tool in bioinformatics [46]. Our feature selection problem departs from traditional feature selection w.r.t. three unique aspects: N Feature function selection: we want to select feature functions rather than individual features. Given that feature functions can be parameterized by window sizes, our algorithm has to perform two tasks simultaneously: determining a subset of feature functions and determining the best setting for associated window sizes.
N Insertion in a pipeline: we want to optimize the performance Q p of the whole pipeline rather than the accuracy Q 2 of the classifier for which we perform feature selection. Preliminary studies have shown that these two performance measures are not perfectly correlated: a binary classifier with higher accuracy can lead to worse disulfide pattern predictions when combined with the graph matching algorithm, and conversely. N Interpretability: our approach not only aims at constructing a pipeline maximizing Q p , but also at drawing more general scientific conclusions on the relevance of various annotations of the primary structure. We thus require the result of the feature selection process to be interpretable.
In order to fulfill these requirements, we adopt a wrapper approach that repeatedly evaluates feature function subsets by cross-validating the whole pipeline and that is directly driven by the cross-validated Q p scores. In order to obtain interpretable results, we rely on a rather simple scheme, which consists in constructing the feature function set greedily in a forward way: starting from an empty set and adding one element to this set at each iteration.
In order to treat feature functions with parameters and those without parameters in an unified way, we express the feature functions as a set of parameterized feature functions W~fW (1) , . . . ,W (M) g where each W (i) contains a set of alternative feature functions W (i)~f w (i) 1 , . . . ,w (i) ai g. In the case where the feature function has no parameters (e.g., number of residues or labels global histogram), this set is a singleton W~fwg. Otherwise, when the feature function is parameterized by a window size, there is one alternative per possible window size, e.g., W csp W~f csp(1),csp (3), . . . ,csp (19)g. Our forward feature function selection approach is depicted in Algorithm. We denote by S( : , : , : )[R the objective function that evaluates the Q p score associated to a given set of feature functions, based on a cysteine pair classifier C and a dataset of proteins D. In our experiments, this objective function is computed by performing a 10-fold cross-validation of the whole prediction pipeline and by returning the test Q p scores averaged over the ten folds. 4remove the best parameterized feature function 6: until some stopping criterion is fulfilled 7:return U 4return feature function set The feature function is first initialized to an empty set 6 0 (line 1). Each iteration then consists in inserting a candidate feature functions w (i) j taken in the set W into U. For this, we try to add each candidate w (i) j to the current feature function set and select the best feature function w.r.t. the obtained cross-validation Q p scores (line 3). This feature function is then inserted into U (line 4) and the corresponding set of alternatives W (i Ã ) is removed from W. After a given stopping criterion is fulfilled, the constructed function set U is returned (line 7). In our experiments, this stopping criterion is simply a fixed number of iterations. An alternative consists in stopping the algorithm when no additional feature functions enable to improve the S score.
Note that due to its greedy nature, our feature selection may fall into local minima. However, compared to traditional feature selection, it may be the case that selecting feature functions instead of individual features significantly reduces the importance of this problem (the dimensionality of our search problem is much smaller than in the case of individual feature selection). We show in the next section that this algorithm is a tractable feature function selection approach that provides interpretable results, from which we can draw some general conclusions about the relevance of primary structure annotations.

Results: Disulfide Pattern Prediction
This section describes our experimental study on disulfide pattern prediction using the SPXz benchmark dataset. We first make an overall comparison of the three binary classification algorithms described previously and show that extremely randomized trees lead to significantly better results than the two other algorithms. We then apply our forward feature function selection approach using this algorithm and show that only a few feature functions are sufficient to construct a high performance disulfide pattern predictor. We finally compare this predictor with the state of the art and propose an analysis of the sensitivity of extremely randomized trees w.r.t. their hyper-parameters. Note that, for the moment, our prediction pipeline always tries to construct fully connected disulfide patterns and that it does not enable predicting partially connected disulfide patterns. We address this issue in the next section, by coupling our predictor with filters based on the bonding state of individual cysteines.

Comparison of the cysteine pair classifiers
Comparing cysteine pair classifiers in our context is not trivial for two reasons. First, we are primarily interested in the Q p score of the whole prediction pipeline rather than in the classification accuracy. Second, we do not have a fixed feature representation and different classification algorithms may require different feature function sets to work optimally. To circumvent these difficulties, we compare cross-validated Q p scores obtained with the three classifiers on a large number of randomly sampled feature function sets. To sample a feature function set of size m[½1,18, we proceed as follows. First, we draw a subset fW (1) , . . . ,W (m) g from W. Then, for each member W (i) of this subset, we select a feature function w (i) j , using the following rules: (i) local window sizes are sampled according to the Gaussian distribution N (15,15 2 ), (ii) local histogram sizes are sampled according to N (51,50 2 ) and (iii) CSP window sizes are sampled from N (7,11 2 ). These values were chosen according to preliminary studies using the three classifiers.
We set the hyper-parameters in the following way: N kNN. By studying the effect of k, we found out that large values of k drastically decrease the performance of kNN and low values do not enable to distinguish patterns well since the set of possible predicted probabilities is limited to kz1 values. In the following, we use the default value k~5, which we found to be a generally good compromise.
N SVM. It turns out that the best setting for c and C is highly dependent on the chosen feature function set. For each tested set of feature functions, we thus tuned these two parameters by testing all combinations of c[f2 {14 ,2 {7 ,2 0 ,2 7 ,2 14 g and C[f2 0 ,2 5 ,2 10 ,2 15 g and by selecting the values of (c,C) that led to the best Q p scores.
N ETs. We use a default setting that corresponds to an ensemble of 1 000 fully developed trees (T~1 000,N min~2 ) and K is set to the square root of the total number of features ffiffiffi d p , as proposed by Geurts et al [9].
The results of our comparison on SPXz are given in Figure 4. As a first remark, note the large range in which the Q p scores lie: from^15% to^60%. This shows that all three classifiers are highly sensitive to the choice of the features used to describe cysteine pairs, which is a major motivation for our work on feature function selection. The experiments are color-encoded w.r.t the size m of their feature function set. This color-encoding enables us to notice that, in general, larger feature function sets lead to better classifiers.
The mean and standard deviations of these results are 34:23%+7:45% for kNN classifiers, 43:96%+5:31% for SVM classifiers and 47:85%+7:17% for ETs classifiers. In 73.25% of the experiments, the best pattern accuracy is given by ETs and in 20.35% of them by SVMs. In the remaining 6.40% experiments, exactly the same number of disulfide patterns were correctly predicted by ETs and SVM. kNN was always outperformed by the other two classifiers. We have used the paired t-test to assess the significance of the out-performance of ETs. The p-value against kNN is O(10 {128 ) and the p-value against SVM is O(10 {38 ), which make it clear that ETs significantly outperform kNN and SVM. Moreover, ETs work well with a default setting contrarily to SVM that required advanced, highly time-consuming, hyperparameters tuning.
Given those observations, we proceed in the remainder of this study by restricting to the ETs method.

Feature functions selection
We now apply our feature function selection approach on top of extremely randomized trees. We rely on the set of parameterized feature functions W described in Table 3 and consider the following window size values: N Cysteine separation profile window: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19. N Local histograms: 10, 20, 30, 40, 50, 60, 70, 80, 90. N Local windows: 1, 5, 9, 11, 15, 19, 21, 25. This setting leads to a total of 150 candidate features functions. As cysteine pair classifier, we use ETs with the same default setting as previously (T~1 000, K~ffi ffiffi d p , N min~2 ). The simplest way to apply our algorithm would be to apply it once on the whole SPXz dataset. By proceeding in this way, the same data would be used for both selecting the set of feature functions and assessing the quality of this selected set. It has been shown that this approach is biased due to using the same data for selecting and for evaluating and that it could lead to highly overestimated performance scores [25].
To avoid this risk of overfitting, we adopted a more evolved approach, which consists in running the feature selection algorithm once for each of our 10 different train/test splits. In this setting, the whole feature selection algorithm is executed on a training dataset composed of 90% of the data and the generalization performance of the selected feature functions is evaluated using the remainder 10% of data. There are thus two different objective functions. We call cross-validated score the value returned by S ( : , : , : ), i.e., the 10 cross-validated Q p score using 90% of the data, and we call verification score the Q p score computed over the remainder 10% of the data. Figure 5 shows the evolution of the cross-validated score and the verification score for five iterations of the feature selection algorithm on each of the 10 train/test splits. Note that, since the cross-validated score is the score being optimized, its value increases at every iteration of each of the 10 runs. The evolution of the verification score, which represents the true generalization performance, is far from being so clear, as, in most cases, the optimum is not located after the fifth iteration. Table 4 reports the selected feature functions for each of the 10 runs. We observe that the first selected feature function is always w(PSSM, : ) with a window size varying in f9, 11,15,19g. This means that, taken alone, the best individual feature function is always a window over the position-specific scoring matrix. The fact that this results was observed during each run is very strong, since the selection algorithm has to select between 150 different functions. Similarly, the second selected feature function is always csp( : ) with a window size varying in f9, 13,17,19g.
After the two first iterations, the selected feature functions become more disparate and only lead to tiny improvements. This probably indicates that the system starts to overfit, by selecting feature functions that are specifically tailored to a specific subset of the training proteins. In iterations 3-4, we note that h local (SS8, : ) occurs slightly more often than the other feature functions (6 times over 20). From the two last rows, which give the averaged crossvalidated scores and the averaged verification scores, we observe that while the cross-validated score systematically increases, the verification score becomes unstable after the two first iterations. These observations reinforce the fact that the selected feature functions become more and more specific to training samples. From these results, it is clear that the feature functions w(PSSM, : ) and csp( : ) bring the major part of the predictive power that can be obtained by our feature functions.
According to these results, we focus in the following on the feature functions w(PSSM,15), csp (17) and h local (SS8,77), where we chose windows sizes by taking the average sizes reported in Table 4. Note that, contrarily to the observation of Figure 4 that suggested large feature function sets, our method carefully selected

Evaluation of the constructed prediction pipeline
We now compare our constructed prediction pipeline with the state of the art. We consider three baselines that were evaluated using the same experimental protocol as ours (10 cross-validated Q p ). The first baseline is the recursive neural network approach proposed by Cheng et al. [6]. These authors, who introduced the SPXz dataset, reached a pattern accuracy of 51% using the true secondary structure and solvent accessibility information. Lin et al. [7] proposed to predict the bonding state probabilities using a fine tuned support vector machine. They obtained a pattern accuracy of 54:5% by using the same data for feature selection and for evaluation, making this results probably over-estimated. Vincent et al. [47] proposed a simple approach based on a multiclass onenearest neighbor algorithm that relies on the fact that two proteins tend to have the same disulfide connectivity pattern if they share a similar cysteine environment. This method reaches a pattern accuracy of 55%. Table 5 reports the performance obtained by ETs with feature functions w(PSSM,15), csp (17) and h local (SS8,77). We observe that using only w(PSSM,15) already leads to a pattern accuracy of 51:6%, which is better than the baseline of Cheng et al. [6]. A significant improvement of z6:6% is achieved by adding the feature function csp (17), which leads to a model that significantly outperforms the state of the art. The feature function h local (SS8,77) leads to small further improvement of the Q p score, but due to the large variance, this improvement cannot be shown to be significant.
From these results, we conclude that only the following two feature functions are sufficient for high-quality disulfide pattern prediction in combination with ETs: local PSSM windows and CSP windows. Note that it might be the case that, by using larger datasets, feature functions such as medium-size histograms on predicted DSSP secondary structure could slightly improve the quality of the system. Table 6 reports the pattern accuracy as a function of the true number of disulfide bridges. By comparing the results with the three baselines, we observe that our method outperforms the baselines, except for proteins with 4 potential disulfide bonds where the approach proposed by Vincent et al. [47] obtains a better pattern accuracy.

Sensitivity of extremely randomized trees to its hyperparameters
This series of experiments aims at studying the impact of the hyper-parameters (T, K and N min ) when using the feature functions fw(PSSM,15),csp(17)g. With these two feature functions, the number of features is d~662. The default setting is T~1 000,K~ffi ffiffi d p , N min~2 and we study the parameters one by one, by varying their values in ranges T[½10,10 4 , K[½1,d and N min [½2,100. Figure 6 reports the Q p and Q 2 scores in function of the three hyper-parameters. As a matter of comparison, we also reported the Q p scores of the three baseline described previously. We observe that the Q p score grows (roughly) following a logarithmic law w.r.t. T. The value of T~1 000 occurs to be very good tradeoff between performance and model complexity. Concerning K, we observe that the value maximizing Q p is K^50, which is a bit larger than the default setting K~ffi ffiffi d p . Note that the proteinlevel performance measure Q p and the cysteine-pair level performance measure Q 2 do not correlate well in terms of the effect of parameter K, which confirms the interest of directly optimizing Q p in our feature function selection algorithm. N min controls the complexity of built trees and, hence, the bias-variance tradeoff by averaging output noise. It is usually expected that a small value of N min improves performance. In our case, we observe that increasing N min never improves the performance measures and that Q p has a large variance.

Results: Chain Classification and Cysteine Bonding State Prediction
Until now, our pipeline relies on a perfect graph matching algorithm that always attempts to predict patterns involving all cysteines. Due to this, our approach is, for the moment, unable to deal with partially connected disulfide patterns (except for proteins with an odd number of cysteines having a single non-bonded cysteine). This can be harmful, especially on datasets containing many non-bonded cysteines. For example, if we apply our pipeline to the SPX{ dataset, the pattern accuracy Q p is only 22%, since most proteins of this dataset do not contain any disulfide bridges. We now focus on this issue by coupling our predictor with filters based on the output of a chain classifier and on the output of a cysteine bonding state predictor. We first construct a chain classifier and a cysteine bonding state predictor by applying our feature function selection approach. We then study combinations of these predictors with our disulfide pattern predictor.

Chain classification
We consider the binary chain classification problem, which consists in classifying proteins into those that have at least one disulfide bridge and those that have no disulfide bridge. In order to construct a chain classifier, we apply the same methodology as before: we perform feature function selection on top of extremely randomized trees. Since chain classification works at the level of proteins, the set of candidate feature functions is restricted to labels global histograms. We also include as candidates the simple feature functions returning the number of residues, the number of cysteines and the parity of the number of cysteines. We use the following default setting for ETs: T~1 000,K~d and N min~2 . According to preliminary experiments, we found out K~d to be a good default setting for this task. This is probably due to the fact that we have far less features than we had before.
We performed ten runs of the feature function selection algorithm on the SPX{ dataset, which contains both proteins without disulfide bridge and proteins with disulfide bridges. The performance measure is the accuracy, i.e., the percentage of proteins that are correctly classified. In every feature function selection run, the first selected feature function was h global (PSSM) and the second one was h global (AA). Starting from the third   iteration, the results are more diverse and the system starts to overfit. By keeping the two first feature functions, we reach a 10 fold cross-validation accuracy of 79:5% on SPX{, which is not very far from the 82% accuracy obtained by [47].

Cysteine bonding state prediction
Cysteine bonding state prediction consists in classifying cysteines into those that are involved in a disulfide bridge and those that are not. To address this task, we apply our feature function selection approach on top of extremely randomized trees (T~1 000,K~ffi ffiffi d p and N min~2 ). The set of candidate feature functions is composed of those depending only on the protein (number of residues, number of cysteines, parity of the number of cysteines, labels global histograms) and those depending on the protein and on a single cysteine (labels local histograms, labels local windows, cysteine separation profile window). We consider the same window size values as in previous section. The evaluation measure is binary accuracy, i.e., the percentage of cysteines that are correctly classified.
We ran the feature selection algorithm once for each of the ten different train/test splits of SPX{. We observed that the selected feature functions set fw(PSSM,11),h global (PSSM),n C g led to a binary accuracy of 87:4%, which outperforms the result of 87% obtained by Vincent et al. [47]. On SPXz, we obtain a similar accuracy of 87:8%.
Note that once we have a cysteine bonding state predictor, we can use it to also solve the chain classification task as follows. In order to predict whether a protein contains disulfide bridges or not, we run the cysteine bonding state predictor on each cysteine, and see if at least one cysteine is predicted as being bonded. By applying this strategy to SPX{, we obtain a chain classification accuracy of 81:4%, which is comparable to the score of [47]. Table 7 summarizes the feature functions that were selected for the three tasks that we consider in this paper.

Impact on pattern prediction
Now that we have constructed a chain classifier and a disulfide bonding state predictor, we focus on the question of how to exploit the corresponding predictions in order to improve disulfide pattern prediction. Note that, in some cases, the user may have prior knowledge of either the chain class (whether the proteins contains any disulfide bridges or not) or of the cysteine bonding states (which are the cysteines that participate to disulfide bridges). To take the different possible scenarios into account, we study the following four settings: N Chain class known: in this setting, we assume that the chain classes are known a priori and simply filter out all proteins that are known to not contain any disulfide bridge. For the proteins that contain disulfide bridges, we run our disulfide pattern predictor as in previous section.
N Chain class predicted: in this setting, we replace the knowledge of the chain class by a prediction. We therefore rely on the chain classifier derived from the cysteine bonding state predictor, which obtained a chain classification accuracy of 81:4%.
N Cysteine states known: we here assume that the bonding states of cysteines is known a priori. We modify the disulfide pattern predictor by setting a probability of zero to any cysteine pair containing at least one non-bonded cysteine.
N Cysteine states predicted: in this setting, we first run our cysteine state predictor and then perform disulfide pattern prediction by only considering cysteine pairs in which both cysteines where predicted as bonded.  Note that, since the SPXz dataset is entirely composed of proteins with at least one bridge, our two first settings based on chain classification are irrelevant for this dataset. In these experiments, we learnt models using a 10-fold cross-validation of ETs (T~1 000,N min~2 and ffiffiffi d p ). Table 8 summarizes the results of our experiments on chain classification, cysteine bonding state prediction and disulfide pattern prediction with our four different settings. When the chain classes are known, we observe a significant improvement of the Q p score: from 22% to 82:5% on SPX{. When replacing the true chain classes with predicted chain classes, we still have a relatively high Q p score: 70:9%. This result is detailed in Table 9 as a function of the true number of disulfide bridges. We observe that our method clearly outperforms the method of Vincent et al.
[47] on proteins containing one or two disulfide bonds and performs slightly worst on proteins with three disulfide bonds. Given that a majority of proteins in SPX{ contain less than two bonds, these results leads to an overall score that is significantly better than that of Vincent et al. When the cysteine bonding states are known, we obtain impressive disulfide pattern accuracies: more than 75% on SPXz and almost 90% on SPX{. When using predicted cysteine bonding states, we still observe an impressive improvement on SPX{: from 22% to 71:4%. However, on SPXz, the score slightly degrades ({1:4%). This is probably related to the fact that, as soon as one cysteine is falsely predicted as being non-bonded, it becomes impossible to recover the correct disulfide pattern.

Discussion
Disulfide connectivity pattern prediction is a problem of major importance in bioinformatics. Recent state of the art disulfide pattern predictors rely on a three step pipeline, in which the central component is a binary classifier that predicts bridge bonding probabilities given cysteine pair representations. However, the comparison of the conclusions of these works is difficult because it is often the case that these different studies rely on different kinds of binary classifiers and slightly differ in their experimental protocol. Therefore, the relevance of some features is still a subject under heavy debate. This paper has proposed an extensive study on the best way to represent cysteine pairs in the form of features. We considered three classification algorithms: knearest neighbors, support vector machines and extremely randomized trees, and we proposed a forward feature function selection algorithm that we applied on the standard benchmark dataset SPXz.
Our experiments have shown that extremely randomized trees (ETs) are highly promising in terms of predicted disulfide pattern accuracy Q p . ETs are easy to tune and thanks to their use of decision trees, they benefit from good scaling properties, making them applicable to large sets of training proteins and large sets of features. The result of our feature selection experiments with ETs is that the primary structure related features functions w(PSSM,15) (a local window of size 15 on the evolutionary information) and csp(17) (a window of size 17 on the cysteine separation profile) are sufficient to reach a very high performing disulfide pattern predictor: ETs with these two kinds of features predict correct disulfide connectivity patterns in 58:2% of proteins, which outperforms the state of the art [47] with a z3:2% improvement. Furthermore, we showed that appending any other feature function does not lead to significant subsequent improvements or even decreases the accuracy.
We also investigated the question of how to exploit our disulfide pattern predictor with filters based on the output of either a chain classifier or of a cysteine bonding state predictor. Among the four scenarios that we considered, we observed an important potential for improvement when the cysteine bonding states are known, with scores reaching 75% on SPXz and almost 90% on SPX{. When using predicted cysteine bonding states, we still observe an impressive improvement on SPX{ (from 22% to 71:4%) but the score slightly degrades ({1:4%) on SPXz. This degradation is probably due to the fact that, as soon as one cysteine is falsely predicted as being non-bonded, it becomes impossible to construct the correct disulfide pattern. Therefore, one direction of future research is to develop more sophisticated methods to couple the cysteine bonding state prediction task with the pattern prediction task. One direction for such a better coupling is to apply the ideas developed in [32] on multi-stage and multi-task prediction, e.g., by iteratively re-estimating the disulfide bond probabilities.
Note that despite the fact that several studies have shown that tree-based ensemble methods often reach state of the art results in supervised learning (see e.g. [43]), these methods were surprisingly few applied to structural bioinformatics problems yet. We believe that ETs in combination with feature function selection provide a general methodology that can be applied to a wide range of protein related prediction problems and more generally to any kind of classification problems involving many different possible representations. We report the mean and standard deviation of the binary accuracy for chain classification and cysteine bonding state prediction while the Q p score is used for disulfide pattern prediction. The symbol -indicates that all cysteines are used in the experiment. doi:10.1371/journal.pone.0056621.t008