The Loss and Gain of Functional Amino Acid Residues Is a Common Mechanism Causing Human Inherited Disease

Elucidating the precise molecular events altered by disease-causing genetic variants represents a major challenge in translational bioinformatics. To this end, many studies have investigated the structural and functional impact of amino acid substitutions. Most of these studies were however limited in scope to either individual molecular functions or were concerned with functional effects (e.g. deleterious vs. neutral) without specifically considering possible molecular alterations. The recent growth of structural, molecular and genetic data presents an opportunity for more comprehensive studies to consider the structural environment of a residue of interest, to hypothesize specific molecular effects of sequence variants and to statistically associate these effects with genetic disease. In this study, we analyzed data sets of disease-causing and putatively neutral human variants mapped to protein 3D structures as part of a systematic study of the loss and gain of various types of functional attribute potentially underlying pathogenic molecular alterations. We first propose a formal model to assess probabilistically function-impacting variants. We then develop an array of structure-based functional residue predictors, evaluate their performance, and use them to quantify the impact of disease-causing amino acid substitutions on catalytic activity, metal binding, macromolecular binding, ligand binding, allosteric regulation and post-translational modifications. We show that our methodology generates actionable biological hypotheses for up to 41% of disease-causing genetic variants mapped to protein structures suggesting that it can be reliably used to guide experimental validation. Our results suggest that a significant fraction of disease-causing human variants mapping to protein structures are function-altering both in the presence and absence of stability disruption.


Introduction
Spurred by the advances in DNA sequencing, the accumulation of human genetic variation (and with it amino acid substitution data) has over the past two decades been unprecedented. Multiple databases and resources now enumerate and annotate amino acid substitutions, their functional impact, and association with inherited disease [1][2][3]. However, to further our understanding of human genetic variation and its impact on disease, it is necessary to elucidate the associated molecular alterations [4][5][6]. Thus, the step of identifying the underlying molecular mechanisms constitutes a serious impediment to understanding and treating human disease.
A straightforward approach to integrating genetic and molecular data is to search databases for structural and functional annotations at the variation site or in the neighborhood of interest, and then provide both the possible and the likely effects of mutations on these annotations [7][8][9][10][11][12]. Although this approach is useful, its major limitation is its dependence on previously observed and curated functional information as well as our inability, except in limited cases [9,10], to cover mutations that create functional residues. Furthermore, the deterministic nature of data integration does not easily lend itself to a principled strategy of prioritizing many of the possible molecular mechanisms based on their likelihood to impact clinical phenotype, especially when a variant resides in the neighborhood of the functional site.
A more comprehensive approach to analyzing the effects of amino acid substitutions involves the use of statistical inference methods that predict functional impact. Although there are many studies that have adopted this strategy using the data from protein sequence or structure [13], most methods make inferences without specifying which functional property has been impacted. Such an approach, however, is feasible if the methodology can be developed to predict a specific function, say a phosphorylation site or a catalytic residue, which is then applied to sequence variants [14][15][16]. Furthermore, these specific functional predictions can be integrated with general variant effect predictors to provide probabilistic estimates of molecular mechanisms of disease [17].
While many successful machine learning models can be made based on sequence information alone, structural information can provide additional benefits [18,19]. This suggests that further improvements could be made if specific predictors of protein function could be integrated into this pipeline [20]. Wang and Moult published a seminal work on the impact of germline variants on protein function [7]. They searched the Protein Data Bank (PDB) and used homology modeling to obtain 3D structures of wild-type proteins as a means to characterize the structural and functional effects of both disease and neutral variants. They reported that the majority of disease-causing substitutions affect protein stability, whereas a relatively small proportion directly disrupt molecular function. By contrast, Sahni et al. observed a rather larger fraction of function-impacting variants in their experimental studies into the impact of variants on protein-protein and protein-DNA interactions [21]. Their work therefore challenges the traditional view of the dominance of structure-impacting changes. Finally, Steward et al. examined the structural, functional and physicochemical features of wild-type protein structures where disease-causing variants occur [22]. Unfortunately, the scope of these and several other studies was limited to characterizing the functional effects of amino acid substitutions across a handful of protein functions [23,24]. There is therefore a need for large-scale studies that use statistical inference methods based on protein structure to explore the relative contributions made by disruption of functional sites in disease pathogenesis.
In this work, we carry out a systematic study of the alterations of specific functional sites as the underlying molecular mechanisms of disease over a data set comprising germline diseasecausing amino acid substitutions mapped to protein 3D structures. In particular, we develop multiple structure-based functional residue predictors and assess the impact of disease-associated substitutions on catalytic residues, metal-binding sites, macromolecular binding sites, ligand-binding sites, allosteric sites and post-translational modification (PTM) sites. We then quantify the extent to which disruption or introduction of particular types of functional site accounts for the deleterious impact of amino acid substitutions. Our results provide evidence to support the view that the increased and decreased propensity of particular functional activities are common in human inherited disease.

Probabilistic model for alteration of residue function
For a given protein structure and a missense variant, we are interested in estimating whether a particular residue functionality, say f, has been impacted. To achieve this, we broadly distinguish between two scenarios resulting in alteration of function: (i) the mutation disrupts protein stability and subsequently impacts residue function and (ii) the mutation does not impact stability and structure yet still leads to an altered function as a consequence of modified functional propensity. The latter scenario might occur, for example, for tyrosine-to-phenylalanine substitutions that result in minimal structural changes, yet the mutation itself may have significant impact on protein phosphorylation and downstream events. This is because phenylalanine cannot be phosphorylated to subsequently create an SH2 binding site [25]. We informally refer to the events of increased or decreased functional propensity as gain and loss of functional activity.
We formalize this approach as follows. Let x be a collection of features that encodes a particular mutation in a protein structure and let P(loss of f|x) be the probability of the loss of residue function f consequent to mutation. Then, we can write Pðloss of f jxÞ ¼ Pðloss of f jS; xÞPðSjxÞ þ Pðloss of f j S; xÞPð SjxÞ; The posterior probability of stability disruption P(S|x) can be determined by developing a computational model given a representative data set of variants that significantly impact stability of the protein (both stabilizing and destabilizing mutations) and a representative set of variants that do not. The negative data set can also be substituted by a large representative set of variants for which the impact on stability is unknown [26,27]. This formulation falls into the category of positive-unlabeled learning [28], a version of semi-supervised learning in which the set of negative examples is unavailable or ignored; e.g., because available negative examples are biased.
Next, we discuss estimating Pðloss of f j S; xÞ; i.e., the probability of the loss of function given that there is no significant stability disruption. Let x 0 be a collection of features that encodes the structural environment of a residue and let f be the specific residue function of interest; e.g., whether the residue is phosphorylated, DNA-binding, etc. Let Pðf jx 0 wt Þ denote the probability that the residue at this site is functional in the wild-type protein and Pðf jx 0 mt Þ the probability that the mutated residue, at the same locus in the protein, is functional. We then define the probability of the loss of residue function f as where 1 À Pðf jx 0 mt Þ gives the probability that the residue in the mutant protein is not functional. To estimate Pðloss of f j S; xÞ, we can employ the same functional residue predictor to compute prediction scores on the wild-type and mutant proteins. That is, because there are no structural changes, the same structure-based classifier can be used to compute both Pðf jx 0 wt Þ and Pðf jx 0 mt Þ, with the only difference being the replaced amino acid in the feature set x 0 mt . As before, the probabilistic model P(f|x 0 ) can be developed using data sets of positive and negative examples or, in the absence of representative negative examples, using positive and unlabeled data. We will discuss the details of approximating the posterior probability of protein function in the next section. Finally, we can consider that large changes in protein stability and structure always abolish the function of a residue; i.e., Pðf jx 0 mt Þ % 0. This implies that Pðloss of f jS; xÞ % Pðf jx 0 wt Þ; that is, the probability of the loss of function at a particular residue is roughly equal to the probability that the residue was functional in the first place. In addition to the loss of protein function, we can also consider the event of the gain of residue function, where Pðgain of f j S; xÞ ¼ ð1 À Pðf jx wt ÞÞ Á Pðf jx mt Þ: This formulation accounts for the changes in residue microenvironments that increase its functional propensity. While most amino acid substitutions found in nature are neutral or disruptive, there are many examples in which they lead to the gain-of-function events. For example, generation of a sequence motif NX[S/T] has been observed to result in gain of N-linked glycosylation events and disease [9]. Similarly, changes in catalytic residues have been observed to increase the efficiency of catalysis, also with phenotypic implications [29]. Furthermore, assuming that significant stability changes rarely lead to gain of function we can simply take Pðgain of f jS; xÞ % 0: In this paper, we are interested in specific types of residue function f for which sufficiently large data sets could be extracted from biological databases. We organize these residues into catalytic residues, metal-binding residues, macromolecule-binding residues, ligand-binding residues, post-translationally modified sites, and allosteric residues. For the purposes of our study, we consider certain types of residues to be functional although they may also be important for protein stability. For example, a disruption of certain metal-binding sites, say a Zn 2+binding residue, will be considered here as disruption of functional residues that consequently impacts protein stability. We do, however, note that this distinction is somewhat philosophical.

Training stability predictors and functional residue predictors
All classification models in this work were trained using the positive-unlabeled framework in which we are given a set of positive examples and a set of unlabeled examples. In the case of stability predictors, P(S|x), the positive examples represent mutations that have been experimentally shown to significantly impact protein stability; based on the previous studies we selected these mutations to be either stabilizing or destabilizing with |ΔΔG| > 0.5 kcal/mol [30], although some other studies use higher values [31,32]. The set of unlabeled examples, on the other hand, was selected using a database of human variants, dbSNP, mapped to available protein structures in PDB. In the case of functional residue predictors P(f|x 0 ), the positive examples were selected by integrating structural and molecular data that provide experimentally observed functional residues, whereas the unlabeled examples were selected from a set of monomeric proteins in PDB. We will describe all data sets precisely at the end of the Methods section. We next discuss how to train a classification model from positive and unlabeled data. Let Unfortunately, learning P(y = +1|x) is not straightforward because the negative examples are not available. To address this problem we rely on the body of work in semi-supervised learning that decomposes the problem into the training of a non-traditional classifier [26]; i.e., a model that distinguishes between labeled and unlabeled data, and estimating the class prior P(y = +1). We denote the posterior probability from a non-traditional classifier as P(l = +1|x), where l = +1 refers to the event of data point being labeled. We approximate these probabilities using kernel-based learning with support vector machine (SVM) classifiers as underlying optimization engines. Additionally, we estimate P(y = +1) using the AlphaMax algorithm [27], and point out other available options for an interested reader [26,33].
Under mild assumptions [27], the output of a non-traditional classifier P(l = +1|x) can be converted into the output of a traditional classifier P(y = +1|x) using where m and n are the sizes of labeled and unlabeled data sets, respectively. This predictor can now be applied to any data set D to compute the frequency of the phenomenon using the empirical mean formula 1 jDj P x2D Pðy ¼ þ1jxÞ.

The probability of alteration for multiple types of function
We previously considered the loss and gain of the specific function f at a particular residue of interest. We now extend this definition to multiple types of functional residues as follows. Consider an event of loss of any function f from a set F . We can use previous reasoning to re-write the earlier expression as Pðloss of F jxÞ ¼ Pðloss of F jS; xÞPðSjxÞ þ Pðloss of F j S; xÞPð SjxÞ: To compute this probability, we need to compute probabilities Pðloss of F jS; xÞ and Pðloss of F j S; xÞ. Because the functional data is too sparse to learn the joint (posterior) models of residue function, we consider two models to approximate this probability using the marginal (posterior) models that the residue is functional. In the first model that we refer to as the independence model, we consider each type of functional residue to be independent of others and write ð1 À Pðloss of f j S; xÞÞ: The expression above is the probability that at least one of the functions from F has been lost. Because the functions are not in reality independent, this model may lead to overestimation. The second, more conservative model, approximates the probability of loss as We refer to this model as the max model. Equivalent expressions can be written for Pðloss of F jS; xÞ as well as for the gain-of-function events. We note that F may contain particular groups of functions, say all types of metal binding, or can be used for all functions considered in this work.

Graphlet kernels
In this section we briefly summarize the graphlet kernel prediction framework and show how these kernels were used to train both stability predictors and functional site predictors. Graphs. A graph G is a pair (V, E), where V is a set of vertices (nodes) and E V × V is a set of edges. In a vertex-labeled (colored) graph, a labeling function g is defined as g: V ! S, where S is a finite alphabet, commonly referred to as vertex alphabet. A graph without selfloops, i.e. where (v, v) = 2 E, 8v 2 V, is said to be simple. An undirected graph is a graph where the order of the vertices in each pair (u, v) 2 E can be ignored; otherwise, the graph is said to be a directed graph. A rooted graph G is a graph together with a distinguished vertex termed the root.
Edit distance graphlet kernels. Consider a vertex-labeled graph G = (V, E, g, S), where |S| ! 1. Lugo-Martinez and Radivojac [38] defined the m-edit distance representation of vertex v as where c ðn i ;mÞ ðvÞ ¼ X n j 2Eðn i ;mÞ wðn i ; n j Þ Á φ n j ðvÞ: In the previous expression φ n j ðvÞ is the count of the j-th labeled n-graphlet rooted at v, κ(n, S) is the total number of vertex-labeled n-graphlets and E(n i , m) is a set of n-graphlets such that for each n j 2 E(n i , m) there exists an edit distance path of length at most m that transforms n i into n j . That is, the number of edit operations necessary to transform n i into n j is at most m, where edit operations are defined as insertion or deletion of vertices and edges, or in the case of labeled graphs, substitutions of vertex and edge labels. Finally, weights w(n i , n j ) ! 0 are used to adjust the influence of pseudo counts and control computational complexity; in this study, we set w-(n i , n j ) = 1 if n j 2 E(n i , m) and w(n i , n j ) = 0 otherwise. The length-m edit distance n-graphlet kernel k (n,m) (u, v) between vertices u and v can be computed as an inner product between the respective count vectors ϕ (n,m) (u) and ϕ (n,m) (v). Hence, the length-m edit distance graphlet kernel function can be expressed as where N is a small integer; typically defined up to N = 5 for undirected graphs. Additionally, one can define two subclasses of edit distance kernels referred to as (vertex) label-substitution k l (only allows substitutions of vertex labels) and edge-indel kernels k e (only allows insertion or deletion of edges). It is worth noting that if m = 0, then k m , k l m and k e m are all equivalent to the standard graphlet kernel on labeled graphs [37]. In this work, we only considered the normalized kernel calculated as vÞ. The normalized kernel has been previously shown to have favorable performance with respect to non-normalized kernels [37,38].
Practical aspects of training. We used the graphlet kernel framework and SVM classifiers to construct all functional site predictors. First, we modeled protein structures as protein contact graphs, where each amino acid residue was represented as a vertex and two spatially close residues (i.e. 4.5Å or less between any two atoms) were linked by an undirected edge. Fig 1 illustrates a contact graph for a protein kinase rooted at a tyrosine residue at position 148. Next, we computed a set of normalized graphlet kernel matrices K using k l m ðx i ; x j Þ; k e m ðx i ; x j Þ and k m (x i , x j ) for all pairs (x i , x j ). For each k 2 K, we used SVM light [39] and the default value for the capacity parameter to train a predictor. We incorporated evolutionary information by extending the vertex alphabet S from the 20 standard amino acids to 40 based on the median residue conservation observed over the entire data set [38]. For example, the amino acid alanine was split into highly conserved alanines (represented as A) and other alanines (represented as a). Once each predictor was trained, we used Platt's correction to adjust the outputs of the predictor to the 0-1 range [40].

Evaluation of in silico predictions
The performance of each predictor was first evaluated through a per-chain 10-fold cross-validation. In each iteration of cross-validation, 10% of protein chains were selected for the test set, whereas the remaining 90% were used for training. This enforces that all data points from the same protein sequence belong to either training or test set and, thus, reduces the chance of overestimating the accuracy of the models. We estimated the area under the ROC curve (AUC), which plots the true positive rate as a function of the false positive rate and the Matthews correlation coefficient (MCC).
It is impractical to validate, in vitro or in vivo, the functional effects of each amino acid substitution in our data sets. Therefore, we used independent mutagenesis experimental data to additionally evaluate the performance of our functional site predictors and also evaluate predictions of the loss of functional residues. More specifically, we downloaded all human mutagenesis experimental data from UniProt as of September 2014. This data set comprised 14,933 substitutions from 3,044 distinct proteins. We removed all entries associated with more than one substitution. The resulting 11,425 sites were mapped to high-quality PDB structures using the same steps described in the next section. The final data set comprised 3,356 amino acid substitutions from 2,809 different sites in 880 proteins.
For each site in this data set, we extracted functional annotations related to metal binding, PTMs, active sites, macromolecular binding, ligand binding and allosteric activity. Then, for each functional site predictor, we built an independent test set such that (i) each site belonged to a chain that was less than 40% identical to any chain in the training data, (ii) there were at least five positive sites in each test set. The resulting set was used to assess the performance of the functional predictors independently of the cross-validation.
Similarly, we created a test data set to evaluate loss-of-function predictions as follows: we searched the description of the mutagenesis experiment such that there was an experimentally . Right: Corresponding level-3 protein contact graph centered at Tyr148 (denoted with double circles). Nodes represent amino acid residues and edges correspond to spatially neighboring residues (i.e. 6Å or less between C α atoms). observed disruption of a functional site. The resulting non-redundant and filtered data sets were then used to estimate the AUC and MCC of all loss-of-function predictors. We attempted to carry out the same steps for gain-of-function events but there were insufficient data.

Disease-associated mutation data
Missense variants causing inherited disease were obtained from the Human Gene Mutation Database (HGMD) as of June 2013. A set of unlabeled inherited variants was downloaded from dbSNP v.137. All amino acid substitutions were then mapped to protein structures in PDB as follows: (i) a database of amino acid sequences from X-ray crystallographic protein structures with more than 50 amino acids and resolution less than 2.5Å was created, and (ii) for each variant, a 51 residue long sequence centered around the wild-type amino acid at the variant position was aligned using BLAST [51] against the atom sequences in PDB. All alignments without an exact match; i.e., with gaps or sequence identity lower than 100%, were excluded from this study and in the case of multiple exact matches, the structure with the best resolution was selected. This resulted in 10,629 (out of 52,406) disease-causing amino acid substitutions and 8,417 (out of 282,625) unlabeled amino acid substitutions being successfully mapped to highquality PDB structures. Table 1 summarizes both data sets. There exists an overlap between the HGMD (disease) and dbSNP (unlabeled) variants; the subset of dbSNP variants after the removal of HGMD variants will be referred to as putatively neutral variants (Table 1).

Functional site data sets
Metal ions annotated in X-ray structures from PDB as of May 2012 were selected using the HETATM field [52]. A metal-binding residue was defined as the residue that has at least one heavy atom (N, O or S) within 3Å of the metal ion. In order to build an unbiased classifier, we removed chains with: (i) more than 40% sequence identity with any other chain in the data set, (ii) crystallographic resolution greater than or equal to 2.5Å, and (iii) R-values greater than or equal to 30%. We only considered data sets with more than 100 metal-binding residues and we refer to these residues as positives. N-linked glycosylation sites were parsed from PDB and filtered in the same way as the metal-binding sites. In the case of phosphorylation sites, we used the data set assembled previously [20]. Catalytic residues were collected from the Catalytic Site Atlas v2.2.10 [53] and only literature-supported sites were kept as positive examples. As with the previous data sets, we filtered out chains with more than 40% sequence identity. DNAbinding sites were collected from Yan et al. [54], RNA-binding sites were downloaded from ccPDB [55] and protein-protein interaction (PPI) sites were obtained from Chung et al. [56]. Each data set was further filtered to remove redundancy. Protein-protein interaction hot spot residues were obtained from Lise et al. [57]. This data set was created from ASEdb [58] and only sites that mapped to PDB were used. Chains with more than 35% sequence identity were filtered out in the original work. In our analysis, we used ΔΔG greater than 1kcal/mol as the For each data set, we show the total number of amino acid substitutions (n AAS ), the number of substitutions mapped to PDB (n s ), the number of PDB entries (n PDB ) and the number of protein chains (n c ). doi:10.1371/journal.pcbi.1005091.t001 cutoff for hot spots. Ligand binding data sets were collected from ccPDB as of November 2014 and allosteric site data were downloaded from ASD v2.0 [59]. For each data set, we removed chains with resolution greater than or equal to 2.5Å. Protein stability data was collected from Capriotti et al. [41]. We used the S1615 data set which consists of 1615 single site mutations extracted from 42 different proteins in the ProTherm database [60]. The attributes for each data point included solvent accessibility, pH value, temperature and energy change ΔΔG. As previously noted, positive data points comprised mutations with |ΔΔG|>0.5. We further filtered out 112 redundant data points. Table 2 summarizes the protein stability and functional site data sets used in this study. In all situations, the unlabeled data set was constructed using a random sample of 10,000 residues selected from the 40% non-redundant set of monomers in PDB. This set was modified in the For each data set, we show the number of protein chains (n c ) and the number of positive examples (n + ). Additionally, we choose a score threshold corresponding to a specificity (sp) of 99% and report sensitivity (sn) and MCC at this threshold, as well as AUC. In each classification problem, the number of unlabeled examples was set to 10,000. S1 Table predictions lists the full name of each ligand code used. For the purposes of this work, structurally important amino acid residues such as specific metal ion binding residues were considered a part of the portfolio of available residue functions. case of post-translational modifications to include only modifiable residues; e.g., Asn for Nlinked glycosylation and Ser/Thr/Tyr for phosphorylation. It is important to emphasize that the set of negative examples was allowed to contain both buried and surface-exposed residues resulting in somewhat easier downstream classification problems. On the other hand, it allowed us to apply our methods to all PDB-mappable amino acid substitutions and make unbiased inferences related to different data sets.

Results
In this section we present the development of a stability model and a series of structure-based functional site predictors in order to examine the molecular effects of genetic variants. We evaluate the predictors through cross-validation and using an independent data set. We then summarize our results in relation to the functional impact of disease-causing substitutions and compare them to putatively neutral variants.

Assessment of functional site predictors
All classifiers developed in this study were constructed using positive and unlabeled data summarized in Table 2. Their performance was estimated via per-chain 10-fold cross-validation and is also shown in Table 2. S2 Table further lists the parameters for the best-performing kernel matrix obtained from a grid search over K, |S| = {20, 40}, m = {0, 1} and N = {4, 5}. Each predictor performance was assessed by means of the area under the ROC curve (AUC), sensitivity (sn) at 99% level of specificity (sp), and the Matthews correlation coefficient (MCC). The majority of predictors (26 out of 30) show good performance (! 70% AUC); however, we observe that functions related to smaller interfaces such as metal ions and active sites exhibit higher performance than other functional predictors. This result is not unexpected because predictors of macromolecular binding would have benefited from incorporating higher-order structural signatures such as clefts and pockets [20]. We also use an independent data set to evaluate a subset of functional site predictors, as depicted in S3 Table. Interestingly, most predictors, except for macromolecular binding models show similar or improved performance (AUC) values compared to those reported from crossvalidation (Table 2). Overall, despite the variability of performance accuracies, limited number of independent data sets and the relatively small size of the validation data, these results provide evidence that functional site predictions are of sufficient quality to identify possible molecular alterations resulting from specific missense mutations.
A literature survey suggests that our predictors perform well when compared to established structure-based methods. Extensive comparisons with other work are difficult and were beyond the scope of this study as our main goal was to probabilistically assess molecular mechanisms of disease. A set of predictors built using the same methodology was best suited to this task.

Estimating prior and posterior probabilities
To use the formal framework laid out in the Methods section, it is important that all methods approximate posterior distributions. Using positive and unlabeled data, we have approached this problem in two steps: (i) by developing classifiers that discriminate between labeled and unlabeled data, and (ii) by estimating the class priors of the positive class in the unlabeled data [27].
Estimated class priors are a particularly useful by-product of learning posterior distributions. For the stability predictor, we estimate up to 13% of unlabeled variants to significantly impact stability using the AlphaMax algorithm [27]. When the stability model was applied to disease variants only, we estimate 14% of these variants to be impactful using the empirical mean formula. It should be noted that when the known disease variants were removed from the unlabeled data set, only 7% of the remaining variants were estimated to severely impact stability. In the case of functional predictors, we applied the AlphaMax algorithm using a set of positive variants and a set of 10,000 variants randomly sampled from a set of non-redundant monomers in PDB (S4 Table). In the case of catalytic residues, we estimate that up to 3% of PDB residues to be catalytic; however, about 5% of disease-causing and 2% of putatively neutral variants were estimated to be catalytic residues, etc. Overall, we generally observe a larger fraction of function-impacting variants in the disease-causing data set as compared with the putatively neutral variants.

Applying loss and gain functional site predictors to human variants
We applied the structure-based predictors on both the wild-type and mutant structural environments as a means to identify and categorize the functional effects of amino acid substitutions causing inherited disease. The distribution of scores on the putatively neutral variants was used as an empirical null distribution. We then used a particular false positive rate (FPR) value to determine a prediction threshold at which to assess the fraction of disease mutations with loss or gain scores that are as high as or higher than the threshold. Table 3 summarizes the relative contributions of disease mutations that either decrease (loss) or increase (gain) the propensity of functional sites at a conservative threshold of 1% FPR for six different prediction outputs. Fig 2 visualizes a subset of these results for the case when stability is not impacted. Together with Table 3, it provides evidence that the loss and gain of functional sites exist even when protein stability is not disrupted; e.g., in the case of loss of function see columns Pðlossj S; xÞ and Pðloss; SjxÞ that roughly have the same values as P (loss|x). We mention that when either a loss or a gain of function event is found to be statistically significant, the mutation of this type of functional residue is considered to be an active mechanism of genetic disease. For instance, at 1% FPR, we observe that loss of catalytic residues (Cat; 3.34%; p-value = 1.93 Á 10 −28 ) and iron-binding residues, (Fe; 3.17%; p-value = 2.06 Á 10 −25 ) are among the most significantly affected molecular mechanisms. Table 3 also summarizes the statistical enrichment of impact on at least one functional site from the entire repertoire of functions using the independence and max models (see Methods). Here we observe a strong enrichment in all categories of loss of function, with or without impact on stability, for both the independence and max models. Additionally, we also see an enrichment in the gain-of-function events. These results provide statistical support for many individual studies that identify loss of function as a signature of human inherited disease.
Overall, our results suggest that with some exceptions, the loss of functional residues is enriched and common in human inherited disease; similarly, the gain of functional residues is observed to be an active mechanism in catalytic activity, most types of ligand-binding residues, and majority of metal-binding residues. In contrast to previous studies, our results suggest that the loss and gain of PTM sites do not show statistically significant enrichment in disease (although we observe enrichment for the loss); however, we note that this may be due to a considerable reduction of training data imposed by the availability of protein 3D structures, especially given a relationship between post-translational modifications and intrinsically disordered proteins [61][62][63][64]. Table 4 shows the proportions of disease and putatively neutral variants across functional categories for which molecular mechanisms can be computationally hypothesized. In the first part of the table, we compute the fraction of variants for which exactly one of the member predictors reports a score as high or higher than the FPR-value determined threshold. These fractions were then computed separately for disease and neutral variants. For convenience, when a predictor outputs a value as high or higher than the value determined by a 1% FPR, we refer to this prediction as actionable hypothesis of loss or gain of function. On the other hand, when the FPR-based threshold is adjusted using the Bonferroni correction, we refer to these predictions are confident. For example, at a conservative p-value cutoff of p < 8.62 Á 10 −4 , we find For each of the six prediction outputs and each function f, we show the percentage (%) of disease mutations that have a greater probability of loss and gain of function than a threshold corresponding to a 1% false positive rate (FPR). S1 and S2 Figs show an instance of the inverse cumulative distribution function of P(loss|x) and P(gain|x), respectively. These thresholds were estimated from the empirical null distributions of the probability of loss or gain of function on the set of dbSNP neutral data. *Indicates significant p-value measured by a one-tailed Fisher's exact test after Bonferroni correction for multiple comparisons. The p-value was separately estimated for each type of posterior distribution, jointly for loss and gain events (p < 0:05 58 ¼ 8:62 Á 10 À4 ). The p-values for the combined models were corrected separately (p < 0:05 4 ¼ 1:25 Á 10 À2 ).
doi:10.1371/journal.pcbi.1005091.t003 that 1.51% of mutations are likely to alter exactly one metal binding site and 1.43% may alter a single ligand binding site. For all groups of molecular mechanisms, we observe that the probability of observing a high alteration score is more than three times as likely as in the case of putatively neutral variants. Table 4 also shows situations with two or more functional perturbations consequent to the replacement of a given amino acid residue. The amino acid substitutions disrupting multiple functions may be important in a therapeutic context because addressing a single deficiency (e.g. iron binding) may still not result in a fully corrected phenotype because other deficiencies may still remain (e.g. ligand binding). Here, we have a significantly increased likelihood of observing multi-functional alterations in the disease set compared to the putatively neutral set; i.e., the disease set is several times more likely to contain multi-functional alterations than the putatively neutral set. For instance, 1.23% of disease mutations are likely to affect at least two metal binding sites versus only 0.27% of neutral variants, whereas 0.68% of disease variants may affect more than one ligand binding site as opposed to 0.19% of neutral polymorphisms.
If we combine the results for single and multiple mechanisms, we observe that 2.24% of disease variants are predicted, with high confidence, to impair metal-binding sites (1.51% loss of single site and 0.72% loss of multiple sites) and 1.67% probably impair ligand binding sites (1.43% loss of single site and 0.24% loss of more than one site), as depicted in Fig 3. Overall, we believe we can confidently propose molecular mechanisms of disease for 8.6% of all variants in the inherited disease data set whereas we only see about 2.4% of such variants in the neutral set. If we use a p-value cutoff of 0.01 without a Bonferroni correction, then we can computationally hypothesize a molecular mechanism for approximately 40.9% of disease variants.

Validation of loss of function predictions
In this study, we have proposed a novel methodology for identifying specific molecular alterations of disease mutations. Given that it is impractical to experimentally validate the predicted functional effects of each individual amino acid substitution, we use mutagenesis experimental data to independently assess the loss of functional site predictions, as shown in S5 Table. To the best of our knowledge, this is the first time a systematic assessment of computationally predicted disruptions of specific types of functional residues has been carried out in the published literature. For each functional site category, we show the relative contributions (%) of disease and neutral substitutions where at least one function f within a category has a greater P(loss|x)or P(gain|x) than a conservative threshold at 1% FPR. This threshold is estimated from the null distributions of P(loss|x) and P(gain|x) on the putatively neutral polymorphisms data set, respectively. The table is subdivided into two parts: (i) exactly one function (or mechanism) and (ii) two or more mechanisms. In both parts, the relative contributions are assessed at two p-value cutoffs of p < 8.62 Á 10 −4 and p < 0.01. Note that in a small number of cases, a loss of one function might result in the gain of another; thus, the sets of residues counted in the loss and gain may overlap. In general, our loss of function predictors performed as expected. However, more interestingly, if one restricts the loss of function predictions to those with significant p-values (i.e. p < 0.01), then performance (AUC) rises to at least 95% for all predictors. This provides compelling evidence that our methodology can be effectively used to identify molecular mechanisms of disease and hence can be used to prioritize experimental validation. Additionally, Fig  4 depicts two case studies of loss and gain of function predictions which have been experimentally validated. We discuss each case in detail below:  Loss of zinc binding in superoxide dismutase (SOD1). The functional role of SOD1 is to destroy radicals that are normally produced in cells and which are toxic to biological systems. SOD1 forms a zinc-binding pocket consisting of H63, H71, H80 and D83 [65,66] as shown in Fig 4 (left). Mutations in SOD1 are known to be causative of amyotrophic lateral sclerosis [66][67][68]. However, the molecular mechanisms underlying these mutations often remain unclear. We predicted a loss of multiple functional activities for mutation D83G and identified zinc binding as the primary underlying molecular mechanism of disease. In particular, D83G has a Pðf jx 0 wt Þ ¼ 0:99 and Pðf jx 0 mt Þ ¼ 3:3 Á 10 À3 leading to a P(loss|x)%1, which is above the 1% FPR threshold of 0.20 with an empirical p-value of 1.2Á10 −3 . A literature search for experimental evidence reveals that mutation D83G causes the destabilization of native structure which leads to protein aggregation with the formation of amyloid-like fibrils, and, ultimately, a gain of toxicity [69]. Zinc binding is a known stabilizer of protein structure and, therefore, the loss of the zincbinding residue D83 appears to be a plausible destabilizing mechanism that ultimately impacts the biological function of SOD1. We note that the quadruple (H63, H71, H80, D83) was not part of the training data for the zinc-binding predictor.
This example raises an interesting possibility that the loss of a functionally important residue (zinc-binding residue) results in a loss of stability, and ultimately leads to disease through the loss of the protein's function. In other words, protein structure and function appears to be intimately and bidirectionally interconnected. At this moment, however, this is only a theoretical possibility because of the lack of data about the structure and stability of the wild-type and mutant proteins in the absence of zinc ions.
Gain of zinc affinity in carbonic anhydrase 2 (CA2). CA2 is essential for bone resorption and osteoclast differentiation. CA2 has three zinc-binding residues at H94, H96 and H119 as shown in Fig 4 (right). There are multiple studies that have characterized the effects of variants in CA2 via mutagenesis experiments [70][71][72][73][74]. Among these mutations, we predicted a gain of zinc binding for T198E that was experimentally shown to increase zinc affinity. Specifically, T198E has a Pðf jx 0 wt Þ ¼ 3:8 Á 10 À4 and Pðf jx 0 mt Þ ¼ 0:99 leading to a P(gain|x) % 1, which is above the 1% FPR threshold of 0.35 with a p-value of 3.7Á10 −4 . The triple (H94, H96, H119) was not part of the training data for the zinc-binding residue predictor.

Discussion
This study builds on the extensive prior work in structural bioinformatics to provide statistical evidence of the important role that alterations of multiple types of functional residue play in human genetic disease. Most of the existing work has centered around understanding the impact of sequence variants on protein stability or has only considered single types of function such as catalytic residues or protein-interaction sites [7,20,23,24,30,[75][76][77][78]. This work extends these studies by integrating the stability models with a series of functional residue predictors involving metal binding, macromolecular binding, ligand binding and others. Overall, we show and validate the feasibility of computationally predicting mutations that impair specific function using protein 3D structure data.
Despite using sophisticated methodology to model loss and gain of functional residues, the nature of this research has limitations involving both data sets and methodology. First, despite major efforts employed by authors and database curators when annotating amino acid substitutions as being causative of a particular disease, it is possible that some amino acid substitutions have been misannotated as disease-causing by the original authors reporting them. Similarly, mutagenesis experimental data are known to be biased toward certain amino acid residues. For example, alanine mutations comprised about 50% of the independent amino acid substitutions data set (due to the frequent use of alanine-scanning mutagenesis). There are also limitations and biases in relation to the protein structures available in PDB as well as in selecting an appropriate set of unlabeled variants. Second, there exist both theoretical and practical limitations in the semi-supervised framework used in this work. The accuracy of our methods is predicated upon the assumption that the computational models are capable of accurately estimating the posterior probability of the class labels. This however could not be guaranteed and thus requires caution when interpreting our results. Furthermore, there are identifiability issues in estimating class priors in the positiveunlabeled framework; i.e., the estimates for the class priors do not have a unique solution and only an upper bound can be estimated [27]. On the practical side, we have been careful to prevent overfitting. We performed only minor parameter selection steps before the final functional predictors were built. Thus, there is the potential to further improve predictor performance through more extensive work. This includes the use of additional features, optimizing the distance threshold used to define an edge between two residues when constructing protein contact graphs, choice of the capacity parameter in SVM light , among others. Finally, this work was designed to probabilistically reason about molecular mechanisms of disease and not necessarily to develop classifiers that outperform specialized models across the board. If a user needs a tool for a particular prediction task, we recommend that the most accurate predictor for this task be selected.
Despite these limitations, we believe this work contributes to an improved understanding of the impact of sequence variants on protein function. We have provided a model that considers functional alteration both when stability of the protein is disrupted and when it is not disrupted (e.g. interestingly, sequence changes can exert a functional effect in disordered regions such as disorder-to-order transition [79]). We believe that our work suggests a new class of approaches to disease studies that might qualify as mechanism-driven and disease-agnostic, where one might be compelled to identify a set of molecular alterations underlying a disease phenotype without necessarily studying a single disease. While each molecular alteration is likely to require an individualized approach to drug design and therapy, we envisage that the next generation of researchers might decide to specialize in addressing particular types of functional deficiencies rather than beginning with a particular disease. For catalytic residue predictor, we plot the inverse CDF for P(gain|x) on the disease and putatively neutral data sets, respectively. (EPS) S1 Table. Mapping between ligand codes and names. (PDF) S2 Table. Selected kernel matrix parameters for each structural and functional site predictor. For each data set, we show the best-performing kernel matrix parameters obtained through a per-chain 10-fold cross-validation. The normalized edit distance kernel k m (u, v) outperformed both k l m ðu; vÞ and k e m ðu; vÞ on each data set. Note that the edit distance kernel with m = 0 is equivalent to a standard graphlet kernel. (PDF) S3 Table. Performance assessment of functional residue predictors using an independent data set. For each prediction method, we show number of proteins (n p ), number of positive examples (n + ), number of unlabeled examples (n u ), AUC and sensitivity (sn) and MCC at score threshold corresponding to specificity (sp) of 99%. In the case of N-linked glycosylation (Nglyco), we only predict if the wild-type residue is an asparagine, whereas for phosphorylation (Phos), we only make predictions on threonine, tyrosine or serine residues. (PDF) S4 Table. Class priors for structural and functional predictors. Fraction of residues in a data set estimated to be stability-impacting or functional. Estimates on the unlabeled data were made using the AlphaMax algorithm [27]; minor manual adjustments were made by observing the log-likelihood plots. Ã Indicates a confident prior estimate assessed by manually observing log-likelihood plots. Estimates on the disease and putatively neutral data were made using the empirical mean formula. (PDF) S5 Table. Performance assessment of loss of function predictions using mutagenesis experimental data. Independent validation of predicted loss of functional site events using a set of mutagenesis experimental data mapped to protein structures in PDB. This mutagenesis data set contains 3,356 AAS from 880 human proteins. For each functional feature, we show the number of experimentally determined losses (n l ), AUC, sensitivity (sn) and MCC corresponding to a 99% specificity (sp) threshold. Additionally, the last five columns show the number of statistically significant (p < 0.01) loss-of-function predictions (n Ã l ), as well as estimates for AUC (AUC Ã ), sensitivity (sn Ã ), specificity (sp Ã ) and MCC (MCC Ã ) on this filtered set. (PDF)