Mutation severity spectrum of rare alleles in the human genome is predictive of disease type

The human genome harbors a variety of genetic variations. Single-nucleotide changes that alter amino acids in protein-coding regions are one of the major causes of human phenotypic variation and diseases. These single-amino acid variations (SAVs) are routinely found in whole genome and exome sequencing. Evaluating the functional impact of such genomic alterations is crucial for diagnosis of genetic disorders. We developed DeepSAV, a deep-learning convolutional neural network to differentiate disease-causing and benign SAVs based on a variety of protein sequence, structural and functional properties. Our method outperforms most stand-alone programs, and the version incorporating population and gene-level information (DeepSAV+PG) has similar predictive power as some of the best available. We transformed DeepSAV scores of rare SAVs in the human population into a quantity termed “mutation severity measure” for each human protein-coding gene. It reflects a gene's tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. Genes implicated in cancer, autism, and viral interaction are found by this measure as intolerant to mutations, while genes associated with a number of other diseases are scored as tolerant. Among known disease-associated genes, those that are mutation-intolerant are likely to function in development and signal transduction pathways, while those that are mutation-tolerant tend to encode metabolic and mitochondrial proteins.

Introduction Genetic variations are major determinants of human diseases and phenotypes [1]. Accelerating pace of large-scale sequencing projects on genomes and exomes has greatly expanded the landscape of human genetic variations. It remains a challenging task to assess the functional impact of these variations [2]. Comprehensive analysis of genetic variations, especially those found in and near the exons of protein-coding genes [3], may shed light on gene-disease relationships and provide insight into the mechanisms of diseases and variations in phenotypes [4]. The increasing number of sequenced human genomes and exomes from the general population would enhance the statistical power of such analyses [5].
Different types of genetic variations occur at a range of scales from large structural variations such as chromosomal rearrangements and copy number variations (CNVs), to insertions and deletions (indels) of up to hundreds of nucleotide positions, and to single-base-pair (single-nucleotide) variations (SNVs) [6]. Any type of genetic variation could cause human disease with a variety of mechanisms, including effects on chromatin organization, gene expression and regulation, protein function, and genetic instability [7][8][9][10][11]. The observed frequencies of genetic variations in the general population are tied to their fitness cost as well as the evolutionary history of the human species and its ancestors. While common variations, most notably SNVs, were first documented, more rare genetic variations (e.g., those with minor allele frequency (MAF) less than 0.0001) at the individual level have been identified in large-scale sequencing projects of the general population [5] as well as patients with certain diseases such as cancer [12] and intellectual disability [13]. Although some recurring variations have been identified to be the drivers of diseases, a significant number of rare mutations are persistently found, and their clinical significance are difficult to evaluate. Genome-wide association studies can pinpoint the genetic loci, mostly marked by common SNVs, with statistically significant disease or phenotype associations [14,15]. Association of rare and de novo mutations to common and rare diseases could be unveiled through familial or trio studies that are facilitated by genome or exome sequencing nowadays [16,17]. Coupled with pathway profiling, systematic analysis of genetic variations in patients could shed light on the biological processes underlying diseases [18]. However, disease gene prioritization and disease-causing variation discovery are still difficult [19,20].
The identity change in a single base pair position is the most common type of genetic variation. In protein-coding regions, non-synonymous variations (missense mutations) result in the change of a single amino acid in the protein product [21]. Clinical consequences of these missense mutations, referred to as single amino acid variations (SAVs), are generally more difficult to evaluate than synonymous mutations (generally benign) and nonsense (stop codon) mutations (often resulting in loss of function). Deleterious SAVs could affect various aspects of protein function, including protein folding and stability, protein-protein interactions, protein localization and degradation, post-translational modification, and the activity of enzymes [22,23]. A number of computational methods [24] have been developed to assess the mutational effects of SAVs found in the human proteome encoded by around 20,000 proteincoding genes.
Essential genes compromise the viability of an individual when their function is lost. Such genes can be identified by observing intolerance to loss-of-function variants at the population level [25]. In genetic terms, essential genes tend to exhibit haploinsufficiency, where the loss of one of two gene alleles is detrimental. Genetic alterations of haploinsufficient genes are not only a major cause of dominant diseases [26], but also play key roles in developmental disorders [17]. On the one hand, haploinsufficient genes can function as tumor supressors [27]. On the other hand, essential genes tend to be expressed at higher levels in cancer cells than in normal cells [28]. Thus, knowledge about gene essentiality can help prioritize deleterious variants in genetic studies and could help prioritize therapeutic targets in cancer. Given the role of essential genes in human disease, considerable efforts have gone into developing methods for haploinsufficiency prediction [5,[29][30][31][32].
In this study, we developed a deep convolutional neural network-based method for predicting the clinical impact of SAVs in the human proteome based on analysis of their sequence, structural and functional properties. The neural network prediction results of SAVs observed in the general population were used to calculate a mutation severity measure that estimates tolerance of each human protein-coding gene to deleterious missense mutations. This measure correlates with gene essentiality and specific disease classes such as cancer and autism. Finally, we observed a dichotomy of mutation severity for disease-associated genes: those that are mutation-intolerant tend to function in development and signal transduction pathways, while those that are mutation-tolerant tend to function in metabolism.

Analysis of human disease-related genes and their variants
We obtained a set of likely pathogenic (disease-causing) genetic variants from two database resources: ClinVar [33] and UniProt [34]. ClinVar aggregates reported variant-disease associations from submissions of research studies. The ClinVar clinical interpretation of variants follows the ACMG (American College of Medical Genetics and Genomics) guideline [2] and have five categories: "Pathogenic", "Likely pathogenic", "Uncertain significance", "Benign", and "Likely benign". We consider the categories "Pathogenic" or "Likely pathogenic" as disease-causing variants and the categories "Benign" and "Likely benign" as non-disease-causing while ignoring the category of "Uncertain significance". It should be noted that errors or inconsistencies with other databases could be present in ClinVar interpretations, especially the categories of "Likely pathogenic" and "Likely benign", where the ACMG's recommended confidence level is greater than 90% certainty of being disease-causing and benign, respectively [2]. ClinVar variants annotated as "Pathogenic" or "Likely pathogenic" were found in~4,200 protein-coding genes, about one fifth of the human proteome. Among those variants, SAVs were found in the majority (3,410) of these genes. Non-SAV variants were also found in most of them (~3,300 genes). Non-SAV variants include indel variants, single-nucleotide variations in noncoding regions (mostly at splice sites), nonsense single-nucleotide variations (to stop codons), and a small number of synonymous variants (Fig 1A). The 31,171 SAVs made up about 30% of all ClinVar variants with "Pathogenic" or "Likely pathogenic" annotations ( Fig  1B). UniProt is another curated resource for pathogenic SAVs. The number of proteins with UniProt SAVs annotated as disease-related is 2,755 ( Fig 1C). Most of these genes (2,590) overlap with the ClinVar disease-associated gene set, with UniProt contributing only 165 disease-associated genes not found in the ClinVar set. On the other hand, more than half of the Uni-Prot pathogenic variants (15,697 out of 29,300, Fig 1D) were not found in the set of ClinVar pathogenic variants. The total number of likely pathogenic variants in the unified ClinVar and UniProt set is~47,000. We also obtained a set of benign variants (~45,000) by combining the ClinVar variants annotated as "Benign" or "Likely benign" and the UniProt variants in the category of "Polymorphism".
The number of likely pathogenic SAVs are not evenly distributed among the disease-associated genes. The three genes with the greatest number of SAVs encode long proteins: FBN1 (Fibrillin-1, 2,871 amino acids), LDLR (Low-density lipoprotein receptor, 860 amino acids), and SCN1A (Sodium channel protein type 1 subunit alpha, 2,009 amino acids), each of which has more than 500 pathogenic SAVs. In part, it may be due to the length of these proteins. 75 genes possess more than 100 pathogenic SAVs. More than half of the disease-associated genes with SAVs (2,003 out of 3,575) have less than 5 pathogenic SAVs, and 916 of them have only one pathogenic SAV. One cause of the uneven distribution of SAVs could be the bias in research studies of common diseases and certain genes (e.g., the LDLR gene involved in hypercholesterolemia).

Enrichment analysis of sequence, structure and functional properties in likely pathogenic SAVs and gnomAD SAVs
We compiled a set of protein sequence, structure, and functional properties predicted by computer programs or retrieved from UniProt Feature fields (see Materials and methods). An enrichment log-odds score was used to determine if any property is enriched or depleted in amino acid positions with pathogenic SAVs compared to the background frequency of that property in all human proteins (see Materials and methods). We observed a 1.7-fold enrichment of conserved positions (Consv3 in Fig 2A) and more than 3-fold depletion of variable positions (Consv1 in Fig 2A)  Predicted β-strands and α-helices are slightly preferred in pathogenic SAVs, while coil regions of secondary structure predictions are slightly depleted. The enrichment/depletion of predicted secondary structure types is not large (less than 1.33 fold). A previous study of secondary structure biases based on known structures also revealed that there is little difference between secondary structure distributions among disease-causing variations and benign variations, albeit the authors found a weak enrichment of coil/turn/bridge and weak depletion of β-strand for disease-causing variations [38]. The different findings could be due to differences in the datasets (full-length proteins in this study versus regions with known structures) and secondary structure estimation methods (predicted versus real). We also observed slight depletion of low complexity regions and coiled coil regions in pathogenic SAVs.
For regions with indications of subcellular localization, signal peptides and mitochondrial transit peptides are depleted in pathogenic SAVs, but transmembrane segments are enriched by more than 2 fold. Several properties derived from UniProt Features exhibit the strongest enrichments in pathogenic SAVs. They are related to protein stability (UniProt Feature DIS-ULFID: cysteine residues participating in disulfide bonds) or function (UniProt Features: SITE, ACT_SITE, METAL, MOTIF, and BINDING, see their explanations in Materials and methods and S2 Table). Except the MOTIF Feature, they exhibit more than 4-fold enrichment in pathogenic SAVs (log2-based odds score more than 2, Fig 2A).
We also analyzed SAVs found in more than 12,000 exomes (>24,000 alleles) in the gno-mAD (genome Aggregation Database) [5] database, which provides a comprehensive catalogue of natural variants from the general population. Enrichments of protein sequence, structure, and functional properties in each gnomAD SAV category were analyzed in the same way as for pathogenic SAVs (Fig 2B). Common gno-mAD SAVs (MAF � 0.01) generally exhibit opposite enrichment/depletion trends compared to pathogenic SAVs. Properties such as DISULFID, SITE, ACT_SITE, METAL, MOTIF, and BINDING exhibit the most prominent depletion in common gnomAD SAVs and the strongest enrichment in pathogenic SAVs. In contrast, properties enriched in common SAVs include variable positions (Consv1), coil regions of secondary structure prediction, predicted disordered regions, low complexity regions, signal peptides, and mitochondrial transit peptides. The enrichment or depletion of properties were gradually curtailed when moving from the category of common gnomAD SAVs to less frequent gnomAD SAV categories (Fig 2B). This behavior suggests that many low frequency SAVs, especially those with MAF less than 0.0001 in the general population could be deleterious, because functionally important residues (properties specified by UniProt Features SITE, ACT_SITE, BINDING, METAL, and MOTIF) are found more frequently in these rare SAVs than in the common SAVs.

DeepSAV-A deep neural network-based method for SAV pathogenicity prediction
We developed a neural network-based method (DeepSAV) that uses a deep-learning convolutional neural network to predict SAV pathogenicity based on input features of sequence, structure, and functional information (see Materials and methods). The features include amino acid type, sequence profile, sequence conservation, secondary structure and disorder predictions, coiled coil and low complexity region predictions, sequence regions indicating subcellular localization (signal peptide, transit peptide, transmembrane segments), and functional and stability properties from the UniProt database such as post-translational modifications, disulfide bond, active site, and motifs (S3 Table). Features of a window of 21 amino acid positions centered around the mutated amino acid were encoded as input. The neural network has mainly convolutional layers and applies techniques such as max-pooling, residual network, and dropout (S1 Fig). It is trained and tested on a large set (43,000 pathogenic and 43,000 benign) of SAVs from the ClinVar and UniProt database.
Computational methods of variant impact predictions differ in the sources of information used, the training data, and the scoring algorithms/machine learning techniques. Some of these methods such as SIFT  , combine various sources of information in addition to sequence conservation, such as predicted or real structural properties, functional and epigenetic information from experiments, and genomic context. DeepSAV is similar to these stand-alone methods in terms of integrating a set of input features with diverse sequence, structural and functional information. A cross validation test of DeepSAV showed that it yielded better performance (measured by area under the ROC (receiver operating characteristic) curve (AUC)) to differentiate pathogenic from benign SAVs than most stand-alone programs such as SIFT Fig 3A). DeepSAV trails the two best methods REVEL and VEST4 by about 0.06 in AUC. We were not able to find information about the algorithm of VEST4, an improved version of VEST [49], which shows the second best performance, and used VEST4 scores as given in dbNSFP for comparison [24]. DeepSAV scores correlate best with the prediction scores of REVEL and VEST4, with correlation coefficients of 0.767 and 0.783, respectively (S4 Table).
To study the effectiveness of different features used in DeepSAV predictor, we performed neural network prediction experiments by using various feature combinations (S3 Fig). If only the amino acid types (including wild type and mutated) are used as features, the neural network prediction power is significantly lower (ROC AUC: 0.785, S3 Fig) than when all features are used (ROC AUC: 0.879). Adding sequence conservation (AA+consv) or sequence profile (AA+prof) yields significant ROC AUC increases (to 0.849 and 0.86, respectively). These two features appear to be the most important determinants of neural network performance, as adding other types of features individually (AA+seg, AA+coiled.coil, AA+ss.pred, AA+disorder, AA+uniprotFeat) does not enhance the performance as large as adding them (S3 Fig). Similar behavior is observed in the experiments of leaving out a particular type of features. Leaving out either sequence conservation (ALL-consv with ROC AUC 0.864) or profile (ALL-prof with ROC AUC 0.865) while keeping all the other features resulted in the biggest drop of performance compared to leaving out other types features individually (ALL-disorder, ALL-seg, ALL-sec.struct, ALL-coiled.coil, and ALL-uniprotFeat). In fact, leaving out features such as coiled coil prediction (ALL-coiled.coil with AUC 0.876), low complexity region prediction (ALL-seg with ROC AUC 0.878), predicted secondary structure types (ALL-sec.struct with ROC AUC 0.877), predicted disorder propensities (ALL-disorder with ROC AUC 0.874), and features derived from UniProt Feature fields (ALL-uniprotFeat with ROC AUC 0.874) has little influence in the neural network performance. Collectively, these features give an improvement of 0.09 in ROC AUC when comparing the neural network with all features (ALL) to the neural network with amino acid types, conservation, and profile (AA+consv+prof).
The DeepSAV predictor is based on information derived from amino acid positions. Thus, the DeepSAV score reflects the level of deleterious effect to the target protein for any given variant based on its sequence, structural and functional properties. However, protein-level deleterious effects do not necessarily lead to human diseases, as a significant fraction of human protein-coding genes could be compromised without causing diseases. Variations in essential genes are more likely to cause diseases than in non-essential genes, and disease-associated genes are generally involved in more protein-protein interactions than genes not associated with diseases [39]. Adding information based on gene-level annotation or predictions such as gene essentiality and the number of protein-protein interactions have proven useful in improving the differentiation between disease-causing mutations and benign mutations [39]. Indeed, the performance of DEOGEN2 [54], which incorporates heterogeneous information such as the relevance of the gene and the number of protein interactions, is among the best (ROC AUC: 0.924) in our test. Another valuable source of information independent from the amino acid positional properties are the frequencies of the variants observed in the human general population, which is available as minor allele frequencies (MAFs) of the variants in the gnomAD database. We added gnomAD MAF and 17 gene-level features (extracted from dbNSFP, see Materials and methods) to the amino-acid-level features in DeepSAV. The resulting predictor DeepSAV+PG (DeepSAV with population and gene-level information) was able to boost the performance from ROC AUC 0.879 to 0.924, close to some of the best methods ( Fig 3A).
One direction for future improvement of our neural network method would be to use 3-dimensional (3D) structural information for regions with known structures or regions where structures can be reliably modeled, as some current methods do [45,55]. Predictions of protein stability and binding free energies changes could be used as features [22,56]. Available 3D structures would also enable the use of amino acid properties of residues that are structural neighbors of the target position, some of which could be forming long-range contacts that are not covered by a local sequence-based window. Such structural information could be useful in capturing the epistasis effects on protein stability [57, 58] between positions and may be combined with the epistasis effects derived from other sources, such as population frequencies of compensatory variations [59].
We further calculated DeepSAV scores and baseline fitness scores for human protein SAVs observed in the gnomAD database (available at http://prodata.swmed.edu/DeepSAV_data). They show a positive correlation (correlation coefficient: 0.57), and both exhibit bimodal distributions for SAVs in each of the four different MAF categories (MAF � 0.01 (common SAVs), 0.001 � MAF < 0.01, 0.0001 � MAF < 0.001, and MAF < 0.0001) (Fig 3B-3F). The range of DeepSAV scores is between 0 and 1, with higher scores suggesting an increasing likelihood of being deleterious (pathogenic). For common SAVs (MAF � 0.01), the distribution of DeepSAV exhibits a high peak in the low score range, and a flat tail in the high score range, suggesting that the majority of common SAVs are predicted to be benign. With increasing stringencies of rare SAVs, the volume of the peak in the low-score range decreases while the tail in the high-score range increases, suggesting that pathogenic SAVs are more likely to occur in rarer SAVs. The baseline fitness scores display similar behavior for SAVs in different MAF categories, although the peaks in high and low scoring ranges appears to overlap more compared to the DeepSAV scores.

Mutation severity scores enrich for essential genes with potential disease associations
Deep sequencing of human exomes has highlighted the contribution of rare SAVs to gene function and complex diseases [60,61]. Certain genes may be more tolerant to deleterious or partially deleterious SAVs due to their functional properties. To evaluate the mutation tolerance of genes, we chose to use the DeepSAV scores that reflect a variant's deleterious effect on the protein product, as the features used for training are based on protein sequence, structure, and functional properties. DeepSAV+PG scores are better than DeepSAV scores at discriminating pathogenic (disease-causing) variants from benign variants by adding gene-level information that correlates with the likelihood of a gene associated with diseases (e.g., gene essentiality and interaction numbers). However, to more objectively assess the mutation tolerance of genes, DeepSAV scores were used as they reflect the deleterious effects on the protein products regardless of whether the genes are disease-associated.
Mutation severity refers to the degree of deleteriousness associated with a genetic variation. Similar terms such as "degree of harmfulness" [38] and "perturbation index" [62] have been used in previous studies. The deleterious effects of SAVs could have functional impacts on various aspects of a protein's function, including protein stability, interactions with binding partners, enzymatic activity, protein degradation, and subcellular localization [23]. The free energy change in protein folding and binding can be quantified by experiments and structural modeling of the wild-type and mutated proteins. It has been shown that the relative magnitude of energy change (compared to the total folding or binding free energy of the wild type protein) serves as a better predictor of functional impact of a variant than the absolute values of energy change [22]. The same energy perturbation would have a more detrimental effect on a small protein with weak folding energy than the effect on a large protein with strong folding energy.
Natural variants with clinical impact tend to be rare in the human population [48]. Explosive population growth coupled with weak purifying selection in recent human history led to an excess of rare natural variants observed in large-scale sequencing of human genomes and exomes [40, 60,63]. Certain variants likely result in the loss of function (inactivation) of a protein, such as mutations to stop codons, at essential splice sites, and frameshift indels. These types of protein-truncating variants have been quantified in measures such as pLI (probability of being loss-of-function intolerant) and LOEUF (loss-of-function observed/expected upper bound fraction) based on their occurrences in human population. However, the functional impact of SAVs of the human population are more difficult to assess, while they could be statistically more powerful since SAVs are more frequent than commonly used loss-of-function mutations leading to stop codon, altered splicing and frameshift. DeepSAV scores predict the deleterious effects of SAVs with input features of protein sequence, structure and function, thus would serve as a measure of the mutation severity of SAVs. The number of deleterious rare SAVs at the population level could serve as a simple measure of a gene's tolerance to them in recent human history. Since the deleterious effect of any SAV is a continuous variable that correlates with the DeepSAV score, we transformed the DeepSAV scores of rare SAVs (MAF < 0.0001) present in the human population (from the gnomAD database [5]) into a mutation severity measure for each gene called the GTS (Gene Tolerance of rare SAVs) score (see Materials and methods). The GTS score is a weighted sum of DeepSAV scores for observed rare SAVs in the human population (normalized by protein length) (available in S1 Table). It is correlated with a simple measure of mutation severity based on the number of predicted deleterious rare SAVs (DeepSAV score > 0.75) (Fig 4A).
Human genes have been classified by a measure (LOEUF) that reflects their tolerance to inactivation (loss-of-function) [5]. To see how our GTS score correlates with the LOEUF score, we ranked human genes from low GTS score (mutation-intolerant) to high GTS score (mutation-tolerant). The LOEUF distribution for top-ranking mutation-intolerant genes was compared to that of a set of known disease-associated genes (Fig 4B). The top-ranking mutation-intolerant genes selected by the mutation severity measure (lowest GTS scores) include progressively more loss-of-function constrained genes with increased filtering of common On the x axis, 0 means the first LOEUF decile [0, 0.1] (the same extrapolation applies to other numbers). C) Distribution of gene frequency among GTS deciles (MAF cutoff 0.0001) for the same gene set with known pathogenic SAVs compared to essential and nonessential gene sets. D) Distribution of protein interactions from four databases (BioGrid [123], IntAct [124], DIP [125] and HPRD [126]) integrated in PICKLE [127] for gene sets within three different mutation severity GTS score deciles (0, 0.5 and 0.9). E) Venn diagram highlights overlap among essential genes with known pathogenic variants (labeled as "Pathogenic"), essential genes with lowest loss-of-function constraint scores (LOEUF), and essential genes with lowest mutation severity measure (GTS). F) Representation of disease class associated with genes from the overlapping set of top-ranked genes by LOEUF and GTS (126 genes, not including genes with known pathogenic SAVs). variants. In contrast, the disease-associated gene set displays a bimodal distribution of highly constrained genes at low LOEUF and less constrained genes at median LOEUF. Thus, the mutation severity measure for rare SAVs reiterates a gene's tolerance to inactivation, with topranking mutation-intolerant genes being more frequent in the percentile of lowest tolerance to inactivation.
A fraction of the disease-associated human gene set (17%) is annotated as essential by one or more CRISPR screens [64,65]. Among all curated gene-disease associations in DisGeNET (May 2019 version) [66], the essential disease-associated genes contribute to 2,477 diseases or syndromes and 1,847 neoplastic processes. We originally reasoned that genes able to accumulate numerous detrimental SAVs (evaluated by high GTS scores) were less likely to contribute to disease phenotypes. However, the GTS scores do not discriminate disease-associated genes collectively, giving similar gene frequencies displayed across the GTS deciles (Fig 4C, gray  bars). Instead, GTS scores tend to select for gene essentiality, with an increase in essential genes and a decrease in non-essential genes at the lowest GTS decile (Fig 4C). Sequence conservation could partially explain the correlation between GTS score and gene essentiality, as essential genes tend to be more conserved due to their functional importance and at the same time accumulate less deleterious rare mutations in recent human history. Similar to noted trends of both essential and disease-associated genes [67,68], human genes with GTS scores in the lowest decile, regardless of their essentiality, exhibit increased numbers of protein interactions than those from higher GTS deciles ( Fig 4D).
Although the loss-of-function constraint measure LOEUF [5] and the mutation severity measure GTS display similar trends in reflecting gene essentiality (Fig 4C), they define different gene sets that might be used to evaluate potential new disease-associated genes. A comparison of essential genes with the lowest LOEUF scores, essential genes with the lowest GTS scores, and essential genes with pathogenic SAVs highlights the divide among these gene sets ( Fig 4E). The overlap between low-GTS set and low-LOEUF set (126 genes, not including genes with pathogenic SAVs) provides a potential source of disease-associated genes. Indeed, despite the lack of documented pathogenic SAVs in the 126 SAV and inactivation-intolerant genes, curated DisGeNET gene-disease associations annotate almost half (55 genes) as being involved in disease. The set includes almost all disease classes, with several being over-represented when compared to all gene-disease associations: including virus diseases, stomatognathic diseases, immune system diseases, and neoplasms ( Fig 4F). Given their propensity to associate with disease, the essential genes selected by our GTS score could provide insight into novel gene-disease associations.

Mutation-intolerant essential genes cluster with disease-associated genes and contribute to diseases
While low GTS scores tend to reflect select for gene essentiality (Fig 4C), they do not necessarily distinguish among a collective set of known pathogenic genes. Thus, to help prioritize disease associated genes using GTS, we combined the score with other gene-level measures that helped boost the performance of our deep neural network predictor DeepSAV+PG (Fig 3A). The essential genes with known pathogenic SAVs (70 genes, Fig 4E) that overlap with both the loss-of-function-constrained genes (lowest LOEUF) and the low mutation severity genes (lowest GTS) set a standard to prioritize other potential disease-associated genes (126 overlapping genes shared by the low LOEUF and the low GTS sets, but without pathogenic SAVs, Fig 4E). Clustering these two sets of genes (70 known pathogenic +126 potential pathogenic) together using complete linkage of correlated distances over six gene-level scores that potentially correlate with disease-association (see Materials and methods) places potential disease-associated genes among those that are known to be associated with diseases (S2 Fig). One cluster with relatively low GTS scores failed to include any known pathogenic genes, suggesting that these might not be prioritized as disease associated (labels colored gray in S2 Fig). Alternatively, two relatively large clusters (total of 40 genes) with high proportions of known disease-associated genes could prioritize the similarly scored genes of unknown pathogenicity as being candidates of disease-causing genes. These clusters exhibit lower GTS scores than other clusters, indicating their intolerance to detrimental missense mutations (red labels in S2 Fig). Inspection of gene-disease associations for genes in these two clusters reveals that 68% are linked to curated diseases in the DisGeNET database [66], including almost half of the unknown set (10 out of 23 genes).
Enriched GO biological process terms are similar for each identified gene cluster, and annotation clustering of terms from the combined set (40 genes from two clusters, 17 with pathogenic SAVs) highlights their function in RNA splicing (enrichment score 9.02, 13 genes), gene expression (enrichment score 6.81, 31 genes), and chromosome segregation (enrichment score 4.34, 9 genes). Two disease-associated genes and eleven others belong to the most enriched cluster and function in RNA splicing, including pre-mRNA processing factor 3 (PRPF3) having variants associated with Retinitis pigmentosa, and splicing factor 3b subunit 1 (SF3B1) having variants associated with acute myeloid leukemia, among other neoplastic processes. Five of the potential disease-associated genes involved in RNA splicing (DHX15, HNRNPH1, SRSF1, PCBP2, and DHX9) are reported to be associated with myelodysplasias in DisGeNET [66], and the spliceosome has become a therapeutic target for myeloid malignancies [69,70].
The third most enriched functional cluster includes six disease-associated genes and three others that function in chromosome segregation. Three of the disease-associated genes (RAD21, SMC3, and SMC1A) functioning in chromosome segregation have genetic variants causing Cornelia de Lange syndrome (CdLS), which manifests developmentally as intellectual and growth retardations. The protein-coding products of these genes comprise three of the four subunits of the mitotic cohesion complex responsible for chromosome segregation. Mutations in this complex are known to cause a number of diseases termed cohesinopathies, of which CdLS is the best characterized [71]. One additional chromosome segregation gene from this set, PDS5 cohesin associated factor A (PDS5A), is associated with CdLS in DisGeNET literature [72]. Gene dosage appears to be an important component of CdLS severity, which is consistent with the essential nature of our selected gene set [25].

Mutation-intolerant disease-associated genes function in development and signaling pathways
Over half of the top 1,000 human genes ranked by low GTS (571 genes) are associated with 1,618 diseases, 262 phenotypes and 184 disease groups such as "Intellectual Disability" that encompass multiple similar diseases or phenotypes. To understand the functional context of mutation-intolerant genes that are associated with disease, we assigned them to pathways in Reactome [73] (467 genes). Functional enrichments of these pathways highlight involvement in axon guidance (P-value < 1.78E-15), development (P-value < 9.64E-14), and neurotransmitter receptors and postsynaptic signal transmission (P-value < 1.12E-13), among others. Those genes in the enriched category of "developmental biology" describe early steps in development that give rise to diverse tissues in the body and thus represent critical processes that should contribute to fitness. In fact, 25% of this gene set participate in development, and many are annotated as essential (47 genes) or conditionally essential (17 genes). However, a significant portion of the developmental genes are not considered essential (52 genes). Many of them encode protein kinases (18 genes), homeobox transcription regulators (4 genes) or proteins with other signaling domains that are expanded in the genome like rho-binding domains (3 genes), pleckstrin homology domains (4 genes), or SH3 domains (3 genes).
While a relatively small core set of essential genes exists in eukaryotes whose loss of function results in lethality, a larger subset of genes exhibits conditional lethality that also affects fitness [74]. For example, deleterious mutation of immune system genes might not necessarily result in a lethal phenotype. However, their contribution to survival under specific conditions like being challenged with an infectious agent could be considered as essential. This spectrum of gene essentiality is indeed reflected in the disease-associated genes functioning in development, as they exhibit essential and conditionally essential responses in CRISPR screens [64,65]. Furthermore, many of the mutation-intolerant and disease-associated genes not considered as essential belong to families like protein kinases that have expanded in the human genome and could be functionally redundant [75]. Thus, the concept of gene essentiality alone does not necessarily suggest the ability to cause disease.
The mitogen-activated protein kinases ERK1 and ERK2 function in development and signal transduction pathways. They represent a duplication that is thought to be functionally redundant [76]. However, ERK2 includes two known pathogenic variants that are associated with various neoplastic diseases (E322K) as well as with inborn genetic disease (R135T). The ERK2 structure ( Fig 5A) includes a relatively small set (9 positions) of DeepSAV-predicted deleterious SAVs from the gnomAD database (DeepSAV score >0.75). One of these SAVs (D106G) lines the ATP-binding pocket, and four are buried in the structure core (D44Y, G136E, R148H, and R194T), with R148 belonging to the HRD motif that controls kinase activation. The rest are in a C-terminal extension to the catalytic domain that lines the surface of the kinase in between the N-lobe and the C-lobe. The known pathogenic variants cluster together with many of the predicted deleterious mutations. Thus, while this kinase is thought to be functionally redundant, some variants have been reported as pathogenic, several others are predicted as detrimental, and the gene is intolerant to deleterious mutation (GTS score 3.24E-7 and ranked 142 out of more than 17,000 genes). Accordingly, the ERK2 gene was shown to be conditionally essential in a CRISPR screen [64], suggesting conditions exist where the functional redundancy of the two kinases breaks down.
Mutation intolerance appears to be a quality exhibited not only by genes associated with developmental disorders [17], but also by genes contributing to other various disease types such as cancer (COSMIC) [12], autism [77] (https://gene.sfari.org), and viral interacting proteins [78] (Virus_interact in Fig 5D). However, mutation severity does not select for collective disease-associated gene sets (PathVar (ClinVar and UniProt genes with pathogenic SAVs) and DisGeNET, Fig 5C), with nearly uniform distributions of the number of genes among GTS deciles. Genes associated with X-linked diseases (from the Clinical Genomic Database) [79] exhibit pronounced preference for high mutation intolerance, with~70% falling into the two lowest mutation severity deciles (Fig 5C). Genes associated with autosomal dominant diseases and genes associated with autosomal recessive diseases [79] show opposite trends in mutation intolerance (Fig 5C). The preference for mutation-intolerance in selected disease types suggests that the GTS score can be particularly useful for prioritizing disease genes when combined with additional considerations, such a disease type or functional pathways contributing to the disease state.

Mutation-tolerant genes function in metabolic pathways and mitochondria
The concept of functional redundancy from gene duplication extends not only to critical components of developmental and signal transduction pathways, but also to those of metabolic

Fig 5. Mutation-intolerant and mutation-tolerant genes prefer different pathways and disease types. A)
Top ranked genes with low GTS scores like ERK2 kinase (PDB 4fmq) have relatively few DeepSAV predicted deleterious variant positions (DeepSAV score > 0.75, red spheres). One of these (black sphere) is near (< 4Å) the active site (ANP substrate analog in black stick). B) Bottom ranked genes with high GTS scores like CD36 (PDB 5lgd) are tolerant to predicted deleterious mutations (DeepSAV score > 0.75, red spheres), including several positions (black spheres) lining the fatty acid (black stick) binding sites or with known pathogenic variation in platelet glycoprotein deficiency (blue spheres). C) Mutation severity spectrum of disease-associated genes in general, measured by their frequencies in GTS deciles. (PathVar-genes with pathogenic SAVs in ClinVar and UniProt, DisGeNET-genes with diseases in DisGeNET database, X-linked, Autosomal dominant, and Autosomal recessive correspond to sets of genes associated pathways [59]. Enriched functional pathways of mutation-intolerant genes that are associated with disease highlight repeated involvement of core genetic information processing (e.g., transcription and RNA processing) and signal transduction components, but they tend to exclude those of metabolism. In fact, mutation-tolerant genes (a numeric matched set of genes with the highest GTS scores) are significantly enriched in metabolism (P-value < 2.05E-13) in the Reactome pathway database [73].
An example of a mutation-tolerant gene product is platelet glycoprotein 4 (CD36), which functions in cell adhesion by serving as a receptor for thrombospondin in platelets as well as in the metabolism of lipids through binding long chain fatty acids. CD36 represents one of the most mutation-tolerant genes in the diseases-associated set, with 155 DeepSAV-predicted detrimental mutations (DeepSAV score>0.75) in 98 positions covering the structure, including seven lining the fatty acid binding site (Fig 5B). Although this gene is tolerant to mutation, known pathogenic variants (I413L, R386W, P90S, and F254L), with I413L lining the fatty acid binding pocket, cause platelet glycoprotein deficiency, a congenital disease of the hemic and lymphatic class. Furthermore, DisGeNET associates this gene with metabolic phenotypes of impaired glucose tolerance, insulin resistance, and insulin sensitivity. While mutations in CD36 can still lead to disease, the mutation tolerance of the gene might be explained by the recessive nature of the associated disease, by the ability of two paralogs, SCARB1 and SCARB2, to serve as functional replacements, or by the tissue-specific nature of the disease [80]. Therefore, a dichotomy seems to exist for disease-associated genes, where those that are mutation-intolerant tend to function in development and signal transduction pathways, while those that are mutation-tolerant tend to function in metabolism. These trends imply a greater overall fitness cost of mutations in developmental and signal transduction genes when compared to metabolic genes. However, extreme functional redundancy in some signal transduction proteins may lead to their tolerance to mutations. The mutation severity spectrum of signal transduction proteins with numerous paralogs that could exhibit functional redundancy are shown in Fig 5E. Paralogous olfactory receptors (OR), which represent a specialized set of G protein-coupled receptors (GPCRs) that detect odors, are more mutation-tolerant than other GPCRs. In fact, human OR paralogs include more pseudogenes [81] (464, not included in Fig 5E) that have accumulated enough mutations to render them inactive than functional genes (361, included in Fig 5E), and this well-known OR variability likely contributes to an individual's sense of smell [82]. Both non-OR GPCRs (Fig 5E, gray bars) and protein kinases (Fig 5E, orange bars) shift in the spectrum towards mutation intolerance when compared to either metabolic enzymes [83] (Fig 5E, dark green bars) or nucleus-encoded proteins functioning in the mitochondria [84] (Fig 5E, medium green bars), organelles that provide energy from nutrients using metabolic processes [85].
Metabolic enzymes exhibit similar tendency towards mutation-tolerance as the ORs ( Fig  5E). One explanation for the greater tolerance of metabolic genes to mutations might be the redundancy not only in gene duplications, but also in non-homologous proteins that can serve as functional analogs of the same reactions [86]. Metabolites can also be acquired through transport mechanisms, relieving the evolutionary constraints on certain metabolic enzymes. Finally, metabolic pathways exhibit both redundancy and plasticity, allowing for multiple ways to arrive at the same metabolite [87]. with X-linked, autosomal dominant, and autosomal recessive diseases in the Clinical Genome Database, respectively) D) Mutation severity spectrum of disease-associated genes measured by their frequencies in GTS deciles. Associated disease for each gene set is labeled above. E) Mutation severity spectrum of pathway gene sets and large paralogous gene sets measured by their frequencies in GTS deciles. https://doi.org/10.1371/journal.pcbi.1007775.g005 The mutation-tolerance observed for nucleus-encoded mitochondrial proteins might reflect their roles in metabolic processes [85]. However, this tendency is also exhibited by the ribosomal proteins that function in mitochondria compared to ribosomal proteins functioning in cytoplasm (Fig 6A): the majority of mitochondrial ribosomal proteins have high GTS scores while the majority of cytoplasmic ribosomal proteins have low GTS scores. As an example, sideby-side comparison of ribosomal L14P/L23E-like proteins functioning in the cytoplasm (RPL23, Fig 6B and 6C) and the mitochondria (MRPL14, Fig 6B and 6D) highlights their different levels of mutation-intolerance. Both proteins adopt similar small 5-stranded meander barrel folds with relatively long loops that interact with RNA in the assembled ribosome. Cytoplasmic RPL23 and mitochondrial MRPL14 have 34 and 89 gnomAD SAVs respectively, and exhibit quite different DeepSAV distributions (Fig 6B). The cytoplasmic RPL23 includes only a single predicted pathogenic variant (I40F, DeepSAV score = 0.788, Fig 6C) (protein length: 140 amino acid residues), while MRPL14 includes 34 predicted pathogenic variants (DeepSAV score > 0.75, all but one are rare with MAF < 0.0001) covering 28 positions (Fig 6D) (protein length: 145 residues). Neither of these examples possess known pathogenic variants, and only the cytoplasmic version is associated with a neoplastic process in DisGeNET. This marked difference in rare allele mutation severity cannot be explained by either domain or pathway redundancy. The main function of mitochondria is to supply energy, which can be partly salvaged by increasing nutrient intake and decreasing energy-demanding activities. In addition, mitochondria might be able to overcome lowered fitness of mutations in the ribosome through their processes of fusion and fission that help maintain both functional properties and the integrity of the mitochondrial genome that harbors the mitochondrial ribosomal RNA genes [85].

Mutation-intolerant and mutation-tolerant genes function in different disease classes
The set of mutation-intolerant genes define several over-represented disease classes, including virus diseases, behavior & behavior mechanisms, stomatognathic diseases, hemic & lymphatic diseases, immune system diseases, musculoskeletal diseases, nervous system diseases, neoplasms, pathological conditions, signs & symptoms, and mental disorders (Fig 7A). Development and signal transduction are enriched among the mutation-intolerant genes associated with these specific disease classes. Furthermore, top mutation-intolerant genes tend to participate in relevant functional pathways. For example, mutation-intolerant genes associated with behavior diseases are enriched in neurotransmitter receptors and postsynaptic signal transmission (P-value < 1.11E-16), and those involved in immune system diseases are enriched in cytokine signaling of the immune system (P- value < 1.4E-14).
The disease classes that are the most under-represented in mutation-intolerant (and overrepresented in mutation-tolerant) genes include eye diseases, nutritional and metabolic diseases, otorhinolaryngologic diseases, chemically induced disorders, digestive system diseases, respiratory tract diseases, and wounds and injuries. These disease classes tend to be either tissue-specific or related to metabolism. For example, eye diseases involve genes functioning in visual perception, in cilium morphogenesis, as structural constituents of the eye lens, and in phototransduction. Alternatively, nutritional diseases involve genes of respiratory electron transport, steroid metabolism, TCA cycle, and fatty acid metabolism, among others. The nutritional diseases associated with mutation-intolerant genes tend to be dominated by a clinically heterogeneous group of disorders that arise as a result of dysfunction of the mitochondrial respiratory chain (mitochondrial diseases), as well as by obesity and diabetes that display a range of severity in affected individuals and can develop in adolescence or later in life. The relatively modest impact of these diseases on survival may be a reason for the genes associated with such diseases to tolerate mutations.

Viruses exploit disease-causing mutation-intolerant genes for infection
Viral diseases are the highest over-represented disease class among mutation-intolerant genes. Potentially, viral strategies for successful replication and evasion of host immunity could benefit from targeting essential genes that accumulate fewer mutations. In fact, similar observations of viral proteins interacting with more evolutionarily constrained host genes suggest that viruses have driven close to 30% of adaptive amino acid changes in the human proteome, with HIV infection causing a statistically significant increase in adaptation [78]. These evolutionarily constrained viral-interacting host proteins tend to be mutation-intolerant (Fig 5D), while a set of similarly highly adaptive proteins that interact with Plasmodium [88] do not have the same degree of preference for mutation-intolerance (Plasmodium_interact in Fig 5D).
Over a third of the virus disease gene set is involved in HIV coinfection, which describes simultaneous infection of a single host cell by two or more virus particles. Identified HIV coinfection-associated gene products function in pathways such as signaling by interleukins/cytokines, regulation of RUNX3, stabilization of P53, and host interactions with HIV factors, among others (ordered by Reactome enrichment). Cytokines, including interleukins, play a critical role in immunity. Because HIV infects immune CD4 T cells, the connection to interleukin/cytokine signaling molecules that regulate T cell growth and differentiation (i.e. through IL2 or CCL2) is known [89,90], and two of the interleukin signaling examples (PSME3 and PSMC3) represent biomarkers for the disorder [91]. A significant portion of the HIV-related genes (19 out of 23 or 82%) are annotated as essential, including all host interaction factors (KPNB1, RAN, PSMC3, PSME3, PSMA5, PSMA6, PSMC5, and PSMA4), supporting the notion that infection strategies involving essential proteins are utilized by HIV, and potentially other viruses.
The essential HIV host factor importin subunit beta-1 (KPNB1) includes several gnomAD SAV positions predicted as detrimental using both DeepSAV and baseline fitness scores. However, the gene does not belong to our disease-causing set and has no disease associations in DisGeNET. KPNB1 mediates nuclear import of ribosomal proteins [92], and also works together with the RAN GTP-binding protein to bind and import HIV Rev into the nucleus where it exports viral mRNAs for translation [93]. In a structure of KPNB1 bound to RAN (Fig 7B) [94], these positions are buried (T150P, L238S, and A389V) or partially buried (R234G, C436Y) in the hydrophobic core of the KPNB1 repetitive α-hairpins. Such variations could result in local structure instability and loss of function. One surface SAV, R105L, interacts with a nearby E in RAN. Replacement of R by L, which removes a potential interaction of charges, could lower the key interaction of KPNB1 with RAN that drives nuclear import of HIV Rev.
Another example of a GTP-binding protein, RHOA, contributes to viral diseases such as Burkitt Lymphoma, which is a cancer of the lymphatic system with a subtype linked to Epstein-Barr virus (EBV) [95]. RHOA variants are deemed likely pathogenic for several other neoplastic disorders, and several missense variants are listed in DisGeNET, although not in association with Burkitt Lymphoma. Despite the apparent tumor-promoting effects of RHOA in various cancers, previous studies suggest mutations of the gene in the case of Burkitt lymohoma and other neoplastic processes are inhibitory [96,97]. There are only two predicted detrimental SAVs in RHOA in gnomAD. One of them (G189W) maps to the disordered Cterminus adjacent to a residue that gets farnesylated. The disordered and modified C-terminus adopts a coil structure when bound to the RAP1GDS1 guanine nucleotide exchange factor (GEF) (Fig 7C), and the replacement of a small G to W with the larger sidechain would incur steric problems in the GEF-bound conformation. Similarly, a larger sidechain adjacent to the farnesylation site might reduce the modification and influence RHOA localization. While the second SAV (D87N) is relatively conservative, its position near the GTP binding pocket adjacent to a K sidechain that mediates guanine nucleotide binding might influence enzymatic activity.

Human proteome, sequence alignment, and baseline fitness score
The human proteome was obtained from the UniProt database (version 2018.12) [98]. The orthologous groups of human proteins were obtained by OrthoFinder [99] applied to a set of representative vertebrate proteomes. For human proteins in large orthologous groups, we replaced their orthologous groups by the ones retrieved from the OMA database [100] that are usually much smaller and thus exhibit better alignment quality. Multiple sequence alignments of orthologs were obtained by MAFFT [101]. Sequence profile of each position of an alignment, represented as the estimated amino acid frequencies, was calculated as described before [102]. For any amino acid change, we used a previously devised baseline fitness score to represent the severity of the mutation, based on the log-odds ratio between the frequencies of original amino acid and mutated amino acid [50,103].

Positional features used in impact predictions by deep convolutional neural network
For each human protein position, we deduced features reflecting amino acid type, sequence profile, sequence conservation, structure properties, and available functional annotations. The type of 20 amino acids is used as one feature with one-hot encoding. Both the original amino acid and the variant amino acid are encoded in this way, resulting in 40 features. The estimated amino acid frequencies of each position in the multiple sequence alignment of orthologs were used as 20 features. Sequence conservation scores of the multiple sequence alignment of orthologs were calculated by AL2CO [104] and used as one feature. Predictions of 3-state secondary structures (helix, strand, and coil) were made by three programs (PSIPRED [105], SPIDER [106], and PSSpred [107]), resulting in nine features. Three features are based on disorder propensities predicted by three programs (DISOPRED3 [35], SPOT-Disorder [36], and IUPred2A [37]). In addition, low complexity region predictions by SEG [108] and coiled coil predictions by NCOILS [109] were encoded as two features. We also used features reflecting protein-targeting or functional regions/positions from the UniProt sequence annotations. Regions of Nterminal signal peptide (indication of proteins going through secretory pathway), transit peptide (indication of mitochondrion targeting), and transmembrane segments were obtained from UniProt feature records SIGNAL, TRANSIT, and TRANSMEM, respectively. Three post-translational modifications (phosphorylation, acetylation, and methylation) were extracted from the UniProt MOD_RES records. Other UniProt Features includes DISULFID (cysteines participating in disulfide bonds), CARBOHYD (site with covalently attached glycan group), METAL (binding site for a metal ion), BINDING (binding site for any chemical group (co-enzyme, prosthetic group, etc.)), ACT_SITE (amino acid directly involved in the activity of an enzyme), SITE (any single amino acid site that could be functionally relevant), LIPID (site with covalently attached lipid group(s)), and MOTIF (short, i.e. up to 20 amino acids, sequence motif of biological interest). For 1-dimensional convolutional network, the above 89 features from a window of 21 amino acids (the target position and 10 neighboring positions on each side) were used as input. Features in neighboring positions beyond the first or last residues were zero-filled (zero-padding). One additional feature encodes the indicator of zeropadding for such positions (1 for positions beyond the first or last residues, and zero for normal amino acid positions within the protein length). The number of features for each position is 90. By using a window of 21 positions, a total of 90 x 21 = 1890 values serve as the input of the convolutional neural network for each training and testing data point.

Architecture and hyperparameters of the deep-learning convolutional neural network
We used a deep-learning artificial neural network for prediction of SAV pathogenicity. The diagram of neural network structure is shown in S1 Fig. It consists of seven 1-dimensional convolutional (conv1d) layers, two max-pooling layers, and two dense layers before the output. The residual network architecture is implemented twice by combining the input of a conv1d layer with the output after several layers of that input (thick arrows, S1 Fig). The initial input has a window size 21 and 90 channels corresponding to 90 features encoding protein sequence, structure and functional properties (described above, S3 Table). The number of filters and the kernel size of other conv1d layer are 200 and 3, respectively. Each of the two dense layers has 100 nodes and has a following dropout layer with the dropout rate of 0.5. The ReLU activation function is used in all layers except the output layer that uses the softmax function. The batch size is set to 128 in the training process. The neural network was written in python with the TensorFlow package. The prediction score of any SAV, ranging from zero to one, reflects the likelihood of the SAV being pathogenic, and is termed DeepSAV score.

The DeepSAV neural network predictor
We obtained SAVs that were classified as likely pathogenic from the ClinVar [33] and UniProt database. For the ClinVar database, these SAVs are classified as "Pathogenic" or "Likely Pathogenic". For the UniProt database, these SAVs are classified as "Disease" in the SwissVar database [110]. Benign SAVs are those classified as "Benign" or "Likely Benign" in the Clinvar database and those classified as "Polymorphism" by SwissVar in the UniProt database. To evaluate the performance of our neural network predictor, we performed 4-fold cross validation tests. The sets of 43,000 pathogenic variants and 43,000 benign variants were divided into to 4 subsets of equal size. Three subsets of pathogenic variants and three subsets of the benign variants were used to train the neural network and the remaining variants are used for testing. This process is repeated four times with each of the four subsets serving as the validation set. We also obtained scores of various prediction methods from the dbNSFP database [24] and evaluated their performance on the 43,000 pathogenic variants and 43,000 benign variants.

DeepSAV+PG: Variant pathogenicity prediction incorporating population-level and gene-level information
We combined amino acid-level features used in DeepSAV with information from human general population (minor allele frequency of any variant from the gnomAD database) and gene-level information (17 features from dbNSFP) in a deep neural network predictor called Deep-SAV+PG. The 17 gene-level features include three numbers of protein-protein interactions (IntAct, BioGrid, and ConsensusPathDB) [111][112][113], four experimental measures of gene essentiality [64,65,114,115], four scores of estimated haploinsufficiency (P(HI), HIPred_score and GHIS) [26,29,116] or gene essentiality (Gene_indispensability_score) [117], estimated probability of the gene involved in recessive diseases (P(rec)) [118], gene damage index score (GDI-Phred)[31], a loss-of-function intolerance score (LoFtool_score) [119], and three measures of loss-of-function (lof) intolerance/tolerance (gnomAD_pLI (the probability of being loss-of-function intolerant), gnomAD_pRec (the probability of being intolerant of homozygous, but not heterozygous lof variants), and gnomAD_pNull (the probability of being tolerant of both heterozygous and homozygous lof variants)) [5]. We applied the same fourfold cross-validation test to evaluate the performance of DeepSAV+PG on the same set consisting of 43,000 pathogenic variants and 43,000 benign variants.

Enrichment analysis of amino acid properties in likely pathogenic SAVs and gnomAD SAVs
The enrichment log-odds score is defined as the logarithm (with base 2) of the ratio between two probabilities. This ratio is the probability of observing a property among a subset of amino acid positions (e.g., positions with pathogenic SAVs, or positions with gnomAD SAVs with MAF in a certain range) divided by the probability of observing that property among all amino acid positions in the human proteome. It reflects enrichment (if the log-odds score is above zero) or depletion (log-odds score less than zero) of the property in the subset compared to the background (the whole proteome).

Quantification of mutation severity of gnomAD SAVs at the gene level
We applied our deep neural network method to the prediction of mutational impact of gno-mAD [5] SAVs obtained from the dbNSFP database [24] (version 4.0). Rare SAVs were defined as those with MAF less than 0.0001. The mutation severity measure (Gene Tolerance of rare SAVs, GTS score) based on DeepSAV predictions of rare SAVs in the human population is calculated as follows: GTS ¼ sumðDeepSAV scoreðkÞ � MAFðkÞÞ=protein len; where DeepSAV_score(k) is the DeepSAV score of any rare SAV k, and MAF(k) is its minor allele frequency, and the normalization factor is the protein length (protein_len) of the gene.

Analysis of mutation severity measures for potential disease-associated genes
GTS scores were transformed into percentiles (using Excel percentrank) for 17,480 human protein-coding genes. For comparison of our average mutation severity scores to constrained genes that are more likely to be detrimental when inactivated (LOEUF score [5]), we transformed LOEUF scores by percentrank (for 16,670 human genes with LOEUF score). The resulting gene count distributions among LOEUF deciles were plotted for a set of genes with pathogenic SAVs or the top 3,284 genes ranked by four sets of GTS scores from lowest (unlikely to acquire damaging mutations) to highest (tolerates acquired mutations). We chose the GTS score for further evaluation and transformed the score into decile rank. For the disease-related gene set, we compared decile rank of GTS score to those of human genes annotated as essential by either of two large-scale CRISPR experiments [64,65] (2,108 genes) or annotated as non-essential (11,589 genes) in both.
To identify potential disease-associated genes, we compared essential genes having the lowest GTS and LOEUF scores with essential genes with pathogenic SAVs using a Venn diagram. The overlap between the GTS and LOEUF sets is considered to be enriched for potential disease-associated genes. The overlapping set (126 genes) was assigned to disease classes using the DisGeNET curated gene-disease associations (GDAs) [66]. We removed group and phenotype associations from the GDAs. MeSH (Medical Subject Headings) disease class frequencies for the set (observed frequencies) were compared to disease class frequencies assigned to all curated genes (expected frequencies) to evaluate over-and under-representation (observed/ expected frequency ratios).
To further select among potential disease-related genes, we clustered the gene set (126 genes) together with the genes with pathogenic SAVs (70 genes) using ClustVis [120] with six measures for each gene (GTS, LOEUF [5], lnGDI [31], number of rare gnomAD mutations (MAF filter: 0.0001), HIPred [29], and P(HI) [26]). The raw scores for each measure were converted to Z-scores and were pre-processed with row centering and no scaling. Principal component analysis using the SVD (singular value decomposition) with imputation option indicated the first and second components explain 40.8% and 25.9% of the data variance, respectively. Scores were plotted as a heatmap from high (red) to low (blue) Z-score, and both genes and measures were clustered using complete linkage of correlation distances. The genes were split into three large groups for visualization (S2 Fig), with the top 20 clusters separated by space in the resulting heatmaps. Functional analysis for potential disease-related genes were performed using DAVID clustering (medium stringency with 0.001 ease) of GOfat biological process terms [121] and GO enrichment of PANTHER classification [122].

DisGeNET gene-disease mapping
We mapped all human genes with GTS scores to curated DisGeNET diseases (using UniProt to GeneID provided in the Downloads section from the DisGeNET website [66]). Of the 9,414 total GeneIDs with curated GDAs, we mapped GTS scores and ranks from the complete GTS dataset to 8,426 genes with curated GDAs. Over-representation (enrichment) and underrepresentation (depletion) of disease classes from MeSH were calculated over various sets of genes as the ratio of the observed frequency of each class to the expected frequency of each class calculated from disease class frequencies in the entire curated gene-disease database. We chose sets of genes for plotting the distributions of overrepresented disease classes over all GTS ranks, where genes ranked up to 1,000 (top1000 set) tend to include increased frequencies, and genes ranked higher than 12,000 (bottom set) tend to have stable frequencies. We included additional sets surrounding the top1000 (top500 and top1500) to observe trends. We excluded disease classes with few representatives, including F02: psychological phenomena & processes, C22: animal diseases, C03: parasitic diseases, C01: bacterial infections & mycoses, C24: occupational diseases, and C21: disorders of environmental origin. We related genes from various disease classes to function using Reactome or DAVID enrichment analysis tools [73,121]. for each gene are colored from blue (low) to red (high) and clustered (20 clusters delimited by spaces) according to complete linkage of correlation distances. Two clusters with low GTS scores (mutation resistant) have a relatively high proportion of genes with known pathogenic variants and could help identify new disease-associated genes (red labels). ROC AUC values are reported for neural network predictions made by using subsets of features. The notations of the predictors using subsets of features are as follows: AA-using only amino acid types; AA+seg: using amino acid types and low complexity region predictions; AA +coiled.coil: using amino acid types and coiled coil region predictions; using AA+sec.struct: amino acid types and secondary structure predictions; AA+uniprotFeat: using amino acid types and features derived from UniProt Feature fields; AA+disorder: using amino acid types and disorder propensity; AA+consv: using amino acid types and sequence conservation; AA +prof: using amino acid types and sequence profile; AA+consv+prof: using amino acid types, sequence conservation, and sequence profile; ALL-prof: using all features except sequence profile; ALL-consv: using all features except sequence conservation; ALL-prof-consv: using all features except sequence profile and sequence conservation; ALL-uniprotFeat: using all features except those derived from UniProt Feature fields; ALL-disorder: using all features except disordered region predictions; ALL-coiled.coil: using all features except coiled coil region predictions; ALL-sec.struct: using all features except secondary structure predictions; ALL-seg: using all features except low complexity region predictions by seg; ALL: using all features. (DOCX) S1 Table. Information of human protein coding genes (GeneSymbol, UniProt, GTS, LOEUF, and interaction_count) and catalogues of gene sets based on their essentiality (essential1 and essential2), involvement in diseases (autism, DDD (deciphering developmental disorders), cosmic, PathVar, disgenet_path, virus_interacting), and functional categories (kinase, meta-bolic_enzymes, olfactory_receptors, otherGPCR, ribosomal_protein_mitochondrial, riboso-mal_protein_cytoplasmic, mitochondrial).