Fig 1.
Distribution of disease-associated genes and variants.
A) The number of ClinVar disease-associated genes with different types of variants. The non-SAV category combines the categories of nonsense, synonymous, noncoding, and indel. B) The number of variants of different types in ClinVar disease-associated genes. C) Venn diagram of genes with pathogenic SAVs from UniProt and ClinVar. D). Venn diagram of pathogenic variants from UniProt and ClinVar.
Fig 2.
Enrichment of SAVs among sequence, structure and functional properties.
A) Enrichment/depletion of properties in pathogenic SAVs compared to all amino acid positions (y-axis shows log2 based log-odds scores). Notations: Consv1 –positions with low conservation scores (between 0 and 0.3), Consv2 –positions with medium conservation scores (between 0.3 and 0.6), Consv3 –positions with high conservation scores (larger than 0.6). H_psipred, E_psipred, C_psipred are secondary structure predictions of α-helix, β-strand, and coil by PSIPRED. The same notation is used for secondary structure prediction programs SPIDER (H_spd3, E_spd3, C_spd3) and PREDSS (H_predss, E_predss, C_predss). O_disopred and D_disopred correspond to ordered and disordered regions predicted by DISOPRED respectively, and the same notation is used for disorder prediction programs SPOT-DISORDER (O_spotd and D_spotd) and IUPRED2A (O_iupred2a and D_iupred2a). The notations ncoil and seg are predicted coiled coil region and low complexity region, respectively. P_MODRES, A_MODRES, and M_MODRES are positions annotated as being modified with phosphorylation, acetylation, and methylation in UniProt, respectively. SIGNAL, TRANSIT, and TRANSMEM are positions annotated as signal peptide, transit peptide, and transmembrane segment in UniProt. DISULFID, SITE, ACT_SITE, MOTIF, METAL, BINDING, CARBOHYD and LIPID are positions annotated in these key words in UniProt Feature fields (see Materials and methods for their definitions). B) Enrichment/depletion of amino acid properties in gnomAD SAVs with different MAF ranges (from light blue to dark blue: MAF < 0.0001, 0.0001 ≤ MAF < 0.001, 0.001 ≤ MAF < 0.01, 0.01 ≤ MAF) compared to all amino acid positions.
Fig 3.
A) Performance of variant pathogenicity prediction programs in terms of AUC (area under the ROC curve) measure. B) Scatter plot of DeepSAV scores and baseline fitness scores for SAVs observed in gnomAD. Datapoints for four different MAF categories are shown. Their density plots are shown by the axes above (DeepSAV) and right (baseline fitness). C) to F) Two-dimensional histograms (made with R ggplot2 package) of DeepSAV scores and baseline fitness scores for gnomAD variants with MAF ≥ 0.01 (C), 0.001 ≤ MAF < 0.01 (D), 0.0001 ≤ MAF < 0.001 (E), and MAF < 0.0001 (F).
Fig 4.
Mutation severity measures based on DeepSAV identify potential disease-associated genes.
A) Mutation severity measure (GTS score) correlates with the average number of deleterious SAVs for 17,480 human genes B) Distribution of gene count among decile bins of loss-of-function constraint measure (LOEUF) for a set of genes (>3,000) with pathogenic SAVs (red bars, labeled as "path") and for the gene sets (having the same number genes) with the lowest GTS scores computed at various cutoffs of minor allele frequencies (0.0001, 0.001, 0.01 and 1). On the x axis, 0 means the first LOEUF decile [0, 0.1] (the same extrapolation applies to other numbers). C) Distribution of gene frequency among GTS deciles (MAF cutoff 0.0001) for the same gene set with known pathogenic SAVs compared to essential and nonessential gene sets. D) Distribution of protein interactions from four databases (BioGrid [123], IntAct [124], DIP [125] and HPRD [126]) integrated in PICKLE [127] for gene sets within three different mutation severity GTS score deciles (0, 0.5 and 0.9). E) Venn diagram highlights overlap among essential genes with known pathogenic variants (labeled as "Pathogenic"), essential genes with lowest loss-of-function constraint scores (LOEUF), and essential genes with lowest mutation severity measure (GTS). F) Representation of disease class associated with genes from the overlapping set of top-ranked genes by LOEUF and GTS (126 genes, not including genes with known pathogenic SAVs).
Fig 5.
Mutation-intolerant and mutation-tolerant genes prefer different pathways and disease types.
A) Top ranked genes with low GTS scores like ERK2 kinase (PDB 4fmq) have relatively few DeepSAV predicted deleterious variant positions (DeepSAV score > 0.75, red spheres). One of these (black sphere) is near (< 4Å) the active site (ANP substrate analog in black stick). B) Bottom ranked genes with high GTS scores like CD36 (PDB 5lgd) are tolerant to predicted deleterious mutations (DeepSAV score > 0.75, red spheres), including several positions (black spheres) lining the fatty acid (black stick) binding sites or with known pathogenic variation in platelet glycoprotein deficiency (blue spheres). C) Mutation severity spectrum of disease-associated genes in general, measured by their frequencies in GTS deciles. (PathVar–genes with pathogenic SAVs in ClinVar and UniProt, DisGeNET–genes with diseases in DisGeNET database, X-linked, Autosomal dominant, and Autosomal recessive correspond to sets of genes associated with X-linked, autosomal dominant, and autosomal recessive diseases in the Clinical Genome Database, respectively) D) Mutation severity spectrum of disease-associated genes measured by their frequencies in GTS deciles. Associated disease for each gene set is labeled above. E) Mutation severity spectrum of pathway gene sets and large paralogous gene sets measured by their frequencies in GTS deciles.
Fig 6.
A) Mutation severity spectrum of ribosomal proteins functioning in the cytoplasm (orange bars) and in the mitochondria (green bars) measured by their numbers in GTS deciles. B). DeepSAV score distribution for 34 gnomAD SAVs of cytoplasmic ribosomal protein RPL23 (orange) and 89 gnomAD SAVs of mitochondrial ribosomal protein MRPL14. C) 60S ribosomal protein RPL23 from cytoplasm (PDB: 6ek0, chain LV) in orange cartoon has a single detrimental predicted SAV (red sphere). D) Mitochondrial 39S ribosomal protein MRPL14 (PDB 5oom, chain L) in green cartoon has multiple predicted detrimental SAVs.
Fig 7.
Mutation-intolerant genes exhibit pathway preferences and are exploited by viruses.
A) Genes are ranked from low to high by mutation severity measure, GTS. The top ranked genes are mutation-intolerant, and the bottom ranked are mutation-tolerant. Ratios of observed/expected frequencies of disease class associations for sets of mutation-intolerant (top-) and mutation-tolerant (Bottom) genes are shown. Diseases are ordered by the exp/obs frequency ratios in the top1000 set (top 1000 genes with the lowest GTS score). B) Ribbon diagram of KPNB1 (cyan) bound to Ran GTPase (green) with DeepSAV-predicted detrimental variants (red spheres), including R106L (stick) at the interaction interface (from PDB 1ibr). C) Ribbon diagram of GEF (green) bound to RHOA GTPase (cyan, PDB 5zhx), with labeled DeepSAV-predicted detrimental variants (red spheres) adjacent to a farnesylation site (orange stick) and near the active site (stick colored by atom, from superimposed GTPase 1tx4).