Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

doi:10.1371/journal.pcbi.1003382

Table 1.

The different datasets constructed and used in this study and their composition.

More »

Expand

Figure 1.

The amino acid exchanges observed in human protein variants.

The 1*. Amino acids are arranged by 1 letter code according to increasing hydrophobicity (least hydrophobic is left and most hydrophobic is right) using the Fauchère and Pliska scale [58]. Yellow blocks indicate mutations where there are statistically significant differences between 1 kG and OMIM. Blue blocks indicate where no mutations were present in the 1 kG data set. White blocks show where there are no statistically significant differences. Green blocks show where there are proportionally more 1 kG mutations compared to OMIM. Orange blocks show where there are proportionally more OMIM mutations than 1 kG. The mutability scores (see methods) for the 1 kG and OMIM sets are shown in the last column. ^*Note that these matrices are fundamentally different. The 1 kG data set gathers all the observed mutations in the 1 kG project, counting each only once; the OMIM data set combines information gathered from potentially many individuals but filtered to identify those mutations associated with a disease.

More »

Expand

Figure 2.

Comparison of the number of mutating residues vs the amino acid frequency of occurrence.

More »

Expand

Figure 3.

Amino acid mutability vs the number of codons in the 1 kG data.

More »

Expand

Figure 4.

A visual representation of the asymmetry of the 1 kG data.

The plot shows the difference between how often an amino acid mutates vs how often it is mutated to. These are raw counts and also reflect the frequency of occurrence. Each amino acid is coloured according to CpG content. Red: a CpG dinucleotide occurs in its codons; yellow: if one of its codons start with a G (with a C possibly preceding it); blue: no CpG in its codons. The black line indicates the diagonal where ‘mutations to’ equals ‘mutations from’.

More »

Expand

Figure 5.

Site properties for all residues, 1 kG nsSNPs, OMIM nsSNPs and Humsavar nsSNPs in the structure 3D set.

(A) the solvent accessibility for the variants in the four datasets, (B) the secondary structure in which each of the variants occurs, (C) the functional annotation of every variant in the four datasets.

More »

Expand

Table 2.

The various functions assigned to nsSNPs in each set.

More »

Expand

Figure 6.

Comparison of the conservation scores in the four sets used.

The density distribution of residue conservation scores for all the amino acid positions in UniProt (9,532,474 residues, black), 1 kG (185,428 residues, blue), OMIM (8,099 residues, red) and Humsavar (21,446 residues, green). The conservation scores range from 0 for non-conserved residues to 1 for highly conserved residues.

More »

Expand

Figure 7.

Comparison of the differences in observed mutations in the various sets.

Comparison of the differences in the % of observed mutations in the 1 kG (blue) and OMIM (red) sets for one amino acid mutating to all others e.g. proportionally, more mutations from Lys to Glu are recorded in OMIM than in the 1 kG set. Each plot shows the results of mutation from a specific amino acid (e.g. Arg at top left) to every other amino acid.

More »

Expand

Figure 8.

Comparison between the physicochemical properties of the wildtype and the mutant models for each of the data sets.

Plots showing the differences between (A) Modeller DOPE scores for the wild type and mutant model (based on 3D, 10,628 mutations, and Humsavar sets, 21,446 residues), (B) changes in hydrophobicity between wild type and mutant in both sets and (C) changes in size between wild type and mutation in both sets.

More »

Expand

Figure 9.

Bubble plots comparing the relative differences between the instantaneous rate change matrices of the data sets.

(A) 1 kG data, (B) PAM matrix and (C) WAG matrix. (D) A PCA (first two components) plot showing the separation of the 1 kG matrices from other matrices. Matrices included are 1 kG (with and without assuming direction), nuclear (WAG, JTT, LG, PAM, tm126, PCMA), mitochondrial (mtREV24, mtMam, mtArt, mtZoa), chloroplast (cpREV, cpREV64), exposed (alpha helix, beta sheet, coil, turn) and buried (alpha helix, beta sheet, coil, turn). Principal components one and two represent 34% and 20% of the variance, respectively. All other principal components represent 9% or less of the variance each. Amino acids are arranged according to increasing hydrophobicity.

More »

Expand

Figure 10.

Dependence of mutation rates on the change in CpG status.

Rates of change from codons were calculated similarly to the amino acid rate matrix [36], but on a 61 by 61 codon matrix.

More »

Expand

Figure 11.

Amino acid mutability rank order plot comparing the mutability scores for 1 kG, OMIM and Humsavar residues.

The most mutable amino acids are at the top. Correlation coefficients for 1 kG vs OMIM, 1 kG vs Humsavar and OMIM vs Humsavar are 0.09, 0.17 and 0.51, respectively.

More »

Expand