Table 1.
Methods to assess the impact of non-frameshifting insertion/deletion variants.
Table 2.
Number of variants (proteins) in the training data set.
Table 3.
Predicted structural and functional features.
* indicates in-house predictors.
Fig 1.
Characteristics of variants included in the functional analyses.
(A) Training variants in canonical and noncanonical protein sequences. (B) Recurrently impacted residues in COSMIC. (C) Variant size in gnomAD, HGMD, COSMIC, and recurrent variants in COSMIC (COSMIC-R). Size of complex indels is the maximum of the number of amino acid residues inserted or deleted. (D) Variants per protein in COSMIC.
Fig 2.
Relative enrichment of mechanisms impacted by pathogenic variants from HGMD compared to gnomAD.
Negative trend values correspond to enrichment in putatively neutral variation. * indicates statistical significance after Bonferroni correction.
Fig 3.
Proportion of variants predicted to impact structural and functional mechanisms among variants from single residue non-frameshifting insertion/deletion variants.
A variant was considered “predicted” if its score was as high or higher than the 95-th percentile of the gnomAD score distribution. We contrast the functional impact of COSMIC, HGMD (n = 1556), de novo variants (n = 168). The highly recurrent set includes variants at residues impacted by at least 25 missense and insertion/deletion variants in the COSMIC database (n = 98), compared to recurrent variants which are impacted at least twice (n = 3622) and non-recurrent variants (n = 2417).
Fig 4.
Proportion of COSMIC variants per histology type that impact structural and functional mechanisms compared to HGMD variants.
(A) Changes aggregated over each class of structural and functional mechanisms and (B) Proportions for a selection of individual mechanisms.
Fig 5.
Receiver Operating Characteristic (ROC) curves and Areas Under the ROC Curves (AUC).
(A) Cross-validation performance of MutPred-Indel with per-protein and per-cluster training, as well as the performance of a model with training data that includes singleton variants in gnomAD. (B) Cross-validation performance of MutPred-Indel on insertions, deletions, and complex indel variants separately. (C) Performance of MutPred-Indel and MutPred2 on single amino acid insertion/deletion variants. (D) Comparison of MutPred-Indel and three existing methods.
Table 4.
Performance of MutPred-Indel without key feature sets.
Fig 6.
Histogram of predicted pathogenicity scores for (A) the training data using cross-validation, (B) cancer driver mutations from dbCID (yellow), highly recurrent variants (COSMIC-R, red) compared to the background in COSMIC (blue), (C) de novo non-frameshifting insertion/deletion variants in individuals with autism spectrum disorder (ASD, red) and de novo variation from unaffected siblings (Control, blue).