Curated multiple sequence alignment for the Adenomatous Polyposis Coli (APC) gene and accuracy of in silico pathogenicity predictions

Computational algorithms are often used to assess pathogenicity of Variants of Uncertain Significance (VUS) that are found in disease-associated genes. Most computational methods include analysis of protein multiple sequence alignments (PMSA), assessing interspecies variation. Careful validation of PMSA-based methods has been done for relatively few genes, partially because creation of curated PMSAs is labor-intensive. We assessed how PMSA-based computational tools predict the effects of the missense changes in the APC gene, in which pathogenic variants cause Familial Adenomatous Polyposis. Most Pathogenic or Likely Pathogenic APC variants are protein-truncating changes. However, public databases now contain thousands of variants reported as missense. We created a curated APC PMSA that contained >3 substitutions/site, which is large enough for statistically robust in silico analysis. The creation of the PMSA was not easily automated, requiring significant querying and computational analysis of protein and genome sequences. Of 1924 missense APC variants in the NCBI ClinVar database, 1800 (93.5%) are reported as VUS. All but two missense variants listed as P/LP occur at canonical splice or Exonic Splice Enhancer sites. Pathogenicity predictions by five computational tools (Align-GVGD, SIFT, PolyPhen2, MAPP, REVEL) differed widely in their predictions of Pathogenic/Likely Pathogenic (range 17.5–75.0%) and Benign/Likely Benign (range 25.0–82.5%) for APC missense variants in ClinVar. When applied to 21 missense variants reported in ClinVar and securely classified as Benign, the five methods ranged in accuracy from 76.2–100%. Computational PMSA-based methods can be an excellent classifier for variants of some hereditary cancer genes. However, there may be characteristics of the APC gene and protein that confound the results of in silico algorithms. A systematic study of these features could greatly improve the automation of alignment-based techniques and the use of predictive algorithms in hereditary cancer genes.


68
Multi-gene panel testing is now routine for identifying hereditary cancer susceptibility, leading to 69 increased detection of pathogenic mutations, which can improve clinical management.

77
The classification of these VUS represents a major challenge in clinical genetics.

78
Computational (in silico) tools have been developed to help predict whether or not the protein 79 function will be disrupted (reviewed in [5]). In silico tools often use Protein Multiple Sequence

92
Missense pathogenic variants are rare in some genes, including APC, the gene responsible for 93 Familial Adenomatous Polyposis (FAP). APC has been sequenced frequently in clinical genetic 94 testing, but few missense pathogenic variants have been identified, for reasons that have not 95 been clearly demonstrated [14]. The increase in clinical DNA sequencing tests for cancer 96 predisposition has led to an increase in missense VUSs in APC that require classification.

97
Here we systematically apply in silico methods to APC, assessing the logistics and results of 98 using these commonly available tools to predict pathogenicity of missense variants in a gene for 99 which missense is an uncommon mechanism of pathogenicity.

103
Results from searching the NCBI Gene database for "APC" initially yielded reliable full length 104 APC protein sequences from 38 organisms. We encountered a number of challenges to the 105 simple automated assembly of a meaningful APC PMSA, including:

127
We constructed two PMSAs. Our goal was to create a curated PMSA that would optimize 128 predictions for pathogenicity of variants from computational algorithms. This 10-sequence 129 PMSA contained species chosen to reflect as closely as possible the 14-species PMSA 130 previously reported for analyzing variants and validating computational algorithms in the MMR 131 genes, in which missense VUS are common and in silico interpretation is frequently used [8].

132
We identified full length APC sequences for 11 of these 14 species. The 10-species PMSA that 133 we curated using the above criteria (Table 1, Table) Classifications of: "Benign", "Likely Benign", "Pathogenic", "Likely Pathogenic", "Uncertain 218 Significance" and "Conflicting Interpretations of Pathogenicity". 219 220 ClinVar  Table 3 Legend: Substitutions flanking the 12 splice sites found in Human APC were removed 222 from the list of selected missense variants. A total of 1924 variants that met the above 223 classification criteria and were not located in exon boundaries were used for analysis. Of the 224 1924 variants, 1.1% were classified as benign, none were classified as pathogenic and 98.9 % 225 were classified as uncertain or conflicting interpretation of pathogenicity.

236
The proportion of variants predicted to be "Benign" were MAPP 25.0%, PolyPhen2 41.0%, SIFT 237 68.1%, Align-GVGD 82.5% (Table 4A)       truncating variants are known to cause disease", is relevant to APC. By this criterion, any 305 missense APC variant is given "Supporting" evidence, the lowest level, favoring benign 306 classification of missense variants. Further study may help determine whether this criterion for 307 benign classification should be upgraded from "Supporting" (for which estimated Odds of 308 Pathogenicity is low [18], discussed below) to a higher level for these variants. The PP2 309 criterion for pathogenicity presupposes that missense is a common mechanism for mutation; 310 future studies should assess whether it is being inappropriately used when missense is a rare or 311 unknown mechanism for a given gene.

331
One cannot assume that in silico tools that are valuable predictors for one gene will perform as 332 well for other genes. The majority of APC missense variants in ClinVar are likely to be benign,

340
The ClinGen Sequence Variant Interpretation working group has estimated that the "Supporting" 341 level of evidence confers approximately 2.08/1 odds in favor of pathogenicity [18], or a 67.5%

415
Nucleotide regions flanking prospective indels were analyzed using two splice site calculators: