Figure 1.
Distribution of variants according to sequence features and allele frequency.
The y-axis represents the percentage of variants for the allele frequencies and categories represented in the x-axis. Panel A, percentage of variants upstream of a functional domain. Panel B, in alternatively spliced sites. Panel C, in the principal isoform. Panel D, in regions targeted by NMD. The distribution is shown for synonymous (green), missense (blue), stop-gain (red) and frameshift (orange) variants according to minor allele frequency (MAF) intervals, where singletons (variants detected only in one individual) are represented separately. The pattern of OMIM disease variants and homozygous variants for each feature is shown. The corresponding coding genome background (measured as the percentage of nucleotides displaying the feature) is shown as a grey line (partly hidden by the distribution of synonymous variants in some panels). Numbers of variants in each category are reported in Table S2. Logistic regression was used to model the relationship between observing a given sequence feature in a given type of variant as a function of the logarithm of the minor allele frequency (MAF). The odds ratio estimates for stop-gain variants were significantly different from those of synonymous variants in all panels (p-values<5e-05, heterogeneity test [40]; for frameshifts, in panels B, C and D (p-values<5e-03).
Figure 2.
Association of NMD-target variants with gene expression.
Panel A shows the distribution of average expression z-scores for genes from individuals carrying different types of variants (synonymous, missense, frameshift and stop-gain). Peer-factor normalized RPKM from [22] were used. The black sector represents the distribution of variants outside the NMD-target region and the colored sector those within the NMD-target region. Statistically significant differences were observed for stop-gain variants predicted to trigger NMD (n = 756) compared to synonymous variants (one-sided Wilcoxon rank-sum test p-value<2.2e-16). Panel B shows the distribution of average expression z-scores described in panel A for synonymous (grey) and stop-gain (dark and light purple) variants within the NMD-target region. The distribution of NMD-target stop-gains is represented separately for singletons (dark purple, n = 488) and non-singletons (light purple, n = 268). Distributions are statistically different (one-sided Wilcoxon rank-sum test = 1.3e-10). Panel C shows the distribution of average expression z-scores described in panel A for synonymous (grey) and stop-gain (dark and light pink) variants within the NMD-target region of genes with multiple isoforms described in CCDS. The distribution of NMD-target stop-gain is represented separately for those affecting all isoforms (dark pink, n = 216) and those affecting only a fraction of isoforms (light pink, n = 85). Distributions are statistically different (one-sided Wilcoxon rank-sum test = 2.5e-03). Results were reproduced using RPKM normalized expression values (Figure S3).
Figure 3.
Receiver operating characteristic of the performance of pathogenicity scores for stop-gain variants.
Panel A: Classification power of three pathogenicity scores was evaluated on a set of 1160 pathogenic stop-gain variants in the OMIM database, and 125 common stop-gain variants not known to be pathogenic. Shown are the ROC curves for the sequence-based classifier (SB) developed in this work, for the gene-based score reported in [19] (GB), and for the joint classifier (SB×GB). Dashed curves correspond to a randomization test in which rows in sequence features are shuffled column-wise (denoted by SB(r)). Panel B: AUC improvement achieved when combining the sequence-based scores with a gene-based score. The panels shows AUC values of ROC curves using two independent gene-based scores (MacArthur 2012 [19] and RVIS [6]), on two independent datasets of variants (ESP and 1000 Genomes) and two types of variants: stop-gains and frameshifts. Corresponding ROC curves and number of pathogenic and common variants used for benchmark is shown in Figure S4. Inclusion of sequence features led to an increased area under the ROC curve in all evaluated settings.
Figure 4.
Correlation between pathogenicity scores of truncating variants and impact in gene-expression levels.
Shown are the distributions (y-axis) of three pathogenicity scores (Panel A: the sequence-based score developed in this work, Panel B: the gene-based score from MacArthur 2012 [19]; Panel C: the gene-based score RVIS [6]) within quintile bins (x-axis) of the average expression z-scores from individuals carrying stop-gain variants (Peer-factor normalized RPKM from [22] were used; see Methods and Figure 2). A total of 1060 stop-gain variants are represented, 212 in each quintile. Quintiles from 1 to 5 are ordered in decreasing impact on gene expression levels and correspond to the following intervals respectively: z-score<−1.25, (−1.25, −0.66], (−0.66, −0.23], (−0.23, 0.23], (0.23, 5.15]. To allow comparison across scores, they are represented as rank percentiles, where the value of a given variant accounts for the percentage of all stop-variants that had a score more pathogenic than the variant. Therefore, a rank percentile of “0” indicates a variant with the highest predicted probability of being pathogenic while a rank percentile of “100” indicates a variant with the lowest predicted severity. A stronger correlation with expression levels was observed for the sequence-based score (Spearman rank correlation = 0.21±0.03, p-value: <5e-12) than either gene-based scores (0.06±0.04, p-value>0.05 for MacArthur 2012 score and 0.13±0.03, p-value<5e-05, for RVIS score). None of the scores associated frameshift variants with gene expression levels.
Figure 5.
Complementarity between sequence-based and gene-based pathogenicity scores illustrated for OMIM genes with both pathogenic and non-pathogenic/non-annotated stop-gain variants.
Shown are the sequence-based score (x-axis) for 273 stop-gain variants reported by the ESP and 1000 Genomes datasets in 75 OMIM genes carrying both OMIM pathogenic-variants (grey dots) and a non-pathogenic/non-annotated variants (orange dots). Genes are displayed by blocks from 1 to 9 (y-axis on the right) corresponding to deciles of the gene-based MacArthur 2012 rank percentile (e.g. 1: < = 10; 2: (10,20], etc). Grey triangles beside the panels represent the direction of increasing pathogenicity for the corresponding scores.
Figure 6.
Discrimination of pathogenic and non-pathogenic variants within OMIM genes according to the degree of gene conservation.
Shown are boxplots representing the distribution of the average sequence-based score of pathogenic (dark grey) and non-pathogenic/non-annotated (orange) stop-gain variants in OMIM genes depicted in Figure 5. The distributions of the corresponding MacArthur 2012 and RVIS gene-based scores are shown in light grey. Genes are represented in two categories according to their conservation level in primates: dN/dS ratio below (Panel A; n = 54) and above (Panel B; n = 20) the protein-coding genome average.
Figure 7.
Pathogenicity score distributions for rare stop-gain variants in innate immunity genes.
Rank percentile distributions of pathogenicity scores for rare stop-gain variants (MAF<1%) are shown in different sets of genes: protein coding genome background (grey, “Genome”), innate immunity genes (light turquoise, “Inn Imm”) and their subset of interferon stimulated genes (dark turquoise, “ISGs”). The same categories are shown for OMIM disease variants. All variants are reported in ESP and 1000 Genomes datasets except for sets indicated with the § symbol (dashed boxes) which present scores for OMIM disease variants only reported in the OMIM database. Only three variants reported in ESP and 1000 Genomes were found to affect ISGs and annotated as pathogenic in OMIM; this category is not represented in the figure. Variants with the highest probability of being pathogenic have rank percentiles closer to zero (top of the panels). Panel A represents precomputed gene-based pathogenicity scores from [19]. Panel B represents sequence-based pathogenicity scores, i.e. posterior probabilities using the features described in the present work (see main text). Distributions of rank percentiles are represented as boxes where each box spans between 1st and 3rd quantile, and the median is denoted by a bold line in the middle. Total number of variants within each distribution is indicated. Differences in number of variants in equivalent categories between panel A and B originate from unavailability of the gene-based scores for some genes. Statistical differences against the genome reference (one-sided Wilcoxon rank sum tests) are indicated with asterisks according to Bonferroni corrected p-values: <5e-02 (*), <5e-03 (**) and <5e-04 (***). The genome-wide median is denoted by a red line. Spearman correlation between the sequenced-based and gene-based pathogenicity scores was below 0.31 in all sets of genes analyzed (Figure S5).