Integrating Multiple Genomic Data to Predict Disease-Causing Nonsynonymous Single Nucleotide Variants in Exome Sequencing Studies

Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.

(Supplementary Figure 1, A versus C). This is understandable since seed genes selected for the later, as a collection of genes known as associated with 10 diseases of the highest phenotype similarities to the query disease, might not be as reliable as those for the former.

Effects of the number of seed genes and the number of neighbouring diseases
The number of known disease-causing genes is quite different for different diseases whose genetic bases are partly known. On this scenario, we assessed whether the number of seed genes affect the prediction power of SPRING in detecting causative SNVs. We divided the 113 diseases annotated as associated with at least two genes into four groups: the first group contained 71 diseases with 2 seed genes; the second group contained 20 diseases with 3 seed genes; the third group contained 11 diseases with 4 seed genes; the fourth group included 11 diseases with the number of seed genes ranging from 5 to 12. We then repeated the validation experiment for each group. Results, as summarized in Supplementary Figure 2, demonstrate that the number of known disease-causing genes has little influence on the effectiveness of SPRING for partly known disease detection, since both the MRRs and the AUCs only show reasonable fluctuation among these groups.
In the detection of causative SNVs for query diseases whose genetic bases were unknown, we selected seed genes for a query disease as genes annotated as associated with 10 diseases of the highest phenotype similarities with the query disease. We then assessed how the number of such neighboring diseases affected the prediction power of our method by evaluating the performance of SPRING for all the 1,436 diseases with seed genes selected as those associated with 2, 4, 6, 8, 10, 12, and 15 diseases of the highest phenotype similarities with the query disease. Results, as shown in Supplementary Figure 3, illustrate the robustness of our method to the number of neighboring diseases. Briefly, when the number of neighboring diseases is smaller than or equal to 6, the performance of our method exhibits improvement with the increase of the neighbouring diseases. When the number of neighboring diseases is greater than or equal to 8, the performance of our method is quite stable, only showing slight fluctuations with different numbers of neighbouring diseases.
We further explored the reason behind the above observations. Typically, diseases share common causative genes have high phenotype similarity (Wu et al. 2009). Therefore, as the number of neighboring diseases increases, genes causative for a query disease will likely be included in the seed genes for the disease, and hence the performance of our method show improvements. On the opposite side, when the number of neighboring disease is too large, -3 -some redundant, useless or even wrong information about seed genes may be included, and thus the performance of our method on some diseases decreases, preventing further improvement of the overall performance. Moreover, it is obvious that the computational burden will increase as the number of neighboring disease increases. Therefore, to achieve a reasonable balance between the effectiveness and efficiency of our method, we select 10 neighboring diseases by default.

Functional effect scores of rare SNVs
We compared distributions of functional effect scores for these rare SNVs against those for the same number of SNVs selected at random from one of the eight HapMap individuals.
Results show that the rare SNVs are typically assigned more extreme functional effect scores than SNVs occurring in a normal individual (Supplementary Figure 4). Since the sequence conservation property has been used in the derivation of the functional effect scores, we conjecture that the effectiveness of our method in this validation experiment can be partly attributed to the rarity of such rare mutations in a random human.

Distributions of association scores for neutral and disease controls
For each of the 1,436 diseases in the validation experiments for diseases of unknown genetic bases, we calculated association scores for neutral and disease controls for each of the five genomic data sources. We then selected for each data source 100,000 scores at random from the resulting scores to draw distributions of association scores for neutral and disease controls. Results show that the distributions for these two groups of variants are not significantly different (Supplementary Figure 5). This observation helps to explain the reason why an association data source exhibits comparable effectiveness in distinguishing disease-causing SNVs from neutral, disease, and combined controls.

Selection of data sources for individual diseases
We calculated the pairwise Pearson's correlation coefficient of the data sources. Results show that the data sources are correlated (Supplementary Figure 6). We further adopted a sequential backward selection (SBS) strategy, a greedy algorithm, to select a subset of data sources for each of the 1,436 diseases. Briefly, for a certain disease, we started from a set including all data sources, repeatedly removed the data source of the least importance with respect to the remaining data sources until the smallest MRR was achieved, and obtained a subset of selected data sources as the retained ones. The results are summarized in Supplementary Table 4.
First, we observe that the coverage of functional effect data sources for different diseases is almost the same. For association data sources, the coverage of pathway and domain data is lower than that of PPI, whose coverage is in turn lower than that of GO and sequence data.
Moreover, two-sided Fisher's exact tests with multiple testing corrections do not support the existence of biases on the presences of individual data sources in Mendelian and complex diseases. A similar statistical comparison also suggests that the presences of individual data sources in immune and neurological disorders are not significantly different.
Second, we notice that GO, PPI, and pathway data are selected by more than 70% diseases (with the consideration of the presence of a data source before the selection procedure), while all the other data sources are selected by about half of the diseases, suggesting all data sources are important to our method. For Mendelian diseases, the pattern of selection is similar to those for all diseases. For complex diseases, PPI and pathway data are selected by a larger proportion of diseases, suggesting these data may play more important roles for these diseases. However, two-sided Fisher's exact tests with multiple testing corrections do not support the existence of significant difference between Mendelian and complex diseases in the selection of individual data sources. A similar statistical comparison also suggests that the selection of individual data sources is not significantly different between immune diseases and neurological disorders.
Third, we notice that the selection procedure is helpful in improving the performance of our method. For example, using only selected data sources, the average MRR for all the 1,436 diseases improves from 0.1265 to 0.0791, and the average AUC increases from 0.8735 to 0.9209 (Table S2).
Finally, we notice that the selected subsets of data sources are preferred by quite different numbers of diseases (Table S2). There are a total of 308 distinct subsets of data sources selected by the 1,436 diseases. Among these subsets, 273 diseases favor the use of all the 11 data sources, 175 diseases preferring all but the pathway information, and 82 diseases preferring all but the domain information. The total number of diseases preferring the most selected 21 subsets is 911, while the rest 287 subsets are only selected by 525 diseases. When looking deeply at the improvement in the performance of our method with respect to the number of selected data sources (Supplementary Figure 7), we notice that our method is more effective for diseases preferring more data sources, since the performance of our method for a total of 627 (43.66%) diseases that prefer 6 or more data sources is typically higher than that of 809 (56.34%) diseases that prefer 5 or less data sources. For diseases preferring more data sources, the improvement after selection is marginal. For diseases favoring less data sources, the performance does exhibit significant improvement with the selection procedure.
Nevertheless, such diseases exhibit quite different preferences. For example, there are a total of 275 distinct subsets of data sources selected by diseases that prefer 5 or less data sources, and on average each subset is only selected by 809/275  2.94 diseases (Table S2).
From these results, we make the following conclusions. First, the procedure of selecting of a subset of data sources can improve the performance of our approach, especially for more than half diseases that prefer less data sources. Second, the selection procedure should be done for individual diseases, since different diseases show different preferences on the selected data sources and such preferences are diverse. Third, considering that the selection procedure has marginal effects on about 43.66% diseases in our experiment, and for the rest of 56.34% diseases the preference is diverse, we suggest either seeking for the simplicity to use all data sources without selection or resorting to a cross-validation experiment to select a subset of data sources for a query disease when there is no strong prior knowledge indicating the preference of the disease to the data sources. In this paper, we use all data sources without selection by default unless otherwise stated.

Estimation of the false positive rate and true positive rate
To estimate the false positive rate (FPR) and the true positive rate (TPR) at a given q-value cut-off, we generated at random 12 sets of SNVs, each composed of 10,000 SNVs. In each set, the proportion of disease SNVs ranges from 0% to 100% separately. We then calculated q-values for SNVs in every set, estimated the false positive rate as the proportion of neutral SNVs whose q-values are less than or equal to the threshold and the true positive rate at a certain q-value threshold as the proportion of disease SNVs whose q-values are less 154 of these mutations can be mapped to the dbNSFP database. With the criterion that a seed gene should have been reported as associated with autism by independent studies before the publication of this data set, we selected from the OMIM database 34 seed genes and then applied SPRING to prioritize the candidate mutations (Table S3). both supporting the effectiveness of our method. We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 41 disease-associated mutations were significantly less than those of the other candidate mutations (suppose irrelevant to autism) and obtained a p-value of 8.03×10 -9 .
We noticed that CHD8 (Neale et al. 2012), NTNG1 (O'Roak et al. 2011), PTEN (Butler et al. 2005) and TBL1XR1 (Chung et al. 2011) had been previously annotated as causative for autism by independent studies and thus was included in seed genes in the above analysis. To simulate the scenario of identifying causative variants occurring in genes not yet studied, we removed CHD8, NTNG1, PTEN and TBL1XR1 from the seed genes and prioritized the candidate mutations again. We found that SPRING ranked 7 of the reported causative mutations among top 10 and 13 among top 20. Using Fisher's exact test, we estimated that the probability of ranking 7 or more mutations among top 10 by change was 3.85×10 -3 , and that of ranking 13 or more mutations among top 20 by change was 1.22×10 -4 . The one-sided Wilcoxon rank sum test yielded a p-value of 1.10×10 -7 , suggesting the q-values of the reported 41 disease-associated mutations were significantly less than those of the other candidate mutations.

PMID 22495311
From the literature (Neale et al. 2012), we collected 111 unique candidate nonsynonymous de novo mutations from the exome sequencing data of 175 probands, and 107 of these mutations can be mapped to the dbNSFP database. With the criterion that a seed gene should have been reported as associated with autism by independent studies before the publication of this data set, we selected the 34 seed genes identified previously and then applied SPRING to prioritize the candidate mutations (Table S3).
In the literature (Neale et al. 2012), 5 mutations were reported as causative for autism, and none of them occurred in genes previously annotated as causative for this disease. SPRING ranked these mutations at 2, 3, 7, 10 (tie with 3 other mutations), and 52. At the q-value cut-off 0.01, SPRING reported 13 mutations, among which 4 were reported in the literature (Neale et al. 2012). Using Fisher's exact test, we estimated that the probability of ranking 3 or more mutations among top 10 by change was 5.45×10 -3 , and that of ranking 4 or more mutations among top 20 by change was 4.11×10 -3 , both supporting the effectiveness of our method. We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 5 disease-associated mutations were significantly less than those of the other candidate mutations and obtained a p-value of 2.19×10 -3 .

PMID 22542183
From the literature (Iossifov et al. 2012), we collected 1,528 unique candidate nonsynonymous de novo mutations from the exome sequencing data of 343 families, and all these mutations can be mapped to the dbNSFP database. With the criterion that a seed gene should have been reported as associated with autism by independent studies before the publication of this data set, we selected the 34 seed genes identified previously and then applied SPRING to prioritize the candidate mutations (Table S3).
In the literature (Iossifov et al. 2012), 18 mutations were reported "likely gene disrupting" (LGD) in affected children and 116 were reported belonging to FMR1 targets.
SPRING reported 4 such mutations among top 10 and 6 among top 20. At the q-value cut-off 0.01, SPRING reported 43 mutations, among which 13 were reported either LGD or belonging to the FMR1 targets according to the literature (Iossifov et al. 2012). Using Fisher's exact test, we estimated that the probability of ranking 4 or more mutations among top 10 by change was only 7.82×10 -3 , and that of ranking 6 or more mutations among top 20 by change was only 5.62×10 -3 , both supporting the effectiveness of our method. We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 134 disease-associated mutations were significantly less than those of the other candidate mutations and obtained a p-value smaller than 2.2×10 -16 .
We noticed that SCN2A (Weiss et al. 2003) had been previously annotated as causative for autism by an independent study and thus was included in seed genes in the above analysis.
To simulate the scenario of identifying causative variants occurring in genes not yet studied, we removed SCN2A from the seed genes and prioritized the candidate mutations again. We found that SPRING ranked 4 of these mutations among top 10 and 8 among top 20. Using Fisher's exact test, we estimated that the probability of ranking 4 or more mutations among top 10 by change was 7.82×10 -3 , and that of ranking 8 or more mutations among top 20 by change was 1.45×10 -4 . The one-sided Wilcoxon rank sum test yielded a p-value smaller than 2.2×10 -16 , suggesting the q-values of the reported 134 disease-associated mutations were significantly less than those of the other candidate mutations.
In a recent study (Allen et al. 2013), exome sequencing was successfully applied to the detection of causal genetic variants for epileptic encephalopathies, showing strong statistical evidence on the association of several de novo mutations with this group of complex diseases.
We therefore applied SPRING to the analysis of exome sequencing data in this study (PMID -9 -

PMID 23934111
From the literature (Allen et al. 2013), we collected 192 unique candidate nonsynonymous de novo mutations from the exome sequencing data of 264 probands and their parents, and all these mutations can be mapped to the dbNSFP database. With the criterion that a seed gene should have been reported as associated with epileptic encephalopathies by independent studies before the publication of this data set, we selected from the OMIM database a total of 15 seed genes and then applied SPRING to prioritize the candidate mutations (Table S3).
In the literature (Allen et al. 2013), the authors discovered 30 likely functional de novo mutations. SPRING successfully ranked 10 of these mutations among top 10 and 17 among top 20. At the q-value cut-off 0.01, SPRING reported 18 mutations, among which 17 had been reported as likely functional (Allen et al. 2013). Using Fisher's exact test, we estimated that the probability of ranking 10 or more mutations among top 10 by change was only 2.03×10 -9 , and that of ranking 17 or more mutations among top 20 by change was only 1.24×10 -13 , both strongly supporting the effectiveness of our method. We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 30 disease-associated mutations were significantly less than those of the other candidate mutations (suppose irrelevant to epileptic encephalopathies) and obtained a p-value of 1.19×10 -11 .
We noticed that CDKL5 (Rosas-Vargas et al. 2008), KCNQ2 (Borgatti et al. 2004), SCN1A (Ohmori et al. 2002), SCN2A (Meisler and Kearney 2005), SCN8A (Veeramah et al. 2012) and STXBP1 (Saitsu et al. 2011) had been previously annotated as causative for epileptic encephalopathies by an independent study and thus was included in seed genes in the above analysis. To simulate the scenario of identifying causative variants occurring in genes not yet studied, we removed CDKL5, KCNQ2, SCN1A, SCN2A, SCN8A and STXBP1 from the seed genes and prioritized the candidate mutations again. We found that SPRING ranked 4 of these mutations among top 10 and 9 among top 20. Using Fisher's exact test, we estimated that the probability of ranking 4 or more likely functional mutations among top 10 by change was 5.20×10 -2 , and that of ranking 9 or more mutations among top 20 by change was 8.76×10 -4 . The one-sided Wilcoxon rank sum test yielded a p-value of 4.89×10 -8 , suggesting the q-values of the reported 30 disease-associated mutations were significantly less than those of the other candidate mutations.
-10 -We further extended the application of SPRING to the analysis of two independent data sets of intellectual disability (PMID 23033978 (Neale et al. 2012) and PMID 23020937 (Iossifov et al. 2012)).

PMID 23033978
From the literature (Neale et al. 2012), we collected 78 unique candidate nonsynonymous de novo mutations from the exome sequencing data of 53 patients, and 77 of these mutations can be mapped to the dbNSFP database. Since there is no exact entry for intellectual disability in the OMIM database, we selected the 15 genes associated with epileptic encephalopathies as seeds (Table S3).
In the literature (Neale et al. 2012), the authors discovered 16 risk contributing de novo mutations for intellectual disability. When looking at the ranking list produced by SPRING, we found that 3 mutations ranked among top 10 and 8 among top 20 were reported as associated with intellectual disability. At the q-value cut-off 0.1, SPRING reported 30 mutations, among which 10 had been reported as associated with intellectual disability according to the literature (Neale et al. 2012). Using Fisher's exact test, we estimated that the probability of ranking 3 or more mutations among top 10 by change was 0.34, and that of ranking 8 or more mutations among top 20 by change was 1.90×10 -2 . We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 16 risk contributing mutations were significantly less than those of the other candidate mutations and obtained a p-value of 3.02×10 -2 .
We noticed that SCN2A (Meisler and Kearney 2005) had been previously annotated as causative for intellectual disability by an independent study and thus was included in seed genes in the above analysis. To simulate the scenario of identifying causative variants occurring in genes not yet studied, we removed SCN2A from the seed genes and prioritized the candidate mutations again. We found that SPRING ranked 4 of these mutations among top 10 and 8 among top 20. Using Fisher's exact test, we estimated that the probability of ranking 4 or more mutations among top 10 by change was 0.12, and that of ranking 8 or more mutations among top 20 by change was 1.90×10 -2 . The one-sided Wilcoxon rank sum test yielded a p-value of 3.57×10 -2 , suggesting the q-values of the reported 16 disease-associated mutations were significantly less than those of the other candidate mutations.

PMID 23020937
From the literature (Iossifov et al. 2012), we collected 131 unique candidate nonsynonymous de novo mutations from the exome sequencing data of 51 patients, and 126 of these mutations can be mapped to the dbNSFP database. We also selected the 15 genes associated with epileptic encephalopathies as seeds (Table S3).
In the literature (Iossifov et al. 2012), the authors discovered 17 possibly disease-causing mutations. When looking at the ranking list produced by SPRING, we found that 6 mutations ranked among top 10 and 7 among top 20 were reported as associated with intellectual disability. At the q-value cut-off 0.01, SPRING reported 5 mutations, and all of them were reported as disease-causing in the literature (Iossifov et al. 2012). Using Fisher's exact test, we estimated that the probability of ranking 6 or more mutations among top 10 by change was 3.79×10 -4 , and that of ranking 7 or more mutations among top 20 by change was 6.36×10 -3 , both supporting the effectiveness of our method. We also adopted the one-sided Wilcoxon rank sum test to check whether the q-values of the reported 17 disease-associated mutations were significantly less than those of the other candidate mutations and obtained a p-value of 2.63×10 -5 .
We noticed that SCN2A (Meisler and Kearney 2005), SCN8A (Veeramah et al. 2012) and STXBP1 (Saitsu et al. 2011) had been previously annotated as causative for intellectual disability by an independent study and thus was included in seed genes in the above analysis.
To simulate the scenario of identifying causative variants occurring in genes not yet studied, we removed SCN2A, SCN8A and STXBP1 from the seed genes and prioritized the candidate mutations again. We found that SPRING ranked 4 of these mutations among top 10 and 6 among top 20. Using Fisher's exact test, we estimated that the probability of ranking 4 or more mutations among top 10 by change was 2.92×10 -2 , and that of ranking 6 or more mutations among top 20 by change was 2.97×10 -2 . The one-sided Wilcoxon rank sum test yielded a p-value of 6.02×10 -5 , suggesting the q-values of the reported 17 disease-associated mutations were significantly less than those of the other candidate mutations.
These results confirm the capability of SPRING in identifying causative nonsynonymous de novo mutations for complex diseases, not only in the case of finding novel causative variants occurring in known causative genes, but also in the situation of detecting causative variants occurring in genes not yet studied.