Exome Sequencing in 53 Sporadic Cases of Schizophrenia Identifies 18 Putative Candidate Genes

Schizophrenia (SCZ) is a severe, debilitating mental illness which has a significant genetic component. The identification of genetic factors related to SCZ has been challenging and these factors remain largely unknown. To evaluate the contribution of de novo variants (DNVs) to SCZ, we sequenced the exomes of 53 individuals with sporadic SCZ and of their non-affected parents. We identified 49 DNVs, 18 of which were predicted to alter gene function, including 13 damaging missense mutations, 2 conserved splice site mutations, 2 nonsense mutations, and 1 frameshift deletion. The average number of exonic DNV per proband was 0.88, which corresponds to an exonic point mutation rate of 1.7×10−8 per nucleotide per generation. The non-synonymous-to-synonymous mutation ratio of 2.06 did not differ from neutral expectations. Overall, this study provides a list of 18 putative candidate genes for sporadic SCZ, and when combined with the results of similar reports, identifies a second proband carrying a non-synonymous DNV in the RGS12 gene.


Introduction
Schizophrenia (SCZ) is a severe mental illness that has a population prevalence of 0.4 to 0.8% [1]. Individuals with SCZ experience psychosis, poor social functioning, and cognitive impairments. SCZ is a highly heritable disorder encouraging research into the genetic component of the disorder [2]. Large genome-wide studies have recently suggested that common single nucleotide polymorphisms (SNPs) collectively account for at least 32% of the variance in SCZ liability [3,4,5,6]. Rare copy number variants (CNVs) have also been shown to contribute to SCZ risk [7,8,9,10]. Despite these encouraging advances, many of the genetic components of the disease await identification. Besides inherited variants, new mutations may also contribute to risk.
A sizeable proportion of SCZ patients do not have a family history of the disease and it has been hypothesized that de novo variants (DNVs) could account for some of these sporadic cases [11]. Several lines of evidence support this hypothesis. Firstly, despite the markedly reduced reproductive rate among SCZ patients, the prevalence of the disorder has remained constant in the general population (0.4-0.8%) [12]. DNVs, which are not subject to negative selection, provide an explanation for risk alleles remaining frequent in the population. Secondly, it has been shown that DNV load correlates with paternal age [13], which could explain why older males who have accumulated new mutations are more likely than younger males to father schizophrenic children [14]. Thirdly, DNVs tend to be more deleterious to gene function than inherited ones because they have not been subjected to evolutionary selection (with the exception of mutations incompatible with life) and are thus disease-prone [15]. Consistently, individuals with sporadic SCZ carry significantly more de novo CNVs than do controls, whereas individuals with familial SCZ do not [16]. However, subsequent studies confirming a higher rate of de novo CNVs among individuals with SCZ compared with controls did not reveal a significant difference between individuals with sporadic or familial SCZ [17,18].
It has become evident that some fraction of disease alleles may occur as de novo events. With recent advances in high-throughput sequencing technologies, one can now identify a considerable fraction of de novo variation, including single nucleotide variants (SNVs) and small insertions/deletions (indels), thus increasing the proportion of disease risk that can be explained. Five studies have analyzed the contribution of de novo variants in sporadic cases of SCZ [19,20,21,22,23]. Collectively, these studies reported a mutation rate of ,1.5610 28 mutation per base per generation, which is consistent with neutral expectations. Regarding the burden of protein-altering variants, some discrepancies exist among these studies. Indeed, Xu et al. (2011Xu et al. ( & 2012 observed that individuals with SCZ exhibited a significantly higher ratio of non-synonymous-to-synonymous SNVs compared with controls, whereas Fromer et al. (2014) did not find any significant enrichment.
Here, we further explored the contribution of DNVs to the etiology of sporadic SCZ by sequencing the exomes of 53 affected individuals and of their healthy parents.

Exome sequencing data
Exome sequencing was conducted for 159 individuals. On average, 206 (¡78 SD) million reads were produced per sample. Of these, 198 (¡76 SD) million mapped to the reference genome (hg19). After the removal of duplicate reads, 150 (¡51 SD) million reads remained. Among these reads, 113 (¡40 SD) million were ontarget. These reads represented an average coverage of at least 86 for 95.01% (¡4.8 SD) of the coding portion of the RefSeq genes. On average, 24,401 (¡1,407 SD) variants were detected per individual. Table S2 summarizes the exome sequencing results for each individual.

Identification of damaging de novo variants
In our cohort of 53 proband-parent trios, we identified 47 exonic and 2 intronic conserved splice site DNVs, which were not previously reported in any public SNP database ( Table 1). The observed rate of de novo events in the RefSeq proteincoding exons was 0.88, corresponding to an exonic point mutation rate of 1.7610 28 , in accord with previous reports [21,24,25,26,27]. The distribution of the number of DNVs per trio did not differ from the Poisson distribution (Chisquare goodness of fit test: p50.42; Figure S1). Father's age at childbirth had no significant effect on the number of DNVs in the offspring (R 2 50.0113; p50.44; Figure S2A). However, when we separated the father according to the median age at childbirth, we observed a trend towards more DNVs in the children of the oldest fathers (0.79 and 1.08 DNVs in the offspring of the youngest (19-29 years old) and the oldest (30-52 years old) fathers, respectively; one-tailed Student's t-Test; p50.097; Figure S2B).
In addition, we observed four instances of variant alleles being present at low frequencies, ranging from 14 to 21% of reads, indicative of somatic mosaicism in blood cells. We found one nonsense mutation (p.K854X) in DSG3 and three missense mutations: p.S1006F in NLRP11, p.G1352D in LRRC7, and p.R1177C in CC2D2A. These 4 events were confirmed by Sanger sequencing, which showed a similar level of allelic imbalance ( Figure S3). As these post-zygotic mutations were not necessarily present in the brain of the patients, we did not analyze them further.
Among the 49 DNVs, we discovered 15 synonymous and 34 protein-altering DNVs, including 29 missense mutations, 2 nonsense mutations, 2 conserved splice-site mutations, and 1 frameshift insertion/deletion. The non-synonymousto-synonymous mutation ratio of 2.06 was similar to neutral expectations (2.23) [28]. We did not observe a significant increase in the relative rate of loss-offunction-to-missense mutations in our cohort when compared to controls from published data sets [21,22,24,26,29,30] (0.17 versus 0.13, respectively; X-squared 50.0765, p-value 50.78). The transition-to-transversion ratio for coding sequences was 2.61, consistent with neutral expectations [31]. Thirteen out of 29 missense DNVs were classified as damaging by at least 2 of the 3 prediction algorithms used (SIFT, Polyphen2 and Mutation Taster) ( Table 1). Among the nonsense mutations, the first creates a stop codon at residue 599 of the MAP4K4 protein, which would result in a truncation mutant lacking the 617 C-terminal amino acids (1273, R599X). The second nonsense mutation is in codon 171 of the CHRNG gene, which would result in a truncation mutant lacking the 376 Cterminal amino acids (517, Q171X). Concerning the splice site mutations, both affect a conserved ''AG'' dinucleotide of an acceptor site, one within intron 14 of the EIF3B gene and the other within intron 15 of the SETD1A gene; both are predicted to significantly impact normal splicing (Table S3). Finally, the indel variant corresponds to a single-nucleotide deletion in the FN1 gene, resulting in a frameshift at Ala93 and a premature stop codon after the introduction of 26 amino acids. The FN1 gene produces multiple protein isoforms, the shortest of which contains 657 amino acids (NM_054034.2). Globally, 36% of the DNVs identified in this study (18 out of 49) were predicted as damaging (missense) or loss-of-function (nonsense, conserved splice site and frameshift variants).
We did not observe genes recurrently mutated in our cohort. However, when data from all available studies (including ours) were combined [19,20,21,22,23], 21 genes were found to be recurrently mutated with likely damaging DNVs trios (Table S4). Among these 21 genes, 13 were found to carry non-synonymous de novo SNVs. We then determined the probability of such de novo events occurring in these 13 genes based on each gene-specific mutation rate and the total number of non-synonymous de novo SNVs observed across all five datasets (1,020 SCZ trios). The number of DNVs observed in these 13 genes was in agreement with the null expectation ( Table 2). As splice-site mutations and indels were not accounted for in the mutability calculation, we did not determine the probability of observing multiple events in genes carrying these types of mutations. This analysis, using data from all available studies, revealed a short list of 21 candidate genes, among which some may be confirmed as true risk genes upon the analysis of additional sporadic cases. Functional in silico analysis of 375 genes affected by protein-altering de novo variants (missense predicted as probably damaging by Polyphen, nonsense, conserved splice site (¡2) and frameshift Indels) [20,21,22,23] and this study (Table S5) revealed a marginally significant enrichment for genes involved in cellular component morphogenesis (GOTERM_BP_FAT Gene Ontology, Benjamini p-value 53.1610 22 ) and genes expressed in brain tissues (TISSUE Expression, Benjamini p-value 55.4610 28 ).

Discussion
Similar to other studies [19,20,21,22,23], we used whole-exome sequencing to estimate the contribution of protein-altering de novo mutations to sporadic SCZ and to identify susceptibility genes. Our study reveals that 34% of the sporadic cases analyzed (18 out of 53 cases) carried a predicted damaging de novo variant, and our results provide a list of 18 putative candidate genes. The average number of exonic DNV per proband was 0.88, which corresponds to an exonic point mutation rate of 1.7610 28 per nucleotide per generation. The non-synonymous-to-synonymous and loss-of-function-to-missense ratios did not differ from expectations or from those of controls, suggesting no difference in mutational processes in sporadic cases of schizophrenia. In a recent paper, Kong et al. (2012) reported on the importance of father's age on the risk of disease, including SCZ [13]. This group convincingly showed that the number of de novo mutations in one offspring can be explained by the father's age at childbirth. We did not observe any positive correlation between these two variables, probably owing to the limited size of our sample and the range of the fathers' ages at childbirth ( Figure S2A,B). Considering all five of the above-mentioned studies, 1,020 SCZ trios have been sequenced, and 21 genes were found mutated in more than one SCZ-affected individual (Table S4). Among these 21 genes, RGS12 was found mutated in this study (Table 2). At this stage, none of these genes can be considered definitive risk genes for SCZ, as the recurrence of DNVs is not significant after genome-wide correction ( Table 2). Because the estimated mutation rate for indels is less accurate than for SNVs, we did not calculate the probability of observing such events in LAMA2. However, the fact that there are 3 occurrences of probably damaging DNVs in this gene makes it a candidate for SCZ risk. Larger samples size will be required to unequivocally identify true risk variants in specific genes.
Functional in silico analysis of 375 genes affected by protein-altering de novo variants revealed a significant enrichment in genes expressed in brain tissues and genes involved in cellular component morphogenesis. These preliminary results suggest that genes involved in neuronal morphogenesis could be relevant for the altered neurodevelopmental processes in schizophrenia [32]. Among the 18 candidate genes identified in our study, RGS12 has emerged as one of the most interesting candidates in this neurodevelopmental context, as RGS12 is known to coordinate the Ras-dependent signals required for promoting and/or maintaining neuronal differentiation [33], a process that is perturbed in schizophrenic patients [32]. The RGS12 gene encodes a member of the 'regulator of G protein signaling' (RGS) gene family [34]. In PC12 cells and primary dorsal root ganglion neurons, RGS12 sustains nerve growth factor (NGF)-driven ERK activity and, therefore, neurite outgrowth by scaffolding a complete MAPK cascade, including the NGFreceptor TrkA, activated H-Ras, B-Raf, MEK2, and ERK proteins [33]. In addition, RGS12 is observed to be among the most down-regulated proteins in sensory-deprived barrel cortex synapses, suggesting that it may participate in the molecular mechanisms of sensory development [35].

Schizophrenia trios
The schizophrenia trios (case and healthy parents) were collected at 5 different psychiatric hospitals (Text S1). In each of the families selected, the proband had a diagnosis of schizophrenia, schizoaffective disorder or non-organic psychosis based on the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV). Families for which we could not exclude the presence of psychosis, major depression, bipolar disorder, autism, mental retardation, learning and developmental delays, multiple hospitalizations in psychiatric units, or any somatic disorders in first and second-degree relatives were not considered. This research project was approved by the ethics committee of all participating centers (Ethics committee of the University Hospitals of Geneva). All participants or legally authorized representatives provided their written informed consent.

DNA extraction and exome sequencing
Genomic DNA extracted from blood was used except for sample SP-226-003 (Mother), whose DNA was extracted from lymphoblastoid cell line. Whole genome amplification was performed on DNA from SP-198-003 (Mother) and 16(1)-Father using the REPLI-g Mini Kit (Qiagen). Exome capture was conducted using the SureSelect Human ALL Exon kits (Agilent Technologies). Highthroughput sequencing was performed on a HiSeq2000 (Illumina). Fastq files were processed by our ''in-house'' pipeline running on the Vital-IT (http://www. vital-it.ch) Center for high-performance computing of the Swiss Institute of Bioinformatics (SIB) [36].

Identification of de novo variants
De novo variants were identified using VariantMaster [36]. Practically, heterozygous variants detected in the proband with SAMtools and PINDEL quality scores 100 and 600, respectively were retained for subsequent analysis. These variants were filtered so as to exclude variants with a MAF.0.01 in dbSNP (http://www.ncbi.nlm.nih.gov/SNP/), 1000Genomes [37], and Exome Variant Server (EVS, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/)), and variants found within segmental duplications. VariantMaster utilizes the raw (BAM) data to robustly estimate the conditional probability of a variant to be present in the parents. All variants classified as de novo by VariantMaster, were subsequently visually inspected using the SAMtools text alignment viewer. Finally, these candidate variants were validated using Sanger sequencing on each family member. The validation rate was at 100% for the SNVs. Most PINDEL calls classified as de novo were rejected during visual inspection mainly due to miscalling of indels in homopolymer tracts or trinucleotide repeats.

Statistical analysis
The mutation rate (M) of each of the RefSeq gene was calculated by adding up mutation rates of each nucleotide taking into account the mutation rates at CpG and non-CpG sites [13].
We used a germline mutation rate of (i) 6.18610 28 per base per generation for transition at non-CpG sites, (ii) 1.12610 27 per base per generation for transition at CpG sites, (iii) 3.76610 29 per base per generation for transversion at CpG site and (iv) 9.59610 29 per base per generation for transversion at CpG sites. Accordingly, the probability of finding a single mutation in a gene G was  667{N . Independently, we may also evaluate the probability of these N mutations in G to be non-synonymous. For each amino acid A i , the number of non synonymous mutations can be easily calculated as 3 2 {#Ai where #A is the number of codons coding for the amino acid A. Thus, the probability that a mutation in a gene with coding sequence A 1 ,:::::,A L f gwill be non-synonymous is P G (nonsyn)~1 L X L i 3 2 {#Ai 3 2 . For each gene, the final P value is calculated as P G (N):P G (nonsyn) N .

Functional In Silico Analysis
Functional annotation of genes carrying protein-altering de novo variants was performed by Gene Ontology analysis, using a modified Fisher's exact test with Benjamini correction for multiple testing as implemented in DAVID (http:// david.abcc.ncifcrf.gov/).