A Genome-Wide Investigation of SNPs and CNVs in Schizophrenia

We report a genome-wide assessment of single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) in schizophrenia. We investigated SNPs using 871 patients and 863 controls, following up the top hits in four independent cohorts comprising 1,460 patients and 12,995 controls, all of European origin. We found no genome-wide significant associations, nor could we provide support for any previously reported candidate gene or genome-wide associations. We went on to examine CNVs using a subset of 1,013 cases and 1,084 controls of European ancestry, and a further set of 60 cases and 64 controls of African ancestry. We found that eight cases and zero controls carried deletions greater than 2 Mb, of which two, at 8p22 and 16p13.11-p12.4, are newly reported here. A further evaluation of 1,378 controls identified no deletions greater than 2 Mb, suggesting a high prior probability of disease involvement when such deletions are observed in cases. We also provide further evidence for some smaller, previously reported, schizophrenia-associated CNVs, such as those in NRXN1 and APBA2. We could not provide strong support for the hypothesis that schizophrenia patients have a significantly greater “load” of large (>100 kb), rare CNVs, nor could we find common CNVs that associate with schizophrenia. Finally, we did not provide support for the suggestion that schizophrenia-associated CNVs may preferentially disrupt genes in neurodevelopmental pathways. Collectively, these analyses provide the first integrated study of SNPs and CNVs in schizophrenia and support the emerging view that rare deleterious variants may be more important in schizophrenia predisposition than common polymorphisms. While our analyses do not suggest that implicated CNVs impinge on particular key pathways, we do support the contribution of specific genomic regions in schizophrenia, presumably due to recurrent mutation. On balance, these data suggest that very few schizophrenia patients share identical genomic causation, potentially complicating efforts to personalize treatment regimens.

checked the MAF report from PLINK [1] against the original locus report generated by the genotyping facility. We assured that the two MAF reports matched exactly.

Specification of gender
In order to confirm the gender assignment obtained from the phenotype database, we used the observed genotypes of SNPs on chromosome X, and Y where available. We individually inspected all individuals who were marked as "male" but with significant amount of heterozygous X genotypes (>=1%), or who were marked as "female" but with high frequency of homozygous X genotypes (>=80%) or Y genotype readings using the original data source. If no satisfactory correction could be obtained, these individuals were excluded from further analyses. In total we excluded 3 Munich control participants, 5 Aberdeen cases, and 2 Aberdeen control participants in this step.

Cryptic relatedness
We took this step to check for unexpected relatedness between study participants. We estimated the sharing of genetic information by calculating identity by state (IBS) using the PLINK software. All pairs of DNA samples showing ≥ 0.125 were individually inspected, and one sample in each pair was excluded from further analyses. In total, 2 Munich cases, 2 Munich controls, 4 Aberdeen cases, and 16 Aberdeen controls were removed in this step.

Skewed missingness
We took this step to determine whether the missing genotypes were skewed towards cases or controls and hence may give rise to spurious association p values. We used PLINK software to perform this check on the most associated SNPs and found no evidence of a missing data bias.

Hardy-Weinberg Equilibrium (HWE)
We performed this check using PLINK software on the most associated SNPs, separately in the Munich and Aberdeen cohorts. We defined a deviation from HWE with a criterion of p value less than 0.05 both in controls and in controls plus cases. No genotype data violated this check.

Recheck of the genotyping quality
We took this step as an individual recheck on the raw and normalized data for the most associated SNPs to be sure that they are called correctly as described in "Infinium BeadStudio Raw Data Analysis" process.

EIGENSTRAT outliers
We removed nine individuals (7 Munich cases, 1 Munich control, 1 Aberdeen case) because they were extreme outliers on one or more significant EIGENSTRAT axes.

Low MAF
We removed 939 SNPs with a MAF<0.0015. This criterion ensured that at least 6 individuals of the rare genotype were present in the combined Munich+Aberdeen dataset, to control for error in the estimation of asymptotic p values.

Curated EIGENSTRAT analyses
We selected EIGENSTRAT axes for use as covariates to adjust for ancestry in subsequent logistic regression analyses as follows. (1) To find EIGENSTRAT axes, we started with autosomal SNPs with MAF>0.01 from the "overlap" set of SNPs typed in both Munich and Aberdeen. (2) On inspection of SNP loadings for each PC axis (the "gamma" coefficients of Price et al. 2006 [2] ), we found several of the top 10 axes to be dominated by a small number of SNPs all mapping to the same region of the genome. For example, PC axis 2 was found to be dominated by SNPs mapping to a region of chr8p22-23.1 coinciding with a known inversion polymorphism. (3) To correct for these LD effects, and ensure that EIGENSTRAT axes reflected only effects that applied equally across the whole genome (as ancestry effects should), we re-applied principal components analysis to a reduced SNP set in which (i) certain known high LD regions were excluded (chr8:8000000..12000000, chr6:25000000..33500000, chr11:45000000..57000000, chr5:44000000..51500000); (ii) SNPs were thinned using the "--indeppairwise" option in PLINK software, such that all SNPs within a window size of 1500 (step size of 150) were required to have r 2 <0.2; (iii) Each SNP was regressed on the previous 5 SNPs, and the residual entered into the PCA analysis, as suggested by Patterson et al (2006) [3] . (4) Inspection of SNP loadings on all axes deemed significant by the Tracy-Widom method of Patterson et al (2006) [3], using Q-Q plots against Normal expectation, now revealed no axes dominated by single high-LD regions of the genome. However, inspection of individual scores revealed 9 "outlier" individuals who exerted a large influence on one or more PC axes. These 9 individuals were removed and PCA reapplied to the remaining dataset. (5) Tracy-Widom tests nominated the first 4 resulting PC axes as significant (p<0.05). In addition, schizophrenia case-control status associated significantly with PC axes 2-4, but no others in the first 10 axes. We therefore adopted the first four PC axes as covariates in subsequent analyses. PC axis 1 associated very clearly with division into Munich and Aberdeen samples (r=0.91) so to prevent multicollinearity the Munich/Aberdeen sampling location was not included as a covariate.
Finally, we note that hidden population structure appears to be low in our data. We calculated the Devlin and Roeder inflation factor λ [4,5], adjusting for the known Munich/Aberdeen split, and obtained a value of 1.013 indicating that stratification was very slight and did not significantly affect the genetic associations we detected.

Comparison of methods for correcting multiple testing
We compared a simple Bonferroni correction against a randomization strategy for adjusting for multiple comparisons across correlated tests (here due to SNPs in LD). Since we wished to adjust for hidden stratification using EIGENSTRAT covariates, and since simple permutation of the raw y variate is known under certain conditions to lead to destabilized Type I error in the presence of covariates [6] we adopted the following randomization strategy. We first entered the y variate (here, case-control status for schizophrenia) into a logistic regression against the reduced model made up only of covariates (gender plus 4 EIGENSTRAT axes). We then created 1000 randomized datasets by drawing values for y from a Bernoulli distribution with probability derived from the fitted values of the reduced model logistic regression. Multiplicity-adjusted p-values were obtained from the proportion of randomized datasets containing a p value (for any SNP) equal to or less than the observed p value in question. We found that Bonferroni-adjusted values (found by multiplying the raw p value by the number of separate tests, here corresponding to 312,565 SNPs) resulted in adjusted p values that were always larger than the p values adjusted from the randomization strategy (as one would expect for positively-correlated tests), but always within a factor of two for the top hits in our dataset. We therefore found Bonferroni correction to be conservative, but acceptably so.
We provided a joint p value for the most associated SNPs, combined across all arms of the study. We used Stouffer's method [7] to combine p values from the genome-wide and replication stages, weighted according to sample size. We applied a Bonferroni correction for multiple testing, using all SNPs entered into the main part of Stage 1 analysis (n=312,565). Note that more SNPs were typed in the Aberdeen cohort, but these were not used in the combined analysis of Munich+Aberdeen data as they were not typed in the Munich cohort. The total of 312,565 SNPs is the correct number to use in Bonferroni correction as in principle any of these SNPs could have been entered into the replication study under the Null hypothesis. The total ADAMTSL3 mRNA expression was determined using a custom designed Taqman assay targeting exon 30: using forward primer CTCAAGTTGCAGGTTTTCAACAGTT, reverse primer GATCTATGAAAATGCCATTAATGCCAACA, probe CTGGCCAGAGCTTCTA, and using a commercially available Taqman assay for ADAMTSL3 expression targeting the exon 17-18 boundary.

Genome-wide ExonArray Study
In a previously conducted study we assessed on a genome-wide scale how common genetic variation regulates the expression level of both whole transcripts and specific exons (as a measure of splicing) in human cortical brain tissue collected from healthy control subjects [8]. In this study, the Affymetrix Human ST 1.0 Exon chips were used to assess exonic and transcript expression levels. Genome-wide genotyping was performed using the HumanHap 550 whole-genome chips. The association study, relating genetic variants included in the gene and the regions immediately surrounding it to the exon and the transcript level, was conducted using PLINK software (linear regression model) incorporating age, gender, postmortem interval (brain tissue samples), and processing days.
To assess whether genetic variation influences these alternative splicing events, we examined 30 samples of subjects with no known neuropsychiatric conditions and another 32 with a diagnosis of Alzheimer's Disease, from the Kathleen Price Bryan Alzheimer's Disease Brain Bank. We first confirmed the presence of all four alternative transcripts in human prefrontal and temporal cortex ( Figure S1C). We next designed quantitative real time PCR assays to determine the abundance of each of the four alternative transcripts ( Figure S2). We found that the rs950169 and rs2135551 polymorphisms show highly significant correlation with the use of the alternative splice acceptor site, resulting in a truncated PLAC domain. There is a four-fold increase in the RSD-ASA form in the minor-allele homozygotes compared with the major allele homozygotes (p<.0001, Figure S2), with a smaller effect on use of the ASD ( Figure S3). Finally, we note that the effect of the polymorphisms on the absolute amount of the RSD-ASA is much higher than on the RSD-RSA transcript (the only 7 transcript form with a fully intact PLAC domain) suggesting the possibility that the truncating forms may represent a gain of function rather than reducing the amount of transcript encoding a protein with a functional PLAC domain.
Consistent with a role of these polymorphisms in splicing, rs950169 is located 10bp into exon 30 and the minor allele is predicted to increase binding of the SRp55 splicing factor (using the ESE finder software (http://rulai.cshl.edu/tools/ESE) [10], the C allele scores 2.91 and the T allele scores 5.16), and is a primary candidate for mediating the observed splicing change.

Expression of ADAMTSL3
ADAMTSL3 is a widely-expressed glycoprotein of unknown function that contains immunoglobulin (Ig) and thrombospondin I domains with a postulated role in the extracellular matrix [11]. In the mouse, Adamtsl3 mRNA is expressed in many brain regions (see the Allen Mouse Brain Atlas, http://www.brain-map.org), and, strikingly, is particularly high in the hippocampal formation, especially in the pyramidal cell layer of the CA1 and CA3 regions ( Figure S4). This expression corresponds with the anatomical observation that schizophrenia is associated with both smaller pyramidal cells [12,13] and altered arborization in the CA1 and CA3 regions (though some references suggest that CA1 may be relatively spared [14]).