Probing the Association between Early Evolutionary Markers and Schizophrenia

Schizophrenia is suggested to be a by-product of the evolution in humans, a compromise for our language, creative thinking and cognitive abilities, and thus, essentially, a human disorder. The time of its origin during the course of human evolution remains unclear. Here we investigate several markers of early human evolution and their relationship to the genetic risk of schizophrenia. We tested the schizophrenia evolutionary hypothesis by analyzing genome-wide association studies of schizophrenia and other human phenotypes in a statistical framework suited for polygenic architectures. We analyzed evolutionary proxy measures: human accelerated regions, segmental duplications, and ohnologs, representing various time periods of human evolution for overlap with the human genomic loci associated with schizophrenia. Polygenic enrichment plots suggest a higher prevalence of schizophrenia associations in human accelerated regions, segmental duplications and ohnologs. However, the enrichment is mostly accounted for by linkage disequilibrium, especially with functional elements like introns and untranslated regions. Our results did not provide clear evidence that markers of early human evolution are more likely associated with schizophrenia. While SNPs associated with schizophrenia are enriched in HAR, Ohno and SD regions, the enrichment seems to be mediated by affiliation to known genomic enrichment categories. Taken together with previous results, these findings suggest that schizophrenia risk may have mainly developed more recently in human evolution.

Introduction Schizophrenia has affected humans throughout history; it has heritability between 60-80% [1] and a global prevalence of around 1% [2,3]. It is characterized by hallucinations and delusions, often involving language and thought disorders, and higher order cognitive dysfunction [4]. Hence, schizophrenia has been suggested to represent, in part, a by-product of adaptive changes during the hominization process [3,5]. Archaeological and paleontological findings may provide us with evidence of the cultural and anatomical changes in skeletal structure, however, mental and psychiatric changes are hard to trace.
Humans differ from their ancestors on a range of skills including language, creativity, metacognition, executive function, and cooperation. [6,7] These are important characteristics that involved genetic changes leading to functional advantages, with cultural and societal effects. These changes which set us apart from our ancestors could have also made us vulnerable to psychiatric disorders like schizophrenia, plausibly a human-specific disease. [5,8] It is still unknown at which stage of human evolution the risk factors for schizophrenia emerged. We have previously found evidence that genetic risk appeared during the divergence of modern Homo sapiens from Homo Neanderthalensis. [9] As every stage of evolution was driven by genetic changes, it is possible that gene variants associated with schizophrenia emerged even earlier, for example, when new world and old world monkeys split into different branches of the evolutionary tree [10], or even as early as when vertebrates appeared on earth. [11]Building on the recent progress in genome-wide association studies (GWAS) [12], we now have the opportunity to investigate the origin of psychosis further back in the evolution. The human genome consists of evolutionarily new and ancient regions. By investigating single nucleotide polymorphisms (SNPs) association with schizophrenia in these regions, it is possible to roughly estimate when the risk potential appeared during evolution.
Here, we investigated schizophrenia SNPs located in human accelerated regions, segmental duplications and ohnologs to determine to what extent SNPs tagging these regions, which are proxies for various periods in early evolutionary history, are associated with schizophrenia or other human traits and diseases. These represent early evolutionary markers older than 200,000 years when modern humans are regarded to have first appeared on earth. [13] Human accelerated regions (HAR) are DNA sequences that experienced rapid changes after the divergence of humans (including our hominin ancestors) from chimpanzees after remaining constant throughout primate evolution. [14] While most HARs are exclusively non-coding sequences, research suggests that many HARs are developmental gene regulatory elements and RNA genes, most of which evolved their uniquely human mutations through positive selection before divergence of archaic hominins and diversification of modern humans [15]. These regions harbor genes that have been shown to play important roles in neurocognitive development [14,16]. It is known that various genomic regions show differential enrichment, [17] i.e. that certain genomic regions are more likely to harbor gene variants associated with human traits and diseases. Further, an enrichment of schizophrenia risk loci could be due to a general effect of brain related genes, [12] not necessarily of those implicated in evolution. Moreover, according to recent GWAS, gene variants at the major histocompatibility complex (MHC) region on human chromosome 6p22.1 are implicated in schizophrenia [12,18,19] and play a role in evolution. [20] Thus, it is important to disentangle the effect of these mediating factors from the effect of the evolutionary proxies.
Segmental duplications (SD), also known as low copy repeats, are a known source of genetic instability and evolution [10]. Most duplication in the hominid branch can be traced to events 35-40 million years ago [10], marking an early stage of hominid/primate evolution. Increased duplications are seen in regions that distinguish the hominid branch from other primates.
Eichler et al. [21] found that the primate ancestral branch leading to human and African great apes showed the most significant increase in duplication activity both in terms of base pairs and in terms of events. In light of the importance of SDs in contributing to copy-number changes associated with neurocognitive disease [22], it is suggested that this apparent acceleration had a profound impact on the reproductive success, adaptability and evolution of ancestral hominid populations [21].
Ohnologs (Ohno) are genes retained after whole genome duplication events, unlike segmental duplications which are duplications of smaller chunks of the genome. They are often over represented in copy number variations (CNVs) that cause complex neurodevelopmental disorders like schizophrenia, autism spectrum disorders, neurodevelopmental delay, intellectual disability and epilepsy [15,23]. Whole genome duplications events are said to have occurred early in the vertebrate lineage around 500 million years ago [24] and the human genome contains many more duplications than would be expected by chance. These are evolutionarily important since gene duplication and divergence are the primary source of new genes in eukaryotes [25,26].
We used a statistical framework suited for polygenic architectures [17,27]. These methods have been applied, with success, to GWAS of complex human phenotypes to probe the overlap between phenotypes association and Neanderthal selective sweep [9]. Here, we use them to investigate if SNPs in regions of early human evolution are enriched of association with schizophrenia, while controlling for potential confounders.

Samples
We obtained summary statistics for single nucleotide polymorphisms (SNPs) from genomewide association studies (GWAS) of schizophrenia (conducted by the Psychiatric Genomics Consortium (PGC)) [12] and other phenotypes representing a selection of morphological, cardiovascular, immunological and psychiatric phenotypes. They include anthropometric measures (body mass index (BMI) [28], height) [29], a cardiovascular disease risk factor (triglycerides (TG)) [30], an immune-mediated disease (Crohn's disease (CD)) [31] a neurological disorder (Alzheimer's disease (AD) [32] as well as another psychiatric disorder (bipolar disorder (BD)) [33] (S1 Table). In order to make the data comparable all summary statistics were aligned to a set of about 2.5 million variants.

Evolutionary proxies
We computed LD weighted HAR, SD and Ohno regional affiliation scores following the procedure detailed below.
Human accelerated region (HAR) score. The HAR score indicates a SNP's affiliation to HAR. All SNPs were first assigned a raw HAR indicator value of 1 or 0 according to whether they fell inside or outside any HAR, and subsequently an LD-weighted score. The list of HAR was obtained from http://www.broadinstitute.org/scientific-community/science/projects/ mammals-models/29-mammals-project-supplementary-info.
Human segmental duplication (SD) score. The human SD score indicates a SNP's affiliation to SD regions in humans. All SNPs were first assigned a raw score of 1 or 0 and subsequently an LD-weighted score. The lists of segmental duplications were downloaded from the SD database at http://humanparalogy.gs.washington.edu/.
Ohnolog (Ohno) score. This score indicates a SNP's affiliation to Ohno regions. All SNPs were first assigned a raw score of 1 or 0 and subsequently an LD-weighted score. The lists of Ohno regions was obtained from McLysaght et al. [25].

Confounding/mediating factors
We investigated if the following factors could affect the evolutionary enrichment of schizophrenia associations.
Brain genes. We used the NCBI resource (http://www.ncbi.nlm.nih.gov/gene) to select all genes with any relation to the brain. We identified a total of 2494 genes by filtering specifically for genes in Homo sapiens matching the query phrase "human brain". All 1000 Genomes Project SNPs in these genes were assigned a "Brain" value of 1, the rest were assigned a"Brain" value of 0. All SNPs were subsequently assigned LD-weighted "Brain" scores.
Major histocompatibility complex (MHC). The MHC has been implicated in schizophrenia as well as a number of other phenotypes, particularly immune-mediated diseases. The evolution of MHC itself may have involved SD and other large scale genetic variations [34]. It is therefore reasonable to expect that SNPs in these regions might be confounding some of the evolutionary enrichment results. To test for the effect of MHC, we removed the SNPs in the MHC region (chromosome 6 region between genomic positions 25652429 and 33421466 in the hg19 assembly) and repeated the analyses.
Annotation of genomic regions (LD based). The SNPs that fall within certain regions of interest may capture only a limited portion of the association signal actually ascribable to that region. We used an LD-weighted scoring algorithm [17] in order to identify SNPs that tag specific DNA regions even if they are not situated within them. For each SNP a pairwise correlation coefficient approximation to LD (r 2 ) was extracted for all 1KGP SNPs within a 1,000,000 base pairs (1Mb). All r 2 values < 0.2 were set to 0 and each SNP was assigned an r 2 value of 1.0 with itself. LD-weighted region annotation scores for all DNA regions of interest were computed as the sum of LD r 2 between the tag SNP and all 1KGP SNPs in those regions. Given SNP i , its LD-weighted region annotation score was computed as LDscore i = S j (δ j r ij 2 ), where r ij 2 is the LD r-squared between SNP i and SNP j and δ j takes values of 1 or 0 depending on whether the 1KGP SNP j is within the region of interest or not. LD scores were also assigned to exons, introns, 3'UTR and 5'UTR, and the total LD score (TotLD) was computed following the same procedure but extending the tagging region to the whole 1Mb window.
Intergenic correction. Intergenic SNPs are defined as having LD-weighted annotation scores for exon, intron, 3'UTR and 5'UTR equal to zero and being in LD with no SNPs in the 1KGP reference panel located within 100,000 base pairs of a protein coding gene, within a non-coding RNA, within a transcription factor binding site or within a miRNA binding site. Those singled out in this way are expected to form a collection of non-genic SNPs not belonging to any (annotated) functional elements within the genome (including through LD) and therefore represent a collection of likely null associations. Intergenic SNPs were used to estimate the inflation of GWAS summary statistics due to cryptic relatedness. We used intergenic SNPs because their relative depletion of associations (17) suggests they provide a set of reliably null SNPs that is less contaminated by polygenic effects. The inflation factor, λ GC , was estimated as the median squared z-score of independent sets of intergenic SNPs across one hundred LD-pruning iterations, divided by the expected median of a chi-square distribution with one degree of freedom.

Analytical approach
We employed a genetic enrichment method recently developed to uncover more of the genetic architecture of complex traits [17,[35][36][37][38]. Specifically, we investigated the enrichment of associations concurrent with the evolutionary affiliations in a covariate-modulated statistical framework [37]. We investigated whether SNPs located in the evolutionarily salient regions (HAR, SD, Ohno) or tagging other SNPs therein, are more likely associated with schizophrenia or other phenotypes using GWAS data from existing non-censored summary statistics.
All statistical analyses were carried out with a covariate-modulated enrichment analysis package developed on R (www.r-project.org) and MATLAB (www.mathworks.se/products/ matlab/) programming platforms.
Fold enrichment plots. To visually assess genetic enrichment, we used conditional fold enrichment plots [39]. For this purpose the covariate of interest, i.e. the region affiliation score, is used to subdivide SNPs into two strata. For LD-weighted annotation scores, the choice of a threshold score is somewhat arbitrary. We chose 1 since this is the score an isolated SNP within a salient region would have. It has been shown elsewhere [17] that the method is robust to the choice of threshold.
The enrichment plots were obtained by computing the empirical cumulative distribution of-log10(p)-values for SNP association with a given phenotype for all SNPs, and for the two dichotomous SNPs strata determined by the region affiliation score. Then each stratum's fold enrichment was calculated as the ratio CDF stratum /CDF all between the-log10(p) cumulative distribution for that stratum and the-log10(p) cumulative distribution for all SNPs. The nominal-log10(p) values are plotted on the x-axis, the fold enrichment in the y-axis. To assess polygenic effects below the standard GWAS significance threshold, we focused the fold enrichment plots on SNPs with nominal-log10(p) < 7.3 (corresponding to p > 5x10 -8 ). Enrichment is present if the line corresponding to the SNPs of interest has a positive deflection from the horizontal line through 1.
Partial least squares regression (PLSR). The fold enrichment plots give a visual impression of the different association propensities of SNPs affiliated to the evolutionarily salient regions. However, they do not give a quantitative measure of the eventual enrichment. One such measure is provided by squared association z-scores regression [9] which in addition allows controlling for covariates of no interest. Due to their nature, the effects of LD-weighted annotation scores can't be estimated by standard linear regression.
PLSR is a supervised subspace regression method that maximizes covariance between two data blocks: the so called descriptor data set (X) and response data set (Y) [40]. An important aspect of PLSR is that the regression model is statistically stabilized for data sets with highly inter-correlated variables [41]. The predictive PLSR model may be written as follows: Where X and Y are descriptor and response data matrices respectively (both data sets are mean-centered and scaled prior to data modelling), B A are the regression coefficients for a model including A latent variables (LVs) and F A is the residual matrix for the corresponding model.
For all of the PLSR models in this study, we chose the optimal number of LVs (A) as the number of LVs that explained more than 99% of the variance in the descriptor data set.
Statistical validation: Jackknife approximate t-tests of regression coefficients. In order to assess the contribution of the descriptor variables to the PLSR model we carried out approximate t-tests of regression coefficients based on jackknife variance estimates [42]. For this purpose, we ran 50-fold cross-validation on the SNP samples and re-calculated the regression coefficients in every cross-validation round. The LD-matrix was used for partitioning the SNP samples into the cross-validation subsets. We then calculated jackknife estimates for the standard deviations of the regression coefficients and, thereafter, t-statistics and approximate pvalues indicating the significance of the association with the corresponding descriptor variables in the PLSR model. Squared z-scores residuals versus TotLD stratified scatter plots. The PLSR results are better visualized by plotting descriptor and response variables directly. We residualized the squared z-scores in PLSR models deprived of TotLD, binned the latter and plotted the average squared z-scores for all bins against the corresponding TotLD bin centers. To control for the effect of the evolutionary measures we residualized the squared z-scores in a second series of PLSR models deprived of TotLD as well as the evolutionary measure of interest and again plotted the average squared z-scores for all bins against the corresponding TotLD bin centers.

Results
We assessed the influence exerted on schizophrenia association propensity by affiliation to HAR, SD and Ohno. The fold enrichment plots (Fig 1) suggest enrichment of schizophrenia association among SNPs in HAR and SD regions, and to some extent, Ohno.
MHC has a known association with schizophrenia [18]. To assess the effect of immunerelated genes, the analyses were repeated after exclusion of SNPs in the MHC region. The most significant reduction in fold enrichment occurred with SD but none was visible in HAR or Ohno (S1 Fig).
To test if the enrichment seen in schizophrenia genes was mediated by brain function, we investigated the fold enrichment of brain genes annotated to various combinations of regions of evolutionary interest (Fig 2). We observe a rather wider deflection from baseline for brain genes affiliated to HAR and SD, suggesting brain genes in these evolutionarily salient regions are more enriched compared to just any brain genes or just any SNPs in HAR, SD or Ohno.
The regression results are reflected by the squared z-scores versus TotLD scatter plots ( S2  Fig). These show the small but significant effect of TotLD alongside that of stratification. HAR has no apparent effect on the squared z-scores regression while SD and Ohno seem to dampen the effect of TotLD, suggesting a possible interaction effect.
Removal of MHC has no relevant effects on PLSR results apart from a loss of significance of the SD association (Tables 1-3B). In light of these results, we repeated the regression analyses Early Evolutionary Markers and Schizophrenia using only SNPs in the MHC region (Tables 1-3C). We observe that the regression coefficient for HAR changes direction but remains non-significant. In case of SD it remains negative but loses significance. The Ohno affiliation coefficient turns significant suggesting that Ohno affiliation might be an enrichment factor specific to the MHC region.
To determine the effect of LD on the enrichment analyses, we generated random sets of SNPs matching the original HAR, SD and Ohno SNPs in total LD and minor allele frequency and repeated the analyses on these sets. The enrichment is somewhat reduced (S3 Fig) compared to the original set of SNPs but is still present despite the lack of evolutionary content.
The specificity of the schizophrenia results was tested by repeating the same analyses for GWAS summary statistics of other human phenotypes: neurological and psychiatric (Alzheimer's disease and bipolar disorder), anthropometric (body mass index (BMI), height), cardiovascular (triglycerides (TG)), immune-mediated (Crohn's disease (CD)). These GWAS were selected to be roughly comparable in sample size to the schizophrenia GWAS and therefore roughly equally powered. The fold enrichment plots show an overall high enrichment in HAR and lower in SD and Ohno across all phenotypes (S4 Fig). The PLSR shows no significant associations in HAR, negative associations between SD affiliation and BD (non-significant) and BMI (nominally significant), but positive association between Ohno and BD (significant). Other covariates like intron affiliation score are positively associated, and significantly so, across most phenotypes when regressed together with HAR or SD scores.

Discussion
We used polygenic enrichment methods to investigate whether SNPs in the HAR, SD and Ohno regions of the human genome, related to early evolution are more likely to be associated with schizophrenia. Our results did not provide clear evidence that genetic variants in these regions are more likely to associate with schizophrenia. While association enrichment is present, it appears to be mediated by affiliation to known genomic enrichment categories like introns, 3'UTR, 5'UTR, and LD. Thus, the risk variants associated with schizophrenia do not seem to have an evolutionary origin prior to divergence between humans and chimpanzees.
While association enrichment is present, it is likely mediated by affiliation to known genomic enrichment categories like introns, 3'UTR, 5'UTR, and LD. schizophrenia association enrichment of brain genes with various regional affiliation scores. Enrichment plots for A) human accelerated regions (HAR), B) segmental duplications (SD) and C) ohnologs are shown for: SNPs annotated to genes with some relation to the brain, as established by an NCBI site search ("Brain"); SNPs affiliated to these regions of interest (HAR, SD or Ohno) and also annotated to genes with some relation to the brain (HAR Brain, SD Brain or Ohno Brain). In case of segmental duplications, the brain genes in the regions of interest (SD Brain) look more enriched (i.e. present a higher incidence of associations [lower p-values] with schizophrenia) compared to SD or just any Brain genes. In Ohnologs, the enrichment is way lower than in SD but Ohno Brain looks more enriched than other categories. HAR Brain and HAR show similar enrichment. All annotations are LD-weighted. doi:10.1371/journal.pone.0169227.g002

Early Evolutionary Markers and Schizophrenia
HARs have the most recent origin among the three evolutionary proxies used in the current study [14]. HAR fold enrichment plots show some enrichment in schizophrenia, bipolar disorder and BMI, and more noteworthy enrichment in height, triglycerides and Crohn's disease. However, regression analyses yielded no significant results in any phenotype, likely owing to the small number of SNPs in these regions. SD and Ohno show similar patterns of enrichment but generally less pronounced than in HAR. This suggests that the enrichment is not specific for schizophrenia but is influenced by the age and the extent of the evolutionary markers. We do observe significant and positive correlation between Ohno and BD but this is likely spurious since none of the GWAS from larger samples, including the schizophrenia GWAS, show any significant results.
Removal of MHC had the most visible effects on HAR followed by SD enrichments. HAR and SD, likely owing to their more recent origin, contain a higher proportion of MHC genes   (Fig 1A) is not supported by regression where no significant (p = 0.57) association is detected. The results remain non-significant after removing MHC SNPs or using only them. doi:10.1371/journal.pone.0169227.t001 Early Evolutionary Markers and Schizophrenia compared to Ohno (S2 and S3 Tables). Ohnos are the oldest regions and cover large portions of the human genome [25]. As such, while still confounded by LD, their enrichment is largely determined by SNPs with low LD scores and hence less likely to tag any causal variants. The removal of MHC had no visible effect on Ohno enrichment but our analysis of SNPs in MHC regions interestingly yielded a significant positive association, suggesting the presence of some evolutionary effect specific to the MHC [44]. Since brain genes are more likely to be associated with schizophrenia, we carried out separate analyses targeting these. Our results suggest that brain SNPs in evolutionary salient regions show enrichment of associations with schizophrenia compared to just any brain genes or non-brain genes within the same regions. Interestingly, regions that are shared by Ohno and SD show greater enrichment of associations with schizophrenia compared to either of them separately. However, this could be due to higher LD in these shared regions. SNPs with Phenotypes: psychiatric and neurological diseases (Alzheimer's disease (AD), bipolar disorder (BD), schizophrenia (SCZ)), anthropometric measures (body mass index (BMI), height), cardiovascular risk factors (Triglycerides (TG)), and immune-mediated diseases (Crohn's disease (CD)). A: regression on the complete set of SNPs, B: regression analysis after removing MHC SNPs and C: regression analysis keeping only MHC SNPs. The enrichment seen in schizophrenia (Fig 1B) is not supported by regression where the association is significant (p = 6.00x10 -6 ) but the effect is negative. The association loses significance upon removal of MHC. No significant associations are found when regressing MHC SNPs only. doi:10.1371/journal.pone.0169227.t002 Early Evolutionary Markers and Schizophrenia higher LD scores are likely to tag both Ohno and SD regions as much as they are likely to be associated with any causal variants. Our analyses on random sets of SNPs matching the original ones suggest that most of the enrichment in the evolutionary salient regions is due to LD. However, a residual enrichment is clearly visible in SD and Ohno and can be guessed in HAR despite its larger variance ( S3  Fig). Judging by the PLSR results, such residual enrichment should probably be attributed to Introns and UTRs (Tables 1-3). This is likely due to overall higher gene content in the case of SD and Ohno (S3 Table). The culprit in the case of HAR is likely to be found in LD with noncoding regulatory elements.
It has been proposed that schizophrenia is the price we paid for the development of a complex brain and our ability to have abstract reasoning and language [45]. The schizophrenia association enrichment in evolutionary markers appears to decrease as the age of the evolutionary Phenotypes: psychiatric and neurological diseases (Alzheimer's disease (AD), bipolar disorder (BD), schizophrenia (SCZ)), anthropometric measures (body mass index (BMI), height), cardiovascular risk factors (Triglycerides (TG)), and immune-mediated diseases (Crohn's disease (CD)). A: regression on the complete set of SNPs, B: regression analysis after removing MHC SNPs and C: regression analysis keeping only MHC SNPs. The enrichment seen in schizophrenia (Fig 1C) is not supported by regression where no significant (p-value = 0.694) association is detected. The results remain non-significant after removing MHC SNPs but become significant when using only them. doi:10.1371/journal.pone.0169227.t003 Early Evolutionary Markers and Schizophrenia markers increases. However, the extent of the LD-weighted evolutionary markers studied here must also be considered. Limited enrichment sensitivity may also be a reason for the lack of findings in the present study, especially in the case of HAR. Other investigators found HAR to be enriched in schizophrenia [46]. However, they distinguished HAR that have accelerated in humans compared to non-human primates (pHAR) and compared to other mammals (mHAR). Further, their analyses were restricted to schizophrenia GWAS loci below given significance levels. They used LD (r2 > 0.5) to define their regions of interest while we treat LD as a confounder and control for it. These factors may be the reason for our failure to reproduce their findings. However, our polygenic techniques were able to show enrichment in regions under positive selection in humans after divergence from Neanderthals [9]. The modern human genome is the result of a complex process of evolution and adaptation. Using multiple evolutionary markers we investigated traces of human-specific pathology in the ancestral regions of the human genome. Our investigation did not reveal conclusive evidence supporting increased prevalence of schizophrenia association in the markers of early human evolution, HAR, SD or Ohno. Taken together with our previous report on markers of human-Neanderthal divergence [9], the current findings suggest that the increase in polygenic schizophrenia risk may have developed during more recent periods of human evolution.
Supporting Information S1  . The SNPs are stratified based on their regional affiliation score: HAR, SD, Ohno vs non HAR, non SD, non Ohno, respectively. A scatter plot for all SNPs (None) is also reported in all figures. Mean squared z-scores are lower in SD and Ohno compared to non SD and non Ohno (Fig 1B and  1C). HAR has no apparent effect on squared z-scores. Non HAR, non SD and non Ohno essentially overlap with the non-stratified set (None). (TIF) S3 Fig. Enrichment plots for schizophrenia, conditioning SNPs according to their affiliation to SNP sets matched to ohnologs, segmental duplications and human accelerated regions in minor allele frequency and total LD. Plot A shows all SNPs stratified by affiliation to human accelerated regions (HAR) and non HAR. Plot B shows all SNPs stratified by affiliation to segmental duplication (SDLD) and non-segmental duplication regions (NonSD); Plot C shows all SNPs in ohnologous (Ohno) and non ohnologous regions (NonOhno). We observe some depletion of enrichment compared to the original set of SNPs but it is still present despite the lack of evolutionary content. (TIF)