A Novel, Functional and Replicable Risk Gene Region for Alcohol Dependence Identified by Genome-Wide Association Study

Several genome-wide association studies (GWASs) reported tens of risk genes for alcohol dependence, but most of them have not been replicated or confirmed by functional studies. The present study used a GWAS to search for novel, functional and replicable risk gene regions for alcohol dependence. Associations of all top-ranked SNPs identified in a discovery sample of 681 African-American (AA) cases with alcohol dependence and 508 AA controls were retested in a primary replication sample of 1,409 European-American (EA) cases and 1,518 EA controls. The replicable associations were then subjected to secondary replication in a sample of 6,438 Australian family subjects. A functional expression quantitative trait locus (eQTL) analysis of these replicable risk SNPs was followed-up in order to explore their cis-acting regulatory effects on gene expression. We found that within a 90 Mb region around PHF3-PTP4A1 locus in AAs, a linkage disequilibrium (LD) block in PHF3-PTP4A1 formed the only peak associated with alcohol dependence at p<10−4. Within this block, 30 SNPs associated with alcohol dependence in AAs (1.6×10−5≤p≤0.050) were replicated in EAs (1.3×10−3≤p≤0.038), and 18 of them were also replicated in Australians (1.8×10−3≤p≤0.048). Most of these risk SNPs had strong cis-acting regulatory effects on PHF3-PTP4A1 mRNA expression across three HapMap samples. The distributions of −log(p) values for association and functional signals throughout this LD block were highly consistent across AAs, EAs, Australians and three HapMap samples. We conclude that the PHF3-PTP4A1 region appears to harbor a causal locus for alcohol dependence, and proteins encoded by PHF3 and/or PTP4A1 might play a functional role in the disorder.


Introduction
Alcohol dependence is a common, highly familial disorder that is a leading cause of morbidity and premature death. It results in serious medical, legal, social and psychiatric problems and influences many facets of American society. It affects 4 to 5% of the United States population at any given time, with a lifetime prevalence of 12.5% [1,2]. Family, twin and adoption studies have demonstrated that genetic factors constitute a significant cause for alcohol dependence. A large number of risk loci have been reported for alcohol dependence (AD) by candidate gene approach. Several genome-wide association studies (GWASs) [3,4,5,6,7] have also reported tens of risk loci for alcohol dependence and alcohol consumption (summarized by Zuo et al. [3]). However, most GWAS findings have not been replicated in independent samples and confirmed by functional studies.
In the present study, we reanalyzed the data sets of the Study of Addiction Genetics and Environment (SAGE), the Collaborative Study on the Genetics of Alcoholism (COGA) and the Australian family study of alcohol use disorder (OZ-ALC). Using the following analytic strategies, we expected to discover the novel (i.e., previously unimplicated) risk loci for alcohol dependence. First, we combined SAGE and COGA datasets to increase the sample sizes and power (with site-to-site variation and sample overlapping being considered), which may be able to detect some novel risk loci missed in previous studies. Second, we set AAs as the discovery sample. The top-ranked SNP list in AAs would be different from those in the previous studies that used EAs, Germans or Australians as the discovery sample. Third, we used replication and confirmation design to reduce the chance of false positive findings, and thus increase a level, which may be able to detect some novel risk loci missed in previous studies due to too conservative Bonferroni correction. Fourth, we completely separated EAs and AAs in the analysis to increase the population homogeneity, and controlled for admixture effects in the association tests. Fifth, we used EAs and Australians as replication samples, and then used different samples with distinct ethnicity to detect eQTL signals, as a confirmation of variant functions to the discovery association findings. Although using distinct samples in one study might increase the false negative rates due to sample heterogeneity, replication in distinct samples does make the false positive findings less likely. Replicable findings in distinct populations would be more generalizable to more other populations, and would be more likely to appear on the causal variants. Sixth, we applied innovative definition of replication. The primary target of investigation in the current study was not the top-ranked SNPs in the discovery sample as previous GWASs, but rather the replicable risk regions. This idea was similar to that in a prior study [4]. In the replicable risk regions, there should be not only many individual markers replicable between the discovery and replication samples, but the overall distributions of association signals and functional signals throughout the whole region should also be consistent across the discovery, replication and confirmation samples (see rationales in Materials and Methods S1). Such important regions have not been reported in previous GWASs of alcohol dependence.  [5,6], and Australian sample was OZ-ALC (dbGaP: phs000181.v1.p1) dataset [7]. These datasets were originally collected mainly for study of alcoholism. All Australian subjects were of European ancestry. Affected subjects met lifetime DSM-IV criteria for alcohol dependence [8], and Australian subjects were also measured for alcohol consumption by a quantitative scale. Controls were defined as individuals who had been exposed to alcohol (and possibly to other drugs) at sufficient amounts for a sufficient time, but had never become addicted to alcohol or other illicit substances (lifetime diagnoses). This criterion for controls took into account the confounding effects from an environmental factor, i.e., drinking. In contrast to general controls who had never used substances, our controls reduced the potential false negative rates, because a proportion of general controls might still have a risk to develop to alcohol dependence when drinking. Additionally, controls were also screened to exclude individuals with major axis I disorders, including schizophrenia, mood disorders, and anxiety disorders. More detailed demographic information is available in Materials and Methods S1 or elsewhere [3,5,6,9]. AA and EA samples were genotyped on the Illumina Human 1 M beadchip and Australian sample was genotyped on the Illumina CNV370v1 beadchip.

Ethics Statement
All subjects gave written informed consent to participate in protocols approved by the relevant institutional review boards (IRBs). All subjects were de-identified in this study and the study was approved by Yale IRB.

Imputation
CNV370 beadchip has only one-sixth of markers overlapping with Human1M beadchip. To know if the risk markers identified in AAs and EAs (Human1M) could be replicated in Australians (CNV370), we imputed the genotype data in Australians to fill in the missing markers and then performed association tests. First, we pre-phased the original genotype data 5 Mb around the risk genes of interest in Australians. Second, we used 1,000 Genome Project and HapMap 3 CEU datasets as reference panels to impute the missing genotypes in this 5 Mb region by the program IMPUTE2 [10]. This program uses a Markov Chain Monte Carlo (MCMC) algorithm to derive full posterior probabilities of genotypes of each SNP (burnin = 10, iteration = 30, k = 80 and Ne = 11,500). If the probability of one of the three genotypes of a SNP was over the threshold of 0.95, the genotypes of this SNP were then expressed as a corresponding allele pair for the following association analysis; otherwise, they were treated as missing genotypes. For SNPs that were directly genotyped, we used the direct genotypes rather than the imputed data. The imputed genotype data in Australians were checked for Mendelian errors by the program PEDCHECK [11].

Association analysis
Before statistical analysis, we cleaned the phenotype data first and then the genotype data. This cleaning process yielded 805,814 SNPs in EAs, 895,714 SNPs in AAs and 300,839 SNPs in Australians. [Detailed cleaning steps were described previously [3]].
(a) Genome-wide association tests in AA discovery sample: The allele and genotype frequencies were compared between cases and controls in AAs using genome-wide logistic regression analysis implemented in the program PLINK [12]. Diagnosis served as the dependent variable, alleles or genotypes served as the independent variables, and ancestry proportions (to control for admixture effects), sex, and age served as covariates. Ancestry proportions of each individual were estimated from 3,172 completely independent markers [3]. The top-ranked SNPs (p,10 24 ) were also tested by Fisher's exact tests without controlling for admixture effects.
The p-values derived from these analyses are illustrated in Figure S1 and the top 5 SNPs are listed in Table S1. (b) Association tests in the primary EA replication sample: Associations between the above top-ranked SNPs (p, 10 24 ) and alcohol dependence were tested using logistic regression analysis (with ancestry proportions, sex and age as covariates) and Fisher's exact test (without covariates) in EAs, to identify risk genes (i.e., Plant HomeoDomain (PHD) finger protein 3 gene -protein tyrosine phosphatase type IVA gene, member 1 (PHF3-PTP4A1) here) that were enriched with replicable markers. Then, associations between alcohol dependence and all nominally significant SNPs (p,0.05 in AAs) in PHF3-PTP4A1 were retested in EAs. The associations that were replicated across AAs and EAs are shown in Table 1 and Figure 1. Meta-analysis was performed to derive the combined p values between AAs and EAs.
(c) Family-based association tests in the secondary Australian family replication sample: Associations between alcohol dependence and the replicable risk SNPs in PHF3-PTP4A1 ( Table 1) identified between AAs and EAs were retested in Australians using a family-based association test implemented in PLINK [12]. Meta-analysis was performed to derive the combined p values between EAs and Australians.

Cis-acting genetic regulation of expression analysis
To examine relationships between genetic variants and local gene expression levels in lymphoblastoid cell lines, we performed cisacting expression of quantitative locus (cis-eQTL) analysis. These relationships included those between all replicable risk SNPs in PHF3 and PHF3 mRNA expression levels, and those between all replicable risk SNPs in PTP4A1 and PTP4A1 mRNA expression levels. Expression data of 14,925 transcripts (14,072 genes) in 270  [13]. Differences in the distribution of mRNA expression levels between SNP genotypes were compared using a Wilcoxon-type trend test. P-values less than 0.05 were listed in Table 1 and plotted in Figure 1. Additionally, effects of SNPs 1 Mb surrounding the association peak SNP (rs9294269) were illustrated in Figures 1D-H.

Correction for multiple testing
The AA discovery sample was genotyped for one million SNPs. The association results could be corrected for one million tests (a = 5610 28 ) to prevent from false positive findings. However, this correction is overly conservative, because these 1 M markers are not completely independent. Instead, in the present study, we used multiple samples to replicate and confirm the discovery findings, in order to reduce the chance of false positive findings and increase the a level from 5610 28 . First, we used EAs and Australians, the most genetically distinct populations from AAs in the world, as replication groups for association analysis. This would make the replicable findings more generalizable to more other populations. Second, we aimed to detect replicable regions that were enriched with many, not a single, risk markers, which reduced the chance of false positive association findings too. Third, functional analysis as confirmation of association analysis further reduced the chance of false positive findings. Additionally, functional analysis in multiple populations with distinct ethnicity, which were also different from the populations for association analysis, would make the findings more generalizable too. Fourth, the distributions of 2log(P) values across the discovery, replication, and confirmation samples were compared for the similarity using Pearson correlation analysis (see rationale in Materials and Methods S1). The consistency between them would significantly reduce the chance of false positive findings. Additionally, our analyses followed a fixed procedure (Materials and Methods S1) step-by-step, which reduced multiple testing. Therefore, a in the discovery sample was not necessary to be corrected for one million of times if an association was replicated.
Furthermore, only when a discovery finding was replicated and confirmed by multiple groups, it was taken as ''significant'' in the present study. For these replicable findings, a region-wide correction might be sufficient. Five independent markers, which were the effective number capturing the information content of all 30 replicable risk markers in whole PHF3-PTP4A1 region ( Table 1), were predicted by the program SNPSpD [14]. Thus, a region-wide corrected a could be set at 0.01 ( = 0.05/5) for those replicable findings.

Transcriptome-wide expression correlation analysis
The expression data of 14,925 transcripts in 93 autopsycollected frontal cortical brain tissue samples were evaluated using Affymetrix Human ST 1.0 exon arrays. These data were obtained from a research study [15] at Duke University. These individuals included 55 males and 38 females, from 34 to 104 years old with an average of 74616 years. The postmortem intervals, i.e., the time from death to brain tissue collection, were 1.2-46 hours with an average of 14.369.5 hours. These individuals had no defined neuropsychiatric condition. Correlations between expression of PHF3-PTP4A1 transcript and expression of other genes across transcriptome in these individuals were tested (Table S2). a was set at 3.4610 26 ( = 0.05/14,925).

Results
There were a total of 114 SNPs in 79 genes that were marginally (p,10 24 ) associated with alcohol dependence in the AA discovery sample (data available on request). The p values from the allelewise and genotypewise association analyses of the five top-ranked SNPs before and after controlling for admixture effects are listed in Table S1. Among these top-ranked SNPs, 22 SNPs (19.3%) in 10 genes were replicable in EAs (Table S3). Among these 10 genes, only PHF3-PTP4A1 region was enriched with 12 replicable top-ranked SNPs (Table S3).
Testing all available SNPs (n = 131) in the PHF3-PTP4A1 region in AAs, we found 38 SNPs that were nominally associated (1.6610 25 #p#0.050) with alcohol dependence, among which, 28 survived region-wide correction for multiple testing (a = 0.01). Testing these 38 SNPs in EAs, we found 30 in one LD block (D9.0.9; Figure 1) that were well replicated in EAs (1.3610 23 #p#0.038), and 23 of them that survived region-wide correction (a = 0.01) ( Table 1). Testing all of these 30 SNPs in Australians, we found 18 SNPs that were replicable in this sample (1.8610 23 #p#0.048), and 9 of them that survived region-wide correction (a = 0.01) ( Table 1). Interestingly, 29 risk SNPs had same direction of gene effects on alcohol dependence between EAs and Australians, but had opposite directions of effects between EAs and AAs ( Table 1). Meta-analysis showed that these gene effects became less significant when combined AAs and EAs, but became a little more significant when combined EAs and Australians (data not shown). In spite of this, all risk alleles (OR.1; Table 1) of these 29 SNPs (except for rs1744134, rs1482451 and rs3003672) were the minor alleles (f,0.5) in both AAs and EAs (Table S4). Additionally, sex, age and admixture effects did not significantly affect our results (data not shown).
Cis-eQTL analysis showed that 24 of the 30 replicable risk SNPs had significant cis-acting regulatory effects on PHF3-PTP4A1 mRNA expression level in at least one of HapMap CEU-Children, CEU-Parent and CHB populations (Table 1; Figure 1D-F), and 12 of them survived region-wide correction (a = 0.01). PHF3-PTP4A1 was enriched with many other functional signals across five HapMap populations (Figure 1), although these functional SNPs in JPT and YRI-Parent were not exactly, but in high LD with, those replicable risk SNPs in the AA discovery sample ( Figure 1G-H).
The LD block of PHF3-PTP4A1 containing the association signals overlapped extensively across AAs, EAs and Australians ( Figure 1B  and 1C; Table 2). The LD block that was enriched with functional signals across HapMap CEU-Children, CEU-Parent and CHB populations overlapped extensively with the region that had significant association signals across AAs, EAs and Australians ( Figure 1B-C and 1D-F). The distributions of 2log(p) values for [Left Y-axis corresponds to 2log(p) value; right Y-axis corresponds to recombination rates; quantitative color gradient corresponds to r 2 ; red squares represent peak SNPs. (a) regional association plot in AAs for a 10 Mb region surrounding the peak association SNP (rs9294269) in PHF3-PTP4A1; (b, c) regional association plots in AAs or EAs for a 1 Mb region surrounding the peak association SNP (rs9294269) in PHF3-PTP4A1; (d-h) regional eQTL plots in HapMap populations for a 1 MB region surrounding rs9294269; (i) LD map for all available markers for a region surrounding rs9294269 in EAs (Illumina Human1M beadchip), in which red bars represent the peak SNPs in each population]. doi:10.1371/journal.pone.0026726.g001 association and functional signals across AA, EA, Australians, CHB and CEU-Children populations were highly consistent (Pearson correlation coefficient r$0.465 with 2.5610 221 #p#4.0610 24 ), and were negatively correlated with that in YRI-Children (r#20.407; 5.7610 26 #p#9.2610 24 ; Table 2 and Figure 1).
In the AA discovery sample, within the 25 Mb region around the peak association SNP (rs9294269; p = 1.6610 25 ), this risk LD block formed the only peak that had association signals significant at a p,10 23 ; within the 90 Mb region around this SNP, this risk LD block was the only peak that had association signals significant at a p,10 24 (see Figure 1A, which depicts 10 Mb of this interval). In the EA replication sample, within the 10 Mb region around the peak SNP (rs9449312; p = 1.3610 23 ), this risk LD block was the only peak that had association signals significant at a p,1.5610 23 (see Figure 1C, which depicts 1 Mb of this interval). Additionally, within 1 Mb range, the most significant functional SNPs in HapMap CHB (rs9294269; p = 0.0023; Figure 1D), CEU-Child (rs1681957; p = 0.0036; Figure 1E), and JPT (rs6916092; p = 0.016; Figure 1G), and the second most significant functional SNPs in CEU-Parent (rs10943869; p = 0.017; Figure 1F) and YRI-Child (rs3757350; p = 0.012; Figure 1H) were all located in PHF3-PTP4A1. The peak SNPs among each of these populations were in high LD (D9.0.9); especially, the peak SNP in CHB (rs9294269) was exactly the same peak SNP in AAs ( Figure 1B vs. 1D). The more closely the peak SNPs were located ( Figure 1I), the correlations between the distributions of 2log(p) values across whole region were more significant ( Table 2), which suggested that the peak SNP captured most information of the whole distribution across that region. The more significant those correlations were, the more consistent (replicable) between populations the risk regions would be. Thus, the distance between peak SNPs reflected the strength of replicability of association or function signals between populations.

Discussion
In the present study, when merging 480 COGA subjects into SAGE sample, we got highly similar results to previous studies that used SAGE sample alone [5,22]. The top-ranked risk SNPs (p,10 25 ) in EAs, AAs, and AAs+EAs in those previous studies [5,22] were confirmed by our analysis (presented previously [3]). Similarly, many top-ranked risk SNPs (Table S1) in the present study were also listed as top-ranked genes previously. However, these top-ranked genes have not yet been replicated independently and confirmed by functional studies before.
In the present study, using new analytic strategy and integrating evidence from the functional analysis, we identified a risk region for alcohol dependence (i.e., PHF3-PTP4A1 locus) that was missed previously. This region was enriched with functional genetic SNPs that had replicable associations with alcohol dependence. This important risk region was not reported previously, because most of the risk SNPs in it had p-values between 10 25 and 10 23 that were out of the top-ranked risk SNP list (p,10 25 ) in previous GWASs. Such p values were reasonable for alcohol dependence, because the effect sizes of individual loci for this complex trait had to be small. We used a replication design to reduce the false positive rate and increase the significance threshold (a) from 5610 28 , and thus discovered this risk region.
PHF3-PTP4A1 region was enriched with 30 replicable risk SNPs for alcohol dependence in two kinds of genetically distinct populations, i.e., AAs and EAs. Twenty-six of these replicable risk SNPs were found to be functional by expression data obtained across multiple HapMap populations. All risk SNPs were in one LD block around the association peak SNP (i.e., rs9294269 in PHF3 in AAs). This risk LD block overlapped extensively across AAs, EAs, Australians and three HapMap populations, and the association or functional peak SNPs in each of these populations were in high LD with each other. In a word, the association and functional signals in this LD block were highly consistent across six samples. These findings suggested that the PHF3-PTP4A1 region might harbor a causal locus and that the proteins encoded by PHF3 and PTP4A1 might contribute to the vulnerability to alcohol dependence. First, the risk LD block in the region of PHF3-PTP4A1 formed the only association peak within a 90 Mb region in AAs (threshold p = 10 24 ) and within a 10 Mb region in EAs (threshold p = 1.5610 23 ). It is, thus, highly likely that the putative causal locus for alcohol dependence was located within this PHF3-PTP4A1 LD block. We speculated that there might be only one causal locus in this region, and all risk SNPs might be in LD with this putative causal locus and, thus, presented association signals. If there were $2 independent causal loci, the risk markers in LD with respective causal loci would be located in $2 independent risk LD blocks, which were not observed in the present study. Second, most replicable risk SNPs in this block had strong cisacting regulatory effects on PHF3-PTP4A1 mRNA expression. This increased the possibility that PHF3-PTP4A1 per se played a direct functional role in the disorder. Third, many PHF3-PTP4A1 SNPs had significant (in PHF3) or slight (in PTP4A1) potential for altering the secondary RNA structure (predicted by MFOLD [23]) ( Table S4), providing additional evidence in support of the hypothesis that PHF3-PTPA41 per se contributed to alcohol dependence. Fourth, distributions of 2log(P) values for genedisease associations and for gene-expression associations were highly consistent across at least six populations. This might suggest that the majority of the functions of PHF3-PTP4A1 contributed to the risk for alcohol dependence, and that the regulatory pathway via which these SNPs caused alcohol dependence might be related to the PHF3 and PTP4A1 proteins per se. Taken together, these findings strongly supported the hypothesis that PHF3-PTP4A1 harbored a causal locus for alcohol dependence.
It is well-known that the gene expression is tissue-specific. In another word, consistent findings between lymphoblastoid cell lines and brain tissues are rare, but inconsistent findings between them are common. Suppose the alcoholism-associated markers have positive cis-eQTL signals in the brain, the chance of these markers happening to have negative cis-eQTL signals (i.e., false negative rate) in the lymphoblastoid cell lines could be common; but the chance of these markers happening to have positive cis-eQTL signals (i.e., false positive rate) in the lymphoblastoid cell lines is rare; and the chance of these markers happening to have distributions highly consistent between cis-eQTL signals in the lymphoblastoid cell lines and gene-disease association signals across different samples should be extremely rare. That is, using lymphoblastoid cell lines for cis-eQTL analysis of brain disorderrelated markers might increase the false negative rates due to the relatively poor conservation in cis-eQTLs between cell lines and brain tissue samples, but it should not significantly increase the false positive rates. In the present study, (1) we detected positive cis-eQTL signals in lymphoblastoid cell lines across multiple populations, (2) these markers were alcoholism-associated, and (3) the distributions of these cis-eQTL signals matched the distribution of the alcoholism-gene association signals. We believed that these findings might be highly likely to be truly positive, and strongly suggested that these markers might have positive cis-eQTL signals in the brain too. Independent validation of the cis-eQTL analysis in the brain tissues is warranted in the follow-up study to test our hypothesis.
PHF3 and PTP4A1 might also influence alcohol dependence by interacting with other genes. Expression of PHF3 and PTP4A1 transcripts was significantly correlated with expression of many alcoholism-related genes in brain, including those in the dopaminergic, serotoninergic, GABAergic, glutamatergic, histaminergic and endocannabinoid systems [6,17,18,19,20,21]. These findings suggested that PHF3 and PTP4A1 might also be implicated in alcohol dependence via the classical neurotransmission systems or metabolic pathways.
It is worth noting that the putative causal locus within the PHF3-PTP4A1 region may not be identical to the risk markers implicated in the current study, and therefore, may need to be identified by sequencing. First, none of the risk SNPs presented here were non-synonymous. Rather, they appear to have implications for risk and function by virtue of their being in LD with a putative causal locus and/or due to their location in regulatory regions (e.g., enhancer elements) that may in turn regulate transcription of the causal locus. Second, the SNPs employed by GWAS are common, but not rare, variants. Numerous studies have shown that many gene-disease associations are not due to a single common variant, but rather due to a constellation of more rare, regionally concentrated, diseasecausing variants. Thus, the signals of association credited to our common SNPs may be synthetic associations resulting from the contributions of multiple rare SNPs within the PHF3-PTP4A1 region, which need to be identified by sequencing. Third, both PHF3 and PTP4A1 were found to have significant association and functional signals. PHF3 had weaker association signals in AAs and EAs and weaker functional signals in lymphoblastoid cell lines than PTP4A1. However, associations for PHF3 markers were also replicated in the Australian sample. PHF3 had greater evidence of altered RNA secondary structures than PTP4A1. These positive signals might be due to the LD with a single causal locus in PHF3-PTP4A1 region, and this putative causal locus was more likely to be located in PHF3 based on our current evidence, which, again, needs sequencing to confirm. Finally, HapMap JPT and YRI-Children populations also presented functional signals, but the distributions of 2log(P) values across the LD block in these two populations were negatively correlated with those in AAs, EAs, Australians, HapMap CHB and CEU-Children. It is likely that, in these two sets of populations, different phases of alleles might be in LD with the same causal allele.
The Plant HomeoDomain (PHD) finger proteins (PHFs) are members of zinc finger protein (ZNF) superfamily. They are regulatory proteins in nucleus and cytoplasm and are frequently associated with chromatin-mediated transcriptional regulation [24,25]. They can specifically recognize and bind to the trimethylated lysines (e.g., H3K4me3 or H3K9me3) on histones, and regulate their methylation status. PHF3 is ubiquitously expressed in normal tissues including brain. It has been reported that alcohol abuse could significantly up-regulate the gene expression level of PHF3 in the frontal cortex in alcoholics [26].
Additionally, the prenylated protein tyrosine phosphatases (PTPs) are cell signaling molecules that play regulatory roles in a variety of cellular processes. Over-expression of PTPs in mammalian cells confers a transformed phenotype, which implicates its role in diseases. It has been reported that, in mice, Ptp4a1 expression was significantly regulated by ethanol in prefrontal cortex [27]; and transcript expression of Ptp4a1 (p = 3.2610 211 ) was significantly associated with alcohol consumption [28]. These findings supported PHF3 and PTP4A1 as reasonable candidates for alcohol dependence, although the biological mechanisms warrant more studies in the future.    [*These SNPs are located in the transcription factorbinding site; Bold SNPs can significantly (underlined; in PHF3) or slightly (in PTP4A1) alter the RNA secondary structures. Some databases categorize the SNPs in the 39 flanking region of PHF3 (from rs319924 to rs3003672) into LOC389405 that encodes a notch 5-like protein similar to Neurogenic locus Notch protein precursor. OR, odds ratio directions corresponding to Table 1