Genomic Regions Identified by Overlapping Clusters of Nominally-Positive SNPs from Genome-Wide Studies of Alcohol and Illegal Substance Dependence

Declaring “replication” from results of genome wide association (GWA) studies is straightforward when major gene effects provide genome-wide significance for association of the same allele of the same SNP in each of multiple independent samples. However, such unambiguous replication is unlikely when phenotypes display polygenic genetic architecture, allelic heterogeneity, locus heterogeneity and when different samples display linkage disequilibria with different fine structures. We seek chromosomal regions that are tagged by clustered SNPs that display nominally-significant association in each of several independent samples. This approach provides one “nontemplate” approach to identifying overall replication of groups of GWA results in the face of difficult genetic architectures. We apply this strategy to 1 M SNP GWA results for dependence on: a) alcohol (including many individuals with dependence on other addictive substances) and b) at least one illegal substance (including many individuals dependent on alcohol). This approach provides high confidence in rejecting the null hypothesis that chance alone accounts for the extent to which clustered, nominally-significant SNPs from samples of the same racial/ethnic background identify the same sets of chromosomal regions. It identifies several genes that are also reported in other independent alcohol-dependence GWA datasets. There is more modest confidence in: a) identification of individual chromosomal regions and genes that are not also identified by data from other independent samples, b) the more modest overlap between results from samples of different racial/ethnic backgrounds and c) the extent to which any gene not identified herein is excluded, since the power of each of these individual samples is modest. Nevertheless, the strong overlap identified among the samples with similar racial/ethnic backgrounds supports contributions to individual differences in vulnerability to addictions that come from newer allelic variants that are common in subsets of current humans.


Introduction
Genome wide association (GWA) is a method of choice for identifying genes whose variants influence vulnerability to complex disorders. Declaring ''replication'' of individual results of genome wide association studies is straightforward when major gene effects provide associations between marker and phenotype that display the same phase and ''genome wide'' levels of significance (p ca 10 28 ) in each of several independent samples. However, such ''template'' replication for individual markers is unlikely to be achieved in many otherwise-reasonable samples for many phenotypes. Phenotypes and samples that display polygenic genetic architecture, allelic heterogeneity, locus heterogeneity and sample-to-sample differences in fine structures of linkage disequilibrium can provide especial difficulties for this ''template'' approach. These difficulties can be exacerbated when data comes from different genotyping platforms that do not assess allele frequencies for identical sets of SNPs. Much current genome wide association and linkage data suggests that we may have identified many or even most of the loci at which we might expect ''template'' analyses to identify reproducible genome wide significance in reasonably sized samples (see references below). Much of the risk attributable to genetic influences on common phenotypes appears likely to arise from polygenic influences whose properties are likely to provide many false negative results in searches for replicated ''genome wide'' significance in multiple independent samples that use ''template'' criteria for replication.
Vulnerability to heavy use and development of dependence on alcohol and/or an illegal abused substance (''addiction vulnerability'') appears to be such a trait. The substantial genetic influences on addiction vulnerability are documented by data from family, adoption and twin studies [1,2,3,4]. Twin studies also document shared heritable influences on vulnerability to dependence on addictive substances from different pharmacological classes, including alcohol and illegal drugs from several pharmacological classes [2,3,5]. Combined data from linkage and initial GWA studies [6,7,8,9,10,11,12,13,14,15,16,17,18,19] suggest that much of the genetic influence on vulnerability to substance dependence is likely to be polygenic.
We have developed a ''nontemplate'' strategy that identifies overall replication of sets of genome wide association (GWA) results in the face of difficulties with genetic architectures, samples and genotyping methods [9,14,20,21]. Such an approach can complement meta-analyses that seek to combine data from single markers whose significance in single samples does not achieve genome wide significance.
We now report application of this nontemplate strategy to identify overall replication of groups of results from GWA studies of samples of individuals with dependence on alcohol and illegal substances vs matched controls [21], (http://www.ncbi.nlm. nih. gov/gap). We separately compare data from independent samples of individuals with European-American genetic backgrounds and samples of individuals with African-American genetic backgrounds. These data come from individual genotyping and multiple-pool genotyping approaches that use 1 M SNP Illumina and Affymetrix platforms, respectively. The results focus attention on chromosomal regions that are identified by clusters of SNPs for which case vs control differences achieve nominal statistical significance in multiple samples from the same racial/ethnic group. We describe the high confidence with which this approach rejects the null hypothesis that clusters of nominally-significant SNPs from different samples of individuals from the same racial/ ethnic group identify the same chromosomal regions with frequencies expected by chance. We note the more modest levels of confidence that this approach provides for identification of individual SNPs, individual chromosomal regions, individual genes and for the overlap between data from samples of the two racial/ethnic groups studied, except in genes in which we and other investigators have identified associations in independent samples. We discuss this work in light of its technical and analytic limitations and in its similarities with and differences from ''template'' GWA analyses and meta-analyses that seek reproducible associations of striking levels of significance at single SNP markers. The current ''nontemplate'' replication of sets of results may be useful in other settings in which the underlying properties of the disorder and of the samples create difficulties for searches for individual SNPs with replicated genome wide significance.

Materials and Methods
Subjects, genotyping and assignment of nominal significance of dependent vs control allele frequencies in each sample 1) dbGAP samples from the FSCD, COGA and COGEND studies. Genotypes from unrelated subjects who provided written consents and met DSM criteria for alcohol dependence and consenting control subjects with no evidence for dependence on any drug were assembled from three sets of subjects and deposited in dbGAP (http://www.ncbi.nlm.nih.gov/projects/ gap/cgi-bin/study.cgi?study_id = phs000092.v1.p1). Family study of cocaine dependence (FSCD) subjects were recruited from treatment centers close to St. Louis. Mo; 55% of contacted subjects participated [22]. Community-based comparison subjects were recruited through driver's license records from the Missouri Family Registry and were matched to alcohol dependent subjects based on date of birth, ethnicity, gender, and zip code. Eighty percent of screened and eligible comparison subjects participated. Other participants came from individuals who participated in the Collaborative Study on the Genetics of Alcoholism (COGA) [23] and the Collaborative Study on the Genetics of Nicotine Dependence [10]. Dependent individuals displayed DSM (Diagnostic and Statistical Manual IV) dependence on alcohol. Controls, defined in dbGap variable phv00022939.v1.p1.c2 ''final_type'', displayed no DSM dependence on alcohol, cocaine, marijuana, opioids or other drugs but may have evinced DSM nicotine dependence, FTND scores .4 and/or regular smoking as defined by smoking .100 cigarettes in their lives. We identified 1171 dependent and 1395 control unrelated European-American subjects and 652 dependent and 499 control unrelated African-American subjects for this analysis. Subjects were 45% male; 48% of the alcohol-dependent subjects were also dependent on cocaine.
Genotyping for these samples was performed using Illumina 1 M SNP arrays at the Center for Inherited Disease Research (CIDR), with quality controls and principal components analysis (PCA) controls for racial/ethnic background available at the dbGAP website. Genotypes from dependent and control individuals were selected from dbGAP files, excluding SNPs with minor allele frequencies less than 0.01-0.02 (for European and African American samples, respectively) and those with missing call rates .5%. p values for each SNP were based on x 2 tests.
2) NIDA/MNB samples. European-American and African-American research volunteers, largely non treatment seeking, came to the NIDA research facility in Baltimore, Maryland between 1990 and 2007 in response to advertisements and referrals from other research volunteers. Subjects provided written informed consents, self-reported ethnicity data, drug use histories via the Drug Use Survey and DSMIII-R or IV diagnoses (Diagnostic and Statistical Manual) and were reimbursed for their time as previously described [6,17,21,24]. Genotypes were assessed in DNA pools using Affymetrix 6.0 arrays and methods that we have extensively validated, as previously described [6,7,8,9,21]. Pooling 1) provided us with the maximal ability to protect the genetic confidentiality of subjects who volunteered for study of genetics of illegal behaviors, 2) allowed us to utilize DNAs from individuals who consented to participation in this study during time periods when consents did not explicitly describe studies using high densities of DNA markers, 3) allowed us to use methods that we have developed and validated in this and in previous work and 4) reduced costs. Many of these subjects would thus not have been available for studies that assessed substantial numbers of polymorphisms using individual genotyping. Nominal p values for each SNP were determined based on t tests that compared data from multiple abuser vs control pools that contained DNAs from 680 European-American and 940 African-American individuals who had mean ages of 32.8 and 34.0 and were 69.5 and 58.8% male, respectively, as described [21]. In addition, to provide additional validation for the pooling results for the SNPs that formed the basis of the clusters evaluated herein, we also performed individual genotyping using Affymetrix 6.0 arrays for the 155 African American research volunteers who constituted virtually all of the members of 8 DNA pools and who had consented to unlimited individual genotyping. These individual genotyping results all passed Affymetrix quality control standards and resulted in $98% call rates.
3) Identification of chromosomal regions containing clusters of SNPs with nominally-significant case vs control differences in single or multiple samples. We performed analyses based on previously-defined criteria using datasets of approximately 1 million SNPs [21]. We identified chromosomal regions of interest in individual samples by seeking regions in which at least 4 clustered SNPs displayed case vs control differences with nominal, p,0.05 levels of statistical significance. We defined clustering based on separation of each clustered SNP from the nearest nominally-significant SNP by #10 kb. We identified similarities between the results obtained from multiple samples by identifying the chromosomal regions that were tagged by such clustered, nominally positive SNPs in each of the samples of individuals from the same racial/ethnic groups. We identified genes for which these chromosomal intervals lay within the exons of the gene and/or in 10 kb of 59 or 39 flanking sequence.

4) Monte
Carlo methods for assignment of levels of significance to: a) the extent of clustering in each sample and b) the degree to which clustered nominally-positive SNPs from multiple independent samples identify the same chromosomal regions. Monte Carlo methods were used to assign empirical statistical probabilities to two null hypotheses, starting with the sets of all SNPs and the nominally positive SNPs that displayed p,0.05 case vs control values.
We first tested the null hypothesis that chromosomal clustering of these nominally positive SNPs occurred at the level expected by chance in these datasets. For each Monte Carlo trial that tested this null hypothesis, we randomly selected a number of ''pseudo positive'' SNPs from each dataset that matched the number that achieved nominal significance in the bona fide dataset. Thus, we constructed a list of autosomal SNPs assayed in each sample and assigned a number to each SNP that corresponded to its position on the list. To select the pseudopositive SNPs for each trial of the European-American datasets, we selected 75,413 random numbers for the NIDA (see below) and 49,843 random numbers for the dbGAP datasets. For the African American datasets, we used 83,330 and 45,325 random numbers, respectively. For each trial, the SNPs identified by the positions on the list that corresponded to these randomly-assigned numbers were then queried for the extent to which their results equaled or exceeded the results obtained for the actual dataset. In 10,000 such trials for each sample, we compared results concerning the extent of chromosomal clustering from these sets of pseudopositive SNPs to those for the true positive SNPs. These empirical Monte Carlo p values thus addressed the null hypothesis that the true positive SNPs from each single sample were randomly arrayed on the chromosomes. Of course, the clustering of SNPs that provided nominally-significant case vs control differences in each individual sample did not allow us to discern whether the haplotypes identified in such a manner were related to a) phenotypic differences or to b) stochastic differences in haplotype frequencies between case and control samples.
Monte Carlo methods were also used to assign empirical statistical probabilities to a second null hypotheses: that the same chromosomal regions were identified by the clustered, nominally positive SNPs in independent samples with the frequencies expected by chance. In 10,000 trials from pairs of independent samples, we compared the extent of overlap between the chromosomal regions identified by the clustered, nominallypositive SNPs in each sample. The Monte Carlo p values that derive from these trials thus addressed the second null hypothesis that the chromosomal regions identified by clusters of nominally positive SNPs in each of multiple samples were identified only on stochastic bases that were unrelated to phenotype.
Secondary analysis of dbGAP data used permutation approaches as implemented in PLINK (v1.06) (http://pngu.mgh.harvard. edu/purcell/plink/) [25]. We randomized assignment of the phenotypes to data derived from the current SNPs and analyzed the data from 3,000 permutation trials that addressed each of several null hypotheses (see below).
To assess the power of our current approach we used current sample sizes and standard deviations, power calculator PS v2.1.31 [26,27] and a = 0.05.

Results
As noted elsewhere [21], variation among the allele frequency estimates between pools from individuals of the same phenotype for each racial/ethnic group from the NIDA/MNB samples was +/2 0.02 (standard error of the mean SEM).

European-American samples
For the dbGAP data from European-Americans, x 2 tests displayed p,0.05 for 49,843 autosomal Illumina SNPs. For the NIDA/MNB European-American samples, 75,413 of the autosomal Affymetrix 6.0 SNPs displayed t values with p,0.05 in comparisons between data from substance dependent vs control samples [21].

Searches for genome wide significance in each European-American sample
We identified case vs control p values for t test results from NIDA/MNB samples and for x 2 results from dbGAP samples from unrelated individuals. Permutation testing for the dbGAP European-American samples revealed p,0.0003 (3,000 trials) for the number of SNPs with nominal case vs control p values,0.05. However, virtually none of these p values reached the 10 28 level deemed necessary for genome wide significance.

Searches for clustering of SNPs with nominally-significant case vs control differences in each European-American sample
We identified 3125 clusters of SNPs that displayed nominally significant, p,0.05 case vs control differences for x 2 results from dbGAP samples and 2931 clusters with nominally significant t test results from NIDA/MNB samples.

Searches for chromosomal regions identified by clustered SNPs with nominally-significant case vs control differences in both European-American samples
Two hundred four chromosomal regions contained clusters of nominally-significant SNPs from both of these two European-American samples.
None of 10,000 Monte Carlo simulation trials that each began with random sets of SNPs selected from each of the datasets identified as many overlapping regions as found in the true dataset. The overall Monte Carlo p,0.0001 for the overlap noted in the true data thus provides very high levels of confidence that these independently-derived sets of results do not identify the same set of chromosomal regions by chance alone. Thus, the null hypothesis that the chromosomal regions identified by both samples are identified based only on stochastic grounds is falsified by these Monte Carlo data.
In addition, none of 3,000 permutation trials provides data that identifies as many chromosomal regions from permutated data as those identified by the real datasets. Thus, the null hypothesis that the chromosomal regions identified by both samples are identified based only on stochastic grounds also nullified by permutation testing data. The genes that: a) lie in chromosomal regions identified by data from both European-American samples and b) display the most nominally-significant SNPs are listed in Table 1; Table 1. Chromosomal regions and genes identified by clusters of SNPs that provide nominally-significant differences between individuals dependent on alcohol (dbGAP alcohol dependent v ctl) or at least one illegal substance (NIDA/MNB drug dependent v ctl) in subjects of European-American heritage.  the complete list of chromosomal regions identified in this way is listed in Table S1. The fraction of the genome occupied by these results is 210% of the size expected by chance, based on the fractions of the genome occupied by clustered nominally positive results from each of these two European-American samples (data not shown).

Searches for genome wide significance in each African-American sample
We identified case vs control p values for x 2 results from dbGAP samples and for t test results from NIDA/MNB pooled samples [21]. None of these p values approached the 10 28 level deemed necessary for genome wide significance.

Searches for clustering of SNPs with nominally-significant case vs control differences in each African-American sample
We identified clusters of SNPs that displayed nominally significant, p,0.05 case vs control differences for p values from x 2 results from dbGAP samples and t test results from NIDA/ MNB samples (2026 and 3383 clusters, respectively).

Searches for chromosomal regions identified by clustered SNPs with nominally-significant case vs control differences in both African-American samples
One hundred twenty nine chromosomal regions were identified by clustered nominally-positive results from both of the two African-American samples. None of 10,000 Monte Carlo simulation trials that each began with random sets of SNPs selected from each of the datasets identified as many overlapping regions as found in the true dataset; hence Monte Carlo p,0.0001. Thus, the null hypothesis that the chromosomal regions identified by both African American samples are found based only on stochastic grounds is nullified by these Monte Carlo data.
However, 199 of 200 permutation trials did provide data that identifies as many chromosomal regions from permutated data as those identified by the real datasets. Thus, the null hypothesis that the chromosomal regions identified by both samples are identified based only on stochastic grounds was not nullified by permutation testing, in ways that suggest that structure in the data may have contributed to the known propensity for permutation testing to overestimate false discovery rates in the presence of such structure [28,29]. The genes that: a) lie in chromosomal regions identified by data from both African-American samples and b) display the most nominally-significant SNPs are listed in Table 2; the complete list of chromosomal regions identified in this way is listed in Table S2. The fraction of the genome occupied by these results is about 220% of that expected by chance, based on the fraction of the genome occupied by clustered nominally positive results from each of the African American samples (data not shown).

Searches for genes identified by clustered SNPs with nominally-significant case vs control differences in all four samples
The clusters from both of the two African-American samples identified six genes that were also identified by clusters from both of the two European-American samples. CDH13, CSMD1 and DSCAM are three cell adhesion molecules that we have identified in many prior studies of addiction vulnerability and/or abilities to quit smoking (see below), while CADPS, MTMR7 and UBASH3B have been identified in fewer prior studies. This modest overlap contrasts with the larger overall overlap between the Affymetrix datasets for the African-American vs European American NIDA/ MNB samples [21] and the Illumina datasets for the African-American vs European American dbGAP samples. In the latter case, we can identify 146 chromosomal regions, 88 of which contain 126 genes, in which overlapping results between the two racial/ethnic groups are found in ways not found by chance in 10,000 Monte Carlo simulation trials (data not shown).

Validation of pooling vs individual genotyping for SNPs whose results provided the clusters
We compared individual vs pooled allele frequency estimations for the ca. 500 SNPs that displayed minor allele frequencies .0.1 and provided clustered, nominally positive results in data from the NIDA pooled samples. The results from these SNPs displayed mean 0.66 Pearson correlation coefficients between data from pooled and individual genotyping. These correlations were more modest than those identified in validating studies for pooling that used larger ranges of expected allele frequencies. Thus, there was an average 0.19 range of expected values for these genotypes vs 0.9 range for the SNPs and pools used in initial studies that validated pooling with these Affymetrix 6.0 arrays) [21].

Discussion
Genome-wide association data of increasing richness is available for many complex disorders. Several of these GWA datasets contain relatively robust results at ''oligogenic'' loci that can also be identified, in many cases, by linkage-based approaches [30,31,32,33]. Even moderately secure GWA identification of ''polygenic'' influences on disease, however, is likely to require replicated data from multiple independent samples.
''Template'' analyses seek SNPs that provide ''genome wide significance'' with the same phase of association in data from multiple independent samples. However, there have been no unanimous criteria for declaring replication of sets of data in circumstances in which no SNP achieves this level of statistical significance in each of multiple samples.
We have focused on identification of statistical significance for sets of chromosomal regions that are each identified by sets of nominally-significant SNPs from several independent samples. This approach identifies chromosomal regions and genes that are very likely, as a group, to display bona fide association with individual differences in vulnerability to develop dependence on an addictive substance. This overall confidence derives from approaches that address distinct sets of null and/or alternative hypotheses to explain the results obtained. First, seeking chromosomal regions in each sample that are identified by at least 4 closely-spaced nominally-positive SNPs addresses the null hypothesis that the results obtained are randomly distributed across chromosomes. This initial process also addresses the alternative hypothesis that the nominally-positive SNPs are identified based on technical problems in correctly assigning allele frequency differences to case vs control sample comparisons (or in correctly identifying the true variances for these values). Of course, we would expect to see clustering of nominally-positive SNPs in each sample in regions in which there was either a) linkage disequilibrium between the SNPs studied and between these SNPs and functional variants that influenced addiction vulnerability or b) linkage disequilibrium between these SNPs and stochastic differences in haplotype frequencies in individual samples of cases vs those in a single sample of controls that are unrelated to the phenotype. The second way in which we seek replication identifies, in independent samples, many of the same chromosomal regions based on their content of clustered, nominally positive SNPs. This comparison addresses the null hypothesis that the clustering observed in each sample derives from stochastic case vs control differences in haplotype frequencies rather than case vs control differences that are truly related to differences in phenotypes. This comparison also provides additional support for our ability to reject the null and alternative hypotheses relating to assay noise. We thus identify more chromosomal regions and genes based on the overlap between the chromosomal regions identified by data from each sample than we would expect if the only reason for clustering of nominally positive SNPs in each sample was stochastic variation in the frequencies with which blocks of restricted haplotype diversity are found in cases vs controls that are unrelated to the phenotype. Availability of data from other recently-reported genome wide association studies also provides a third way in which we seek replication, based on identification by the current data, of more of the same genes that were identified in other reports from independent samples and different analyses than we would expect by chance. This comparison also provides additional means for us to refute then null hypothesis that the clustering observed in each sample derives from stochastic case vs control differences in haplotype frequencies rather than case vs control differences that are truly related to differences in phenotypes. In replicated samples that compared 500 k allele frequencies in alcohol dependent to population control samples, Treutlein and colleagues [15] have used a mixed analytic strategy to identify nine genes. Products of two of these genes, ADH1C and PECR, are likely to play direct roles in alcohol metabolism and thus provide weak candidates for overlap with data from the NIDA/MNB samples. Our current results identify three of the remaining seven genes: CDH13, ERAP and CAST. Based on chance, we should have identified fewer than one of these genes (0.07 genes on average). We have also recently begun analyses of a 500,000 SNP dataset supplied by these authors. We have identified chromosomal regions tagged by clusters of at least 3 SNPs which lie within 25 kb of each other that display nominallysignificant case vs control differences in this sample, criteria that we have previously used for 500 k datasets. These analyses identify  There are a number of important limitations that come from these samples, these analyses, and from the application of this approach to these datasets. The two distinct null hypotheses both require careful thinking about linkage disequilibrium, since it is easy to confuse data and analyses that bear on linkage disequilibrium among markers that display case vs control differences in single samples, the chromosomal regions that such markers label, and the chromosomal regions labeled by such sets of markers in multiple independent samples. Especial difficulties in clarity may arise since we anticipate true positive results that combine differences between cases and controls that are based on linkage disequilibrium among markers that display case vs control association with disease and between these markers and the functional allelic variants that provide the variation in gene function that influences phenotype. Without dissecting these differences, it is easy to come to the incorrect conclusion that the method described herein is only detecting the linkage disequilibrium structure and not disease association.
There are other limitations. The NIDA/MNB samples, largely of individuals who were not seeking treatment, were recruited at a single site and compare dependent individuals with heavy levels of substance use to controls with modest or no substance use. These features might provide differences from the dbGAP samples which were recruited at a number of sites from largely treatment-seeking individuals or probands. The dbGAP samples compare alcohol dependent individuals to controls whose levels of illegal substance use do not produce dependence, but might be substantial. To parallel the recently-reported analysis of this data by Beirut and colleagues [16], we have included, in the control group, individuals who smoked significant numbers of cigarettes and/or display DSM or FTND dependence on nicotine. Reanalyses of the data from the dbGAP European American sample after excluding the 376 individuals with FTDN scores .4 and/or DSM nicotine dependence yields overlap with data from NIDA samples that is even stronger than that identified in the main analyses presented here, even though more than J of the ''controls'' are removed from these analyses (Johnson et al, unpublished observations, 2010). Due to the small number of individuals with Asian or Hispanic racial/ethnic backgrounds in this sample, we have excluded them from the present analyses. This exclusion also renders our comparisons different from those used in the recent report of data from many of these same dbGAP individuals [16]. While Monte Carlo simulation tests weigh strongly against the null hypothesis that chance alone accounts for the degree to which the same genes are identified by data from each of the two samples from individuals of the same racial/ethnic background, permutation tests only reach high levels of statistical significance in rejecting this null hypothesis in the European American subjects. Principal components analyses suggest that much of the variance in this data is not due to phenotype or racial/ethnic group (Johnson et al, unpublished observations, 2010); such structure might account for the permutation results from the African American data [28,29]. Based on statistical considerations, the present analyses are likely to provide many false negative results. The power of each of these samples to detect polygenic influences is moderate. The requirement for convergent identification of the same chromosomal region by data from both samples of the same racial/ethnic background provides a likelihood of even more false negative results. Case vs control allele frequency differences in the NIDA/MNB samples were genotyped using multiple DNA pools and an Affymetrix 6.0 platform, providing t tests that use information about both mean differences and variances. Case vs control differences in the dbGAP samples were assessed using Illumina platform genotyping of individual samples, yielding x 2 results without explicit assessment of variance. The requirement that at least 4 nominally-significant SNPs lie within 10 kb of each other cannot be fulfilled in a number of chromosomal regions or in a number of genes in which the density of SNPs is too low to meet this stringent requirement (see Supplement of [14] for list of the genes that cannot be assessed with these criteria using the Affymetrix platform). There are only about J million autosomal SNPs that are shared between the ca. 900 K and 1 M autosomal SNPs evaluated by the Affymetrix and Illumina platforms, respectively, further exacerbating this problem in many genomic regions.
Despite these limitations, there is highly-significant overall convergence between two comparisons of NIDA/MNB and dbGAP GWA data from substance-dependent individuals vs controls: one comparison in European-American subjects and another comparison in African-American subjects. For each of these comparisons, the degree to which clusters of nominallypositive SNPs identify the same chromosomal regions and genes is never found by chance in up to 10,000 Monte Carlo simulation trials.
This evidence for replication, defined in this fashion, also provides striking contrasts to results from attempts to identify replication (and/or generalization) in other ways. For example, results that seek to identify the extent to which the same SNPs display nominally-significant associations with the same phase in each of these replicate samples within each racial/ethnic group identify about as many SNPs with these properties as expected by chance (data not shown).
We have previously reported the apparent success of ''nontemplate'' analyses that are similar to those used herein when applied to data from four independent case vs control samples for bipolar disorder [20]. None of these bipolar vs control samples, individually, provided results with genome wide significance. These samples combined data from individual and pooled genotyping using different genotyping platforms. Despite these difficulties, the results of nontemplate analyses provided much more frequent identification of the same genomic regions and genes by clustered, nominally positive SNPs from multiple independent samples in bipolar disorder than we would anticipate by chance.
Studies that focus on identifying ''template'' same-phase association with genome wide levels of significance in multiple independent samples appear most likely to succeed when oligogenic genetic architecture confers large association signals in each independent sample, when the same SNP sets are studied in each, when the disease exhibits little allelic or locus heterogeneity and when there are good matches between the fine patterns of linkage disequilibrium of the samples being studied. Apparent replication ''failures'' using this approach could thus relate to a number of features that include associations of modest magnitude, sample-to-sample differences in fine patterns of linkage disequilibrium, different amounts of information provided by markers with population-specific differences in allele frequencies, allelic heterogeneity and locus heterogeneity.
Monte Carlo methods allow us to test the probabilities of chance clustering of nominally-positive SNPs and the chance of convergence between clusters identified in one sample with clusters identified in other samples. Our Monte Carlo approaches deploy an empirical method that uses the existing dataset as a source for randomly selected SNPs for each Monte Carlo trial. The results of these simulations provide strong overall confidence that these sets of results are not due to chance. By contrast, these approaches alone provide unequivocal identification of few individual SNPs or genes. This lack of unequivocal identification of individual SNPs is consistent with polygenic/allelic heterogeneity current working models for the genetic architecture of vulnerability to substance abuse [14,34]. However, identification of associations at some loci, such as the CDH13 locus, in many independent samples (see below) makes it very highly unlikely that this locus does not harbor allelic variants that influence interactions between humans and addictive substances.
Previous analyses that have compared the MNB/NIDA European-American to African-American results have identified genomic regions that are labeled by clustered, nominally-positive SNPs from both samples, supporting roles for some allelic variants that are likely to be old in relation to human history [6,7,9,21]. Data from analyses that combine results from individuals with different racial/ethnic backgrounds also provide suggestive results in regions such as the GABA receptor gene cluster on chromosome 4 for evolutionarily-old variants [16,35]. Identification in both studies of SNP markers whose allelic frequencies distinguish controls from addicts of different ethnicities supports ''common disease/common allele'' genetic architecture [36] for part of the genetics of addiction vulnerability. However, the substantially greater convergence, noted here, for data from the same racial/ethnic groups also points to possibly-substantial roles for variants that have been accumulated more recently in human populations that have been more separate until relatively recently.
Genes identified by this work include those in several classes. When we compare the list of genes identified by these samples to functional classes as annotated in Gene ontology (GO) using Biobase, we find the greatest (9.2610 29 -1.2610 26 ) statistical significance for overrepresentation of the genes whose products are involved with the following biological processes: signal transmission (57 observed/28 expected by chance), signaling process (57/ 28), cell communication (39/16), regulation of cellular process (90/57), regulation of localization (23/7), signaling (68/39), negative regulation of biological process (42/19), regulation of biological process (94/63), biological regulation (100/70) and synaptic transmission (16/4). CDH13 associations with addiction phenotypes have now been identified in both the four samples studied here and in a number of prior reports. We initially identified associations between substance dependence vulnerability and CDH13 variants in smaller subsets of COGA and MNB samples in studies that utilized earlier microarray types [9,37]. We and others have subsequently identified such associations in several other samples for addiction-related phenotypes that include: a) vulnerability to substance dependence [12], independent replicated alcohol dependence datasets [15,19,38], b) individual differences in acute responses to alcohol administration [39] and c) abilities to quit smoking [24,40,41]. Allelic variants in CDH13, a glycophosphoinositolanchored cadherin that is expressed in neurons that lie in interesting brain circuits, are thus very strong candidates to contribute to addiction-related phenotypes.
The findings presented in the current report thus add the strong evidence for involvement of variants in several individual genes, add to the ongoing consideration of methods for comparing GWA datasets and enhance understanding of genetic underpinnings of human addiction. For addictions, as for many complex disorders, such data provides an increasingly rich basis for improved understanding and for personalized prevention and treatment strategies. Table S1 The complete list of chromosomal regions that a) are identified by data from both European-American samples and b) display the most nominally-significant SNPs (genes listed in Table 1). (XLS)

Table S2
The complete list of chromosomal regions that a) are identified by data from both African-American samples and b) display the most nominally-significant SNPs (genes listed in Table 2). (XLS)