Single Nucleotide Polymorphism (SNP)-Strings: An Alternative Method for Assessing Genetic Associations

Background Genome-wide association studies (GWAS) identify disease-associations for single-nucleotide-polymorphisms (SNPs) from scattered genomic-locations. However, SNPs frequently reside on several different SNP-haplotypes, only some of which may be disease-associated. This circumstance lowers the observed odds-ratio for disease-association. Methodology/Principal Findings Here we develop a method to identify the two SNP-haplotypes, which combine to produce each person’s SNP-genotype over specified chromosomal segments. Two multiple sclerosis (MS)-associated genetic regions were modeled; DRB1 (a Class II molecule of the major histocompatibility complex) and MMEL1 (an endopeptidase that degrades both neuropeptides and β-amyloid). For each locus, we considered sets of eleven adjacent SNPs, surrounding the putative disease-associated gene and spanning ∼200 kb of DNA. The SNP-information was converted into an ordered-set of eleven-numbers (subject-vectors) based on whether a person had zero, one, or two copies of particular SNP-variant at each sequential SNP-location. SNP-strings were defined as those ordered-combinations of eleven-numbers (0 or 1), representing a haplotype, two of which combined to form the observed subject-vector. Subject-vectors were resolved using probabilistic methods. In both regions, only a small number of SNP-strings were present. We compared our method to the SHAPEIT-2 phasing-algorithm. When the SNP-information spanning 200 kb was used, SHAPEIT-2 was inaccurate. When the SHAPEIT-2 window was increased to 2,000 kb, the concordance between the two methods, in both of these eleven-SNP regions, was over 99%, suggesting that, in these regions, both methods were quite accurate. Nevertheless, correspondence was not uniformly high over the entire DNA-span but, rather, was characterized by alternating peaks and valleys of concordance. Moreover, in the valleys of poor-correspondence, SHAPEIT-2 was also inconsistent with itself, suggesting that the SNP-string method is more accurate across the entire region. Conclusions/Significance Accurate haplotype identification will enhance the detection of genetic-associations. The SNP-string method provides a simple means to accomplish this and can be extended to cover larger genomic regions, thereby improving a GWAS’s power, even for those published previously.


Introduction
Multiple sclerosis (MS) has a complex etiological basis, which involves both the genetic makeup of an individual and their environmental experiences [1][2][3][4][5][6][7][8]. With regard to the importance of genetics, it is notable that the life-time risk of disease for an individual from northern Europe or Canada is about 0.1% [2]. The risk in individuals who have an affected family member increases in rough proportion to the amount of shared genetic information between the affected relative and the individual [2][3][4][5][6][7][8]. Thus, third degree relatives (12.5% genetic similarity) such as first cousins, have a risk less than 1%; second degree relatives (25% genetic similarity) such as aunts and uncles have a risk of about 1-2% and first degree relatives (50% genetic similarity) such as siblings, parents, and children of an MS proband have a risk of approximately 2-5%. By contrast, in monozygotic-twins of an MS proband (100% genetic similarity) the risk increases to about 25-30% [2][3][4][5][6][7][8]. From this, it is clear that although environmental events are important for MS pathogenesis, genetic susceptibility is critical. Indeed, because genetic susceptibility seems necessary for MS to develop [9], because the necessary environmental factors seem to be ''population-wide'' exposures [9], and because only about 2% of the population is genetically susceptible to getting MS [9], genetics is, by far, the greatest single determinant of disease. In addition, this genetic susceptibility to multiple sclerosis (MS) seems to involve multiple genetic loci [10][11][12][13][14][15][16]. This fact has become particularly apparent from the very large genome-wide associations studies (GWAS) that have been recently published [14,15,[17][18][19][20][21].
The largest of these was a multicenter, multi-country GWAS involving tens of thousands of cases and controls [14,15], which identified single-nucleotide polymorphisms (SNPs) in about 100 genomic regions that were MS-associated. With notable exception of some SNPs near the DRB1 locus on chromosome 6, however, the odds ratios (OR) for almost all of these associations was quite low -i.e., mostly between OR = 1.1 and OR = 1.2 [14,15]. This could be due to these other genetic factors having a smaller impact on MS-susceptibility compared to the DRB1 locus [14,15]. However, it could also be due to the same SNP being present on more than one haplotype at a particular locus. If so, this may reduce substantially the measured OR for the association of a particular genomic region with the disease [11].
As a result, several haplotype-based approaches have been explored and are thought to improve the statistical power for detecting genetic associations, especially for rare causal alleles [22]. One early approach was a simple algorithm, which identified individuals who had unambiguous haplotypes of DNA sequences because they were either homozygous for every nucleotide in the sequence or because they were heterozygous at only a single site [23,24]. For unresolved individuals, the algorithm then attempted to pair the known haplotypes with ''novel'' haplotypes to produce the individual's likely genotype. If such a pair was possible, the new ''novel'' haplotype was presumed to be present in the population. Despite the simplicity of this method, however, probabilistic approaches (using maximum likelihood estimation or Bayesian analysis) have become the preferred method for identifying likely haplotypes [25][26][27][28]. In part, this has been due to expressed concerns that the simple algorithm yields haplotypes, which depend upon the order of data entry [25][26][27], that it depends upon the presence of homozygotes or single site heterozygotes to get started [24][25][26][27], that it identifies only the minimum number of haplotypes [26,27], that it is more sensitive to departures from Hardy-Weinberg equilibrium (HWE) than other methods [27], and that it results in more errors compared to probabilistic approaches [27].
Nevertheless, the simplicity of the approach is notable and it seems likely that refinements in the method might overcome many of these concerns. It is the purpose of this paper to explore the utility of using an alternative analysis method to define the haplotypes that are present at a particular genetic locus within the population and to test the ability of such specific haplotypes to detect disease-associations.

Study Participants
The study cohort has been described in detail previously [21]. Briefly, the cohort was assembled as a prospective multicenter effort, which began in 2003. Three MS centers participated both in patient recruitment and in the collection of biological specimens. Two of the centers (Vrije Universiteit Medical Center, Amsterdam; and University Hospital Basel) were in Europe and one (University of California, San Francisco) was in the United States (US). This study consisted primarily of patients with a northern-European ancestry. Although all clinical MS-subtypes were included, most had a relapsing-remitting (RR) onset. The diagnosis of RRMS (or other subtypes of MS) was made utilizing internationally recognized criteria [29,30]. All participating centers used identical inclusion and diagnostic criteria. Control subjects were matched with cases by age and gender. The Committee on Human Research at each of the participating centers approved the protocol and informed consent was obtained from each study participant.

Genotyping and Quality Control
The genotyping and quality control methods utilized for the analysis of this cohort have been previously described in detail [21]. Briefly, genotyping was done at the Illumina facilities using the Sentrix HumanHap550 BeadChip. This analysis resulted in genotype information about 551,642 SNPs in 975 cases and 882 controls. DRB1*1501 genotyping was performed using a validated gene-specific TaqMan assay [21] and was only undertaken for participants from the US.

Genetic Loci
Although several of the MS-related genetic loci were screened preliminarily (including an intergenic region), two loci were selected for detailed analysis, both of which had been previously associated with MS in a large GWAS [14]. The first locus was the DRB1 locus on Chromosome 6, for which the susceptibility allele (DRB1*1501) is known [19,[31][32][33]. DRB1 encodes a protein (a Class II molecule of the major histocompatibility complex), which binds foreign peptides derived from extracellular proteins for presentation to thymic derived lymphocytes (T-cells). It is expressed on the surface of antigen presenting cells such as dendritic cells, bone marrow derived lymphocytes (B-cells) and macrophages. This locus was picked because, for the US population, both the DRB1*1501 status and the SNP-information was available. The other locus (MMEL1) on Chromosome 1 was chosen because it was the first non-DRB1 MS-associated gene listed by the International Consortium [14]. MMEL1 is a member of the membrane (M13) metallo-endopeptidase family and is involved in the degradation of both neuropeptides and b-amyloid [34]. As it might potentially relate to MS risk, however, the function of this protein is not well defined.

Statistical Methods
To simplify program development, only eleven SNPs were used from each genomic region. The choice of eleven SNPs was arbitrary. Preliminary exploration demonstrated that the method could define haplotypes using anywhere between 3 and 24 SNPs although, in theory, there is no upper limit to the method as long as there are a sufficient number of homozygotes and single-site heterozygotes in the population. Nevertheless, for the purpose of this study, the eleven SNPs were chosen because they flanked both the most significantly associated SNP and the putative gene of interest ( Figure 1). Each SNP was labeled sequentially from (n1) to (n11) based on its chromosomal location (Table 1; Figure 1). The eleven SNPs, which were analyzed at the DRB1 locus, did not include the four tagging SNPs (rs3129934, rs9267992, rs9271366, and rs3129860) identified previously [21] because these particular SNPs were not available in this dataset. Moreover, the MMEL1 region was not identified as MS-associated by this earlier study [21].
The DRB1 cluster spans a length of 160.7 kilobases (kb) of DNA and the MMEL1 cluster spans 228.6 kb (Table 1). Of note, both clusters have a large gap in SNP-coverage of approximately 137 kb between SNPs (n7) and (n8). For each SNP-cluster, the SNP-information for each individual was converted into an ordered set (the subject-vector) of eleven ternary numbers (0, 1, or 2) based on whether they had zero, one or two copies of a particular SNP-variant for each sequential SNP in the cluster (these SNP-variants were designated according to the number of copies of the minor allele -in the control population -at each location). For example, the 5 th subject in the database had the DRB1 subject-vector of (20001111110), which indicated that he possessed 2 copies of (n1), 0 copies of (n2), 0 copies of (n3), and so forth. Because, essentially, all SNPs in the genome are binary, these subject-vectors are composed of two haplotypes, which either do or don't have a particular SNP variant at each of the SNP locations. For the purpose of the present analysis, SNP-strings were vectors, defined as those specific ordered sets of eleven binary numbers (0 or 1), representing the two haplotypes (over the entire cluster span), which combined (added) to produce each observed subject-vector.
As with Clark's method [23,24], those SNP-strings that were unambiguously present in the population were identified in two ways. The first was to identify all subject-vectors, which consisted entirely of zeros (0s) and twos (2s). These individuals must be homozygous for the same SNP-string. For example, the 6 th subject in the database had a DRB1 subject-vector of (02202200000), which indicated that she possessed two copies of the (01101100000) SNP-string. The second method was to identify all individuals who were single SNP-heterozygotes (i.e., had subject-vectors consisting of all 0s and 2s except for having a 1 at a single SNP location). These individuals must have identical SNPstrings except for the single location where one SNP-string had a 0 and the other had a 1. For example, the 147 th person in the database had a DRB1 subject-vector of (20000022221), which could only arise from the combination of the SNP-strings (10000011110) and (10000011111). In this manner, a list of most common SNP-strings in the population was compiled ( Figure 2). Moreover, the relative frequency of the homozygous representation of each SNP-string in the cases and controls provides an estimate of the underlying SNP-string frequency in each population ( Figure 3).
Once the list of these unambiguous ''identified'' SNP-strings was compiled, the observed subject-vectors were decomposed into those combinations of the unambiguous ''identified'' SNP-strings, which could (potentially) have produced the observed subjectvector. For each person, one of three outcomes was possible. First, it might be the case that there was only one (unique) combination of two unambiguous ''identified'' SNP-strings, which could give rise to the observed subject-vector ( Figure 2). Second, it might be that there was more than one combination of two unambiguous ''identified'' SNP-strings, each of which could give rise to the same observed subject-vector. In this second case, the decomposition of the observed subject-vector was said to be ''conflicted'' ( Figure 2). And third, it might be that there was no combination of unambiguous ''identified'' SNP-strings, which could give rise to the observed subject-vector. In this last case, an additional ''novel'' (not unambiguously identified) SNP-string (or strings) must be present in the population. Because all such ''novel'' SNP-strings are (presumably) low in frequency compared to the unambiguous ''identified'' SNP-strings, it is very likely that, when they occur, they will be heterozygous with one of the unambiguous ''identified'' SNP-strings. In many cases, two or more ''novel'' SNP-strings could, potentially, have combined with an unambiguous ''identified'' SNP-string to produce the same observed subject-vector. In these cases, the ''novel'' SNP-string, which actually underlies the observed subjectvector was said to be in an ''indeterminate'' (conflicted) combination ( Figure 2). In these circumstances, every such ''legitimate'' haplotype (i.e., one which consisted of a string of zeros and ones) was added to the growing list ''possible'' alternative haplotypes. The SNP-strings underlying a subjectvector could also be ''indeterminate'' if the observed subject-vector could only be formed by the combination of two ''novel'' SNPstrings. In some cases, however, only a single possible ''novel'' SNP-string could be combined with an unambiguous ''identified'' SNP-string to produce the observed subject-vector ( Figure 2). These were referred to as ''indeterminate'' (unique) combinations ( Figure 2). In these cases, it is very likely that the ''novel'' SNPstring (so identified) actually occurs in the population. Therefore, after searching each ''indeterminate'' subject-vector for these unique ''novel'' SNP-strings, this set of newly-discovered ''novel'' haplotypes was added to the set of unambiguous ''identified'' SNPstrings to form a new list of ''identified'' SNP-strings. Then the entire set of observed subject-vectors was again decomposed using this new ''identified'' SNP-string list. After compiling this new ''identified'' SNP-string list, the decomposition of the subjectvectors into their constituent SNP-strings was repeated. As expected, compared to the first decomposition using only unambiguous ''identified'' SNP-strings, this decomposition resulted in a greater number of conflicts being found.
Additional rounds of the same procedure were repeated until no further SNP-strings could be added to the ''identified'' SNP-string list or until more than 4 iterations were performed. Generally, however, this process was complete after 2 or 3 cycles. After screening ,25,000 subject-vectors using this procedure, only one individual had a genotype (subject-vector) which could not be decomposed into a pair of haplotypes, in which at least one of the pair was an ''identified'' SNP-string. This single individual, therefore, must either have had a heterozygous state with two unidentified ''novel'' SNP-strings or an error was made in their genotyping.
For the present analysis, whenever the SNP-string identification was conflicted, these conflicts were resolved based on the relative probabilities of the different allelic combinations in the population. These probabilities, in turn, were estimated from the SNP-string frequencies for all non-conflicted identifications ( Figure 2; Table 2). For example, the 7 th person in the database had a DRB1 subjectvector of (11111011110). This could have arisen from the combination of either the (a7-a8) or the (a13-a20) SNP-strings. The probability of the latter combination (determined from the non-conflicted identifications in the whole population), however, involves two very rare SNP-strings compared to the former ( Table 2) and, in this case, there is 50-fold difference in likelihood. Consequently, this particular conflict was resolved in favor of the (a7-a8) combination. In rare cases, there was little difference in likelihood between possible haplotype pairs although, more commonly, the likelihoods differed by an order of magnitude (or more) between pairs. Therefore, for the purposes of the present analysis method, all conflicts were always resolved in favor of the most likely SNP-string combination.
We also generated a so-called ''complete'' SNP-string list. To do this we combined the lists of ''possible'' and ''identified'' SNPstrings, which had been compiled during the above analyses. We then decomposed the subject-vectors using the entire combined list. Again conflicts were resolved using the product of the estimated frequencies derived from the non-conflicted SNP-string identifications. For the ''novel'' SNP-strings, we assigned each the nominal estimated frequency of half the smallest possible frequency for an ''identified'' SNP-string (i.e., half of one divided by the number of unambiguous or non-conflicted SNP-strings in the population). Following the development of this ''complete'' list, the decomposition was repeated. Again, if a particular SNP-string was never observed in a non-conflicted combination, the estimated frequency was taken half the smallest possible frequency. However, for every SNP-string that was observed in a nonconflicted combination, its estimated frequency was taken as that, which had actually been observed. Using this ''complete'' list, 100% of the subject-vectors could be resolved into haplotypes. Following this ''complete'' analysis, those ''novel'' SNP-strings, which accounted for more than 1% of the non-conflicted identifications, were added to the final ''identified'' SNP-string list. Also, using this ''complete'' list, the decomposition was redone iteratively, eliminating those haplotypes that were never selected (i.e., which had a zero-frequency in both cases or controls), until all remaining haplotypes on the ''complete'' list had a non-zero final Figure 2. Graphical depiction of the SNP-string method presented here (see text). Subject-vectors (strings of 0s, 1s, and 2s) are searched for homozygous and single-site heterozygus individuals (A). These ''unambiguous'' combinations are decomposed into their constituent ''unambiguous'' SNP-string combinations and then a list of all the unambiguous SNP-strings that have been ''identified'' is compiled (A). Following this, the entire subject-vector list is decomposed into the possible combination categories (B). The ''indeterminate (unique)'' decompositions are used to ''identify'' additional SNP-strings, which are then added to the ''identified'' SNP-string list (B and C). The entire decomposition is then repeated until either no further SNP-strings can be added to the ''identified'' list or more than 4 decompositions have taken place. The final list are considered ''identified'' SNP-strings (C and D). Unique decompositions (of ''identified'' SNP-strings) and unambiguous combinations are used to calculate the allellic frequencies used for resolving the ''conflicted'' decompositions (E). Following this, a list of all possible addtional ''novel'' SNP-strings is compiled from the ''indeterminate (conflicted)'' decompositions (F). This list is then combined with the ''identified'' list to make the ''complete'' SNPstring list (F) and the decomposition repeated. By definition, using this ''complete'' list, every subject-vector will be either an unambiguous combination or a unique or conflicted decomposition. Allellic frequencies are recalculated from the unique decompositions and unambiguous combinations and these frequencies used to resolve the conflicted decompositions. Those SNP-strings that never selected (i.e., have a zero final frequency) are then dropped from the ''complete'' list and the decomposition repeated until all SNP-strings on the ''complete'' list have a non-zero final frequency (G). In the final step, persons with uncommon alleles (selected in less than 9 individuals) are then reassessed for alternative possible combinaitions with ''novel'' SNP-strings, these novel SNP-strings are added to the ''complete'' list, and the process described above is again repeated until all SNP-strings on the ''complete'' list have a non-zero final frequency and the SNP-string composition of every subject-vector has been selected. doi:10.1371/journal.pone.0090034.g002 frequency and 100% of the subject-vectors could be decomposed into their constituent haplotypes ( Figure 2).
Following, this penultimate decomposition, we reassessed those individuals who carried haplotypes that were found in fewer than 9 individuals. For these individuals, we created a new ''possible'' haplotype list, consisting of all of those ''legitimate'' alternative haplotypes, which could have combined with an ''identified'' haplotype to form the observed subject-vector and, thus, could have substituted for the very rare haplotype, which had been selected by the above procedure. The number nine was chosen because, for haplotypes having this expected number of observations (or more), there is greater than a 99.9% chance (Poisson distribution) that, at least, one example would have been found in the data. Also, the inclusion of a greater number of haplotypes on the ''possible'' list only serves to make the inadvertent exclusion of a rare (but present) haplotype less likely. The new list of ''possible'' haplotypes was added to the final ''complete'' list and, once again, the decomposition was redone iteratively, eliminating those haplotypes that were never selected, until all remaining haplotypes on the final ''complete'' list had a non-zero final frequency and 100% of the subject-vectors were decomposed. For both the DRB1 and MMEL1 loci, this last iterative decomposition made no difference to the final ''complete'' SNP-string list. The final SNPstring frequencies (Table 3) were calculated following this last ''complete'' decomposition and following the final selection process.
We compared our method of phasing to the commonly-used SHAPEIT-2 method, which has been validated on several large datasets [35]. The details of this method are described elsewhere [35,36,37]. Briefly, the SHAPEIT-2 method combines features of SHAPEIT [36] and Impute2 [37] to enhance performance. SHAPEIT uses a Markov model to separate the haplotype-space for a given individual [36] from the set of all possible haplotype pairs consistent with a person's genotype. Transition probabilities are estimated in local windows of a given size with a ''surrogate family'' approach used to select the set of templates with the smallest Hamming distances because those at short distances presumably share recent ancestry with the individual under consideration [37]. The SHAPEIT-2 method has been shown to be superior to several other methods based on its performance using several large-sample, whole-chromosome data sets from a range of SNP genotyping chips [35].
For the purposes of our analysis, we compared the haplotype predictions using the two phasing methods. The actual haplotype frequencies in the population were estimated in three manners. The first (Figure 3), was to determine the most frequent haplotypes based on the number of homozygotes in the sample population. Because both the case and control populations are at HWE, at least with respect to the DRB1 locus (9), the different haplotype frequencies can be estimated as the square root of the homozygotic frequencies [9,38]. This method is independent of which phasing method is used. The second method was to use the haplotype frequencies, estimated from all non-conflicted haplotypes found using the phasing method presented in this study (Tables 2 and 3). These frequencies were estimated jointly, combining cases with controls, although they are presented separately in Table 2. The  Table 2. Estimated SNP-string frequencies using either the non-conflicted or final identifications*.   Table 3. SNP-string frequency in European and US populations*.  third method was to estimate haplotype frequencies as predicted by the SHAPEIT-2 output. All three of these methods provided substantially similar estimates for the relative frequencies of the most common haplotypes (Figure 3; Tables 2 and 3). We analyzed discrepancies between SHAPEIT-2 and our phasing algorithm by using these predicted allele frequencies to estimate the likelihood of individual phasing predictions (defined as the product of frequencies of the two phased haplotypes identified) by each algorithm. Subjects were then grouped into homozygous carriers, heterozygous carriers and non-carriers of a particular SNP-string and tested for differences in the SNP-string distribution between patients and controls. ORs were calculated separately from homozygous and heterozygous frequencies (with non-carrier frequencies as reference) and the significance of any distribution shift assessed by a Chi Square test with 1 degree of freedom. These were compared to similar calculations for the SNPs considered individually and, in the case of the DRB1 locus, to the same calculations for the actual distribution of the DRB1*1501 allele. Because the European and American data were acquired independently and in different geographic regions but were otherwise similar, these two data sets were used separately to assess the replicability of any findings. Only the American data contained both the SNP-status and the DRB1*1501 status so that only this data could be used to correlate SNP-haplotypes with a known susceptibility genotype.

Results
The results of the SNP-string analysis are presented separately for the DRB1 and the MMEL1 loci.
Of the 1857 participants in this study, 54 had missing data in the DRB1 region and, therefore, their subject-vectors could not be constructed. Of the remaining 1803 subjects, 652 (36%) had    Table 7. Association of DRB1 SNP-Strings with MS (all participants)*.   unambiguous SNP-string identifications. Also, using the final ''complete'' SNP-string list for the decomposition, 1,084 of the DRB1 SNP-strings identifications were non-conflicted, whereas 719 were conflicted. Of the conflicted identifications, however, only selected SNP-strings were involved in the conflict. Thus, for example, the (a2) SNP-string was involved in only 256 of the conflicts. All subject-vectors could be explained either as a combination of two of the 22 ''identified'' SNP-strings or as a combination of a ''novel'' SNP-string with one of the 22 ''identified'' SNP-strings. In no circumstance, was there a need to postulate two ''novel'' SNP-strings to explain any subject-vector. Thus, for the 3,606 haplotypes present in the study population, the set of ''identified'' SNP-strings set (see Table 4) was sufficient to explain 3,597 or 99.8% of the haplotypes, which are present in the population. In addition, this set was adequate to explain completely all but 20 (98.9%) of the subject-vectors. The ''complete'' set of SNP-strings was sufficient to explain 100% of the subject-vectors. In addition, the SNP-strings found at the DRB1 locus seemed to be related evolutionarily. Thus, these SNP-strings could be divided into four apparent ''families'' (Table 5), which are related in the sense both that they each share certain structural features in common and that every family member can be derived by the change of a single SNP (0 to 1 or vice versa) from some other family member (i.e., each member had a Hamming distance of 1 from some other family member). For example, family #1 all had zeros for SNPs (n2-n6) and had ones for SNPs (n7-n9). Family #2 all had zeros for SNPs (n1 & n7-n11) and a one for SNP (n2). Family #3 all had zeros for SNPs (n2, n3, n8, n9 & n11). Family #4 consisted of a single member (a13), which could have resulted from a cross-over event between SNPs (n7 & n8) of SNP-strings (a7 & a8) to produce SNP-strings (a13 & a20). If so, this event occurred within the longest stretch of untagged DNA at this locus ( Figure 1) and interconnects the DRB1 SNP-string families.
Genetic associations. The status of the major susceptibility allele (DRB1*1501) was known for the US-population. The OR of disease for having one-copy of this allele was 3.12 (p,0.0001) whereas the OR for having two-copies was 9.24 (Table 6). This is in keeping with the previously reported observation that both the control populations and the MS populations are in HWE with respect to the 1501 allele [9]. Thus, the weighting scheme for homozygous non-carriers, heterozygous carriers, and homozygous carriers of this allele, at least for the US population, is geometric (1, w, w 2 ), as it must be for the cases to be in HWE [9,38]. This type of weighting has also been referred to as co-dominant or allelic [38]. The frequency of the DRB1*1501 allele in controls was 10% (Table 2).
Only SNPs (n2, n3, n5, & n6) were positively correlated with DRB1*1501 status ( Table 6). All other correlations were negative ( Table 6). The highest correlation observed was for (n3) and was (r = 0.78). The other positive correlations ranged from 0.57 to 0.59 ( Table 6). The ORs for possessing 1 copy of these SNPs ranged from 1.72 to 2.41 whereas, for possessing 2 copies of each SNP, the ORs ranged from 2.74 to 4.08 ( Table 6). All of these associations were highly significant although the observed ORs were substantially below those for the actual DRB1*1501 status, particularly for the homozygous association (see above).
By contrast, the SNP-string analysis provided a much closer correspondence to both MS status and DRB1*1501 status ( Table 7). The OR (of disease) for possessing 1 copy of the (a2) SNP-string was 3.12 whereas for 2 copies of this SNP-string, the OR was 6.94 (p,0.0001). The allelic frequency of (a2) in controls was 12% (Table 3). The (a2) status was also highly correlated with DRB1*1501 status (r = 0.979) within the US-population where the DRB1 status was known ( Table 9). As shown in Table 8, the OR observed in the US-population for possessing 2 copies of the (a2) SNP-string was substantially larger (OR = 8.95), in keeping with the known distribution of DRB1*1501 in this population. In Europe it was somewhat less (OR = 5.68) although the difference was not statistically significant and, in the European population, the actual DRB1*1501 distribution is not known. By contrast, the disease risks for heterozygotes from Europe (OR = 3.20) and from the US (OR = 3.05) were quite similar between regions, as were the allelic frequencies in both cases and controls (Table 3).
In addition, from Table 7, it appears that SNP-strings (a1, a3, a7, & a8) may be protective. Nevertheless, this is an illusion. When the (a2) SNP-string is removed from the analysis, the apparent protective effect vanishes. Thus, the protective effect of these SNPstrings lies in the fact that carriers are less likely to also carry the (a2) SNP-string.
Closer examination of the 12 individuals who accounted for the non-perfect correlation of (a2) status with DRB1*1501 status revealed that only five (a2) SNP-string carriers (an a2-a7; an a2-a27; and three a2-a3 heterozygotes) did not also carry the 1501 allele. In addition, one individual (an a2-a21 heterozygote) was homozygous for the 1501 allele, which implies that the (a21) SNPstring can also carry this allele. However, the (a21) SNP-string is very different from the (a2) SNP-string ( Table 5). The (a21) SNPstring is from a separate family and differs from (a2) in 9 out of the 11 SNP-positions (Table 5). It is, therefore, hard to rationalize (short of invoking some double crossover event or exon-exchange) whereby this linkage would be possible. Moreover, for every other (a21)-carrier, this SNP-string either didn't harbor the 1501 allele or the individual was both an (a2-a21) and a DRB1*1501 heterozygote so that, other than in this one instance, it was never necessary to posit that an (a21) SNP-string harbored the 1501 allele. It seemed plausible, therefore, that an error had been made in the typing of this subject's DRB1 status. On this basis, the typing for this individual was rechecked and, on repeat typing this individual was found to be a (1501/0701) heterozygote. This   Table 11. Association of MMEL1 SNP-Strings with MS (all participants). change slightly improved the correlation between (a2)-status and HLA DRB1 status (r = 0.981). The remaining 6 subjects were DRB1*1501 heterozygotes, who lacked the (a2) SNP-string but included (in part) SNP-strings (a3) or (a10), which seem more plausible as 1501 carriers from an evolutionary perspective ( Table 5).
Of the 1,857 participants in this study, 10 had missing data in the MMEL1 region and, therefore, the MMEL1 subject-vectors could not be constructed. Of the remaining 1,847 participants, 536 (29%) had unambiguous SNP-string identifications. Also, using the final ''complete'' SNP-string list for the decomposition, 845 of the SNP-string identifications were non-conflicted whereas 1,002 were conflicted. Again, only selected SNP-strings were involved in each conflict. For example, the (a4) SNP-string was involved in only 271 of the conflicts. All subject-vectors could be explained either as a combination of two of the 24 ''identified'' SNP-strings or as a combination of a ''novel'' SNP-string with one of the 24 ''identified'' SNP-strings. In no circumstance, was there a need to postulate two ''novel'' SNP-strings to explain any subject-vector. Thus, for the 3,694 haplotypes present in the study population, the set of ''identified'' SNP-strings set (see Table 4) was sufficient to explain 3,660 or 99.1% of the haplotypes, which are present in the population. In addition, this set was adequate to explain completely all but 34 (98.2%) of the subject-vectors. Again, the ''complete'' set of SNP-strings was sufficient to explain 100% of the subject-vectors.
As was the case for the DRB1 locus, the MMEL1 locus also seemed to consist of related families of SNP-strings although, in this region, the families are more interconnected and, thus, the distinction of one family from another was less clear-cut (Table 5). Nevertheless, family #1 all had zeros for SNPs (n1, n2, & n5). Family #2 all had ones for SNPs (n1 & n2) and zeros for SNPs (n4 & n5). Family #3 all had zeros for SNPs (n1) and ones for SNPs (n2 & n3). Family #4 only had two members, which differed from each other only at SNP (n11).
Genetic associations. The genetic association for each of the 11 SNPs in the MMEL1 region are presented in Table 10. None of the ORs for either 1 or 2 copies of any SNP in this region were more than marginally significant (Table 10). Nevertheless, using SNP-strings, there was significant association between the (a4) SNP-string and MS, with the OR for possessing 2 copies of this allele being 2.93 (Table 11). Moreover, this was replicated in both independent subpopulations with the OR in Europe being 2.86 and the OR in the US being 2.96. By contrast, the OR for possessing 1 copy of this allele was 0.90. Presumably, therefore, this susceptibility allele acts in an autosomal recessive manner. The allelic frequency of (a4) in the control population was 11% (Table 2).

Comparisons with the SHAPEIT-2 Phasing Method
Running SHAPEIT-2 on the same 11-SNP data used by the SNP-string method and excluding subject-vectors that had missing data, the two methods were only 61% concordant in the DRB1 region and 75% concordant in the MMEL1 region. When the two methods predicted different haplotype combinations to explain a particular subject-vector, in general, the combinations chosen by SHAPEIT-2 had substantially lower likelihoods compared to those predicted by the SNP-string method. Indeed, this was even the case when estimating the haplotype frequencies based the SHAPEIT-2 predictions. In addition, to account for all of the subject-vectors, SHAPEIT-2 predicted the presence of many more ''novel'' haplotypes than the SNP-string method, despite the fact that the smaller set of haplotypes was sufficient to explain 100% of the subject-vectors. Thus, in the DRB1 region the SNP-string method invoked only 32 haplotypes, compared to 62 haplotypes using SHAPEIT-2, in order to explain 100% of the genotype (subject-vector) data. Similarly, in the MMEL1 region, SNP-string method invoked only 41 haplotypes, whereas SHAPEIT-2 invoked 57 haplotypes. In both regions, the additional haplotypes identified by SHAPEIT-2 were never unambiguously present in the subjectvector data. Finally, although a highly significant association of the (a2) haplotype with HLA DRB1*1501 status (r = 0.760) was found using the SHAPEIT-2 method, the strength of this association was significantly less (p,0.0001) compared to that found using the SNP-string method (r = 0.981).
By contrast, when the length of the SNP data included in the SHAPEIT-2 analysis was increased ten fold (to a span of 2,000 kb surrounding the 11-SNP sequences used in the above analyses) the results of the 2 methods were almost identical. Thus, the two methods were concordant in 99.4% of individuals in the DRB1 region and 99.1% of individuals in the MMEL1 region. Many of the discrepancies occurred because the one method included haplotypes, which were not on the other method's list. Also, compared to the SNP-string method, SHAPEIT-2 predicted an additional 2 haplotypes in the DRB1 region and an additional 3 haplotypes in the MMEL1 region. In most of these case there is no way to determine which method is more accurate. In the DRB1 region, for two such individuals, however, there was additional information. In both individuals, the SNP-string method predicted the (a2) allele whereas SHAPEIT-2 did not. Only one of these individuals carried the DRB1*1501 allele. Consequently, provided these individuals were typed correctly, and assuming that DRB1 status is a good surrogate for (a2) status, each method was correct only once. For the remaining discrepancies in the DRB1 region, the two methods chose combinations from the same subset of SNP-strings so that probabilistic comparisons were possible. Using the haplotype frequencies estimated from the SHAPEIT-2 output, the SNP-string method predicted combinations were more probable than the SHAPEIT-2 predictions in every case. In these circumstances, the ratios of the two probabilities ranged from 1.05 to 619 in favor of the SNP-string method.
Nevertheless, such high agreement between the two methods was not uniform throughout the 2,000 kb span of DNA but, rather, was characterized by peaks and valleys of agreement alternating throughout the span. For example, as noted above, in the MMEL1 region, when the 11-SNP sequence of subject vectors was started at SNP-position 119 (rs2234167) the agreement between the two methods was 99.1%. By contrast, when the 11-SNP sequence was begun at SNP-position 100 (rs1129333) the agreement between the two methods fell to 66.6%. Moreover, in this region there was, again, a marked disparity between the two methods in the number of haplotypes needed to explain 100% of the subject-vectors. Thus, in this region, the SNP-string method predicted only 120 haplotypes, whereas SHAPEIT-2 predicted the presence of 173 haplotypes (i.e., using either method, the variability of haplotypes in this region was 3-4 times the variability found in a DNA region only 150 kb away). Because the SNPstring method is based upon only the local 11-tuple subjectvectors, the outcome of this method depends only upon the (known) identity of the subject-vectors in the population. By contrast, because SHAPEIT-2 seemed to perform poorly when the input was limited to the 11-tuple subject-vectors of interest, it also seemed possible that SHAPEIT-2 might perform less well depending upon where (in the genome) the haplotype analysis was begun. Consequently, we ran the SHAPEIT-2 program starting at two different genomic locations. The first SHAPEIT-2 analysis began at SNP-position 1 (rs2031709); the second began at position 25 (rs6603813). The concordance of the predictions from these two SHAPEIT-2 analyses were then compared for each 11tuple across the region ( Figure 4). As can be seen from the figure, the concordance between the two SHAPEIT-2 analyses is very similar to the concordance between the two haplotype methods, being characterized by peaks and valleys of agreement alternating throughout the span of DNA (Figure 4).

Discussion
Currently, many groups throughout the world conduct GWAS studies to identify genomic regions that are associated with complex human diseases [14,15,[17][18][19][20][21]. Typically, gene chips are used to interrogate ,1,000,000 known human SNPs genome-wide in a sample population. If a SNP or several SNPs in a particular region are associated with the disease then it is presumed that some allele of a nearby gene is responsible for the observed association, thus GWAS are designed to identify genomic regions of association. The difficulty with this approach, however, is that the associations are often weak and require thousands of patients to uncover [14,15]. Moreover, at least for the DRB1 locus, each associated SNP has a much greater allelic frequency compared to the underlying susceptibility allele, which is an example of synthetic association [39]. Thus, the allelic frequency for the DRB1*1501 allele in the control population was 10%, whereas for the four most associated SNPs it ranges from 20% to 37% (Table 7). Also, unless the SNP is itself produces the associated genetic abnormality, it may difficult to determine which allele is responsible for the association. Finally, some probabilistic approaches to haplotype identification [25][26][27][28] depend upon correlations with the disease and don't really lead to identifications of the specific haplotypes, which exist at a given genetic location. Rather, these methods search the data for clusters of SNPs, which are (jointly) associated with the disease and, which, therefore, presumably belong to a particular disease-associated haplotype. They don't actually define the haplotype. For example, knowing that the four SNPs (n2, n3, n5 and n6) tag a haplotype, in this case the (a2) SNP-string, which is associated with MS, is not the same as identifying the (a2) haplotype itself and doesn't permit either testing of other haplotypic associations or comparing the genetic makeup of different populations (Table 3). Presumably, many of these potential difficulties could be mitigated if the two haplotypes at a given genetic locus could be identified confidently for each individual.
There is a high degree of confidence in the identity of those predicted haplotype combinations, which include ''identified'' SNP-strings. Thus, the large majority of these ''identified'' SNPstrings are present unambiguously and, for the others, their presence is implicated repeatedly in many different individuals. The degree confidence in the identity of those combinations that include ''novel'' SNP-strings is more tentative because, in most cases, these are present in only a few individuals, they are never identified uniquely, and, for the purposes of their selection, they are ''assigned'' a uniform (arbitrarily low) frequency. Nevertheless, because the vast majority of persons (.98%) have a combination, which includes 2 ''identified'' SNP-strings, the phasing method proposed here identifies, with a high degree of confidence (for the vast majority of the population), the 2 SNP-haplotypes, which make up a person's genotype. Thus, the SNP-string method mitigates many of the potential problems discussed earlier. First, in regions spanning approximately 200 kb of DNA, only a limited number of SNP-strings (SNP-haplotypes) seem to exist within the case-control populations from Europe and the US (Figure 3; Table 3). Indeed, as anticipated because both populations were largely of northern European origin, the frequency distribution of the different SNP-strings was almost identical in the two groups (Table 3). Moreover, the identification of the constituent SNPstring haplotype for each genotype are, for the most part, either unambiguous or the unique combination of two ''identified'' SNPstrings. Consequently, there is little doubt that these SNP-strings (so identified) represent the actual haplotypes, which cover the entire 200 kb segment and, consequently, this method should facilitate comparisons regarding the genetic make-up different human populations in specific genetic regions.
Although the analysis presented here represents only two loci, the same pattern pertained to every locus screened preliminarily (including MS-associated intergenic genomic regions). Naturally, the ''identified'' SNP-strings, likely represent sub-families of related haplotypes, which have not been separately defined. For example, only 22 SNP-strings were ''identified'' at the DRB1 locus (Table 4), whereas there are known to be hundreds of DRB1 alleles [40,41]. Some of these alleles are either very rare or only present in non-European ethnic groups [40,41] and might not be present in our sample. Others, however, likely, share the same 11tuple SNP-string. Including a larger number of SNPs in the SNPstring would permit finer distinctions to be made between alleles. However, this will also probably decrease the number of unambiguous SNP-string identifications. Similarly, reducing the length of DNA covered by an 11-tuple SNP-string will also increase the allelic separation but may also reduce the number of unambiguous identifications and the number of ''identified'' SNPstrings. Each of these changes could be either good or bad, depending on which effects predominate. Unquestionably, therefore, each of these variables will need to be studied systematically in order to determine the optimum number of SNPs and the optimum length of DNA to be included in the analysis.
Second, this SNP-string phasing method seems to overcome many of the objections to Clark's original algorithm [25][26][27]. All ''identified'' SNP-strings are either ''unambiguous'' or they are unique pairings involving a ''novel'' with previously ''identified'' SNP-string. In addition, each of the SNP-string lists here developed (i.e., unambiguous ''identified'', ''identified'', and ''complete'') and the associated haplotype frequencies are derived from repeated deconstructions of the entire dataset. Therefore, these lists (and the estimated haplotype frequencies) are independent of the order of data entry. Although the method does not guarantee the presence of ''unambiguous'' SNP-string identifications in the data, nevertheless, at every locus examined either in detail or preliminarily (including intergenic genomic regions), more than 16% of individuals (oftentimes much more) had ''unambiguous'' identifications, even when the SNP-string length was increased from 11 to 24. Also, these unambiguous identifications, when they are made, are independent of any distribution effects.
Although the SNP-string method does not directly assume that the population is in HWE, it does resolve halpotype conflicts based on the observed frequencies of the different non-conflicted SNPstrings in the population. Therefore, the method does imply the random combination of SNP-strings. Nevertheless, at least in the case of the DRB1 locus, the susceptibility allele is known to be at HWE [9]. Moreover, the fact that the frequency distribution of the different haplotypes is essentially identical in the European and US populations (Tables 2 and 3) suggests that each population is in a similar equilibrium state. However, regardless of the exact distribution, the combination of two rare ''novel'' SNP-strings is still anticipated to be less common than the combination of either two ''identified'' or one ''novel'' plus one ''identified'' SNP-string. Thus, only one subject-vector (out of approximately 25,000 screened preliminarily) could not be explained such a combination, which included, at least, one ''identified'' SNP-string. This suggests that the combination of two ''novel'' SNP strings is extremely rare, as anticipated if the population were composed of random SNP-string combinations. Also, the list of ''identified'' SNP-strings, generated by this method, is sufficient to account for more than 99% of the SNP-strings present in the population. Thus, the ''identified'' list may be short but it is also, largely, complete. Naturally, the ''complete'' SNP-string list accounts for 100% of the subject-vectors, although most of these additional SNP-strings were found in only a single individual. Because the actual phased haplotype information is not available, it is not possible to test the method directly for errors. Nevertheless, most of the identifications were either non-conflicted or, when conflicted, had one particular SNP-string pair, which was far more likely than the others. In addition, it is noteworthy that the haplotypes predicted by the SNP-string method in the DNA region studied here were virtually identical to those predicted by the SHAPEIT-2 algorithm when the span of DNA included in this latter analysis was increased 10-fold. This concordance of independent analysis methods, adds strong support for the notion that the haplotypes have been correctly identified in the vast majority of subjects. However, the agreement of the SNP-string method with SHAPEIT-2 was considerably less in other regions within the 2,000 kb span ( Figure 4). Notably, the regions of discordance between algorithms occurred in regions where SHAPEIT-2 was also substantially discordant with itself ( Figure 4). These regions also had especially high SNP-string variability and it is possible that the SHAPEIT-2 algorithm has difficulty in such circumstances.
Third, the identification of SNP-strings could be expanded to provide haplotype information over larger genomic segments. For example, suppose that the ''identified'' SNP-string combinations from two adjacent 11-tuple SNP-string combinations were known. Thus, suppose that the first combination was (01101100000) plus (01011010000); and the second was (01010000000) plus (10000011111). Suppose, further, that the combination of the overlapping 11-tuple segment, beginning at position 6 of the first 11-tuple, was (10000001010) plus (01000010000). In this case, the only possible 22 SNP-string haplotype configuration is (0110110000010000001010) plus (0101101000010000011111). Using similar (but expanded) logic, such an approach can be extended to provide haplotype information over increasingly large segments of the genome. Naturally, there is a need to optimize the number of SNPs and the length of DNA to be included in each SNP-string analysis and, no doubt, the method will require other refinements. Nevertheless, the haplotype information yielded by such a method would be reproducible and largely accurate over protracted regions of DNA. Naturally, in the future, next generation sequencing techniques (e.g., full exome sequencing) might replace some (or many) of the current phasing methods. However, even if these methods becomes readily available and can overcome their own phasing issues, specific methods might fail to identify disease-associations with intergenic regions, some of which have already been found in the GWAS published by the International Consortium [14,15].
Fourth, and most importantly, this method permits considerably more powerful tests of genetic association. The advantage of this method over the simple SNP-analysis is underscored by two examples. First within this dataset, there was no SNP-association of the MMEL1 locus with the MS disease, even for the SNP that was identified in the much larger GWAS as being highly MSassociated [14]. Moreover, although highly significant, this association only had an OR of 1.16 [14]. By contrast, the SNPstring method, using the same SNP data, was able to detect a significant and replicable association of this locus with MS and the OR was considerably higher (2.93). Second, at the DRB1 locus, although there were highly associated SNPs, the highest correla-