Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies

The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.


Introduction
With the rapid advancements made in biotechnology, the volume and types of biological data collected have grown at an accelerated rate. Such an enormous amount of data complicates the analysis of genetic association studies, especially when dealing with complex diseases with single-marker tests, which are often hampered by the problems inherent in multiple testing and by low power [1,2]. To alleviate such problems, current approaches either change the unit of analysis to a group of markers and focus on reduction of data dimension, or select representative markers through several stages of analysis. The former philosophy of analysis is favored by many researchers, mostly because a set of markers may jointly contribute to an effect different from the additive effect of single markers. Existing multiple-marker analyses include candidate multiple-marker tests, haplotype association analysis, SNP-set analysis, and gene-set or pathway analysis [3][4][5].
Candidate multiple-marker tests and haplotype analysis are usually employed when the number of genetic markers is not large. For instance, Hotelling's T 2 statistic and linear or logistic regression models with marker genotypes can include a moderately sized group of candidate markers in one model. These analyses are sensitive to genotype or allele frequencies, and work only for markers in the same, either protective or deleterious, direction [1]. As for haplotype analysis, it can directly take into consideration the possible interplay between individual markers, such as the linkage disequilibrium (LD) between SNPs. Three common measurements for LD are D, D 0 and r 2 . The first two compare the difference between haplotype and allelic frequencies; while the third quantity takes into account the allele frequencies in such differences. Such haplotype analyses, however, may not be straightforward because of the ambiguity in the determination of haplotype phase and the uncertainty in haplotype composition. In addition, computational burden becomes increasingly difficult as the number of markers located in the region of interest increases [6,7].
To explore the association between a wide genetic region and a disease, SNP-set analysis, gene set analysis, and pathway analysis are all more flexible approaches, in the sense that they are useful when analyzing a large pre-specified region based on investigators' prior knowledge or any available biological information [8,9]. These tools therefore can model the joint effect of a group of variants that may be correlated with each other. These methods can be categorized as SNP-set analysis, because, in a broad sense, the unit of analysis in SNP-sets can be any arbitrary cluster of SNPs, located either in neighboring genomic regions or in genes from a known pathway [10].
To construct SNP-sets when no a priori knowledge is available, clustering algorithms may be utilized as an exploratory tool to integrate the information contained in a large number of SNPs. Unfortunately, most clustering algorithms are developed to summarize relationships among quantitative measurements. Only a few studies have discussed how to cluster discrete data like SNPs or to identify subsets of relevant SNPs [11,12]. These studies have applied similarity measures, such as matching coefficients, modified Pearson's coefficients, or Spearman's coefficients, to model the potential relationship between SNPs and epidemiological data. These clustering methods are applied separately on case and control groups and, thus, may result in different clusterings of SNP-sets for the two groups, leading to difficulties in interpretation. The measures such as D 0 and r 2 for linkage disequilibrium can be utilized to group SNP variables as well. These are useful when haplotype information is available and when grouping loci in regions close to each other is of interest.
Another similarity measure for categorical variables is Hamming distance [13], a distance measure commonly employed in information theory to quantify similarity or dissimilarity among discrete data. This metric measures the dissimilarity between two strings of equal length, as a symmetric kernel between two vectors, and has been widely applied [14][15][16]. However, the properties of a clustering dendrogram constructed with the Hamming distance measure have not yet been investigated. Hamming distance calculates the number of dissimilar components between two strings so that all component-wise information is retained and no single marker needs to be removed. To determine if two groups of data should be clustered, Zhang et al. [16] tested if these two groups are equivalent based on the empirical distributions of Hamming distance. Their method was found to be useful as long as data were of a size moderate enough that convergence was be achieved in a reasonable amount of time. Another common algorithm, the k-mode, also tends to require a long running time and is sensitive to initial values [15,17]. In cases where clustering long regions of SNPs is of interest, the two issues of deciding when to merge SNP-sets and what procedure to use to compute distance remain critical.
For association tests, Hamming distance was previously considered as a symmetric kernel in tests proposed by Pinheiro et al. [18] and Wei et al. [19] in a test involving decomposition of the similarity measure between subjects. Pinheiro et al. [18] used the between-group component to define a test statistic and defined their Hamming distance as the proportion of discordant SNPs. Their test statistic is insensitive in detecting differences between groups since the value of the test statistic is often too small to achieve significance by bootstrap re-sampling. Wei et al. [19] used a weighted Hamming distance as the kernel and a ratio of a between-group statistic to a within-group statistic as the test quantity. They improved the nonparametric test proposed by Schaid et al. [20] by taking into account the dissimilarity between individuals from different treatment groups. Their between group statistic was the between group dissimilarity adjusted by a scalar of the within-group dissimilarity. This modification may not be compatible with heterogeneity between subjects of different traits, and thus its use may deserve some reservations. Furthermore, they considered the weight as the negative logarithm of the p-value from a single marker test, which may not be suitable when sample size is large or when minor allele frequency is small.
In regression models, other statistical tests based on different similarity measures have been proposed for use in association studies. Tzeng et al. [21,22] proposed a gene-trait similarity regression to detect associations between genetic variants and disease traits. They first derived a weighted cross product of trait residuals, and then regressed this product on genetic similarity. For a binary disease trait with no covariate information available, this is like comparing the average genetic similarity within a specific group, say the affected, with an average of the genetic similarity within the control group and the genetic similarity across groups. That is, the latter average considers both within-group and cross-group similarity. This latter average considers both the within and cross group similarity. Alternatively, Wessel and Schork [22] proposed regressing a genetic dis-similarity matrix of squared "distance" on a design matrix of phenotypes such as gene expressions to test for associations. They considered seven different distance metrics, including ibs and haplotype similarity, with various weights. For dichotomous phenotypes, this approach compares the average dissimilarity (the distance) within the affected group with the average dissimilarity within the normal group. It does not consider the cross-group dissimilarity. For case-control studies, the dissimilarity between a pair of individuals with different disease traits may exhibit larger deviation and thus reveal greater information about disease susceptibility, especially when evaluated on a causal set of markers. Therefore, such cross-group dissimilarity should be included in analysis, and should be considered separately from the within group dissimilarity.
The aim of our study is to develop a methodology of utilizing the Hamming distance metric to measure the distance between two sets of vectors containing discrete observations, in order to first perform clustering and then to use this clustering to conduct association studies. In the clustering stage, we modify the hierarchical clustering algorithm with average linkage for continuous data and develop a Hamming distance-based algorithm to determine SNP-sets. The procedure we develop is based on the rational that the more individuals carrying the same genotype with respect to two given SNPs (or two SNP-sets), the more similar these two SNPs (or sets) should be considered, which is exactly what the Hamming distance does by assigning them a smaller value. This Hamming distance dissimilarity measure can be considered as a function of ibs, or a special case of rare-allele-weighting similarity proposed by Wessel and Schork [23]. We also extend this idea to evaluate whether two SNP-sets should be considered close enough to be clustered together. In the second stage of our study, in which we investigate whether a given SNP-set obtained in the clustering stage is associated with the disease of interest, we propose a test statistic that compares the "distance" (or difference) between individuals with regard to their SNP profile for the given SNP-set. If the SNP-set is not associated with the disease, then the expected distance between a case and a normal subject should be the same as that between two individuals of the same disease status. Therefore, the focus of the second stage is on analyzing the difference between subjects in the disease and non-disease groups rather than the difference between SNPs. In contrast to the tests used by Pinheiro et al. [18] and Wei et al. [19], our proposed method does not depend on ANOVA decomposition and, instead, we simply compare the "distance" within groups and the "distance" across groups. The spirit of our test is similar to earlier tests in the use of similarity measures such as the "allelematch" kernel [18][19][20][21][22][23], but differs in its focus on comparing within-group and cross-group dissimilarity. Our statistic thus differs from those previously proposed. In addition, we place more emphasis on our statistic's operational characteristics in a variety of settings. We also assess the performance of the proposed method under different settings of sample size, case-to-control ratio, and noise-to-signal ratio (the proportion of the number of causal SNPs to the number of non-disease-related SNPs). Applications are illustrated and simulation studies are conducted to evaluate the performance of the clustering algorithm and the association test.

Clustering SNP-sets with a Similarity Measure
Let C 1 and C 2 be the two SNP-sets under study, where the sets contain K 1 and K 2 SNPs, respectively. For each SNP indexed k, the vector S k denotes the genotypes of N individuals. In other words, S k ¼ ðS 1k ; S 2k ; . . . ; S Nk Þ t is a SNP genotype vector of length N, where S ik is the genotype of the i-th individual at the kth SNP and k = 1, 2, . . ., K (K = K 1 + K 2 ). All genotypes of the K 1 SNPs in cluster C 1 can be expressed in the form of an N × K 1 matrix M 1 with elements S ik . Similarly, the genotype matrix M 2 for cluster C 2 is of dimension N × K 2 . Fig 1 shows a schematic diagram of these two clusters.
Distance (dis-similarity) between SNP-sets. We define d(C 1 , C 2 ) the dissimilarity between two SNP-sets C 1 and C 2 as the average dissimilarity of all possible pairs S l and S m from C 1 and C 2 , respectively, the number of individuals whose genotypes on SNP l and SNP m differ, The subscript H stands for Hamming distance. Note that some may prefer to compute the proportion, instead of the counts, of discordant components. These two definitions however are equivalent. This similarity measure is indicated in Fig 1 as well.
The quantity d(C 1 , C 2 ) measures the average "dissimilarity" between all possible pairs of SNPs with one from each set. A large value indicates a greater degree of dissimilarity between these two sets, whereas a small value implies "resemblance" between two sets and may lead to an action of clustering. In matrix terminology, d(C 1 , C 2 ) evaluates the average distance between a column S l in M 1 and a column S m in M 2 .
Merging Procedure, Handling Ties and Dendrogram Construction. When there are J clusters, {C 1 , C 2 ,. . ., C J }, the distances d(C i , C j ) of all possible combinatorial selections of paired clusters (C J 2 ¼ J Â ðJ À 1Þ=2 pairs in this case) will be computed and the pair with the smallest distance will be merged into a new SNP-set, leading to J−1 clusters.
In case of a tie, the following rules are applied. If more than one pair shares the same minimum distance and these pairs involve non-overlapping sets, then each pair will be combined into a new set. When overlapping sets exist, the grouping strategy will depend on the way they overlap. For instance, if more than a pair from C A , C B and C C show the same minimum distance, then a random pair, say (C A , C B ), will be combined, leaving C C to be grouped in the next stage. However, if C A has the same minimum distance with each of the other two distinct clusters C B and C C , but C B and C C are not close to each other, then a union, either C A [C B or C A [C C , will be selected at random to form a new cluster. This procedure guarantees that the number of clusters will be reduced by at least 1 in each stage. Therefore, if the clustering procedure starts with K SNPs, then this proposed procedure will proceed until all K SNPs are grouped into one final cluster to complete an agglomerative hierarchical clustering dendrogram.
Determination of Clusters. If J is the pre-specified number of clusters that investigators believe to exist for all SNPs under study, then these clusters can be identified by examining the dendrogram in a top-down order. When no knowledge about J is available, we trace for each SNP the clusters it visits sequentially and allocate the SNP to a cluster that shows the largest degree of dissimilarity with others. The steps of the procedure are as follows: 1. Trace History of Clusters: For each column vector SNP S k , trace all n k clusters in the dendrogram that the SNP ever belonged to, and order these clusters from the smallest to largest fC kð1Þ ; C kð2Þ ; . . . ; 2. Calculate Node Heights in Dendrogram: Calculate for S k the "height" H k(i) of each node of its corresponding clusters, H kðn k À1Þ ¼ dðC kðn k À1Þ ; C kðn k Þ À C kðn k À1Þ Þ In the dendrogram, H k(i) is the vertical length of a new node measured from the bottom, when the cluster C k(i) becomes C k(i+1) . Therefore, H kð1Þ < H kð2Þ < Á Á Á < H kðn k À1Þ for SNP S k . This quantity indeed measures the "distance" between two sets. Note that C k(i+1) − C k(i) denotes the difference in SNPs between sets C k(i+1) and 3. Assign Cluster: Compute the maximum "relative height" between successive nodes, where H k(i) is the node height as defined above. This "relative height" is indeed the vertical distance between nodes. If D(S k ) = H k(i) − H k(i−1) for some integer i, then C k(i) is the cluster selected for SNP S k . The result of the procedure is that every SNP will be assigned to a cluster where the size of each cluster may vary, and the number of clusters is now determined. The resulting clusters can then be used in the next stage, i.e. the association test. A hypothetical data set and dendrogram presented in additional file (S1 Text) illustrate calculation of the measures H k(i) and D (S k ).
If one prefers to work on a smaller number of clusters containing, for instance, only 5% of the SNPs for further investigation, then these SNPs and corresponding clusters can be retrieved by selecting SNPs whose corresponding D(S k ) are greater than the 95% quantile of {D(S 1 ), D (S 2 ),. . ., D(S k )}. The resulting number of clusters, say J, for these SNPs can then be used. Alternatively, one can utilize gap statistics or similar quantities [24][25][26] to determine the number of clusters. These statistics have been developed to determine the optimal number of clusters from continuous observations [27].

Association Tests with SNP-sets
Testing whether a SNP-set is associated with a disease of interest is equivalent to testing whether the set can distinguish different disease phenotypes. In other words, the unit of analysis now becomes the subjects, and the Hamming distance between paired subjects across all markers is to be evaluated. That is, the focus now is on the difference with respect to all markers between two paired individuals. If the SNP-set is independent of the disease, then any two individuals should carry similar markers in this SNP-set, regardless of disease status. The expected difference in genetic markers between a case and a control should be similar to that between any two cases or between any two control subjects. In contrast, if the set is composed of susceptible SNPs, then the expected difference between a case and a control would be large, at least larger than that between any two cases or between any two controls. Here the dissimilarity is evaluated by the heterogeneity across all SNPs in this set for any two individuals, and Hamming distance is the metric considered. Notice that now the dissimilarity is evaluated between subjects' SNP strings (the row vectors in Fig 1), not between SNP column vectors.
Notation and Statistics. Let C be the SNP-set of interest where C contains K SNPs. Let x 1 ; x 2 ; . . . ; x N 1 denote the genotype row vectors of the N 1 subjects in the control (normal) group and y 1 ; y 2 ; . . . ; y N 2 denote those of the N 2 subjects in the case (disease) group, where both x i and y i are of the form (S i1 , S i2 ,. . .,S iK ) with the same S ik defined earlier. Then, the SNP genotype matrix R 1 for the normal group is of dimension N 1 ×K, the matrix R 2 for the case group is of dimension N 2 ×K, and the distance is measured between any two row vectors. Note that the stacks of R 1 and R 2 column-wise are the same as the stacks of M 1 and M 2 row-wise when the same SNP-sets are involved, The distance between the case and the control group is defined as the average dissimilarity between these two groups, The function h transforms the original genotype S ik into a binary variable h(S ik ) = s ik = 1 if the subject i carries the minor allele at the k-th SNP, and 0 if not. Now the Hamming distance d H ðhðx i Þ; hðy j ÞÞ ¼ Iðs ik 6 ¼ s jk Þ computes the number of SNPs for which one of the paired individuals i and j carries the minor allele while the other does not. Note that the incorporation of minor alleles in the evaluation of distance follows the rationale discussed in Wessel and Schork [23] that two individuals sharing minor alleles may reveal more similarity in their genomes than do two individuals sharing common alleles. Other functional forms can certainly be assumed for h if some other relationship is preferred.
The average dissimilarity within groups is : Note that a T substantially larger than U implies a substantial difference in SNP-set configuration between the case and the control group. In other words, the difference T−U may provide evidence of disease association. The p-value for this test statistic T−U can be determined via a permutation test.
Clustering with Case and Control Subjects. One issue needs to be addressed when the association test adopts the SNP-sets obtained from the proposed Hamming distance-based clustering algorithm. To identify influential SNP-sets, the clustering algorithm for case-control studies should be carried out on both the case and control groups together, rather than on a single group. Taking the first four SNPs in Table 1 for example, two clusters C 1 = {S 1 , S 2 } and C 2 = {S 3 , S 4 } would be constructed based on 10 individuals (5 cases and 5 controls) since d(S 1 , . This is because the Hamming distance dissimilarity algorithm groups together those SNPs which exhibit a similar pattern in both case and control groups. This includes the neutral SNPs S 1 and S 2 which display an identical pattern (1,1) in all 10 subjects. These two SNPs provide no information of relevance to disease association. In contrast, the other SNP-set {S 3 , S 4 } contains more information, because all cases carry (0,0) for {S 3 , S 4 } while all controls carry (1,1). The pattern in the case group alone is the same as that in the control. In other words, among these two clusters {S 1 , S 2 } and {S 3 , S 4 }, it is the latter that provides information for association. In addition, the latter set is identified only when case and control subjects are all included in the clustering algorithm. If only the control individuals are used for clustering, the cluster {S 1 ,S 2 ,S 3 ,S 4 } would be identified with neutral SNPs included.

Simulation Studies of SNP Clustering
To evaluate the proposed SNP clustering algorithm, we first simulated correlated SNPs and independent SNPs, mixed them, and then examined whether these correlated SNPs would be clustered together by the proposed algorithm. The correlated SNPs were generated with a Under each setting, we performed the clustering algorithm and examined if the 10 correlated SNPs were grouped together. With pairwise correlation set at 0.55, the 10 correlated SNPs always formed a unique cluster under the mixing ratio of 1:1 and 1:10. When the correlation was 0.3, the percentage of this correct clustering pattern was 97.7% and 100% for the mixing ratios of 1:1 and 1:10, respectively. As for the remaining 23 (1000-977 = 23) replications, the 10 correlated SNPs were still clustered together but simultaneously mixed with an average of 3 ± 2 independent SNPs in the same cluster. Fig 2 displays the dendrograms of four typical replications.

Simulation Studies of the SNP Association Test
To demonstrate the performance of the proposed Hamming distance-based SNP-set association test, several conditions were considered in simulation studies, including deleterious, protective, or mixed effects; and various signal-to-noise ratios. We considered a protective SNPset containing six SNPs with odds ratios (OR) all less than 1, a deleterious SNP-set of six SNPs with OR>1, and a mixed set containing three protective and three deleterious SNPs. The values of the ORs and MAFs in the control subjects are listed in Table 2. In addition to the six causal variants (signals), other non-associated SNPs with MAFs in the range (0.05, 0.5) were also generated as neutral variants (noise). The number of neutral SNPs ranged between 6 and 90, leading to signal-to-noise ratios between 1:1 and 1:15. The numbers of cases and controls were either 100 each, or 200 each. The coding for each individual was 1 or 0 for carrying the minor allele or not. Each simulation setting contained 1000 replications with p-values derived from 1000 permutation tests.  [19] and SKAT [29]. In general, all three tests had small type I errors at the nominal level. The weighted U-statistic had a very large type I error in comparison with the unweighted U-statistic, thus we considered only the latter in the following simulation comparisons. HDAT had larger power than the U-statistic and SKAT. The power remained above 80%, even when the signal-to-noise ratio reached 1:15, implying a consistent and reliable performance. The U-statistic and SKAT are more sensitive to the signalto-noise ratio, especially when the number of individuals is not large, only 100 cases and 100 controls (Fig 3F).

Coronary Artery Disease from WTCCC
The data, obtained from a Wellcome Trust Case Control Consortium (WTCCC) study [30], included 1926 subjects with coronary artery disease (CAD) and 2938 controls. Previous studies have shown a strong association between 9p21.3 and CAD [30][31][32], hence the 940 SNPs (with minor allele frequency > 0.001) on the 9p21.3 region were considered as an illustration for the Hamming distance-based clustering algorithm and association test.
First, our proposed clustering algorithm derived 123 clusters containing at least three SNPs, and 49 clusters containing two SNPs. No cluster was composed of a single SNP. The    Table 3. The dendrogram of these 11 SNP-sets is reproduced in Fig 4 with different colors. Note that most clusters contain SNPs with effects of the same direction, either protective or deleterious, as indicated in Table 3, except the sixth and the ninth set, which contain different but weak effects. After performing a Bonferroni correction for multiple tests, the first set with four deleterious SNPs remains significant (p<0.0005 in 5000 permutations). Alternatively, selecting the top 5, or top 10%, relative heights D(S k ) to determine the clusters for the subsequent association test results in the same SNP-set being significant. This significant cluster contains 4 SNPs, rs7865618, rs523096, rs564398, and rs518394, all in the MTAP gene, which has been reported to be involved in homocysteine metabolism [30][31][32]. This SNP-set spans 11.9kb and includes only protective SNPs. Their minor alleles all carry significant protective effects (p-values for single marker tests all < 1e-5). In addition, these variants have been investigated in the laboratory and found to be significantly associated with disease in other GWAS [31].
For the 172 (= 123+49) clusters identified by the Hamming distance based clustering algorithm, we also applied Wei et al.'s [19], SKAT [28], and HDAT with 1000 permutations. Only three sets reached significance with all three association tests. These three sets were the top three SNP-sets identified by HDAT, also listed in the top of Table 3. Because Wei et al.'s method is time consuming, here we only performed 1000 permutations in the test of each SNP-set. This number is not large enough to carry out a Bonferroni correction for this CAD study. The purpose is simply to demonstrate the similar findings reached under the three tests.
We also applied the k-mode [15] and Zhang et al.'s [16] method to cluster these 940 SNPs. It took 5.88 hours for the k-mode to perform clustering, but only 26.5 minutes for our Hamming distance clustering algorithm to complete the computation, while Zhang's method failed to converge because the data size was too large. The first row in Table 4 lists the respective computational times.
HDAT on LD Blocks. Alternatively, with Haploview, we constructed 145 LD blocks containing at least two SNPs based on the same 940 SNPs. These blocks were determined based on the correlation among neighboring loci; where SNPs of deleterious and protective effects can both appear in the same block. All 145 blocks were tested with HDAT. S1 Table lists ten blocks with smaller p-values. These blocks were significant with p<0.05 but none reached significance after Bonferroni correction. Two SNPs (rs523096, rs518394) in the first LD block also appeared in the first SNP-set of the Hamming distance clusters in Table 3.
Hamming Distance Clusters and LD Blocks. For reference, we displayed in S2 Fig the linkage disequilibrium measure r 2 for the Hamming distance clusters. The 4 SNPs in our first Hamming distance cluster, though highly correlated with each other (D 0 >0.96, r 2 >0.87), were separated into two different LD blocks, due to another SNP (rs10758264) locating between them and its weak LD with these 4 SNPs. Another example for the difference between LD blocks and Hamming distance clusters is the SNP-set 6 in Hamming distance clusters. Inside this cluster, SNP rs10966811 is not in close linkage with other SNPs (r 2 in 0.21~0.46, D 0 in 0.51~0.83), and thus not contained in the LD block 2 (The LD block numbered 2 in S1 Table). Only 4 close SNPs (rs1461325, rs3126950, rs4366163, rs3118240) from this Hamming distance cluster are included in the LD block 2. In addition, the spans of these clustered SNP-sets can vary greatly. For instance, the SNP-set 11 ranges 3.4 kilobase pairs; while the SNP-set 6 ranges 183.6 kb. These sets may serve as a complementary option or reference if one decides to combine or prune LD blocks.
In contrast to the test with all SNP-sets, we applied the gap statistic to determine the number K of clusters under every clustering algorithm discussed above. However, based on the gap statistic, no best value for K was selected in the range of 1 to 250. Again, such large data size induces extensive computation and hence cannot be accommodated by most traditional methods.

Simulation Studies for Combined Procedures
The statistical significance of any association test on a large SNP-set can only imply a possible relationship between this whole set and the disease of interest. A finer range or locations of the susceptible genes may still remain unknown. With the Hamming distance clustering algorithm, a larger set can be divided into several smaller SNP-sets whose significance can provide more information as where these susceptible genetic markers may locate. In this simulation, we investigated if the performance of any of the association tests (HDAT, U statistic, and SKAT) can be improved by testing on the Hamming distance clusters.
The same procedures for generating the genotypes in the simulation studies for SNP association tests were adopted. The numbers of cases and controls were 100 each, and the number of replications was 1000. In each replication, the beginning SNP-set containing 6 signal (risk) SNPs and 90 neutral SNPs was constructed and tested with every one of the three tests. The powers were 77%, 63%, and 67%, respectively. We refer to these as the overall powers.
For the purpose of comparison, each one of the large sets underwent the Hamming distance clustering algorithm first, followed by each of the three tests on every one of the resulting clustered SNP-sets. We then examined if any clustered SNP-sets containing signal SNPs were successfully detected with statistical significance, as well as the number of signal SNPs detected in each replication. The solid lines in Fig 5 show that, for any of the three association tests, carrying out the test on Hamming distance clustered sets can reach a power larger than 90% (95% for HDAT, 93% for U statistic, and 96% for SKAT) in identifying at least one susceptible SNP marker, and larger than 70% (82% for HDAT, 73% for U statistic, and 80% for SKAT) in identifying at least three markers. If the overall power was taken as the baseline (dotted line in Fig  5), then testing on the clustered sets can provide more information on genomic range of at least 4 SNPs for each one the three tests.
Note also that, combining the clustering and HDAT (red solid line) has a slightly better performance than combining the clustering and SKAT (green solid line). The Hamming clustering algorithm improved SKAT more (77% vs. 0.82% for HDAT and 67% vs. 80% for SKAT), because SKAT is sensitive to the ratio of noise to signal SNPs (shown in earlier section and Fig 3). The type I errors for the three combination tests were all reasonable (5.2%, 5.5%, and 5.0%, respectively). Table 4. Numbers are the run time (average and standard deviation of 10 repetitions) under the Hamming distance-based clustering algorithm (HD), k-mode, and Zhang's method for the three applications. For the first three applications, the run time was assessed with single-threaded computation; while the time for the last application was under parallel computation and single-threaded computation in R. This experiment implies, in addition to the better performance of HDAT over the other two tests, that the Hamming distance clustering algorithm can help to identify smaller SNP-sets for association tests like U statistic and SKAT, and thus improves the ability to confine or localize genetic regions for future studies.

Discussion
The proposed SNP clustering algorithm based on the Hamming distance dissimilarity measure not only works faster than current existing methods, it is also free from the constraint that SNP-sets are formed by neighboring SNPs. This algorithm can also be used to determine whether several given SNP-sets should be clustered together. Additionally, this clustering algorithm can separate protective from deleterious genetic variants, resulting in SNP-sets containing components with effects of the same direction. The ability to produce such sets would be useful as some association tests work only for SNPs with effects of the same direction. When illustrating this algorithm, we started with original SNP genotypes to gradually form SNP-sets of various sizes. In other words, this proposed procedure can be employed directly on SNP-sets that are pre-determined based on certain functional characteristics. For instance, if different known relationship, such as inhibition or activation, between members is to be specified in this clustering procedures, then the coding should be modified accordingly. One example would be to replace {0, 1, 2} with {1, 2, 3} for tumor suppressor genes (or protective loci) and {-1, -2, -3} for oncogenic markers.
Other applications of this algorithm are the construction of population structures and clustering of multi-categorical variables.

Construction of Population Structures
The Hamming distance-based clustering algorithm can be used as a tool to elucidate the underlying population structure from genomic SNP data. We illustrate this property with data from HapMap ENCODE database [33][34] containing a total of 7538 SNP genotypes from 269 subjects (90 Yorba, 44 Japanese, 45 Han Chinese, and 90 CEU Utah residents with northern and western European ancestry). The clustering procedure was applied on the 7538×269 SNP matrix, and the column variables (individuals) were clustered. Fig 6 displays the results where different colors represent different populations. Three clusters were identified, one for CEU, one for Japanese and Chinese together, and one for Yorba. These three clusters are fairly distinct from each other. However, these SNPs cannot clearly differentiate Japanese from Chinese. The heatmap of the corresponding Hamming dis-similarity matrix is displayed in Fig 7. The pattern between groups does differ from the pattern within any group (squares in diagonal direction in Fig 7). The proposed clustering algorithm took 43 seconds to construct three clusters; while the k-mode took 23 minutes. For Zhang's method, the size was again too large to compute (computation times in the second row of Table 4).

Accommodation of Multi-categorical Variables
In previous sections, when it is of interest to cluster SNPs or SNP-sets, we used the same coding system (1/0 coding indicating with and without minor alleles, or 0/1/2 for the number of minor alleles) for all SNPs. However, if this clustering algorithm is used to cluster variables of different levels, then the Hamming distance-based algorithm can be extended easily. In other words, the algorithm can handle variables with different numbers of categories. The following soybean data include 47 individual soybeans, each being 1 of 4 classes, and 35 variables (attributes) related to the appearance, growing environment and date of bloom of each bean, with the number of levels for each variable ranging from two to seven [35]. Using the original levels for each variable, the clustering algorithm successfully identified the four actual classes of clusters ( Fig 8A). In contrast, two beans would be mis-clustered when the binary coding (0 for the first level and 1 for the rest) was used (Fig 8B).

Computational Burden
In previous sections we have proposed a clustering algorithm and an association test, both based on Hamming distance measure. The clustering algorithm focuses on the similarity between SNP vectors (the column vector S k ), where each vector contains the genotypes of a specific SNP for all subjects. The test, in contrast, examines the similarity between subject vectors (the row vector), where each vector denotes the genotypes of all SNPs for a specific individual. The computational burden, therefore, depends on the number of variables to be compared for similarity; it is the number of SNPs in the case of the clustering algorithm and the number of subjects in the case of the test. The order of the computational complexity is discussed in S2 Text. Here we examined first the run time for Hamming distance clustering algorithm. In Table 4, it is clear that clustering 940 SNPs took longer (26.5 minutes) than clustering the ENCODE 269 subjects (43 seconds). If the number of variables to be clustered increases to 4000, such as the case of clustering 4000 SNPs from 4864 subjects (the last row in Table 4), then the computation takes 52.45 hours for single-threaded computation in an ordinary desktop computer with Intel Core i7-4770 processor (3.40 GHz) and 32 GB RAM. With such volume of data, we then carried out 7-CPU parallel computation with packages doSNOW and foreach in R, leading to 18.27 hours (Table 4). Next, we investigated how sensitive the run time for HDAT is as we change the number of subjects to be compared and the number of permutations in the test. For illustration, we selected randomly 1000 subjects from the WTCCC study, along with their first 5000 SNP genotypes in 9p or all SNPs on chromosome 9, and next performed HDAT with either 1000 or 5000 permutations with parallel computation. Table 5 lists the corresponding run time under each scenario. When all 19948 SNPs on chromosome 9 were included, it took 121.56 seconds for 1000 subjects and 72 minutes for 4864 subjects to complete the test, with the number of permutations set at 1000. The increase in the number of subjects has a greater influence on the computational burden. When handling data of a larger scale like GWAS, either performing parallel computations, adopting C-alike programming languages, or considering graphics processing unit (GPU) computing will surely enhance substantially the computational performance.

Extension to Rare Variants
In addition to the common causal variants, this test may be extended to rare causal variants. When dealing with rare variants, the small frequency of individuals carrying the minor allele greatly impacts the difference between the two statistics T and U. Taking a deleterious causal rare variant for example, since it is rare only a small proportion of individuals in the case group will carry this variant, making the value of the Hamming distance measure U cs in the case group similar to T. To increase the statistical power in this case, inclusion of a larger group of controls would correspond to a within-group Hamming distance U cn that would balance U cs , making U (the weighted combination of U cs and U cn ) move away from T. Fortunately, when the allele represents risk, it becomes easier to find more controls than cases. Such recruitment would not be hard. The influence of the increase in control samples can be observed in the difference in magnitude between the blue lines in Fig 9A and 9B. Similarly, when the causal rare variant is protective, then only a small proportion of subjects in the control group will carry this variant, leading U cn and T to share a similar magnitude. Therefore, following the same argument above, a larger sample size in the case group will enlarge the difference between U and T. This effect on T−U when relatively more cases are involved in the study can be seen by  Testing and Clustering of SNP-Sets comparing the red lines in Fig 9A and 9C. Our team is currently working on this topic with modifications on variant-specific weights.

Conclusions
As discussed in previous sections, our clustering algorithm saved more computational time than existing methods. For large data size, our algorithm easily outperforms the others (first two rows in Table 4). Even for data of smaller size like the soybean data, our algorithm took only 0.1 second, while k-mode took 1 second and Zhang's took 2 seconds (the third row in Table 4). Another advantage of our proposed clustering algorithm is that applying the Hamming distance-based algorithm with both control and case observations before conducting an association test can produce SNP-sets containing SNPs with effects of the same direction. The homogeneity of SNP effects may warrant further biological validation studies, and will ensure statistical power in certain association analyses [36].
Our second proposal, the association test HDAT, can accommodate a large number of SNPs in one test, while allowing for SNPs in the same set to share the same effect or to have mixed effects. If this test is to be used on a pre-defined SNP-set in which the effects may be mixed, this association test can outperform other tests, especially when the majority of SNPs are neutral. When sample size reaches 400, all three tests perform equally well (S3 Fig), regardless of the noise-to-signal ratio. This HDAT can be used on other previously defined SNP sets. For instance, this test can be applied on haplotype blocks that have been constructed based on D 0 or r 2 from linkage disequilibrium analysis. In other words, the test can be employed along with other clustering algorithms. The R code for clustering SNP-sets and testing for disease association is available at http://homepage.ntu.edu.tw/~ckhsiao/HammingDistance/HD.htm.
This research was concerned mainly with common variants. Certain issues remain to be resolved. First, when rare variants are of major interest, we have recommended to include different numbers of cases and controls for the association test. Much of the detailed allocations and adjustment for minor allele frequencies are currently under study. Second, if admixture populations exist in association studies, it is not known whether one should cluster the subjects before or after the association test. Intuitively, handling the population stratification first before conducting HDAT on the case and control groups of the same population should be recommended. However, the stratified populations to be tested may be of smaller sizes, resulting in unstable findings. A balance between the problem of population stratifications and the choice of sample size for statistical power should be evaluated. Third, the association test considered in this article only deals with genetic markers. No covariates have been discussed. If the covariates are available and can be transformed into categorical variables, then they can be considered Table 5. Numbers are the run time (in seconds, s, or minutes, m) of HDAT, using parallel computation in R, under different numbers of subjects, SNPs, and permutations.
as pseudo markers and their similarity can be measured with Hamming distance. Such treatment, however, may not work on continuous explanatory variables. Our next research will extend the Hamming distance measure to covariates, so that effects of the genetic markers as well as other environmental factors can be examined via regression models.