Fig 1.
Schematic diagram of SNP clusters.
Table 1.
Toy data for SNP clustering. For the purpose of illustration, the coding here for genotypes is binary.
Fig 2.
Hierarchical clustering dendrograms based on one simulated dataset.
The data contained 10 correlated SNPs (S1, S2, …, S10) and 10 or 100 independent SNPs (N1, …, N10 or N1, …, N100). (A) Ten correlated SNPs with correlation of 0.55 and ten independent SNPs. (B) Ten correlated SNPs with 0.55 correlation and 100 independent SNPs. (C) Ten correlated SNPs with 0.3 correlation and ten independent SNPs. (D) Ten correlated SNPs with 0.3 correlation and 100 independent SNPs.
Table 2.
Values of ORs and MAFs in simulation studies.
Fig 3.
Type I error and power in simulations.
Type I error (A, C, E) and power (B, D, F) of the Hamming distance-based association test HDAT (red line), the U-statistic (blue line) and SKAT (green line) for the SNP-set association test under different noise-to-signal ratios, and effect sizes. The X-axis stands for the numbers of neutral SNPs in (A), (C) and (E), but the noise-to-signal ratios in (B), (D) and (F). The effects of causal SNPs are deleterious in (A) and (B), protective in (C) and (D), and mixture in (E) and (F). The simulation included 100 cases and 100 controls.
Fig 4.
Dendrogram for the 11 selected SNP-sets.
Labels indicate the SNP markers and different colors indicate different SNP-sets.
Table 3.
The compositions and size of the top 11 SNP-sets with the smallest p-values.
SNPs in boldface indicate protective effect from single-marker test (OR<1) and the rest indicate deleterious effect (OR>1).
Table 4.
Numbers are the run time (average and standard deviation of 10 repetitions) under the Hamming distance-based clustering algorithm (HD), k-mode, and Zhang’s method for the three applications.
For the first three applications, the run time was assessed with single-threaded computation; while the time for the last application was under parallel computation and single-threaded computation in R.
Fig 5.
Power calculation for combining Hamming distance clustering algorithm and association test.
Solid lines are for clustering+HDAT (red), clustering+U-statistic (blue) and clustering+SKAT (green); while dotted lines are for the tests on the original complete set (termed as the overall power).
Fig 6.
Dendrogram for the HapMap ENCODE database.
Different colors represent different populations, red for CEU, blue for Japanese, green for Chinese, and purple for Yorba.
Fig 7.
Heatmap of the corresponding Hamming dis-similarity matrix for the HapMap ENCODE database.
Different colors represent different populations, red for CEU, blue for Japanese, green for Chinese, and purple for Yorba.
Fig 8.
Dendrogram for the soybean data.
The four colors are for the 4 true classes. (A) The original categories are used as the coding for each variable. (B) The binary coding is considered for each variable.
Table 5.
Numbers are the run time (in seconds, s, or minutes, m) of HDAT, using parallel computation in R, under different numbers of subjects, SNPs, and permutations.
Fig 9.
Expected values of the test statistic.
Expected values of the test statistic under different MAFs in controls, ORs, and case-to-control ratios based on one single SNP. (A) Case-to-control ratio is 1:1 (1000:1000). (B) 1:1.5 (1000:1500). (C) 1.5:1 (1500:1000). Red solid line for protective SNPs with OR = 0.5, red dashed line for OR = 0.8, blue solid line for OR = 2, and blue dashed line for OR = 1.25.