Comparative study for haplotype block partitioning methods – Evidence from chromosome 6 of the North American Rheumatoid Arthritis Consortium (NARAC) dataset

Haplotype-based methods compete with “one-SNP-at-a-time” approaches on being preferred for association studies. Chromosome 6 contains most of the known genetic biomarkers for rheumatoid arthritis (RA) disease. Therefore, chromosome 6 serves as a benchmark for the haplotype methods testing. The aim of this study is to test the North American Rheumatoid Arthritis Consortium (NARAC) dataset to find out if haplotype block methods or single-locus approaches alone can sufficiently provide the significant single nucleotide polymorphisms (SNPs) associated with RA. In addition, could we be satisfied with only one method of the haplotype block methods for partitioning chromosome 6 of the NARAC dataset? In the NARAC dataset, chromosome 6 comprises 35,574 SNPs for 2,062 individuals (868 cases, 1,194 controls). Individual SNP approach and three haplotype block methods were applied to the NARAC dataset to identify the RA biomarkers. We employed three haplotype partitioning methods which are confidence interval test (CIT), four gamete test (FGT), and solid spine of linkage disequilibrium (SSLD). P-values after stringent Bonferroni correction for multiple testing were measured to assess the strength of association between the genetic variants and RA susceptibility. Moreover, the block size (in base pairs (bp) and number of SNPs included), number of blocks, percentage of uncovered SNPs by the block method, percentage of significant blocks from the total number of blocks, number of significant haplotypes and SNPs were used to compare among the three haplotype block methods. Individual SNP, CIT, FGT, and SSLD methods detected 432, 1,086, 1,099, and 1,322 associated SNPs, respectively. Each method identified significant SNPs that were not detected by any other method (Individual SNP: 12, FGT: 37, CIT: 55, and SSLD: 189 SNPs). 916 SNPs were discovered by all the three haplotype block methods. 367 SNPs were discovered by the haplotype block methods and the individual SNP approach. The P-values of these 367 SNPs were lower than those of the SNPs uniquely detected by only one method. The 367 SNPs detected by all the methods represent promising candidates for RA susceptibility. They should be further investigated for the European population. A hybrid technique including the four methods should be applied to detect the significant SNPs associated with RA for chromosome 6 of the NARAC dataset. Moreover, SSLD method may be preferred for its favored benefits in case of selecting only one method.


Introduction
The reasons for excluding the 1,452 SNPs were the biomarker checks: (a) less than 75% genotype percent, (b) less than 0.001 Hardy-Weinberg equilibrium (HWE) P-value or (c) less than 0.001 minor allele frequency (MAF) in the total sample.

Materials
The chromosome 6 data file was extracted from the NARAC data file using the programming language (Perl). The chromosome 6 data file was reformatted to be ready for processing by the program (PLINK) by the statistical package (R 3.1.0). Moreover, the R language was used to extract the chromosome 6 map file from the NARAC map file (SNP ID, physical position, and chromosome number). The chromosome 6 reformatted data and map files were processed by (PLINK 1.07) and (gPLINK 2.05) programs to be ready for processing by the program (Haploview) [14].
The computer program (Haploview 4.2) was used to partition chromosome 6 into successive blocks using CIT, FGT, and SSLD methods and to calculate the corresponding P-value for each haplotype in each block. The default parameters for the three methods were used. In addition, the program (Haploview 4.2) was used to apply the individual SNP approach and to provide the corresponding P-value for each SNP [15]. The distribution of P-values on chromosome 6 for the individual SNP approach was presented using the Integrative Genomics Viewer (IGV 2.3.72) [16,17]. The significant blocks and the associated SNPs were selected using the programming language (Matlab Release 2010a). Fig 1 showed a block diagram for the whole chromosomic association analysis process.

Haplotype block methods
The haplotype blocks have been defined through more than one method. Several algorithms for haplotype partitioning have been proposed, among which the CIT, FGT, and SSLD have been implemented in HaploView 4.2. Each method differs greatly from the others in its scope of the definition of the haplotype blocks. Therefore, an objective has been arisen for comparing the results of these methods which was studied through this research paper [18].
Confidence interval test. A haplotype block is defined by the CIT method as a region over which a very small fraction (� 5%) of the measures among the informative SNP pairs are in weak linkage disequilibrium (LD). The informative SNP pairs are pairs showing strong LD or weak LD. Biological and artefactual forces in addition to recombination events are the reasons for allowing 5% of weak LD in the haplotype block. These forces could be recurrent mutation, gene conversion, or errors of the genome assembly or genotyping.
D´(normalized deviation) is used to measure the LD between a pair of SNPs. CI is used to assess the reliability of the estimate of D´. Strong LD is defined as the upper limit of the 90% CI is (0.98), and the lower limit is (0.7). In contrast, weak LD is defined as the upper limit of the 90% CI is (0.9), as shown in S1 Fig. The thresholds were obtained by Gabriel et al. Short blocks (2)(3)(4)(5) were treated with different thresholds for different populations to select the used thresholds [19].
Four gamete test. The FGT is a haplotype block partitioning method that assumes recombination events are not allowed within each block. When the four gametes are identified, a recombination event has been occurred. The rare gamete must be observed at a frequency greater than 0.01 to count a recombination event. The recombination events are only accepted between blocks. The FGT method differs from other haplotype block definitions that it does not require a threshold for D´.
The recombination events interrupt the continuity of the testing process. When a recombination event is observed between the (k th ) locus and any preceding locus, the locus (k-1) is considered the ending point of the tested block. The block size could be measured as the distance between the start locus and the end locus. In this situation, the locus (k) is considered the starting point to search for a new block [20].
Solid spine of linkage disequilibrium. The SSLD method defines the haplotype block as a region at which a (spine) of strong LD (D´> 0.8) moving from one allele to the next allele along the legs of the triangle in the LD chart, as shown in S2 Fig. In other words, for each block, there must be a strong LD between (the first SNP and the last SNP) and all the intermediate SNPs. However, the intermediate SNPs should not be in strong LD with each other [15]. Table 1 represents the concept of the SSLD method. Five SNPs are tested, having the same results as in S2 Fig. Here, (SNP2 and SNP3), (SNP3 and SNP4), and (SNP2 and SNP4) are not in strong LD. All other combinations of SNPs are in strong LD. Thus, there is a haplotype block extends from SNP1 to SNP5. Table 2 shows a comparison among the three haplotype block definitions.

Testing for associations with disease status
Both the individual SNP associations and the haplotype associations were measured with the aid of the P-values. The statistically significant SNPs were detected using their corresponding P-values after stringent Bonferroni correction for multiple testing (< 0.05/# of tests) [21]. The Bonferroni thresholds were 8.49 × 10 −6 for the CIT method, 7.95 × 10 −6 for the FGT method, 9.41 × 10 −6 for the SSLD method, and 1.47 × 10 −6 for the individual SNP approach where the total number of tests were 5,888 for the CIT method, 6,293 for the FGT method, 5,313 for the SSLD method, and 34,122 for the individual SNP approach.

Results
The testing algorithms were applied on Intel Core i7-4720HQ 2.6 GHz system with 16 GB of RAM. S1 Table provided the processing time for each used program. A total working time was 181 minutes.
The distribution of the observed P-values for all the used models showed evidence for population stratification (Fig 2) [22, 23]. The reason for this stratification might be the selection of chromosome 6 that contains the HLA region where many highly significant associations occur. The distribution of the P-values across chromosome 6 was presented in Fig 3 for the individual SNP approach. The top SNP (rs660895) in the HLA region (32,685,358 bp) had the lowest P-value (1.03 X 10 −113 ) as previously reported in Arya et al. [24].
The results related to the haplotype block methods are shown in Table 3. The associated SNPs properties are included in S1 Spreadsheet for the CIT method, S2 Spreadsheet for the FGT method, S3 Spreadsheet for the SSLD method, and S4 Spreadsheet for the individual SNP approach.
The SSLD significant blocks included more associated SNPs (1,322) than FGT (1,099) and CIT (1,086) blocks. Moreover, the number of the associated SNPs by the individual SNP  Table 4.  Does the method of haplotype partitioning affect the association results of chromosome 6 of the NARAC dataset?
From Table 4, the intergenic region between HLA-DQB1 and HLA-DQA2 -at the p21.32 band of chromosome 6 (MHC class II)-contained blocks with the largest number of RA-associated biomarkers by the individual SNP approach. Within this region, the SSLD method had two blocks with thirteen (P-value = 1.49 X 10 −53 , no. of SNPs = 20) and twelve (P-value = 1.55 X 10 −52 , no. of SNPs = 17) biomarkers. The CIT method was represented by a block (P-value = 3.68 X 10 −54 , no. of SNPs = 31) having twenty-two biomarkers. The FGT block (P-value = 3.26 X 10 −52 , no. of SNPs = 28) was detected containing nineteen biomarkers.

Discussion and conclusions
In this study, 34,122 SNPs were used to examine the association with RA susceptibility in the NARAC dataset. The examined SNPs belonged to chromosome 6. The surveyed SNPs on chromosome 6 of the NARAC dataset were dense enough for the application of the haplotype block methods. Four methods were applied to assign the associations (three haplotype block  Does the method of haplotype partitioning affect the association results of chromosome 6 of the NARAC dataset?
Does the method of haplotype partitioning affect the association results of chromosome 6 of the NARAC dataset?
Does the method of haplotype partitioning affect the association results of chromosome 6 of the NARAC dataset?
methods and individual SNP approach). The three used haplotype block methods were CIT, FGT, and SSLD. The individual SNP analysis was to point to chromosome regions that are genetically linked to the disease. The haplotype block methods were to further expand from the single SNPs with the strongest signal to the actual causal variants [112]. The aim of this study is to test the NARAC dataset to find out if haplotype block methods or single-locus approach alone can sufficiently provide the significant biomarkers associated with RA. Our research failed to select the best method as each method yielded significant results that were not detected using any of the other methods. S2 Table showed the SNP IDs that were uniquely identified by each method. The individual SNP, CIT, FGT, and SSLD methods exclusively detected 12, 55, 37, and 189 SNPs, respectively. Our findings were in line with Shim et al. conclusion (but they didn't test the SSLD method) that both the individual SNP approach and the haplotype block methods should be applied side by side to discover the valuable associations in the NARAC dataset [5].
In addition, the 367 SNPs-that were significantly associated with RA susceptibility by the individual SNP approach and the haplotype block methods-represent potent candidates for further investigations (S6 Spreadsheet). The three haplotype block methods were able to detect 916 associated SNPs in common. The SSLD method detected more significant SNPs (1,322) than CIT (1,086), FGT (1,099), and individual SNP (432) methods. This observation could be understood, as the SSLD does not take into account the LD between the intermediate SNPs. Therefore, the SSLD method is the lowest conservative method in including SNPs inside it's blocks.
The block similarity for the three applied methods were shown in Table 5. The similarity measure represented the intersected SNPs divided by the union SNPs for the two studied Does the method of haplotype partitioning affect the association results of chromosome 6 of the NARAC dataset?
methods. The highest block similarity was between the CIT method and the FGT method. While, the lowest block similarity was between the CIT method and the SSLD method. The results showed that the FGT method had the most similarity with the other methods. Fig 4 demonstrated the overlapping of the significant blocks resulted from the three haplotype block methods. Nearly, all chromosome 6 regions that were associated with RA were in blocks that were detected by more than one method. Most of the haplotype blocks that showed significant associations with RA disease were in the MHC region or near it (+ 3 Mb). Most of the 916 SNPs that were detected by the three block methods were in the MHC region. These outcomes confirmed the strong association between the MHC region and RA susceptibility. For further analysis of the results, a comparison between the properties of the SNPs that were detected by all the methods (intersection) and that were uniquely detected by one method (unique) was shown in Table 6. The properties of all the associated SNPs detected by the corresponding method (reference) were added to Table 6 for more clarification. For all the methods, the P-values of the (intersection) were lower than that of the (reference) and more lower than that of the (unique). This observation confirmed that the 367 SNPs (intersection) represents potent candidates for further investigations. For the CIT and FGT methods, the median block size (bp) of the (unique) was greater than that of the (intersection) and more greater than the (reference). However, this observation might be due to the small number of blocks representing the (unique) in comparison to the (intersection) and the (reference). While, when the number of blocks representing the (unique) (53) was sufficiently comparable to that of the (intersection) (104) and the (reference) (145) in the SSLD method, the median block size (bp) of the (intersection) was greater than that of the (unique) and more greater than that of the (reference).
Some associated SNPs were discovered using all the methods but others were observed by only one method. This finding might be explained by some reasons. For the associations that were observed using individual SNP approach only, it may be that only one SNP represent strong LD with the causal SNP. Therefore, studying haplotypes could decrease the power of association as they consist of several SNPs. For the associations that were observed using the haplotype block methods only, Individual SNP approach required approximately 83% more tests than the haplotype block methods. Consequently, the Bonferroni correction was more severe for the individual SNP approach. Moreover, the haplotype block methods were able to capture the interactions among many causal SNPs. In addition, haplotypes could capture rare alleles that individual SNPs may not detect. This reason could be clarified as the power to observe associations is maximized when the frequencies of the studied biomarker and the causal SNP are similar.
For the associations that were observed using a haplotype block method but not by the other haplotype block methods, each method differs greatly from the others in its scope of the definition of the haplotype blocks. At last, we conclude that the application of the individual SNP approach and the three haplotype block methods altogether on chromosome 6 of the NARAC dataset will in turn maximize the system's ability for discovering crucial associations. In case of selecting one method, the SSLD would be the most appropriate method for the NARAC dataset. The SSLD method has valuable advantages such as the highest genomic coverage, the largest minimum, median, and maximum significant block sizes, the biggest number of significant SNPs included in blocks, and the biggest number of associated SNPs discovered exclusively by a method.
The limitations of this study are as follows: a) the effects of population stratification were not accounted for; b) a replication study in other datasets was not performed.   Table. SNP IDs that were uniquely identified by each method. (DOCX) S1 Spreadsheet. Properties for the associated SNPs using CIT method. Sheet1 "ID" represents SNPs IDs, and each row represents a block. Sheet2 "Bp" represents SNPs physical positions in base pairs, and each row represents a block. Sheet3 "No. of SNPs in Block" represents the number of SNPs in each block. Sheet4 "Start-Stop"-first column represents blocks start physical positions in base pairs. Sheet4 "Start-Stop"-second column represents blocks end physical positions in base pairs. Sheet4 "Start-Stop"-third column represents blocks sizes in base pairs. Sheet5 "Block no." represents the block numbers (positions) from all blocks partitioned by CIT method. Sheet6 "P-values" represents the P-values of the blocks. (XLS) S2 Spreadsheet. Properties for the associated SNPs using FGT method. Sheet1 "ID" represents SNPs IDs, and each row represents a block. Sheet2 "Bp" represents SNPs physical positions in base pairs, and each row represents a block. Sheet3 "No. of SNPs in Block" represents the number of SNPs in each block. Sheet4 "Start-Stop"-first column represents blocks start physical positions in base pairs. Sheet4 "Start-Stop"-second column represents blocks end physical positions in base pairs. Sheet4 "Start-Stop"-third column represents blocks sizes in base pairs. Sheet5 "Block no." represents the block numbers (positions) from all blocks partitioned by FGT method. Sheet6 "P-values" represents the P-values of the blocks. (XLS) S3 Spreadsheet. Properties for the associated SNPs using SSLD method. Sheet1 "ID" represents SNPs IDs, and each row represents a block. Sheet2 "Bp" represents SNPs physical positions in base pairs, and each row represents a block. Sheet3 "No. of SNPs in Block" represents the number of SNPs in each block. Sheet4 "Start-Stop"-first column represents blocks start physical positions in base pairs. Sheet4 "Start-Stop"-second column represents blocks end physical positions in base pairs. Sheet4 "Start-Stop"-third column represents blocks sizes in base pairs. Sheet5 "Block no." represents the block numbers (positions) from all blocks partitioned by SSLD method. Sheet6 "P-values" represents the P-values of the blocks. (XLS) S4 Spreadsheet. Properties for the associated SNPs using individual SNP approach. Sheet1 "ID" represents SNPs IDs. Sheet2 "Bp" represents SNPs physical positions in base pairs. Sheet6 "P-values" represents the P-values of the SNPs. (XLS) S5 Spreadsheet. Intersection between haplotype methods and individual SNP approach. Sheet1 "CIT" represents SNPs IDs detected by both CIT method and Individual SNP Approach. Sheet2 "FGT" represents SNPs IDs detected by both FGT method and Individual SNP Approach. Sheet3 "SSLD" represents SNPs IDs detected by both SSLD method and Individual SNP Approach.