Conceived and designed the experiments: AEH JD KLL. Performed the experiments: AEH. Analyzed the data: AEH. Wrote the paper: AEH. Provided critical feedback and guidance to the overall interpretation and content: JD MG MWL KLL. Edited and revised all versions of the manuscript: AEH JD MG MWL KLL.
The authors have declared that no competing interests exist.
Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.
In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters.
HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.
We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.
Many new statistical methods and algorithms to detect association between a trait and one or more genetic variants have recently been developed to analyze the abundance of data produced by Genome Wide Association Studies (GWAS). Simulated data are used to verify and compare the type-I error rates and power of these new association methods. The methods are often compared over a variety of gene region, phenotypic, and association simulation scenarios
Genetic data simulation was first developed within population genetics theory. Methods developed from population genetics theory, called forward time and backwards time (or coalescent) methods, often simulate haplotypes without relying on real data, and instead only use parameters to model aspects of population genetics such as recombination, gene conversion, and evolutionary models. More recently, researchers have developed methods that simulate directly from an existing sample of haplotypes. We describe these methods further below.
Simulating directly from a set of existing haplotypes avoids relying exclusively on subjective parameters and is likely to give a representative picture of the complex underlying LD structure in a gene region since the methods start with real data. Further, simulating directly from a gene region is relatively straightforward and is computationally efficient. Therefore, in this paper we focus on methods that simulate from a set of observed haplotypes in a gene region.
In addition to focusing on gene region simulation methods, we further concentrate on methods that appear to or claim to be able to simulate pairwise LD similar to the original sample of haplotypes over at least a 100 Kb chromosomal region, and can take any set of haplotypes as a starting sample. Three methods that meet this criteria are Hapgen
In 2003, Li and Stephens used an approximation to conditional probability to relate a distribution of haplotypes to a recombination rate that varies across a chromosomal region
A) D’ Gene Region 1 (300 Kb), B) D’ Gene Region 2 (1000 Kb), C) r2 Gene Region 1 (300 Kb), D) r2 Gene Region 2 (1000 Kb).
Method | N | Min | Q1 | Median | Q3 | Max | Mean | SD |
|
14500 | −0.074 | −0.010 | −0.001 | 0.009 | 0.068 | −0.001 | 0.017 |
|
14500 | −0.047 | −0.005 | <0.001 | 0.004 | 0.033 | <0.001 | 0.008 |
|
14300 |
−0.042 | −0.005 | <0.001 | 0.004 | 0.038 | <0.001 | 0.008 |
Change in MAF from original HapMap MAF for each pair of SNPs in Gene.
Region 1 (MAFsimulated – MAFHapMap).
HapSim requires that all monoallelic SNPs are removed prior to simulation.
Resampling haplotypes is probably the most straight forward simulation method. It was first described by de Bakker et al. in a paper looking into the efficiency and power of GWAS
In HapSim
Before comparing methods, it is important to establish the desired characteristics of the simulation replicates. As with any sample of simulated replicates, there should be some variation. We believe that too little variation limits the generalizability of the simulation study while too much variation may be unrealistic and might break down the characteristics of the gene region used for simulation. Thus, the ideal simulation method will produce replicates that differ enough to produce sampling variability but not so much that the unique characteristics of the particular gene region are lost. However, the ideal amount of variation is difficult to quantify. Further, we believe that when simulation is used to evaluate association analysis methods for a specific gene region, a desirable characteristic is that the method produces replicates that do not, on average, introduce an overall loss or gain in LD. This is the main characteristic that we evaluate in this article. We also examine the variation and potential bias of MAF.
Several previous reviews of haplotype simulation methods as well as papers describing a new method exist
Here, we compare, through parallel implementation, the ability of Hapgen, HapSim, and resampling to simulate a gene region without introducing an overall loss or gain in LD across the region.
Method | N | Min | Q1 | Median | Q3 | Max | Mean | SD |
|
44500 | −0.086 | −0.010 | <0.001 | 0.009 | 0.093 | <0.001 | 0.018 |
|
44500 | −0.039 | −0.004 | <0.001 | 0.004 | 0.043 | <0.001 | 0.008 |
|
37900 |
−0.041 | −0.005 | <0.001 | 0.005 | 0.037 | <0.001 | 0.008 |
Change in MAF from original HapMap MAF for each pair of SNPs in Gene.
Region 2 (MAFsimulated – MAFHapMap).
HapSim requires that all monoallelic SNPs are removed prior to simulation.
Histograms of the change in simulated LD from original LD for each pair of SNPs in Gene Region 1 (LDsimulated – LDHapMap). A) D’, Resampling (gray) vs Hapgen (dotted); B) D’, Resampling (gray) vs HapSim (dotted); C) r2, Resampling (gray) vs Hapgen (dotted); D) r2, Resampling (gray) vs HapSim (dotted).
To ensure generalizability for our comparisons, we used two diverse gene regions. The first gene region is located on chromosome 4 and was defined as 100 Kb from each end of the longest
Histograms of the change in simulated LD from original LD for each pair of SNPs in Gene Region 2 (LDsimulated – LDHapMap). A) D’, Resampling (gray) vs Hapgen (dotted); B) D’, Resampling (gray) vs HapSim (dotted); C) r2, Resampling (gray) vs Hapgen (dotted); D) r2, Resampling (gray) vs HapSim (dotted).
GR1 | GR2 | ||
Median (Q1, Q3) | Median (Q1, Q3) | ||
|
0.7047 (0.5973, 0.8091) | 0.6747 (0.0776, 0.7948) | |
|
|
0.3543 (0.2251, 0.4399) | 0.3574 (0.2221, 0.4084) |
|
0.4249 (0.3168, 0.5427) | 0.3506 (0, 0.4165) | |
|
0.7886 (0.5264, 1.0360) | 0.7164 (0.6215, 0.8524) | |
|
|
0.2797 (0.2408, 0.3171) | 0.4311 (0.3610, 0.4827) |
|
0.3504 (0.2624, 0.6330) | 0.4417 (0.3829, 0.4909) |
Another region on chromosome 4 has been shown to be associated with Atrial Fibrillation
As described above, we defined the gene regions using two distinct definitions, which produced regions with different lengths, and LD patterns. Some researchers may choose an area around a particular gene, which is similar to how we defined gene region 1, while others may simulate a large section of a chromosome based on some other criterion such as a region surrounding the SNP with the lowest p-value as we do for gene region 2. While the size of the gene regions being simulated and analyzed by other researchers will depend on the definition used to create the region, we believe our gene regions encompass much of the range that would be seen in other studies.
Heat maps of change in median simulated LD from original LD in Gene Region 1 (median[LDsimulated] – LDHapMap). Upper left D’, lower right r2. Blue indicates a gain in LD; red indicates a loss in LD.
Heat maps of change in median simulated LD from original LD in Gene Region 2 (median[LDsimulated] – LDHapMap). Upper left D’, lower right r2. Blue indicates a gain in LD; red indicates a loss in LD.
Method | N |
Min | Q1 | Median | Q3 | Max | Mean | SD | |
|
Hapgen | 1011611 | −1.000 | −0.030 | <0.001 | 0.021 | 1.000 | −0.004 | 0.154 |
Resampling | 1015300 | −0.991 | −0.006 | <0.001 | 0.010 | 0.993 | 0.003 | 0.058 | |
HapSim | 1015300 | −1.000 | −0.309 | −0.106 | 0.000 | 1.000 | −0.161 | 0.288 | |
|
Hapgen | 1011611 | −1.000 | −0.008 | <0.001 | 0.006 | 0.445 | −0.002 | 0.036 |
Resampling | 1015300 | −0.254 | −0.003 | <0.001 | 0.003 | 0.259 | <0.001 | 0.014 | |
HapSim | 1015300 | −0.665 | −0.049 | −0.006 | 0.000 | 0.275 | −0.046 | 0.103 |
Change in simulated LD from original HapMap sample LD for each pair of SNPs in Gene Region 1 (LDsimulated – LDHapMap).
Sum of SNP pairs over all 100 replicates. The number of SNP pairs for Hapgen is not divisible by 100 because monoallelic SNPs were dropped from the LD calculations. Because Hapgen had more variation in MAF, it was more likely that a SNP with a low MAF would become monoallelic in one or more simulation replicates and would thus be dropped from the LD calculations.
Method | N |
Min | Q1 | Median | Q3 | Max | Mean | SD | |
|
Hapgen | 7002401 | −1.000 | −0.053 | <0.001 | 0.061 | 1.000 | 0.002 | 0.172 |
Resampling | 7138597 | −0.999 | −0.018 | <0.001 | 0.032 | 1.000 | 0.009 | 0.073 | |
HapSim | 7149516 | −1.000 | −0.102 | <0.001 | 0.070 | 1.000 | −0.050 | 0.224 | |
|
Hapgen | 7002401 | −0.861 | −0.002 | <0.001 | 0.002 | 0.666 | <0.001 | 0.013 |
Resampling | 7138597 | −0.250 | −0.001 | <0.001 | 0.001 | 0.359 | <0.001 | 0.005 | |
HapSim | 7149516 | −0.934 | −0.002 | <0.001 | 0.003 | 0.333 | −0.005 | 0.040 |
Change in simulated LD from original HapMap sample LD for each pair of SNPs in Gene Region 2 (LDsimulated – LDHapMap).
Sum of SNP pairs over all 100 replicates. The number of SNP pairs is not divisible by 100 because monoallelic SNPs were dropped from the LD calculations. Because Hapgen had more variation in MAF, it was more likely that a SNP with a low MAF would become monoallelic in one or more simulation replicates and thus, more SNP pairs were dropped from the LD calculations than for Resampling or HapSim.
To generate simulated replicates from these gene regions, we used the populations with European ancestry, CEU and TSI, from the HapMap data (Phase III in 2009)
We used variable recombination rates across each gene region estimated by the HapMap project using McVean et al.’s coalescent method
Correlation between dichotomized variables compared to the original correlation between two normally distributed variables. Each curve represents an original correlation value (
For each method and variation, we simulated 100 replicates each consisting of 2,000 subjects.
We used Hapgen v1.3.0. Hapgen simulates mosaic haplotypes using a Hidden Markov Model to define the probability of continuing on the current haplotype segment or transitioning to a randomly chosen haplotype segment. The transition probabilities are defined by the variable recombination rate across the region as well as the effective population size
Given the starting set of haplotypes, we sampled two haplotypes with replacement. We then recombined this pair of haplotypes using a variable recombination rate across the region to specify the probability of recombination occurring at a given chromosomal location. We implemented this method in R
We used HapSim v0.2. Starting with the original set of haplotype markers, HapSim calculates a covariance matrix assuming a bivariate normal distribution and using the observed joint probability of each haplotype marker pair. HapSim then simulates random vectors from a multivariate normal distribution centered at zero using the previously calculated covariance matrix. Finally, the program transforms the normally distributed vectors back to vectors of binary values using thresholds defined by the observed allele frequency of each marker.
HapSim uses a multivariate normal distribution and the observed MAF and joint probabilities to produce simulated haplotypes. Thus, the parameters used by HapSim are embedded within these choices and are set by using a multivariate normal distribution with a mean of 0, and a covariance matrix estimated using the observed MAF for each marker and joint probabilities for each pair of markers.
It is important to note that the calculated covariance matrix may not be positive definite, which is necessary for the matrix to be used as the covariance matrix for a multivariate normal distribution. When this is the case, HapSim approximates a positive definite version of the covariance matrix by: (1) completing eigenvalue decomposition of the covariance matrix (2) rounding all eigenvalues below a minimum tolerance threshold up to that threshold.
When comparing methods, we used the following parameters for Hapgen: mutation rate = 0, effective population size = 11,418 (often used for samples of European descent)
D’ and r2 are two common measures of LD. We calculated the D’ and r2 using Haploview
For each replicate, we calculated the LD (D’ or r2) for each pair of SNPs. To look at the bias of the LD (D’ or r2) we compared the pairwise LD values for each method’s replicates to the original pairwise LD values for the entire region as well as subsets of SNPs in the region by LD, D’ or r2, (≤0.2, between 0.2 & 0.8, and ≥0.8) and MAF (≤0.1, between 0.1 & 0.3, and ≥0.3). (The equation for bias is shown in Equation 1.) To visualize this comparison we produced: (1) histograms of the difference between the simulated pairwise LD and original pairwise LD for each pair of SNPs over all replicates and (2) heat map plots of the median change in simulated pairwise LD compared to original pairwise LD. In addition to LD, we compared the distributions of the change between the replicates and the original sample for MAF.
To gain insight into a possible appropriate amount of variation desired between replicates, we calculated the standard error (SE) of the LD estimate from the original data for each SNP pair and compared those with the replicate standard deviation (SD) of the LD estimate for each SNP pair. We used Zapata et al.’s method to estimate the SE for D’
Finally, we calculated the time necessary to simulate 10 replicates using each method for each gene region.
Where possible and using the haplotype sample for Gene Region 1, we varied simulation parameters to study each parameter’s effect on gene region characteristics. When comparing the effects of varying one particular parameter, we kept all other parameters constant at the values used to compare the methods (Hapgen: mutation rate = 0, effective population size = 11,418, recombination rate weight = 1; Extension to resampling: recombination rate weight = 1).
For Hapgen, we compared the mutation rate by varying θ, the modifiable mutation rate parameter to be 0, 1, 2, or 5. Hapgen uses the following formula to model the probability of a mutation occurring at a given SNP where
Using Equation 3 and k = 410 CEU and TSI haplotypes, the resulting probabilities of a mutation at a given SNP for a given haplotype are 0, 0.0024, 0.0049, and 0.0120 for θ = 0, 1, 2, and 5 respectively.
Also for Hapgen, we modified the location of the starting locus of the Hidden Markov Model (random; 90,770,374; 90,955,029; 91,052,395), and the starting haplotype sample (CEU only, TSI only, CEU & TSI). We chose to use the effective population size that is most commonly used for populations of European ancestry (11,418) as well as effective population sizes approximately 1/10th and twice the size as this commonly used value (1,142 and 22,836 respectively) to explore the effects of changing the effective population size parameter in Hapgen.
For Hapgen and the resampling method, we varied the weight by which we multiplied the variable recombination rate vector (0.1, 0.2, 1, 5, 10). The weight changes the recombination rate vector by a multiple of the weight. A weight above one for the recombination rate vector should increase the level of recombination while a weight below one should decrease the level of recombination.
As shown in
As displayed in
Even more striking and important, resampling and Hapgen produced little to no bias whereas HapSim appeared to produce a loss in LD across both gene regions as shown in
Although having to approximate a positive definite version of a matrix when the calculated covariance matrix is not positive definite will introduce error, there is no indication that the error produces a consistent bias. Rather, the error will likely increase the variance. Another, more likely explanation for HapSim’s loss in LD is dichotomizing the vectors of normally distributed variables to vectors of binary values to create the haplotypes. It has been previously shown that dichotomizing normally distributed variables into binary variables decreases the correlation between the variables
Since association analysis often relies on markers’ correlation with the causal SNP, a loss in the LD across the gene region in the simulation replicates, as seen in HapSim, will decrease the power of most methods to detect association. Further, certain methods may be affected more than others depending on the way the method uses or adjusts for the regional correlation. Thus, the relative order of methods being compared may be affected by this reduction in LD. For example, Principal Component Analysis (PCA) transforms the set of genetic markers into a new set of independent variables (i.e. the principal components). The correlation between the markers will to some degree determine the weight that each marker is given in each principal component. An extreme example would be if all genetic markers were completely independent (i.e. had an LD of zero). The resulting principal components would then each equal one of the genetic markers with a weight of 1 and all of the other genetic markers have a weight of 0. Another method, LASSO regression, controls for the correlation between variables by further shrinking each variable’s regression estimate. As these methods incorporate the regional correlation differently, we would expect that differing LD patterns to in turn have different affects on the resulting power and type-I error of these methods.
For Hapgen, there appeared to be an edge effect for the few SNPs on the left side of Gene Region 1 where there was a large decrease in median D’ values of the simulated replicates compared to the original sample (
Finally, across all methods and both gene regions, there was a higher degree of variation and bias seen for D’ than for r2. This was expected because D’ is more sensitive to low MAF and is estimated to be one in the extreme case where one of the haplotypes has an estimated frequency of zero.
As shown in Information S1, Hapgen was more than 10x faster than the other simulation methods producing 10 replicates in less than 10 seconds for each gene region. For simulation designs that require tens or hundreds of thousands of replicates, Hapgen would likely require hours while the other methods would likely require days.
Changing the starting sample of haplotypes from both CEU and TSI samples to either CEU only or TSI only samples had very little effect on the variance or bias of the LD distributions although the two smaller samples (CEU only and TSI only) appeared to have slightly more variation compared to the larger sample (both CEU and TSI). This is expected, since every haplotype section is more likely to be drawn from the smaller sample of haplotypes and, thus, the replicate sample more often contains identical haplotype sections, which prevents much decay of LD.
The results seen here were not very sensitive to starting with a different sample of haplotypes. Nonetheless, we recommend using as large a starting sample of haplotypes as possible as long as the samples are representative of the desired population.
Increasing the mutation rate led to a loss in LD for the simulated replicates compared to the original sample (
We recommend taking into consideration the sample size when choosing Hapgen’s mutation rate parameter, θ. θ is equal to the expected number of mutations at each SNP for the sample of haplotypes. As we may expect a larger number of mutations in a larger sample of haplotypes, we may want to increase θ accordingly. In addition, if we have a particular interest in rare variants, we may also want to increase θ to introduce more rare variants.
Hapgen developers recommend using 11418, 17469, and 14269 for samples of European (HapMap CEPH), African (HapMap Yoruban), and Asian (HapMap Japanese and Chinese) descent respectively
The effective population size is the number of mating individuals in a population that will produce the same allele frequency distribution as that observed in the entire population assuming that all individuals in the effective population mate at random and have an equal chance of passing along their genetic information
Using a different locus as the starting point for the Hidden Markov Model in Hapgen did not produce any notable change in variation or bias of the replicates. We recommend using a randomly chosen starting location, as the starting locus should not make any difference when only control or general population haplotypes are simulated.
Changing the recombination rate had a visible effect on the difference in LD between the replicates and the original sample (
Unless the user specifically intends to alter the LD within a gene region, we recommend using the recombination rates estimated by the Hap Map project using McVean et al.’s method
As previously stated, since the methods examined use real data they are likely to give a representative picture of the complex underlying LD structure in a gene region. However, it is important to note the sample will only include variation from the particular gene region and population from which the starting sample was gathered. Nonetheless, simulating from a gene region of interest is likely to at least be representative of the particular gene region and is less dependent on simulation parameters used in alternative genetic simulation methods of backwards and forward time.
Recently, many research groups have started to use sequence data to search for genetic associations for variants with low or rare MAFs
We have implemented in parallel three methods for simulating a gene region from a sample of existing haplotypes. We compared these methods using two gene regions that differed in size, LD strength and pattern, and distribution of MAF. Thus, we believe our results and conclusions are applicable to most other gene regions across the genome.
Our goal was to find an adequate simulation method by comparing the LD measures (D’ and r2), and MAF for each of the methods. Producing gene region simulations with a representative LD structure is essential for appropriately comparing genetic association analysis methods, which rely on the LD in the region to find risk signals. Based on our findings, we do not recommend using HapSim as the simulation program produces samples of haplotypes with lower LD, on average, compared to the original haplotype set, especially for gene regions with moderate to high LD. Further, since HapSim does not incorporate parameters, it is both less subjective as well as less modifiable. This is an important consideration when simulating gene regions with rare variants where we may want to introduce additional rare variants by using a mutation rate parameter.
Although our simple resampling method does not introduce bias, the variation introduced across the replicates is quite small and may not provide enough sampling variability between replicates to make a generalizable simulation study. The variability of the resampling method could possibly be increased with further modifications such as completing the resampling process over multiple generations.
Among the gene region simulation methods reviewed here, we recommend using Hapgen. Hapgen provides ample variation between replicates while retaining the LD structure of the gene region and does not introduce an overall loss or gain in LD. In addition, Hapgen is easy to use and provides options for changing additional parameters such as a recombination rate or mutation rate, enabling users to modify the simulation settings to better model a particular population or level of variation in the haplotypes.
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(TIFF)
(DOC)
We would like to thank Richard H. Myers for his thoughtful guidance and comments on this research and paper.