Linkage disequilibrium and haplotype block patterns in popcorn populations

Linkage disequilibrium (LD) analysis provides information on the evolutionary aspects of populations. Recently, haplotype blocks have been used to increase the power of quantitative trait loci detection in genome-wide association studies and the prediction accuracy of genomic selection. Our objectives were as follows: to compare the degree of LD, LD decay, and LD decay extent in popcorn populations; to characterize the number and length of haplotype blocks in the populations; and to determine whether maize chromosomes also have a pattern of interspaced regions of high and low rates of recombination. We used a biparental population, a synthetic, and a breeding population, genotyped for approximately 75,000 single nucleotide polymorphisms (SNPs). The sample size ranged from 190 to 192 plants. For the whole-genome LD and haplotype block analyses, we assumed a window of 500 kb. To characterize the block and step patterns of LD in the populations, we constructed LD maps by chromosome, defining a cold spot as a chromosome segment including SNPs with the same LDU position. The LD and haplotype block analyses were also performed at the intragenic level, selecting 12 genes related to zein, starch, cellulose, and fatty acid biosynthesis. The populations with the higher and lower frequencies of |D'| values greater than 0.75 were the biparental (65–74%) and the breeding population (26–58%), respectively. There were slight differences between the populations regarding the average distance for SNPs with |D'| values greater than 0.75 (in the range of approximately 207 to 229 kb). The level of LD expressed by the r2 values was low in the populations (0.02, 0.04, and 0.04, on average) but comparable to some non-isolated human populations. The frequency of r2 values greater than 0.75 was lower in the biparental population (0.2–0.5%) and higher in the other populations (0.2–1.6%). The average distance for SNPs with r2 values greater than 0.75 was much higher in the biparental population (approximately 80 to 126 kb). In the other populations, the ranges were approximately 6 to 19 and 6 to 35 kb. The heatmaps for the regions covered by the first 100 SNPs in each chromosome, in each population (1 to 3.3 Mb, approximately), provided evidence that the comparatively few high r2 values (close to 1.0) occurred only for SNPs in close proximity, especially in the synthetic and breeding populations. Due to the reduced number of SNPs in the haplotype blocks (2 to 3) in the populations, it is not expected advantage of a haplotype-based association study as well as genomic selection along generations. The results concerning LD decay (rapid decay after 5–10 kb) and LD decay extent (along up to 300 kb) are in the range observed with maize inbred line panels. The LD maps indicate that maize chromosomes had a pattern of regions of extensive LD interspaced with regions of low LD. However, our simulated LD map provides evidence that this pattern can reflect regions with differences in allele frequencies and LD levels (expressed by |D'|) and not regions with high and low rates of recombination.

Higher LD and slower LD decay was observed in biparental and multiparental maize populations [9]. The number and length of haplotype blocks is also highly variable [4,7].
In several investigations in human populations the structure of LD was described based on LD maps. In an LD map, each SNP has a LD position in LD units (LDUs). One LDU is the distance in kilobases at which disequilibrium (expressed as the Malecot's prediction of association -) declines to approximately 0.37 of its starting value. Assuming unrelated individuals,  equates to the absolute value of D'. The difference between the LD positions of two SNPs divided by the distance in kilobases (d) is the exponential decline of disequilibrium (). LDUs share an inverse relationship with the recombination rate. Thus, regions with extensive disequilibrium have few LDUs (plateaus or blocks) and regions with many LDUs have high levels of recombination rate (steps). Holes in the LD maps are regions where greater marker density is required to provide a full characterization of the block and step patterns of the LD. Holes are identified by a LD map interval of 3, which is an arbitrary value because disequilibrium is indeterminate for d > 3 and of doubtful reliability for d > 2 [10,11].
Because there is no information on LD and structure of haplotype blocks in popcorn populations and no LD maps for maize, the objectives of this study were to compare the degree of LD, the LD decay, the LD decay extent, and the number and length of haplotype blocks in the populations and to elaborate the first LD map for maize, for elucidating if the maize chromosomes also had a pattern of interspaced regions of high and low rates of recombination.

Populations
We used a biparental (F 2 generation) temperate population, a tropical synthetic (Synthetic Biotechnology and Omega Bioservices, Norcross, GA, respectively, using B73 version 4 as the reference genome. After reading the data using the R package vcfR [12], we filtered by missing allele and chromosome. Then, we computed the SNP and genotype call rates and the minor allele frequency (MAF), employing the R package HapEstXXR [13]. After filtering by MAF > 0.01, we imputed based on Beagle [14], using the R package synbreed [15]. The number of SNPs after the   data quality control and imputation were 145,420, 74,773, and 76,055 for the biparental population, Synthetic UFV, and Beija-Flor c4, respectively. To maintain a similar number of SNPs for the populations, we finally performed a random sampling of 75,000 SNPs from the biparental population.

LD and haplotype block analyses
For the Hardy-Weinberg equilibrium analysis by population and chromosome it was adopted the Bonferroni criterion to keep a global level of significance of 1%. To characterize the block and step patterns of LD in the populations we constructed the LD maps by chromosome using the interval method [16]. To evaluate if the LD maps allow inference on the overall degree of LD by chromosome in the populations we also processed a simulated data set, generated with REALbreeding software (available by request). This software has been recently used in studies of population structure [17], QTL mapping [18], genomic selection [19], and genome-wide association studies [20]. We simulated the genotyping of 200 individuals in a population (generation 0) and 200 individuals in the same population after 10 generations of random crossings (generation 10), for 287 SNPs covering 298 cM (density of 1 cM) of a single chromosome.
We then evaluated the degree of LD by chromosome in the populations concerning SNPs separated by up to 500 kb, using a two marker expectation-maximization (EM) algorithm [21]. The LD analyses were based on the D' absolute value (|D'|) and r 2 . The physical distances between SNPs were classified into six intervals of 50 kb (0-50 to 451-500) to study the LD decay and LD decay extent. To define a haplotype block, we adopted the criterion proposed by Gabriel,Schaffner (6).
The haplotypes were estimated using an accelerated EM algorithm with a partition-ligation approach [22], to generate phased haplotypes for population frequency [23].
The LD and haplotype block analyses were also performed at the intragenic level. We choose 12 genes related to zein (one), starch (four), cellulose (five), and fatty acids biosynthesis (two) (S1 Table). With two exceptions, the selected genes had at least five SNPs in each population (maximum of 21). For the intragenic LD decay and LD decay extent analyses we computed the average |D'| and r 2 values defining intervals of 1 kb (0-1 to 10.1-11 kb). All analyses were performed using LDMAP [16] and Haploview [21]. To assess the haplotype blocks information, the haplotype files for each population and chromosome were read by a program (Haplotype blocks summary) developed in REALbasic 2009 by Prof. José Marcelo Soriano Viana.

Results
Versus the physical maize map (available at https://www.maizegdb.org/), the GBS provided a SNP coverage between 99.5 to practically 100.0% of the genomes of all chromosomes, in each population (Table 1)   The LD map from the simulated data evidence that the LD units were lower for the generation with lower LD (generation 10) ( Figure 1). Thus, the LD maps by chromosome reveal that the higher global LD (in LDU) was observed in the synthetic but only for chromosomes 1 to 7 (S3 Figure).    (Table 1) Table). Furthermore, the average distance for SNPs with r 2 values greater than 0.75 are much higher  include two SNPs but the number of haplotype blocks with three or more SNPs is greater in the synthetic and breeding population (S7 Figure). It is important to highlight that the total length of the haplotype blocks represents only 0.01 to 5.13% of the chromosome genomes. The intragenic LD analysis also revealed a higher average |D'| values in the biparental population and the synthetic, compared to the average value observed in the breeding population (0.74 and 0.88 versus 0.67). The biparental population presents an average r 2 value much lower than the average values observed in the other two populations (0.02 versus 0.13 and 0.14) ( Table 4).
Regardless of the population, the maximum intragenic |D'| (1,0) was observed for SNPs separated by up to 10.6 kb while most of the higher intragenic r 2 values (0.7 or greater) were only observed for the closest SNPs (S8 Figure). In regard to the intragenic LD decay, there is evidence of |D'| and r 2 decay in the breeding population and r 2 decay in the synthetic (Figure 3). Concerning the intragenic haplotype blocks structure, the general evidence is of a single block of variable size (0.03 to 8.72 kb) with two SNPs (Table 5). Genes Zm00001d018033 and Zm00001d041972 show population differences regarding block size and number of SNPs.   [25,26]. In general, both |D'| and r 2 have been used [26,27] and because their high level of LD, isolated populations have been recommended for association studies [28]. The statistic r 2 is the most relevant for association mapping because it has a simple inverse relationship with the sample size required to detect association [1]. The use of LD maps and two measures of LD for comparing the popcorn populations provided some contrasting results, but the general evidence is that the synthetic is the population with higher LD. As expected, the lower average |D'| value in the breeding population reflects its recombination history. The synthetic and the biparental population Hyland (25) used LD decay as the distance over which the average LD decreases to half of its maximum value (half-length). They defined LD extent as the distance over which the average LD declines to an asymptotic value. Anderson, Mahan (9) used LD decay as the distance over which the average r 2 dropped below 0.8, and LD extent as the distance over which the average r 2 fell below 0.2. Concerning the LD decay, our results showed differences between LD measures and populations. There were slightly differences between chromosomes, but the higher r 2 decay occurred after 5-10 kb (36 to 73%). Yan, Shah (29)  If there is higher LD between QTLs and haplotypes than with individual SNPs, haplotype blocks can provide substantial statistical power in association studies [6] and increased accuracy of genomic prediction of complex traits [36]. Surprisingly, our results evidenced that the number and length of the haplotype blocks and the number of SNPs per haplotype block were proportional to the average r 2 . The criterion of Gabriel, Schaffner (6)  observed in human populations [6,27]. However, the size of each block varied dramatically in the study of Gabriel,Schaffner (6), from less than one to 173 kb.
Concerning the low intragenic LD and the minimum size of the haplotype blocks observed in the three populations, we believe that the lower LD for the biparental population is due to crossing two genetically similar high-quality inbred lines. Because there is no information on the LD and haplotype block patterns in the base populations Viçosa and Beija-Flor, we cannot infer that the higher average intragenic r 2 values observed in the synthetic and breeding population (for 11 of the 12 genes) are due to selection for quality. In conclusion, the level of LD expressed by the r 2 values in the three popcorn populations with different genetic structures -a biparental population, a synthetic, and a breeding population -is surprisingly low, but comparable to some non-isolated human populations. This does not imply that these populations cannot be used for GWAS because there is a fraction of high r 2 values for SNPs separated by less than 5 kb. The populations are also not excluded for genomic selection because the most important factor affecting this selection process is the relatedness between individuals in the training and validation sets. However, we do not expect a significant advantage from haplotype-based GWAS and genomic selection along generations due to the reduced number of SNPs in the haplotype blocks (2 to 3). The results on LD decay (rapid decay after 5-10 kb) and LD decay extent (along up to 300 kb) are in the range observed with maize inbred line panels. Our most important result is that, similar to the human chromosomes, maize (popcorn is also Zea mays, but ssp. everta) chromosomes also have a pattern of regions with extensive LD (plateaus or cold spots), interspaced with regions of high recombination rate (steps or hot spots). It should be highlighted, however, that our simple simulated LD map provides evidence that this pattern can reflect regions with differences in allele frequencies and LD level (expressed by D') and not regions with high and low rates of recombination as evidenced by Jeffreys, Holloway (41), since the simulation process assumes a rate of recombination that is proportional to the distance in cM.

Acknowledgments
We thank the National Council for Scientific and Technological Development