Study of whole genome linkage disequilibrium patterns of Iranian water buffalo breeds using the Axiom Buffalo Genotyping 90K Array

Accuracy of genome-wide association studies, and the successful implementation of genomic selection depends on the level of linkage disequilibrium (LD) across the genome and also the persistence of LD phase between populations. In the present study LD between adjacent SNPs and LD decay between SNPs was calculated in three Iranian water buffalo populations. Persistence of LD phase was evaluated across these populations and effective population size (Ne) was estimated from corrected r2 information. A set of 404 individuals from three Iranian buffalo populations were genotyped with the Axiom Buffalo Genotyping 90K Array. Average r2 and |D'| between adjacent SNP pairs across all chromosomes was 0.27 and 0.66 for AZI, 0.29 and 0.68 for KHU, and 0.32 and 0.72 for MAZ. The LD between the SNPs decreased with increasing physical distance from 100Kb to 1Mb between markers, from 0.234 to 0.018 for AZI, 0.254 to 0.034 for KHU, and 0.297 to 0.119 for MAZ, respectively. These results indicate that a density of 90K SNP is sufficient for genomic analyses relying on long range LD (e.g. GWAS and genomic selection). The persistence of LD phase decreased with increasing marker distances across all the populations, but remained above 0.8 for AZI and KHU for marker distances up to 100Kb. For multi-breed genomic evaluation, the 90K SNP panel is suitable for AZI and KHU buffalo breeds. Estimated effective population sizes for AZI, KHU and MAZ were 477, 212 and 32, respectively, for recent generations. The estimated effective population sizes indicate that the MAZ is at risk and requires careful management.


Introduction
The water buffalo (B. bubalis) is an important livestock resource in many regions of the world, particularly in tropical and subtropical countries. Water buffalo produce milk and meat, and are used as draught animals in developing countries [1,2]. There are two types of domestic a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Genotyping and data quality control Genomic DNA was extracted from blood by the modified salting out method [37] and from hair samples as described by Alberts et al. [38]. The set of 412 water buffalo samples from AZI (n = 262), KHU (n = 123) and MAZ (n = 27) were genotyped using the Axiom Buffalo Genotyping 90K Array. Genotyping was carried out by Affymetrix (Sana Clara, Ca, USA). SNP genotypes were extracted from raw data using the AffyPipe workflow [39]. The genotypes for each population were filtered for quality separately, using PLINK software [40]. Single nucleotide polymorphisms (SNPs) with minor allele frequencies (MAF) below 0.05, SNPs with call rate below 0.05 or which were not in Hardy-Weinberg Equilibrium (P-value >10e -6 ) were removed. Individuals with more than 5% missing genotypes, were excluded from data set. After quality control for each population, genotypes from the three breeds were merged into a single file, and SNPs that were common across all three populations were retained for further analyses. The Axiom Buffalo Genotyping 90K Array was designed based on alignment of buffalo sequences to the bovine UMD3.1 genome assembly [41], therefore this bovine reference genome sequence and relative bovine positions were used in the present study.

Measures of linkage disequilibrium
The LD between two SNPs was evaluated using the statistics r 2 [20] and the absolute D-value (|D'|) [42], which were calculated as follows: Where SNP pairs had alleles A and a at the first locus and B and b at the second locus, freq A, freq a, freq B and freq.b denote frequencies of alleles A, a, B, and b, respectively, and freq AB denote frequency of haplotype AB in the population. The r 2 and |D'| were calculated between adjacent markers and SNP pairs with physical distances from 0 to 15 Mb for each population, using SnppLD software (Sargolzaei M, University of Guelph, Canada) [25]. The average r 2 and |D'| of adjacent SNPs were estimated for each chromosome. SNP pairs were grouped by their pairwise physical distance, based on their position in the UMD3.1 reference cattle sequence, into intervals of 100 Kb (from 0 to 15 Mb). Average r 2 for SNP pairs in each interval was estimated [32]. The consistency of LD between populations was measured by the correlation of the root of r 2 of adjacent marker pairs on each chromosome [32]. The consistency of LD phase between two populations was measured by persistence of phase. The correlation of LD between two populations A and B for a common set of markers was calculated as [32]: Where r ij is the correlation of phase between r ij(A) in population A and r ij (B) in population B, S A and S B are the standard deviation of r ij(A) and r ij(B) respectively, and r A and r B are the average r ij across all SNP i and j within the common set of markers.

Estimation of historical effective population size
The historical effective population size for AZI, KHU and MAZ was calculated for t generations in the past as follows: where Ne is the effective population size, c is the genetic distance between the SNPs in Morgans. The physical distances between SNPs were converted to genetic distances using the approximation 1 cM~1 Mb for all the chromosomes [19,44]. r 2 is the average corrected r 2 value at a given distance. A sample size correction was carried out for all of the computed r 2 values using the following equation where, n is the number of haplotypes in the sample. It should be noted, the estimated Ne value is infinite at r 2 = 0 and zero at r 2 = 1. Therefore. Only values of r 2 between 0.01 and 0.99 were used to estimate Ne.
The generation of Ne for a given distance was estimated by: where t was calculated for the corresponding genetic distance (c) in intervals of 100 Kb (from 0 to 15 Mb). The historical Ne was investigated at 150 time points from recent to 500 generations in the past.

SNP frequency and distribution
After quality control for each population, a set of 63824, 62667 and 58588 SNPs remained for AZI, KHU and MAZ breeds, respectively. SNPs that were in common across all breeds, were merged in single file. The final data set comprised 57212 SNPs from 396 individuals (253 AZI and 118 KHU and 25 MAZ) which was used for further analyses. After removing SNP with a MAF less than 0.05, the mean MAF observed in the Iranian populations was 0.333, 0.321 and 0.299 for AZI, KHU and MAZ, respectively, for the common SNP set. As the buffalo genome available is highly fragmented SNPs were mapped to the bovine sequence (version Btau UMD3.1). A summary of SNP numbers for each bovine chromosome with MAF in each population is shown in S1 Table. Distribution of SNPs with the distance between adjacent SNPs as mapped to the bovine genome sequence (version Btau UMD3.1), is shown in S2 Table. Distances between 93% of adjacent SNPs were less than 100 Kb, while the distances between 60% of adjacent SNPs were between 20-40Kb for all of the three breeds (S2 Table.).
The average estimated physical distance between adjacent markers in the common set was 46 Kb, and covered 2.65 Gb of total genome length ( Table 1). The highest number of polymorphic SNPs was on BTA1 (N = 3549) and the lowest on BTA 27 (N = 1004). The longest interval between polymorphic SNP was 2461.72 Kb on BTA10 and the shortest was 0.008 Kb on BTA15.

Linkage disequilibrium
Linkage disequilibrium and consistency of LD between adjacent SNP. LD was calculated separately for each of the three Iranian buffalo breeds using r 2 and |D'| statistics. Average r 2 and |D'| between adjacent SNP pairs were 0.27 and 0.66 for AZI, 0.29 and 0.68 for KHU, and 0.32 and 0.72 for MAZ (see Table 1). The proportion of r 2 values higher than 0.2 and 0.3 were 44.7 and 34.8% for AZI, 46.7 and 36% for KHU, 50.8 and 40.1% for MAZ, respectively (S3 Table). The correlation of LD between the AZI and KHU breeds was 0.83 (ranging from 0.78 to 0.88 across all chromosomes) which was higher than the correlation between AZI and KHU (0.61), and KHU and MAZ (0.57) ( Table 1

and S2 Fig).
Linkage disequilibrium decay and persistence of LD phase. The average decay of LD over physical distance was calculated by chromosome (S3-S5 Figs), and the overall genome LD was also calculated for each breed for an averaged interval of 100 Kbp (Fig 1 and S4 Table). Comparing the different breeds, the LD was highest for MAZ and lowest for AZI for all SNP distances. The patterns of LD decay was similar for AZI and KHU but differed in MAZ. As expected, the persistence of LD phase decreased with increasing physical distance between markers for all breeds (S5 Table). This decrease was rapid for distances shorter than 300 Kb, while the reduction in LD for distances of 1Mb to 15Mb was very small. At all intervals, the highest correlation was between AZI and KHU and the lowest was between KHU and MAZ. For distances below 100Kb (with an average of 56.9 Kb), the correlation varied from 0.82 for AZI and KHU to 0.54 for KHU and MAZ. While for distances greater than 1Mb the correlations varied from 0.36 for AZI and KHU to 0.10 for KHU and MAZ (Fig 2 and S5 Table). At all distances between SNP, average LD was highest for MAZ, intermediate for KHU and lowest for AZI.

Effective Population Size (Ne) based on genomic data
In the absence of pedigree information, analysis of LD can be used to estimate the effective population size, Ne [12]. LD between SNPs that are close together reveals historic events, while LD between more distant SNPs can be used to explore more recent population history. Ne in the recent generations was estimated to be 477, 212 and 32 for AZI, KHU and MAZ, respectively. While for 500 generation ago, Ne was estimated as 826, 748 and 632 for AZI, KHU and MAZ, respectively (Fig 3 and S6 Table). However, changes were not linear, and the intensity and direction of changes differed over time for each population. The reduction in Ne in AZI and KHU has been rapid over the last 20 generation. Ne for AZI seems to have increased between 100-40 generation ago.

SNP frequency and distribution
After quality control, at total of 57,212 common SNPs remained across all chromosomes for the three Iranian buffalo breeds, which is comparable with the number of polymorphic SNP found in Brazilian dairy buffaloes [46] but lower than 67,580 polymorphic SNPs, seen in Italian Mediterranean buffalos [47]. As would be expected, the Axiom Buffalo Genotyping 90K Array had more polymorphic SNP in the three Iranian buffalo populations (65-75% polymor-populations with lower diversity have higher average LD between adjacent loci. It should be noted that the level of LD differs across chromosomes e.g. Nelore cattle genotyped using a high-density bovine SNP marker panel gave a wide range of LD estimates across different chromosome regions, ranging from 0.17 to 0.24 for r 2 and from 0.55 to 0.72 for |D'| [51]. These differences can be attributed to variable recombination rates between and within chromosomes, heterozygosity, genetic drift and effects of selection [9]. In designing marker panels for particular populations the density of loci could be varied depending on the level of LD at specific regions of the genome to optimize the information recovered. The consistency of LD is high between the AZI and KHU breeds, indicating the close genetic relation of these breeds, whereas the comparison of AZI with MAZ, and KHU with MAZ show lower preservation of LD. This is consistent with Colli et al [52] who reported the lowest differentiation between AZI and KHU (0.021), moderate for AZI and MAZ (0.038), and highest for KHU and MAZ (0.045).
Linkage disequilibrium decay and persistence of LD phase. The level of LD between markers is important in the design and success of genome wide association studies and genomic selection [53]: genomic breeding estimates are more accurate when the mean r 2 between adjacent SNPs is higher, as the makers are more likely to predict the alleles at adjacent QTLs. Marker spacing giving an r 2 of at least 0.2 is recommended to estimate genomic breeding values [27,53,54] while a r 2 of 0.3 and above has been suggested for genome wide association studies and QTL mapping [27]. The average r 2 values between adjacent markers obtained from using the Axiom Buffalo Genotyping 90K Array were between 0.27 and 0.327 for the populations studied here, which is around the threshold of 0.3. One of the factors impacting on the accuracy of genomic breeding estimates across-populations is the persistence of LD phase, which reflects the genetic relationship between the populations [55]. The maintenance of LD phase for adjacent SNPs and persistence LD phase between AZI and KHU was higher than AZI and MAZ, and KHU and MAZ at all distances. These results support the AZI and KHU being genetically closer than AZI and MAZ, and KHU and MAZ. The persistence of marker phase between populations decreases as the divergence between the populations increases, and hence a higher marker density is required for more divergent breeds [56]. The present study suggests that the AZI and KHU breeds could be treated as a single population for genomic selection when using the Axiom Buffalo Genotyping 90K Array.
Effective population size based on genomic data. Effective populations size (Ne) is related to the history of a population [57] and is a key parameter used in conservation biology [13]. The FAO (1992) reports that with effective population size of 50, the loss of genetic diversity over 10 generations is approximately 10% [58]. LD-based methods with markers spaced at 1Mb tend to overestimate Ne for more than 50 generations ago, while estimates for recent generations are more accurate [12]. Ne estimates presented here are based on corrected r 2 values, which are less sensitive to allele frequency and a small sample size than |D'| [6]. In the present study, Ne was estimated from SNP distance from recent generation (15Mb) to 500 generation ago (100Kb). The results suggest that Ne has been lower in the recent generations compared with more ancient generations for all three breeds, and that the effective population size is currently highest for AZI and lowest for MAZ. The first buffalo were imported to Iran from India in 2000-2500 BC [59]. The higher Ne estimated for the ancient populations may reflect the diversity in the original Iranian population before it separated into breeds [33]. Artificial insemination has not been widely used in the AZI and KHU populations, which is likely to have contributed to the relative high Ne values for both populations, which are above the threshold (Ne = 100) to ensure that a population is viable for long-term survival [60]. The MAZ population has been geographically isolated and managed in a protected natural area, both of which are contributory factors to the low estimated Ne, which is below 100. Therefore, the MAZ population can be considered as endangered and it is essential to monitor the population and develop a breeding program to ensure viability and avoid inbreeding.

Conclusions
The average distance between adjacent SNPs in the current Axiom Buffalo Genotyping 90K Array is 30 Kb, based on alignment with the bovine genome. After filtrating for quality and MAF the between-marker distance for the 57212 common SNPs was 46 Kb. The level of LD in Iranian buffalo using this set is above that recommended for genome wide association studies (r 2 > 0.3) or to estimate genomic breeding values (r 2 > 0.2). The calculated Ne from LD decay indicates that the AZI and KHU have a sufficiently large effective population size to be sustainable, while the MAZ has a low effective population and needs careful management to ensure its survival.