Estimation of linkage disequilibrium levels and allele frequency distribution in crossbred Vrindavani cattle using 50K SNP data

The objective of this study was to calculate the extent and decay of linkage disequilibrium (LD) in 96 crossbred Vrindavani cattle genotyped with Bovine SNP50K Bead Chip. After filtering, 43,821 SNPs were retained for final analysis, across 2500.3 Mb of autosome. A significant percentage of SNPs was having minor allele frequency of less than 0.20. The extent of LD between autosomal SNPs up to 10 Mb apart across the genome was measured using r2 statistic. The mean r2 value was 0.43, if pairwise distance of marker was less than10 kb and it decreased further to 0.21 for 25–50 kb markers distance. Further, the effect of minor allele frequency and sample size on LD estimate was investigated. The LD value decreased with the increase in inter-marker distance, and increased with the increase of minor allelic frequency. The estimated inbreeding coefficient and effective population size were 0.04, and 46 for present generation, which indicated small and unstable population of Vrindavani cattle. These findings suggested that a denser or breed specific SNP panel would be required to cover all genome of Vrindavani cattle for genome wide association studies (GWAS).


Introduction
Linkage disequilibrium (LD), defined as the degree of non-random association of allele between loci or correlation between genotypes of markers, is an important concept in the understanding of gene mapping and its application in genetic studies. The LD between markers provides insight in exploring the level of diversity between different breeds, inferring the frequency of recombination events, investigating the change in effective population size across generations, and identifying genomic regions to improve economically important traits [1][2][3]. The pattern of LD is mainly determined by the physical distance between markers, although several other demographic and evolutionary factors including population stratification, inbreeding, effective population size, genetic bottleneck, genetic drift, migration, mutation, selection, and recombination rate may also influence the extent and pattern of LD [4][5][6].
The declining cost of high throughput genotyping provides a new opportunity to explore genome-scale studies, especially Genome Wide Association Studies (GWAS) and Genomic Selection (GS). The accuracy and precision of the genomic studies rely on the level and pattern of linkage disequilibrium (LD) between markers across genome [7]. The LD between markers has been studied in the genome of several taurine and indicine cattle breeds [2,8]. Results from their study revealed that moderate LD (r 2 = 0.20) were extended to <100 kb, which indicated that 50K SNPs captures most of the LD information necessary for GWAS in taurine breeds [1,2]. However, the extent for of LD (r 2 = 0.20 to 0.34) was observed lower (<30kb) within indicine cattle, which indicated use of High density chip for genomic studies in these cattle [8]. The reduced extent of LD detected in indicine cattle breeds can be attributed to the ascertainment bias of the SNP Chips. Vrindavani is four breed synthetic crossbred cattle strain developed in India by crossing between Bos taurus cattle (Holstein, Brown Swiss, Jersey) and Bos indicus cattle (Hariana) [9]. Recently, the ROH fragments, haplotype blocks, inbreeding coefficient and effective population size in Vrindavani cattle has been evaluated [10]. Further, the admixture pattern and signatures of selection in Vrindavani was identified using 50K SNP chip [9]. However, the extent and pattern of genome-wide LD using 50K SNP Chip in Vrindavani cattle remains unexplored till date. This study was indented to build on the previous work of genetic characterization [9], and gives insight to estimate marker density for genomic studies in Vrindavani cattle. The objective of present study was to investigate allelic frequency distribution, extent of linkage disequilibrium level (r 2 ) and estimate the effective population size in crossbred Vrindavani cattle using a 50K Bovine chip.

Animal sample and genotype quality control
A total of 96 female Vrindavani crossbred cattle were selected from ICAR-Indian Veterinary Research Institute (IVRI), Izatnagar, Bareilly (Uttar Pradesh), India. The experiment was approved by the Institutional Animal Ethics Committee (IAEC) of ICAR-Indian Veterinary Research Institute (IVRI). DNA was extracted from blood samples and genotypic data were generated using the Bovine 50K SNP Bead Chip v2 (Agri Genome Labs Pvt. Ltd., India).
The HiScan images and genotypes were analysed using Genome Studio software (Illumina). The genotyping rate was 0.99, and a total of 53,218 SNPs were identified across the genome at a mean distance of 37.4 kb between markers. Quality control (QC) was performed using the PLINK v1.9 software [11]. The SNPs with call rate of less than 95%, MAF of less than 0.02 and HWE less than 10E x 10 −05 were removed. Only SNPs that were uniquely mapped to autosomes were included in the analysis. A total of 43,821 markers were retained for further analysis. The length of chromosome (Mb), number of markers on individual autosome and the mean distance between markers were estimated using R software [12].

Inbreeding coefficient and effective population size
Inbreeding coefficient (F) was calculated as a function of the expected and observed homozygote difference using PLINK v1.07 [11]. The equation is given as follow: where F i is the estimated inbreeding coefficient of the i ih animal; O i is the number of observed homozygous loci in the i ih animal, E i is the number of expected homozygous loci and L i is the number of genotyped autosomal loci. The Effective population size (Ne) was estimated using the SNeP tool [13], based on the relationship between N e , linkage disequilibrium represented by r 2 , and recombination rate (c). This is given by: where, N T is the effective population size t generations ago, c t is the recombination rate, r 2 adj is the LD value adjusted for sample size, and α is he correction for occurrence of mutation.

Linkage disequilibrium
The square of the correlation coefficient between two loci (r 2 ) was used to evaluate the extent of LD measure, as r 2 was considered is considered as more robust and not influenced by change in allele frequency and population size [14]. The r 2 estimate is an important estimator to measure loci required for genome wide association study and QTL mapping [8]. The equation for LD (r 2 ) estimate is represented as follows: where, P A1 , P A2 , P B1 and P B2 are frequencies of each allele at loci A and B, and P 11 , P 12 , P 21 and P 22 are the frequencies of haplotype A 1 B 1 , A 1 B 2 , A 2 B 1 , A 2 B 2 , respectively. The r 2 between SNP pairs with physical distances between 0 to 10 Mb for all autosome was used to estimate the extent and pattern of LD using the default command of PLINK v1.07 using command-ld-snp-list mysnplist-ld-window-kb 10000 -ld-window 99999 -ld-window r2 0 [11]. The decay of LD value, were then calculated for distances of pair-wise SNPs were interval into eight categories (0-10 kb, 10-25 kb, 25-50 kb, 50-100 kb, 100-500 kb, 0.5-1 Mb, 1-5 Mb and 5-10 Mb) along the first 10 Mb of each autosome. LD was also estimated as the mean r 2 value for each autosomal chromosome according to these intervals and plotted against distance range.
Further the effect of minimum allelic frequency (MAF) and sample size on the extent of LD was investigated. The LD was calculated as previously described with four different MAF thresholds (0.05, 0.10, 0.15 and 0.2) for physical distance of 10 Mb of genome. In order to determine the effect of sample size on LD(r 2 ), five different subset of population (N = 10, 25, 50, 75 and, 90) were randomly selected from the total population and the extent of LD was explored for each sub-population. To visualize the effects of MAF and sample size on the genome-wide LD (r 2 ) levels was also plotted accordingly.

Marker statistics
After quality control, a total 43,821 autosomal SNPs were available for downstream analysis. About 6734 SNPs having MAF less than 0.02 were removed, and 723 SNPs were removed on the basis Hardy-Weinberg equilibrium and call rate threshold criteria. The SNPs retained after quality control spanned across total of 2500.30 Mb of region of Vrindavani genome, with a mean chromosome length of 86.21Mb. The BTA1(158.03Mb) was found to be longest, while the BTA25 (42.80Mb) was shortest across the genome. The distribution of SNPs was proportional to length of chromosome; the highest number of SNPs were on BTA1 (2798), and the lowest on BTA25 (792). The average distance between SNPs was 57.24 kb, where the longest distance between SNPs was 3.26 Mb on BTA10 and shortest distance was 0.01 kb on BTA15. The descriptive statistical results of the SNP marker for each autosome are shown in Table 1.

Minor allele frequency
The mean minor allele frequency (MAF) across all autosomes was 0.26 (Table 1). After filtering, 36% of the SNPs (15,805) the MAF was lower than 0.20. The proportion of SNPs on each chromosome at different MAF threshold is presented in Fig 1. While the MAF was lower than 0.10 (0.2-0.10), the chromosome BTA 4, BTA16, BTA21 and BTA24 had a higher SNP proportion, whereas BTA6, BTA7 and BTA27 presented a lower proportion of SNPs.

Effective population size and inbreeding coefficient
The effective population size was estimated over the past 50 generations from the average r 2 ( Table 2). The estimate of Ne decreased from 245 (50 generations ago) to 46 (5 generations

Extent of LD across the genome
The level of LD among SNPs was estimated by r 2 method. The free recombination generally occurs at a physical distance of more than 10 Mb [15] thus, the range between SNPs was set at 0-10 Mb to estimate LD between SNP markers. In order to consider all possible pairs of SNPs with a distance of less than or equal to 10 Mb, combination pairs of 7,378,918 SNPs were obtained across autosomes to estimate LD. The mean values of r 2 of markers for distance of 10 Mb was 0.07. The mean LD (r 2 ) for physical distance intervals is presented in Table 3. The autosomes BTA16, BTA24, and BTA 20 had higher level of r 2 . For the physical distance of less than 10 kb, the average r 2 was 0.43 and it decreased to 0.21 for the distances of 25 to 50 kb. Results revealed the decline in mean r 2 value with the increase in physical distance between markers. The extent of LD (r 2 ) was found different across each autosomes according to the physical distance. The mean r 2 was estimated for markers physical distance separated by intervals of 0-10 kb, 10-25 kb, 50-100kb, 100-500kb, 0.5-1Mb, 1-5Mb, and 5-10 Mb per autosome ( Table 4). The mean r 2 revealed smaller reduction across different autosomes, when the physical distance between markers exceeded 100 kb (Fig 2).

MAF and LD estimates
The effect of MAF on LD extent was estimated using four different threshold level of MAF: >0.05, >0.10, >0.15 and 0.20 for physical distances up to 100kb between the SNP pairs ( Fig  3). Results showed significant effect of MAF threshold on the average r 2 , particularly for less distances. The r 2 between SNPs was smaller when the threshold of MAF was low (0.05), and increases significantly at higher MAF threshold. The value of mean r 2 ranged from 0.45 to 0.06 for MAF > 0.05, 0.49 to 0.07 for MAF> 0.10, 0.53 to 0.07 for MAF> 0.15, and 0.57 to 0.08 for MAF> 0.20, respectively.

Sample size and LD estimates
In this investigation, sample size (n) of 10, 25, 50, 75, and 90 were selected at random from the total population to assess the effect of sample size on LD (r 2 ) extent (Fig 4). The average r 2 was greater for smaller sample size, and this trend was more evident when physical distance intervals were large (more than 50kb). These findings indicated that the size of samples should be at least 50 animals for precise estimation of r 2 .

Discussion
After quality control, a total of 43,821 autosomal SNPs retained, which were distributed across 2500.30 Mb region of Vrindavani genome. The average MAF value was 0.26, which is comparable with the value reported for different cattle breeds [8]. The distribution of MAF directly influenced the LD, as low MAF could correspond to greater difference in the frequency of allelic pairs, resulted in underestimation of LD [16]. Therefore, four different level of MAF were selected to estimate the effect of MAF on LD extent. The result showed that the average LD (r 2 ) between SNPs was high, when the threshold of MAF was high (>0.20), particularly at smaller distance. Similar effect of allele frequency on r 2 estimate was also observed in other cattle breeds [17,18], suggesting that the 50K chip can be used for genetic studies in crossbred Vrindavani cattle. The LD between markers was estimated using r 2 method as it is less influenced by low allele frequency [5] and small sample size [15]. At inter-marker distance of 10kb, the mean r 2 value was 0.43 which was lower as compared to the LD estimates previously reported for taurine breed, Angus (0.46) and Hereford (0.49), but higher than the value reported for indicine breeds of Brahman (0.25), Nellore (0.27) cattle [2,3]. The mean LD between adjacent value was estimated for different intervals across autosomes. The mean r 2 for was highest for BTA16 and BTA24 and lowest BTA18 and BTA15. However, no relationship was observed between chromosomal size and r 2 estimate [19]. To explore the dependency of sample size on the extent of LD (r 2 ), different sample size was used to calculate r 2 . Khatkar et al. [18] has reported that small sample size leads to overestimation of LD. In the current study, a minimal sample size of 55 animal had no influence on r 2 , which is consistent with the previous study [19].
To analyse the LD decay, the physical distance between markers were classified in different interval. The results showed rapid decay of r 2 at distance more than 100kb for all autosomes. Moreover, the LD estimate decreased from 0.43 to 0.21 at 10kb and 50kb marker distance, respectively. Similar level of LD estimates at 40-60kb have been reported for taurine cattle breeds [18,20]. However, low level of LD estimate had previously been reported in indicine cattle, which may be attributed to ascertainment bias of SNPs in the current chip [2].  The reliability and accuracy of genomic association studies and genomic prediction is dependent on the amount of LD between marker. Mckay et al. [1] reported that a total of 28,700 (2.87GB/100kb at r 2 = 0.2, where 2.87GB is the bovine genome size) SNPs were required to capture LD information for genomic studies. In our study, the LD (r 2 ) estimate was 0.21 at distance of 50 kb between SNPs. This indicates that a minimum of 57,400 (2.87 GB/ 50kb at r 2 = 0.2) SNPs is required for genomic association studies in Vrindavani cattle. Previous studies have also indicated the requirement of large number of SNPs to cover the genome for genomic studies when analysing data from indicine, crossbred and multi-breed populations [2,8]. Our data suggest that a higher density SNP array were required for reliable genome-wide association mapping and genomic prediction in crossbred Vrindavani population.
For better understanding of the population structure, the inbreeding coefficient and the effective population size were estimated. The inbreeding coefficient was 0.04, which was comparable to previously reported 3% and 6% inbreeding coefficient in Vrindavani population using ROH and F HOM analysis [10]. The current study observed decrease in effective population size from 5 generations ago to 50 generations ago. Further, the estimate of N e was 40 in Vrindavani, and a decreasing trend of N e was reported for over five generations. The FAO has recommended N e should be at least 50 for maintaining proper breeding plan. In the current study, 5 generations ago the estimated N e (46), was below the recommended number (50) of FAO. The low N e estimate may be attributed to the small well protected population size and fluctuating breed ratio in the admixed Vrindavani population [10].

Conclusion
This study reported the magnitude of LD between markers crossbred Vrindavani cattle using 50k Bovine Chip. The estimated r 2 > 0.2 extended up to 50 kb indicates requirement of high density SNP panel for precise and accurate estimation of whole genome association studies in Vrindavani cattle. Furthermore, declining trend of N e estimates was observed in Vrindavani population indicates the requirement of breeding plan that could maintain the sufficiently large Ne. However, further confirmatory investigating for the extent of LD and effective population is required in larger population using high density array of SNPs.