Genome Scan for Selection in Structured Layer Chicken Populations Exploiting Linkage Disequilibrium Information

An increasing interest is being placed in the detection of genes, or genomic regions, that have been targeted by selection because identifying signatures of selection can lead to a better understanding of genotype-phenotype relationships. A common strategy for the detection of selection signatures is to compare samples from distinct populations and to search for genomic regions with outstanding genetic differentiation. The aim of this study was to detect selective signatures in layer chicken populations using a recently proposed approach, hapFLK, which exploits linkage disequilibrium information while accounting appropriately for the hierarchical structure of populations. We performed the analysis on 70 individuals from three commercial layer breeds (White Leghorn, White Rock and Rhode Island Red), genotyped for approximately 1 million SNPs. We found a total of 41 and 107 regions with outstanding differentiation or similarity using hapFLK and its single SNP counterpart FLK respectively. Annotation of selection signature regions revealed various genes and QTL corresponding to productions traits, for which layer breeds were selected. A number of the detected genes were associated with growth and carcass traits, including IGF-1R, AGRP and STAT5B. We also annotated an interesting gene associated with the dark brown feather color mutational phenotype in chickens (SOX10). We compared FST, FLK and hapFLK and demonstrated that exploiting linkage disequilibrium information and accounting for hierarchical population structure decreased the false detection rate.


Introduction
A local reduction of genetic variation up-and downstream of the beneficial mutation is caused by the rapid fixation of a beneficial mutation, leaving special patterns of DNA behind, which is commonly referred to as a "selective sweep" [1].The study of such signatures of selection can production traits (in the 20th century) [22] fosters the interest in detecting selective sweeps in chicken using statistical methods that account for the strong hierarchical structure between these populations.Therefore, this dataset offers an interesting opportunity to evaluate methods that account for population structure in a setting characterized by a strong past selection pressure, high genetic drift and clear population structure, which has never been done before.
In this study, FLK [12] and hapFLK [15] statistics were applied on the same data as in our previous study on selection signatures in commercial chicken [20], allowing a comparison between F ST , FLK and hapFLK.In contrast to our previous work, the approaches used in the current study have the potential to identify genomic regions which have been selected more recently (e.g.soft sweeps) and are associated with specific layer traits.

Animals, Data collection and filtering
Two sets of samples-commercial egg layers and wild chicken (coded respectively LAY and ANC)-were used in this study.The commercial individuals from Lohmann Tierzucht GmbH originated from three different breeds.One commercial white egg layer breed based on White Leghorn (WL), with three separate lines, and the other two brown egg layer breeds based on White Rock (WR) and Rhode Island Red (RIR), respectively, each with two separate lines per breed.In each of these seven lines, ten individuals were sampled and genotyped.The wild chickens, comprising Red Jungle fowl (Cochin-Chinese) (G.g. gallus) and Red Jungle fowl (Burmese) (G.g. spadiceus) were sampled within the AVIANDIV project.A more detailed list of breeds is presented in Table 1.The ANC group consisted of two subspecies of Gallus gallus that are believed to stem in straight line from wild ancestors of domestic chickens.Data is publicly available (S1 Dataset).
Genotyping was done with three Affymetrix 600K SNP arrays.Overlapping SNPs between the three 600K SNP arrays were removed by the data provider and a total of 1,139,073 SNPs remained.For this study we included only the SNPs that were located on autosomal chromosomes , SNPs that were located on sex chromosomes and linkage groups were removed (62,337 were removed).SNPs with at least one missing value and SNPs with minor allele frequencies lower than 5% (172,344 SNPs) were removed in order to avoid dealing with genotyping errors; this approach was suggested by the data provider.A total of 904,392 SNPs remained after filtering.The entire filtering process was done using the PLINK software (http://pngu.mgh.harvard.edu/purcell/plink/)[23].

Population structure analysis
Using Reynolds' genetic distances [24], a phylogenetic tree was constructed to retrieve the structure of the studied samples.

FLK and hapFLK calculation
To identify regions under selection, FLK and hapFLK were calculated in all LAY breeds, using ANC individuals for rooting the population tree.FLK calculates variation of the inbreeding coefficient and incorporate hierarchical structure by using a population kinship matrix (for details see Bonhomme et al. (2010) [12]).The same matrix is used in hapFLK, but the statistic is computed from haplotype frequencies rather than SNP allele frequencies.Here, the haplotypes considered are in fact latent states extracted from the multipoint linkage disequilibrium model of Scheet and Stephens [25] (for details read Fariello et al. (2013) [15]).To determine the number of underlying latent states we used the fastPHASE [25] cross validation procedure, which indicated that 5 or 10 haplotype clusters were adequate.We found that using either 5 or 10 haplotype clusters gave nearly identical results and therefore present those obtained assuming 5 haplotype clusters.

Assigning signatures of selection to specific population groups
When using differentiation-based approaches, it is sometimes difficult to pinpoint the population(s) that have been the target of selection.Fariello et al. (2013) [15] proposed to decompose the hapFLK statistic by projecting it on principal components (PC) of the population kinship matrix to identify which part of the population tree exhibits an outlying differentiation in a particular genomic region.Here, we employed this approach to look for selection signatures that affected either (i) the whole population set (LAY), (ii) white layer populations or (iii) brown layer populations.For (i) we used the hapFLK statistic, for (ii) and (iii) we considered the projection of the statistic on the subtree corresponding to white (resp.brown) layer populations.In each case we considered that a position lying in the top and bottom 0.05% of the empirical distribution was potentially within a selection signature.
For each selection signature, we then re-estimated the branch lengths of the population tree, using local allele or haplotype clusters frequencies (see Fariello et al. (2013) [15] for details) and identified the branch lengths that seem significantly larger than the branches of whole genome tree to pinpoint selected populations.

Fitting of gamma distribution
As hapFLK statistic does not follow a known distribution under neutrality, the null distribution has to be estimated from the data.As hapFLK is similar to FLK, a good approximation to the asymptotic distribution of hapFLK comes from the gamma distribution family.To estimate pvalues of selection signatures, we fitted a gamma distribution to the hapFLK observed distribution, using the minimum distance estimation method [26,27] which is robust to outliers, which helps to reduce the influence of selection signatures in estimating the null distribution.This was done for false detection rate (FDR) estimation.

Annotation
As explained above, regions with extreme FLK and hapFLK values were considered as candidates for selective sweeps.For all the three groups (all layers, white layers and brown layers) the extreme values (the upper and lower 0.05%) that were within 500 kb of each other were grouped together.For all joined groups gene annotations, QTL annotations and pathway annotations were completed.Gene annotations were done with the biomaRt R package [28] based on the Ensembl database [29] of Gallus_gallus-4.0 assembly.Animal QTL database [30] was used for QTL annotation, KEGG database for pathway annotation [31] and Gene Ontology (GO) database for GO annotation [32].Gene enrichment analysis was done with Fisher's exact test [33] for all annotated genes in all groups (all layers, white layers and brown layers) separately.Pathways and gene ontologies with p 0.05 were identified as being under selection.

Population structure
A phylogenetic tree based on Reynolds' genetic distances with 100,000 randomly selected SNPs (100 replications) was constructed and is shown in

FLK
Based on the FLK values distribution, a total of 107 regions (63 in all layers, 27 in white layers and 17 in brown layers) were detected as signatures of selection (S1 Table ).All these regions were in the upper 0.05% of the distribution which is representative of regions with fixed difference between populations.The genome-wide distribution of FLK values obtained from each group-all, white and brown-are depicted in Fig 2A , 2B and 2C, respectively.Annotation was carried out for all regions with extreme FLK values, i.e. potential selection signatures.The lists of genes in selective sweeps detected with FLK are available in the supplementary tables (S2, S3 and S4 Tables).The annotation list is enriched with genes of biological interest involved in various pathways such as ATP metabolic process (P = 0.023), metal ion binding (P = 0.001), nucleic acid binding (P = 0.008) and metabolic pathways (P<0.001),all of which can be related to production traits under selection in layers.The lists of pathways and gene ontologies under selection are available in supplementary tables (S5, S6 and S7 Tables).We identified three candidate genes which can be related to the breeding goals of chickens.H3F3C and AGRP which are associated with body growth and body weight [34,35], and IL19 which is associated to the immune system in chicken [36].More details about gene locations and study groups are available in Table 2.We also detected several QTL overlapping selection signatures for traits such as breast muscle weight, abdominal fat weight and liver weight, which all are related to the breeding goals of chickens (Table 3). hapFLK Based on the hapFLK values distribution, a total of 41 regions (17 in all layers, 12 in white layers and 12 in brown layers) were detected as selection signatures (S8 Table ).All these regions were in either the upper or the lower 0.05% of the distribution, which represent regions with a fixed difference or fixed similarity between populations, respectively.The genome-wide distribution of hapFLK values with 5 haplotype clusters obtained for each group-all, white and brown-are depicted in Fig 3A , 3B and 3C, respectively.Annotation was carried out for all regions with extreme hapFLK values, i.e. potential selective sweeps.The lists of genes for selective sweeps detected with hapFLK are available in the supplementary tables (S9, S10 and S11 Tables).The annotation list is enriched with genes of biological interest involved in various pathways such as nerve development (p = 0.027), growth factor receptor (p = 0.008), RNA metabolic process (p = 0.042) and skeletal muscle cell differentiation (p = 0.032), all of which could be related to production traits indirectly.The lists of pathways and gene ontologies under which were detected under selection in this study are available in the supplementary tables (S12, S13 and S14 Tables).We identified four genes that were related to the breeding goals of chickens with the hapFLK method.IGF-1R and STAT5B are associated with growth and carcass traits [37,38].BPIFB8 and SOX10, which are associated with egg natural defense [39] and dark brown mutational phenotype [40] respectively (more details is available in Table 2).Several QTL, which were related to the breeding goals of egg-layer chickens were detected as well,  for traits such as drumstick and thigh carcass weight and shank length.A complete list of all QTL with more details is available in Table 4.

Structure analysis and P 0 comparison
Our population structure analyses are largely in agreement with the expected historical origin of the breeds [21] and as expected, they are also similar to the previous study using the same data [20].
One of the issues in the FLK and hapFLK analysis in this study is using only 4 wild chickens for development of the population's kinship matrix.We assessed whether using a different set of outgroup individuals could possibly change our findings by verifying the influence of the outgroup set on the estimation of the ancestral allele frequency (p 0 ).p 0 can be seen as a nuisance parameter in the model that has to be estimated from the data through the kinship matrix.We studied the possible impact of the number of wild chickens used by comparing p 0 when being calculated from 4 wild chickens (our ANC group) vs. 40 wild chickens (consisted of 20 Gallus gallus gallus and 20 Gallus gallus spadiceus which were genotyped with Axiom Genome-Wide Chicken Genotyping Array of Affymetrix and were available only for this comparison).p 0 was calculated for each group (ANC group and 40 wild chickens) for every SNP on the 600K SNP chip.Pairwise comparison of each group's p 0 values along the genome gave an average correlation of 0.95.This high correlation suggests that there is no vital difference in development of population's kinship matrix with 4 or 40 wild chickens.Therefore the kinship matrix calculated based on four wild chickens, which had been genotyped for the complete set Table 3. QTL associated with productive traits in FLK analysis in all three studies.'All' stands for inclusion of all commercial breeds, and 'White' for analysis within white layers.

Fitting of gamma distribution
Although the outlier approach is an effective and widely used method for identification of genes under selection lacking known phenotypes [41], an outlier signal is not necessarily synonymous with regions being under selection [42].Therefore we fitted a gamma distribution to the hapFLK in order to estimate the false discovery rate (FDR).This approach suggested an FDR of 10-20% in our analysis.

F ST , FLK and hapFLK
An overlap exists between the regions that have been determined as regions under selection in a previous study with F ST [20] and the current analysis of FLK and hapFLK as shown by the Venn diagram for the number of SNPs identified as being under selection with either of the methods shown in Fig 4 .Using the same threshold as in our previous work [20] (upper and lower 1%) resulted in detection of a lower number of selection signatures with FLK (73.2%) and much lower with hapFLK (13.4%) compared to the F ST based results reported in our earlier study on the same data (list of regions detected with F ST method is available in S15 Table) [20].
A finding suggested that ten-thousands of polymorphisms respond to selection, which was the case in our earlier work [20], does not appear realistic [43].Many of the outliers detected with F ST must be considered as false positives, which might be partly due to the fact that the method assumes populations to have the same effective size and to have emerged independently from the same ancestral population.Therefore we used a much stricter threshold (upper and lower 0.05%) in the study presented here than in our previous work (upper and lower 1%) [20].Accordingly, the use of a stricter threshold and the application of methods that account for different effective population sizes and hierarchical phylogenies (FLK and hapFLK), resulted in the detection of much lower number of selection signatures.There is also an overlap between regions detected by hapFLK and FLK (44.2%) which is due to the use of same statistic in both methods.The difference between regions detected by FLK and hapFLK can be due to the fact that haplotype and SNPs harbor different information.
As an example, in Fig 5A we demonstrate allele frequencies at SNP positions around the TGFB2 gene (Chr3: 18,690,003-18,753,123) which was detected as a gene under selection by Table 4. QTL associated with productive traits in hapFLK analysis in all three studies.'All', 'White', and 'Brown', stand for inclusion of all the commercial breeds, analysis within white layers and analysis within brown layers, respectively.'s' stands for similarity and 'd' for difference.F ST [20] due to a reduction of diversity within the WL breed.However, since reduction exists only within the WL breed this can also be explained by drift alone.By taking the population tree into consideration, FLK does not detect any signals in this region.Another example is  the region around the H3F3C gene (Chr3: 16,483,487,393) which was detected to be under selection by FLK.Allele frequencies around this region shows that a huge diversity exists between some breeds (Fig 5B).We detect an outlier with FLK in particular because WR1 and WR2 show very different patterns of allele frequencies in this region although they are closely related in the population tree.However F ST was not able to detect any signal here, since F ST treats each population as an independent evidence for sweep detection and does not consider the huge difference between WL, RIR and WR breeds.There are as well cases in which all three methods (F ST , FLK and hapFLK) were able to detect the region under selection.An example is a 60Kb region on chromosome 10 (6,799,776-6,738,610).A complete hard sweep is expected to be large [44], while a soft sweep is more likely to have smaller size [45].In the current study we detected smaller sweeps (bp length) compared to our F ST study, which may be due to the fact that hapFLK has greater power in detection of soft sweeps.Nevertheless we should as well take into account the false positive rate of our F ST study.A boxplot of sweep size with F ST and hapFLK method is shown in S2 Fig.
A vast majority of differentiated polymorphisms in our data set could be caused by genetic drift.Genetic drift is high when the (effective) population size is small [46] which is the case in commercial laying breeds [47].Since regions differentiated by selection and regions differentiated by drift alone may overlap, there is a lack of power in our analysis.This could be solved by using a larger number of populations to minimize the risk that a systematic pattern of differentiation in many breeds (say, several white layers vs. several brown layers) is created at random by drift alone.Other obstacles in this study are the use of only 10 animals per sample and filtering for minor allele frequencies; these two issues might have an effect on the estimation of allele frequencies, comparison of rarer alleles and identification of all haplotypes.In a recent simulation study [48] it was shown that the power of most selection signature tests is more dependent on marker density than on sample size, and that with a marker density similar to the one used in the present study a high power and positional resolution was achieved with 15 sampled individuals per population.We detected several genes related to the breeding goals of egg-layer chickens, such as low body weight, high reproduction performance and good feed conversion [49], both with FLK and hapFLK.For instance, with the FLK method we detected several QTL associated to disease-related traits and breast muscle weight, as well as AGRP (agouti related protein homolog), which is associated with breast muscle water loss rate, chest width, body weight, slaughter rate and semi-evisceration weight [35].
In the hapFLK analysis, we also detected several genes, which are associated with growth and carcass traits, such as IGF-1R and STAT5B.STAT5B (signal transducer and activator of transcription 5B) is associated with growth and reproduction traits [38].IGF-1R (insulin-like growth factor 1) is similar to IGF2 [50], which was detected in our previous work [20].IGF-1R is associated with chicken early growth and carcass traits [37].We additionally detected several QTL associated to carcass weight, drumstick weight and shank length.QTL associated with meat production, as well as both IGF-1R and STAT5B, were located in regions that were similar between brown layers.Supporting results were found in our previous study [20], where we detected genes associated to meat quality and production in brown layers, which reflects the fact that brown egg-layers were originally a dual-purpose breed [21].
Bonhomme et al. (2010) [12] and Fariello et al. (2013) [15] showed with simulation that using FLK or hapFLK method to detect selection signatures in comparison to other F ST -like approaches greatly increases the detection power.Specifically, hapFLK statistic has more power in detecting sweeps occurring in several populations.Due to this, we were able to detect SOX10 with hapFLK which was not detected by F ST or FLK method.SOX10 is a gene on chromosome one underlying the dark brown mutational phenotype in chickens plumage [40].SOX10 was detected in regions that were different between brown layers.Re-estimation of the local tree using haplotype clusters frequencies (Fig 6A ) and haplotype frequencies (Fig 6B ) for the region surrounding SOX10 revealed selection in the RIR breeds in this region.RIR is the only breed with dark brown feather in our data set [51], which is in great agreement with our selection signature detection.

Conclusions
In conclusion we were able to identify several putative selection signature regions with genes corresponding to the traits associated to growth and reproduction traits.Some of these annotated genes were similar (or had similar functions) to our findings in our previous work [20].However, several of the detected regions were not associated with any genes related to production traits, which could be due to insufficient knowledge about these regions [52].We did not identify selection signatures that were reported in other studies on chicken [17,53] which could be due to lack of diversity in our data compared to their data set.By detection of SOX10 as a gene under selection, we demonstrated that the use of haplotype frequencies and consideration of hierarchical structure can improve the power of detection of soft sweep in our data set.S14 Table .Lists of pathways and gene ontologies under selection with hapFLK with 0.05% threshold in brown layers.(PDF) S15 Table .Regions detected as putative selective sweeps detected with F ST with upper (U) and lower (L) 1% threshold.
LG stands for studies between commercial-layers and non-commercial chickens and BW stands for studies between brown and white layers, respectively.(PDF) S1 Dataset.Compress file of genotyped data in plink format.(RAR)

Fig 1 .
As Fig1shows, commercial white egg-layer breeds were separated from brown egg-layers and grouped in one sub-tree.In the sub-trees, the two white-layer lines WL2 and WL3 as well as the two brown-layer lines WR1 and WR2 form a separate sub-cluster, respectively.The population specific fixation indices of all populations, also shown in Fig1, are extremely high (ranging from 0.45 to 0.75), reflecting the very strong effect of genetic drift in these populations, with the three White Leghorn populations notably more inbred than the Brown layer populations.

Fig 1 .
Fig 1. Reynolds' genetic distances population tree of seven commercial breeds and histogram of fixation index for each line.doi:10.1371/journal.pone.0130497.g001

Fig 2 .
Fig 2. Manhattan plot of FLK analysis over the entire genome.Blue line indicates the upper 0.05% of FLK distribution, for (A) within all breeds, (B) within white breeds, and, (C) within brown breeds.doi:10.1371/journal.pone.0130497.g002

Table 1 .
Name, abbreviation, number of individuals and the egg color for each breed used in this study.

Table 2 .
Genes associated with productive traits in FLK and hapFLK analysis in all three studies.'All', 'White', and 'Brown' stand for inclusion of all the commercial breeds, analysis within white layers and analysis within brown layers, respectively.'s' stands for similarity and 'd' for difference. doi:10.1371/journal.pone.0130497.t002