Sardinians Genetic Background Explained by Runs of Homozygosity and Genomic Regions under Positive Selection

The peculiar position of Sardinia in the Mediterranean sea has rendered its population an interesting biogeographical isolate. The aim of this study was to investigate the genetic population structure, as well as to estimate Runs of Homozygosity and regions under positive selection, using about 1.2 million single nucleotide polymorphisms genotyped in 1077 Sardinian individuals. Using four different methods - fixation index, inflation factor, principal component analysis and ancestry estimation - we were able to highlight, as expected for a genetic isolate, the high internal homogeneity of the island. Sardinians showed a higher percentage of genome covered by RoHs>0.5 Mb (FRoH%0.5) when compared to peninsular Italians, with the only exception of the area surrounding Alghero. We furthermore identified 9 genomic regions showing signs of positive selection and, we re-captured many previously inferred signals. Other regions harbor novel candidate genes for positive selection, like TMEM252, or regions containing long non coding RNA. With the present study we confirmed the high genetic homogeneity of Sardinia that may be explained by the shared ancestry combined with the action of evolutionary forces.


Introduction
Due to the geographic isolation of Sardinia in the Mediterranean sea, Sardinian population can be considered a genetic isolate. Faunal and floral endemism underline this peculiarity, which is reflected also in the genetic and cultural structure of the human population. For such reasons, Sardinians have been object of numerous investigations in the fields of anthropology and population genetics [1,2,3,4].
Several studies have shown that the genome of current Sardinia inhabitants still contains some signatures of a long history of isolation. These features make this genetic isolate an ideal population for association studies [5,6,7]. However, much remains to be discovered about the genomic regions that were inherited from common ancestors, such as the short Runs of Homozygosity (RoHs), or the portions of the genome that have been selected by positive sweep.
In the present study, we have analyzed the genetic structure of the Sardinian population by using 1.2 million single nucleotide polymorphisms (SNPs) from 1077 Sardinians previously included in a genome-wide association study (GWAS) [8], and 79 healthy individuals from peninsular Italy. The aims of the study were the following: (i) reconfirming, through the use of autosomal genome wide data, the homogeneity of Sardinia population at the interregional level; (ii) inferring, through the use of RoHs, the population genetic history by estimating the background level of shared ancestry within the island and by comparing it with peninsular Italy; (iii) identifying signals of positive selection.

Data sets
Genotypic data from 1077 healthy subjects from Sardinia were used as primary data-set. Those samples were collected in the frame of an international consortium for GWAS on hypertension (HyperGene) and described elsewhere [8]. Subjects were clustered according to birth place, dividing Sardinia on the basis of the language spoken as suggested by Contini and coworkers [9,10]. In the present work, a simplification of this approach was used by dividing the island into six main macro-areas as displayed in Figure 1: Gallurese (n = 77), Nuorese (n = 88), Logudorese (n = 385), Sassarese (n = 342), Alghero (n = 87) and Campidanese (n = 98). Part of the samples (n = 250) have been already analyzed in a previous work [11].
An additional group consisting of 79 Italian individuals was included in the study to perform a comparison of Sardinian genetic background with the Italian mainland. The peninsular Italian subjects were genotyped in our laboratory for more than 1M SNPs (HumanOmni1-QUAD v1.0 BeadChip, Illumina Inc, S. Diego, CA, USA). To compare Sardinia and Italy, only SNPs common to both data-sets were considered (,520 k markers).
All samples were collected with informed consent and analyzed anonymously. Their use for population genetics studies was approved by the ethics committee of the Human Genetics Foundation (HuGeF) in Turin.

Quality Assessment and Control Procedure
Stringent quality control procedures were applied when performing SNPs genotyping analysis. Samples with an individual call rate lower than 98% were excluded. SNPs with minor allele frequency (MAF) less than 0.01 were excluded, as well as those who failed the Hardy-Weinberg equilibrium test (p,1610 23 ). In order to estimate individual number of RoHs, SNP markers on sex chromosomes were excluded. After quality control procedures, the Sardinian data-set contained a total of 946,970 SNPs.

Statistical Data Analyses
Analysis was performed at different levels. The first one was to assess the genetic structure within Sardinia. A second level was aimed at reconstructing the genetic population history through RoHs analysis, and the identification of genomic regions under positive selection.

Sardinian population structure
Principal Component Analysis (PCA) was performed using the complete set of markers, with the algorithm implemented in the R package [12] SNPRelate [13]. The PCA values of each individual sample have been plotted on the space defined by the first 2 eigenvectors: subjects from the same linguistic macro-area or the same geographic area have been displayed with identical color (Figure 2A and B).
We used the first four principal components (PCs) as predictors in a multinomial logistic regression using the linguistic macro-area as dependent outcome. We then evaluated the prediction accuracy of the described model: for each sample the most probable linguistic macro-area estimated by the model was compared to the real one (10,000 iterations).
Pairwise inflation factors (l GC ) [14] between the six macro-areas were computed through PLINK software [15], simulating a casecontrol study between each pair of macro-areas (-adjust option).
We used two different methods to calculate F st : the first one was roundly intended to produce estimates on data with significant inbreeding (like the six macro-areas) while the second one was Pairwise genetic F st correct for inbreeding between the six macroareas was estimated as suggested in Reich et al. [16]. F st between the Sardinian and peninsular Italians populations was estimated using the Hudson estimator for genome-wide data [17], as suggested in Bhatia et al. [18]. The R code to compute both estimators is available in Text S1.
Mean inbreeding coefficients were estimated on the basis of the observed versus expected number of homozygous genotypes over the whole genome, using the data set containing also peninsular Italian individuals (PLINK software (-het option)). Differences between Sardinian population and peninsular Italians were evaluated using a T test. The software ADMIXTURE [19] was used to estimate the ancestry for each individual in Sardinian population and in peninsular Italian subjects. A cross validation error-based method was applied to detect the number of clusters (K) after 20 runs.

Runs of Homozygosity Analysis
RoHs were estimated separately for Sardinians and peninsular Italians (PLINK software (-homozyg option)). The following parameters were used for the estimation algorithm:1) a sliding window of 5000 kb, with a minimum of 50 SNPs that must be present in the region considered; 2) for a given window, a maximum of one heterozygous and a maximum of five missing calls allowed; 3) each SNP was considered to be part of an homozygous segment when the proportion of homozygous windows overlapping that position was above the threshold value of 0.05.
We identified 6 RoH categories based on the length of the genomic region of homozygosity (0.5-1 Mb,1-2 Mb, 2-4 Mb, 4-8 Mb, 8-16 Mb, .16 Mb), and estimated the proportion of individuals with RoHs of different size in each Sardinia's macroareas. Differences between Sardinian macro-areas and peninsular Italy were evaluated using a T test. We also estimated the proportion of the genome covered by regions of homozygosity (F RoH %) according to McQuillian et al. [20]. Two classes of RoHs were considered in this analysis: RoH$0.5 Mb, and RoH$5 Mb. For each class and for each macro-area we computed the average F RoH % over all individuals, as well as the average sum of length of all RoHs in the same class. A T test was performed to evaluate the differences between the two classes of RoH within each macroarea and Italy.

Extended haplotype homozygosity (EHH) and related tests
FastPHASE software [21] was used to perform a haplotype phase estimation. The estimated haplotypes were subsequently used to detect footprints of selection from haplotype structure.
For each SNPs we computed the EHH statistic [22] of both alleles (ancestral and derived), as well as the integrated haplotype score (iHS) [23]. The algorithm is implemented in the R package rehh [24]. For this specific analysis we employed a total of ,900 k markers for which information about ancestral allele was available in the public databases [25]. Lastly, we searched for chromosomal regions that showed enrichment of SNPs with |iHS|.4, using the approach suggested by Voight et al. [23]. Permutation based correction for multiple comparisons was applied.

Results
The multinomial logistic regression model using the first four eigenvector as predictors of the linguistic macro-areas showed very low accuracy (from a minimum of 0.2044 to a maximum of  0.3201, 10,000 iterations), suggesting a high degree of homogeneity within Sardinian population. No sub-populations were apparently identified projecting the Sardinian samples onto a two-dimensional space (based on the first two eigenvectors) using all autosomal markers (934,288 SNPs) within the linguistic macroareas (Figure 2A), or dividing the island in 3 geographic regions ( Figure 2B). The distribution of the first four eigenvectors is shown in Figure S1. All pairwise F st values inbreeding corrected within Sardinian linguistic macro-areas were close to zero (Table 1), and we observed a F st estimator of 0.003 (p-value,0.0001 95% C.I. 0.0025-0.0033) when comparing Sardinia to peninsular Italy. Pairwise inflation factors (l GC ) were strictly close to 1 (from 1.01 to 1.05) ( Table 1). The ancestry analysis highlighted a common genetic background for all the individuals of the island (Figure 3). The observed shared ancestry made unfeasible any attempt to cluster individuals on the basis of their place of birth. By using the cross validation error, we indicated ''K = 2'' as the number of clusters more compatible with the data. Furthermore higher values of K did not reveal additional population-specific ancestries.
The percentage of genome covered by RoHs.0.5 Mb (F RoH % 0.5 ) was higher in Sardinians when compared to peninsular Italians, with the only exception of the area surrounding Alghero ( Table 2). No significant difference was observed between Sardinians and Italians when comparing the fraction of the genome covered by RoHs.5 Mb (F RoH % 5 ) ( Table 2).
Significant differences were observed in the mean inbreeding coefficients between Campidanese, Gallurese, Sassarese, and Logudorese macro-areas and peninsular Italy (Table 3).
Since the distribution of different classes of RoH allows to study different demographic patterns involving a population, we further divided RoHs into 6 different classes, as shown in Table 4. Sardinia had a higher number of RoHs than Italy for 2 classes of RoH: 0.51 Mb, and 1-2 Mb (p-value,0.05). Regarding longer RoH classes (8-16 Mb and .16 Mb) no significant difference was found between the two regions, with the exception of Campidanese for the class 8-16 Mb. Comparing the class of RoH longer than 2 Mb, the Alghero district and Sassarese were not found statistically different from peninsular Italy.
To detect possible footprints of positive selection, the decay of standardized EHH (namely, iHS [23]) has been estimated. Nine genomic regions, harboring more than 200 different genes, showed a signal of positive selection (Table 5)
In general, Sardinia appears characterized by a large internal homogeneity [5,7], like all isolated populations, even though other investigators suggested the presence of genetically different subpopulations in the island [6,48]. Recently several genomewide studies have been performed on Sardinian population taking advantage of the genetic homogeneity of the island using also large cohort of individuals [49,50,51,52,53].
In the present study, we have reconfirmed the high internal homogeneity of Sardinia using four different methods (PCA, F st distance, inflation factor parameter (l GC ) and ancestry estimation).
The lack of a subpopulation structure seems clear from PCA. In fact, the multinomial logistic regression model showed that the first four PCs are not able to predict the linguistic macro-areas. Moreover, the inbreeding corrected F st values were spanning from 9.1610 25 to 1.1610 24 , and the l GC values were all nearly 1, indicating both the lack of population differentiation among different areas, and of genetic stratification within the island. The ancestries estimation also suggested a remarkable degree of similarity for all the sampled Sardinian subjects, at the same time a significant heterogeneity when Sardinians are compared to peninsular Italian subjects.
It is nevertheless worthy to note that some Sardinia sub-regions, such as Ogliastra, are actually formed by isolated villages, each of them with a unique demography. Several studies [6,48,54] observed differences of linkage disequilibrium (LD) and population structure among these villages. Unfortunately, the limited number of individuals from Ogliastra in our sample (N = 16) did not allow us to test the hypothesis of genetic substructures at the microgeographic level. The isolation of population has also left its mark on the Sardinians' DNA. In fact a 2-fold increase in the mean homozygosity compared with Italy, is still detectable. Nevertheless we still found evidence for a significant decrease of genome homozygosity in the area surrounding Alghero, which is the linguistic macro-area with the lowest signature of isolation in Sardinia. We focused on RoHs for a more detailed study on the demographic history of the island. RoHs are regions of the genome in which the inherited copies from both parents are identical as both parents inherited them from a common ancestor at some point in the past (identical by descendent tracts). RoHs are observed in the genome of each individual, and their length is related with their time of origin. RoHs describe different aspects of a population, such as consanguinity, endogamy and demographic events such as bottlenecks. We therefore evaluated the average percentage of the genome covered by RoH.0.5 Mb and RoH.5 Mb (F RoH0.5 % and F RoH5 %, respectively) within each Sardinian sub-population compared with those from the Italian peninsula. The F RoH0.5 % describes the global trend of homozygosity within the sub-populations, while the F RoH5 % provides information on other phenomena, such as endogamy or recent inbreeding. The average F RoH0.5 % and the mean sum of the lengths of these segments in Sardinia were higher as compared to Italy (mean sum of F RoH0.5 for Sardinia from 72.77 to 82.55 Mb, for Italy 67.55 Mb). These observations are consistent with an   ancestral small effective population size (N e ) in Sardinia and a deeper level of shared ancestry. Once again the Alghero area contrast with those observations showing a F RoH0.5 % similar to peninsular Italy. However, we were not able to observe a similar trend for F RoH5 %, for any of 6 macro-areas. To achieve a deeper detail, we ranked RoHs in six different classes. On average, in Sardinia the mean sum of the shortest RoHs (0.5-1 Mb and 1-2 Mb) was significantly longer than in Italy. This phenomenon can be explained as the result of common extended haplotypes probably inherited from both parents, that are frequent in isolates and small communities [55].
Other macro-area such as Campidanese and Gallura (concerning RoH from 2 to 8 Mb) and Logudorese and Nuorese (RoH 2-4 Mb) still retain traces of endogamy when compared to peninsular Italy.
Again, in the Alghero area, RoHs above the threshold of 2 Mb, were shorter and less common than in the other Sardinian populations; this finding indicates significant lower endogamy and consanguinity degree in this subpopulation. It should be noted that the North-Western town of Alghero is a Catalan-speaking community and this language is a remarkable exception from all Sardinian varieties of dialects. The Alghero's dialect derives from historical events which affected the city in the Middle Ages when the population was swelled by the arrival of Catalan-speaking colonists [56].
In our knowledge, only one study has previously assessed genome-wide patterns of homozygosity in the Sardinian population [5]. Although the criteria used for the identification of RoHs are slightly different between the present study and that of Pardo and colleagues, the results of the two studies are consistent.
Additionally, we searched for footprints of positive selection in the Sardinian genome by using extended haplotype homozygosity and iHS test. Our results identified some genomic regions not previously described as being under positive selection, that may be considered as novel candidates worthy of investigation for positive selection in Sardinian population. Among them, the TMEM252 gene (ID 169693) and PGM5 gene (ID 5239) region, and a region on chromosome 19 containing a long non-coding RNA (LINC00662). As expected, we re-captured many of the previously described signals of recent positive selection. Specifically, the PRLH gene (ID 51052) and MLPH gene (ID 79083), both located on the long arm of chromosome 2, which are under selection in Mideast and European populations [57], the SH3BP5L gene (ID 80851) [58], and a region on chromosome 11 containing several olfactory-related genes [59]. As reported in literature, the region of the human leukocyte antigen (HLA) system is under positive selection in the Europeans, Mideast and South Asian populations [57]. In our study we did not find the lactase gene (LCT ID 3938) among the regions under positive selection, as reported also by other studies [57,60].

Conclusion
Although the main limitation of our study is that the information on Sardinian individuals' origins were based only on their birth place, our study reconfirmed by using different approaches the high degree of internal genetic homogeneity in Sardinia. We have shown that the genome of the Sardinians has mean inbreeding coefficients which are higher than those of mainland Italians. Furthermore, the Sardinian's genome still preserves traces of the elaborate demographic history of the island. Between the macro-areas analyzed, the area surrounding Alghero shows less inbreeding than others, according to its peculiar history and underlined also by the local dialect. Several genomic regions showing signals of positive selection were identified, some of them not previously described and as such worthy of further investigation. In the near future, our results could be confirmed by resequencing the genes/regions showing signature of positive selection and by identifying potentially functional SNPs/haplotypes. Figure S1 Box plot distribution of the first four eigenvectors in the 6 macro-areas.

Supporting Information
(JPG) Text S1 The R code used to compute 1) the Hudson estimator [17], as suggested in Bhatia et al. [18]. 2) inbreeding corrected F st estimator as suggested in Reich et al. [16]. (DOC)