Exome Sequencing of Only Seven Qataris Identifies Potentially Deleterious Variants in the Qatari Population

The Qatari population, located at the Arabian migration crossroads of African and Eurasia, is comprised of Bedouin, Persian and African genetic subgroups. By deep exome sequencing of only 7 Qataris, including individuals in each subgroup, we identified 2,750 nonsynonymous SNPs predicted to be deleterious, many of which are linked to human health, or are in genes linked to human health. Many of these SNPs were at significantly elevated deleterious allele frequency in Qataris compared to other populations worldwide. Despite the small sample size, SNP allele frequency was highly correlated with a larger Qatari sample. Together, the data demonstrate that exome sequencing of only a small number of individuals can reveal genetic variations with potential health consequences in understudied populations.


Introduction
Exome sequencing, in which the coding sequences of the genome are selected from fragmented DNA and analyzed using next-generation sequencing, has led to remarkable insights into the incidence and frequency of polymorphisms within the coding regions of the human genome [1][2][3].Importantly, exome sequencing has led to the identification of novel genetic variations with potentially deleterious effects on protein structure and function, which are thus of possible importance in health risks [4].Sampling a diverse global sample of populations by exome sequencing can identify variants that distinguish these populations and can yield insights of medical relevance [5,6].However, there are large segments of the world population that have yet to be analyzed by massive parallel sequencing, despite large-scale efforts such as the 1000 Genomes Project (referred to below as 1000 G or 1000 Genomes) [7], the Environmental Genomes Project (http:// www.niehs.nih.gov/research/supported/programs/egp/), the Human Genome Diversity Project (http://hagsc.org/hgdp/files.html), and the NHLBI Exome Project (http://www.nhlbi.nih.gov/resources/exome.htm).Identification of specific genetic factors associated with disparate disease risk, incidence, or severity among populations is important for increasing access to genomic medicine [8].
We carried out exome sequencing of 7 randomly chosen Qataris from the 3 Qatari genetic subpopulations [9], representing a region of the Arabian peninsula that is not at present part of any major sequencing effort.The focus was to identify potentially deleterious health-related alleles in Qataris, and to determine whether the prevalence of these alleles is significantly different in Qataris compared to other populations.From these data, potentially deleterious nonsynonymous missense coding polymorphisms were identified.We confirmed that the allelic representation of potentially deleterious missense polymorphisms of health-related interest in this small sample mirrored the frequency of alleles in a larger validation group of Qataris assessed using Affymetrix Genome-Wide SNP Array 5.0 (Affymetrix Inc., Santa Clara, CA) or TaqMan SNP Genotyping Assay (Life Technologies Corp., Carlsbad CA).The representation of SNP alleles in our small sample was compared to the allele frequencies estimated using data from the 1000 Genomes Project (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/).
Overall, the data identified numerous potentially deleterious SNPs at significantly elevated frequency in the Qatari population compared to Europeans, Asians, Africans, and recently admixed Americans.These data demonstrate the power of exome sequencing, where sequencing of only a small number of individuals can be used to identify some of the potentially deleterious health-related alleles in a population and to identify allelic variation that differentiate these peoples from African, European, Asian, and American populations.The results provide insight into this population that, despite its importance in the story of human genetic variation [10,11], has not been represented in the major genome consortium sequencing efforts [5][6][7].
Potentially deleterious SNPs where the gene and SNP have been previously linked to human health.Of the 650 potentially deleterious missense coding SNPs identified in $1 of 14 QE7 alleles, 131 were on the Affymetrix 5.0 microarray, and thus could be confirmed in the QA149 Qatari validation set (Table 1, Figure 2A).Of these 131 potentially deleterious SNPs, 49 had been previously linked to human health.Of these, we selected 10 of potential medical relevance in Qatar to highlight (Table 2; see Web Resource S1 for all potentially deleterious SNPs).Among these 10 SNPs, several are relevant to disorders common in Qataris, including type 2 diabetes (PPARG, PPP1R3A), cardiovascular disease (PON2, MTR), hypertension (ULK4), pulmonary disease (CDC6) and neurologic disease (BDNF).
Potentially deleterious SNPs when the gene but not the SNP has been previously linked to human health.Of the 650 potentially deleterious missense coding SNPs identified in $1 of 14 QE7 alleles where the SNP was in a gene previously linked to human health, but the potentially deleterious SNP was different than previously reported, 33 were on the Affymetrix 5.0 microarray and could be validated in the QA149 Qataris (Web Resource S1).Of these 33, the genes containing 13 were described in the OMIM database of Mendelian Disorders [12].Of these SNPs, 10 of potential medical relevance in Qatar were further assessed (Table 3).Two of these genes have been related to ophthalmologic disorders, including age related macular degeneration (HMCN1) and keratoconus (VSX1) and several have been related to neurologic disorders (IKBKAP, SGCG, SACS, ARHGEF10, and CACNA1S).

Correlation between QE7 and Larger Qatari Sample Allele Frequency
One of the important observations in the present study is the remarkable correlation of the allele frequency of potentially deleterious nonsynonymous SNPs described by sequencing only 7 Qatari exomes with a larger validation population of Qataris (Figure 2).While exome sequencing of only 7 individuals in a population only provides a snapshot of some of the potentially deleterious SNPs in the population, we asked the question: would sequencing of only 7 subjects in each of the continental populations also provide a valid snapshot of at least some of the potentially deleterious SNPs in these populations?To answer this question, using 1000 Genomes Project genotypes at 20,381 potentially deleterious SNPs genotyped in the QE7, we calculated the allele frequency correlation between a sample size of 7 vs a larger Qatari sample of n = 48, n = 174 or n = 193, depending on the number of available samples, using Pearson's product-moment method (Table S4).For comparison, we also compared the allele frequency correlation for seven individuals selected from 1000 Genomes populations, continents, and ''World mix 1 and 2'' of 3 European, 2 Asian and 2 African individuals with a larger sample of 193 from a random selection of the larger populations.These mixes of 3 European, 2 Asian, and 2 African were similar to a mix of 3 Q1, 2 Q2 and 2 Q3 based on the observation that SNPs used to classify Qatari into ancestry groups are effective at classifying 1000 Genomes individuals into European, Asian, and African clusters using both STRUCTURE [13] quantification of genomewide admixture proportions (Figure S2) and SMARTPCA [14] principal component analysis (Figure 3).
The ''World mix 1 and 20 had the lowest allele frequency correlation observed out of all groups tested (0.85 and 0.86).The highest allele frequency correlation between small and larger samples was for Asians (0.91 and 0.92) and Europeans (0.92), for each population, each continent.The allele frequency correlation for European and Asian populations was, in general, higher than that of African and American populations.By comparison, the Qatari allele frequency correlation was 0.89, higher than a ''World mix 20 but lower than a European or Asian population.The observed correlation was similar for 131 potentially deleterious SNPs validated by Affymetrix 5.0 genotyping in 149 Qataris (Pearson correlation of 0.89; p,2610 216 , 95% confidence interval 0.85 to 0.92, Figure 4A) and for 10 potentially deleterious SNPs validated by TaqMan PCR in 86 Qataris (Pearson's correlation 0.76, p,0.012, 95% confidence interval 0.25 to 0.94; Figure 4B).

Comparison of the Prevalence of QE7 SNPs to Prevalence of These SNPs in 1000 Genomes
SNPs at significantly different allele frequency that make Qataris distinct from Europeans, Asians or Africans, were identified with an Fst threshold of population differentiation of Fst.0.25 and a Storey q-value multiple testing corrected p value of q,0.05 for the probability of observing x alternate alleles in 14 trials, using the 1000 Genomes continental allele frequency as a probability of success (Figure 3).Out of 25,803 SNPs at significantly different allele frequency in Qatar, 77% (19,940) had higher alternate allele frequency in QE7 in the four comparisons (QE7 vs Europeans, QE7 vs Asians, QE7 vs Africans, QE7 vs Americans) and significantly higher vs at least one, including 14,347 (56%) significantly higher in QE7 compared to four continents (Web Resource S1).Of the remainder, 2 (,1%) were significantly lower than four continents (see Web Resource S1 for full list of all combinations assessed).These were two intronic SNPs in the gamma aminobutyric acid B receptor 1 (GABBR1) gene linked to nicotine dependence [15], predicted to influence alternative splicing [16].These SNPs are common in the 1000G, with overall continental allele frequencies .0.64 for both SNPs in all continents (Web Resource S1).
Classification of the potentially deleterious SNPs significantly higher or lower in QE7.Out of 25,803 potentially deleterious SNPs significantly higher or lower than the overall frequency in at least one continent, 1,853 were missense nonsynonymous SNPs predicted to be potentially deleterious by SIFT or PolyPhen2 (Figure 1C).Of these potentially deleterious SNPs 1,841 (99% of 1,853) were observed in $1 of the 14 QE7 alleles (Table 4), and 135 were observed in $6 of the 14 QE7 alleles, (7% of 1,853).The total potentially deleterious missense coding SNPs identified in QE7 that are predicted significantly different in allele frequencies in Qataris compared to world populations were subclassified using databases of disease and metabolism as: (1) not previously associated with nor within a health related gene (1,336, 72% of 1,853 in $1 of the 14 QE7 alleles; 96, 5% of 1,853 in $6 of QE7); (2) in a gene previously linked to human health, but a different SNP than previously reported (442, 24% of 1,853 in $1/14; 30 (2% of 1,853) in $6/14 alleles); and (3) the gene and SNP have previously been linked to human health (63, 3% of 1,853 in $1/14; 9, 0.5% of 1,853 in $6/14 alleles).An additional 5 potentially deleterious missense SNPs were not observed to be segregating in Qatar but were observed in other continental populations a significantly higher number of times, for a total of 510 potentially deleterious missense coding SNPs in a gene previously linked to human health (Figure 1D).A further subset of these SNPs were validated by Affymetrix 5.0 genotyping in n = 149 Qatari and by TaqMan genotyping in n = 86 Qatari.
Variants of potential medical relevance in Qatar.The main objective of this study is to identify variants for further study in Qatar.For this purpose a set of variants were selected to be highlighted in Table 4 based on known prevalence as recorded in the medical records of Hamad Medical Corporation.Of the 30 potentially deleterious missense coding SNPs identified in $6 of 14 QE7 alleles as significantly higher in QE7 compared to one or more overall continental frequencies, where the gene and SNP have been previously linked to human health, we selected 7 of medical relevance in Qatar to highlight (Table 4; BMP4, ZNF229, ULK4, AKAP13, FMO2, COL4A3 and UTS2).Among these SNPs, several are relevant to disorders common in Qataris, including type 2 diabetes (UTS2), breast cancer (AKAP13), hypertension (ULK4), and nicotine dependence (FMO2) [17].Interestingly, ZNF229 is associated with tuberculosis resistance, and the prevalence of tuberculosis is low among Qataris.[33] and classified using databases of SNP function (NCBI dbSNP build 134, SIFT online webserver [34], and GATK VariantAnnotator function [33].Shown are bar plots of: A. SNPs observed in $1 of 1,099 exomes [QE7 and 1000 G]; B. SNPs identified in $1 of 14 QE7 alleles; C. SNPs significantly higher or lower in QE7 vs at least one population; and D. Subset of significantly higher or lower SNPs in genes with a health-related role (OMIM [12], HGMD [37], PharmGKB [38] or HUGE [39]).In the four plots, the x-axis lists the functional categories (noncoding, coding, silent, missense, splice, nonsense) and the y-axis the number of SNPs.There were 20,857 (52%) missense SNPs predicted deleterious by SIFT [34] or PolyPhen2 [35] polymorphic in 1,099 exomes (QE7 and 1000 G), a subset 2,750 polymorphic in QE7 with $1 of 14 alleles (Table 1).There were 1,853 significantly higher or lower missense SNPs predicted deleterious by SIFT [34] or PolyPhen2 [35] polymorphic in 1,099 exomes (QE7+1000 G), and a subset of 510 relevant to health; see Table 2).Red = predicted deleterious SNPs.doi:10.1371/journal.pone.0047614.g001 Of the 9 potentially deleterious missense coding SNPs identified in $6 of 14 QE7 significantly higher in QE7 compared to one or more continents where the SNP was in a gene previously linked to human health, but the potentially deleterious SNP was different than previously reported, we selected 3 of potential medical relevance in Qatar to highlight (Table 4; ACAT2, TTC37 and PDZRN4).Of these 3 genes, ACAT2 is relevant to plasma lipid levels, a finding consistent with high cholesterol levels that are common in Qataris.

Discussion
The inhabitants of Qatar include approximately 300 thousand Qataris and 1.4 million expatriates [18], half of whom arrived in the past decade [19].The Qataris are comprised of 3 distinct genetic subgroups, Bedouin (Q1), Persian/South Asian (Q2) and African (Q3) [9].Like the neighboring countries in the Middle East with closely related populations, cardiovascular disease, type 2 diabetes, obesity, lung disease, breast cancer, congenital malformations, and neurologic disorders are common in the Qatari population [19][20][21][22][23][24].The Qatari population is unique in that the population is small and the rate of consanguineous marriage is high [25], leading to a high probability of shared ancestry between randomly selected individuals and longer runs of homozygosity, particularly in the Q1 [9].Using exome sequencing, this study provides the first genome-wide insight of potentially deleterious coding, nonsynonymous SNPs in the Qatari population that may be relevant for this health profile.Our analysis focused on missense SNPs likely to have a potentially deleterious effect on protein function, as these are more likely to be functional and directly involved in the disease mechanism.Of the exome SNPs identified in at least 1 of the 14 QE7 alleles compared to the GRCh37 reference allele, 2,750 (37%) were predicted to be potentially deleterious to the expressed protein.Confirmation of 131 of these SNPs by microarray in a validation group of 149 Qataris showed a remarkable correlation, i.e., potentially deleterious missense SNPs identified by exome sequencing of only 14 alleles provided a snapshot of at least some of the potentially deleterious SNPs in the Qatari population.Among these, 2 sets of missense potentially deleterious SNPs were assessed in detail; those in a gene where both the gene and SNP have been previously linked to human health and those in a gene previously linked to human health, but a different SNP than previously reported.Remarkably, despite the fact that we sequenced the exomes of only 14 Qatari alleles, among the 30 most frequent potentially deleterious SNPs observed in Qataris that had been observed in genes in other populations, 57% (17 of 30) were in genes linked to disorders common in Qatar.
Among the Qatari potentially deleterious nonsynonymous SNPs uncovered by the QE7 exome sequencing were SNPs that were novel or were significantly different (at a higher or lower frequency) than the overall frequency in at least one continent.To identify these SNPs, we used a combination of a statistical test robust to the small effects of sample size together with an Fst statistic; the latter has been successfully applied to identify missense SNPs for adaptation to altitude in Tibetans [6] and disease SNPs in individuals of Aboriginal ancestry in array genotypes [26].This combined approach yielded a low empirical false discovery rate, as assessed for a subset of SNPs in a larger Qatari sample, indicating that the approach was successful at identifying variation that differentiates the Qatari population from worldwide populations.
Among the 392 missense potentially deleterious SNPs not in dbSNP nor observed in 1000 Genomes and hence potentially novel in the Qatari population, one was observed in all seven Qataris, 5 were observed at least once in the three population clusters (Q1, Q2, Q3), and 6 were observed at least once in two of the three population clusters.The remaining 380 novel potentially deleterious coding nonsynonymous missense SNPs were observed in only one of the QE7.Five of these novel potentially deleterious SNPs observed in two or more Qatari populations are within genes previously linked to disease, including a Phe374Leu mutation in the copper transporter protein CUTC involved in the copper metabolism [27] homozygous for the alternate allele in all seven Qataris.
Among the missense potentially deleterious SNPs with lower or higher frequency in Qataris, there were 36 SNPs lower in allele frequency than all 1000 Genomes Project overall continental frequencies (Europe, Asia, Africa, and the Americas) and significantly lower vs at least one, including 1 SNP significantly lower than 3 continents, 3 significantly lower than 2 continents and 32 significantly lower than 1 continent.This list includes 2 SNPs previously linked to pigmentation of skin (OCA2 His591Arg; In order to identify potentially deleterious missense health-linked SNPs in Qatar, genotypes of the 2,750 predicted to be potentially deleterious alternate alleles observed in QE7 were subdivided by frequency [$1/14 or $6/14 alternate allele frequency] and by functional category. 2 In order to identify potentially deleterious SNPs of medical interest in Qatar, the 2,750 predicted to be potentially deleterious SNPs were subclassified into 3 groups based on prior link of the gene or SNP to a health-related phenotype using four major databases of disease and metabolism SNPs (OMIM [12], HGMD [37], PharmGKB [38] and HUGE [39]). 1 st row -total number of potentially deleterious SNPs; 2 nd row -number of potentially deleterious of SNPs where no SNP in the gene has been previously associated with a phenotype relevant to human health; 3 rd row -SNPs in genes linked to human health, but the SNP has not been previously tested for phenotypic effect; and 4 th -number of potentially deleterious SNPs where the specific SNP and gene has been reported to be health-linked.SNPs in the fourth row (previously identified) are not counted in the third row (in a gene, but not SNP, previously linked).
An additional 8 potentially deleterious coding nonsynonymous missense SNPs were located in genes previously linked to disease, including rs1509309, a Cys206Ser mutation in serine protease 1 (PRSS1), a gene linked to hereditary pancreatitis [30].Of the missense potentially deleterious SNPs higher in prevalence in Qataris compared to at least 1 continent and validated by TaqMan PCR in a larger Qatari population, several were relevant to diseases of high prevalence among Qataris, including genes associated with plasma lipid levels (ACAT2) and diastolic blood pressure (ULK4), as well as a specific SNP associated with type 2 diabetes (UTS2) and a specific SNP associated with nicotine dependence (FMO2).For further details and additional references regarding these SNPs, see Details S1.By identifying potentially deleterious polymorphisms from exome sequences of 7 randomly chosen individuals of Qatari ancestry, and by validating these polymorphisms in a larger population of Qataris using genome-wide SNP array and TaqMan analysis, we have demonstrated that at least some coding sequence variations of potential medical importance within a population can be uncovered by exome sequencing of only a small number of subjects.This strategy can be used as a screening approach to identify, with reasonable confidence, common genetic variations of potential medical importance within a population.While this cannot substitute for assessing large numbers within any given population, this study demonstrates that harnessing the power of next-generation genomics can be used to identify enriched SNPs of potential medical importance that are common in populations with reasonable confidence using a small sample.In this context, a significant subset of the health-relevant variation in large segments of world populations that have yet to be analyzed could be surveyed at minimal cost.Likewise, for global-scale genomics initiatives, many small samples that cast a wide net can complement the current approach of studying large samples of fewer populations in understanding global genomic diversity.

Population Classification
Human subjects were recruited and written informed consent obtained at Hamad Medical Corporation (HMC), Doha, Qatar under protocols approved by the Medical Research Center & Research Committee and the Institutional Review Board of Weill Cornell Medical College in Qatar.In a previous study, we identified the three major components of ancestry in Qatar (Bedouin, Persian/South Asian, African) based on a sample of 156 unrelated Qatari genotyped using the Affymetrix Genome-Wide SNP Array 5.0 (Affymetrix Inc., Santa Clara, CA) as members of Q1 (Bedouin), Q2 (Persian/South Asian) Q3 (African) [9] based on the major ancestry group as determined by the STRUCTURE [13] algorithm with k = 3 (Figure S1).We added to this sample an additional 2 Q2 and 3 Q3 individuals who were classified into genetic subgroups using TaqMan SNP Genotyping Assay (Life Technologies Corp., Carlsbad CA) for a panel of 48 ancestryinformative SNPs (Methods S1, Table S1, and Figure S1).Thus, the complete sample of Qataris included 102 Q1, 39 Q2 and 20 Q3.This sample was used to choose the 7 Qataris for exome (n = 14 alleles) for deleterious SNPs in Table 4, the QE7 allele frequency observed in at least 1 of 14 (7%) QE7 alleles was compared to the allele frequency in QT86 (n = 172 alleles) generated using TaqMan.Shown is the QE7 allele frequency along the x-axis and the QT86 allele frequency along the y-axis.doi:10.1371/journal.pone.0047614.g002Gene symbol and name obtained from the Consensus Coding Sequence (CCDS) NCBI database [32], amino acid substitution position and residues obtained from dbSNP when available, otherwise SIFT online webserver [34].
Transcript position and amino acid substitution were verified to be consistent with the literature.Phenotype information from OMIM [12], HGMD [37], PharmGKB [38] or HUGE [39] database.5 For more details and references, see Details S1. 6 Shown is the alternate allele frequency determined by exome sequencing in QE7 individuals. 7 Shown is the risk allele frequency in the validation set of QA149 individuals (n = 149 Qatari, 298 alleles).Failed genotypes are accounted for in the allele frequency.For statistical comparisons of the QE7 and QA149 allele frequencies, see Figure 2. doi:10.1371/journal.pone.0047614.t002Analysis of the exomes in the QE7 14 alleles identified 650 missense coding SNPs where the gene has been previously identified as linked to human health, but the missense SNP is different than that previously reported (Table 1, row 3).To validate this observation in a larger group of Qataris, the Affymetrix Genome-Wide SNP array 5.0 was used to assess an independent group of 149 Qataris (QA149, 298 alleles).Of the 2,750 missense potentially deleterious SNPs identified in at least 1 of the QE7 14 alleles, 131 were on the microarray.Of these, 49 were in genes linked to human health, including 33 where the gene is linked to human health, but the reported link was for a different SNP.Of these 33, listed are 10 chosen as examples of missense SNPs linked to human health that are extensively documented in the OMIM database (3).
2 Gene symbol and name obtained from the Consensus Coding Sequence (CCDS) NCBI database [32], amino acid substitution position and residues obtained from dbSNP when available; otherwise SIFT online webserver [34].
Transcript position and amino acid substitution was verified to be consistent with the literature.
3 SNP information includes amino acid substitution, dbSNP build 134 rsID if available, chromosome, position in GRCh37 human reference genome assembly, reference and alternate allele in QE7.Ref = references; alt = alternative.
5 See Details S1. 6 Shown is the alternate allele frequency determined by exome sequencing in QE7 individuals.
7 Shown is the alternate allele frequency in the validation set of QA149 individuals (n = 149 Qatari, 298 alleles).Failed genotypes are accounted for in the allele frequency.For statistical comparisons of the QE7 and QA149 allele frequencies, see Figure 2. doi:10.1371/journal.pone.0047614.t003 sequencing, where microarray and/or TaqMan genotype data collected for the remainder were used for validation purposes.

Exome Enrichment and Sequencing
A single exome capture library was prepared for each of the QE7 individuals using standard protocols, enriched using the Agilent SureSelect Human All Exon Kit (Agilent Technologies, Inc., Santa Clara, CA) [31], and sequenced in Illumina GAIIx (Illumina Inc., San Diego, CA) using a full lane for each exome.Reads were mapped to GRCh37 using BWA v0.5.9, mean coverage depth was verified to be .30xfor all samples.The breadth of coverage of the target exome 6500 bp was high, with .96% of target bases covered with at least one read.

Identification of SNPs
SNP genotypes were called in coding exons [32] and flanking regions using the GATK [33] framework, as outlined in the GATK ''Best Practices for Variant Detection V2'' Wiki for all autosomal chromosomes (see Methods S1).After application of stringent quality filters, the discordance rate was calculated between the Affymetrix 5.0 genotyping array and sequencing genotypes was 0.0055, and principal component analysis of population structure was confirmed to replicate our prior analysis using array genotypes [9] (Figure 3).

SNP Function Annotation
Available databases, including dbSNP, GATK VariantAnnotator [33], SIFT [34] and PolyPhen2 [35], were used to collect functional annotation for SNPs, including dbSNP rsIDs, genes and transcripts overlapping SNPs, coding function, amino acid substitution, and prediction of nonsynonymous missense coding SNPs.In order to maximize the number of potentially deleterious SNPs identified, SIFT and PolyPhen2 [34,35] classifications were combined, such that a missense SNP predicted to be potentially deleterious by either algorithm was considered potentially deleterious, a more liberal classification than CONDEL [36] scores.Disease and drug metabolism annotation of known and novel SNPs was conducted using a SQL database combining public versions of OMIM [12], HGMD [37], PharmGKB [38] and HUGE [39].
SNPs were annotated in two ways: by SNP and by gene.First, SNPs involved in disease or drug metabolism were identified by querying each database with the dbSNP rsID.For OMIM SNPs, where there are nonsynonymous missense SNPs not linked to rsIDs, the database was queried using the gene name and amino acid substitution (e.g., gene ''CFTR'' and substitution ''Ar-g117His'').Second, coding SNPs within genes involved in disease or drug metabolism were identified by querying each database with the gene that contained the coding SNP.In order to verify the overall quality of the genotyping call set, the seven Qatari exomes were compared to 1,092 individuals from four continents (1000 Genomes Project October 2011 Integrated Phase 1 Variant Set Release) at 18,865 SNPs segregating in both QE7 and 1000 Genomes that are present in dbSNP build 134 using SMARTPCA [14].Plotted is PCA 1 (x-axis) vs PCA 2 (y-axis).Individuals are color-coded by continent of origin (European = red, Asian = green, African = blue, American = grey, Qatar = orange).Clustering of the Qatari individuals was verified to be consistent with our prior report [9], where Q1 cluster near Europeans, Q2 in between Q1 and Asians, and Q3 between Q1 and Africans.doi:10.1371/journal.pone.0047614.g003 Affymetrix and TaqMan Validation of Identified Qatari Potentially Deleterious SNPs Affymetrix 5.0 genotypes for the 156 Qataris (102 Q1, 37 Q2, 17 Q3; Figure S1) were used to validate Qatari population allele frequency of potentially deleterious SNPs identified in the QE7 population sample ($1 of 14 alternate alleles).Genotypes for 131 nonsynonymous missense coding SNPs identified in the QE7 were validated to be 100% consistent with the exome genotypes at the 131 potentially deleterious SNPs.Next, the QE7 individuals were excluded from the 156 Qatari and allele frequency was calculated for the 131 SNPs in the remaining 149 Qatari (QA149 = 99 Q1+35 Q2+15 Q3).In order to determine the correlation between QE7 and QA149 allele frequency, SNP allele frequency was plotted and fitted to a linear regression model, where correlations and 95% confidence interval for the allele frequency comparison were calculated.Whereas the Affymetrix 5.0 array was used to validate the 131 potentially deleterious SNPs in 149 Qataris, TaqMan PCR genotyping was used to validate 10 significantly higher SNPs predicted to be potentially deleterious by SIFT or PolyPhen2 with a known role in human health in a population sample of 86 Qataris.

Identification of SNPs at Significantly Different Allele Frequency in Qatari Compared to Continental Populations
To identify known SNPs that have a significantly different allele frequency in the Qatari population, all known SNPs were tested for cases where the alternate allele was at a higher or lower count than expected in the 7 Qatari exomes (n = 14 alleles), when compared to the 1000 Genomes allele frequencies across populations within Europe, Asia, Africa or America.While the approach of comparing to the overall frequency in populations within a continent is not expected to identify differences between Qatari and all populations within a continent, particularly for alleles with high variance across a continent, this approach is expected to identify differences between Qatari and at least a subset of continental populations, a strategy that balances comparison to a larger sample number while flagging alleles of potential interest in the Qatari population.The allele frequency estimates used in the frequency enrichment test were   Analysis of the QE7 exomes (n = 14 alleles) identified 1,841 predicted deleterious SNPs observed in at least 1 of 14 QE7 alleles and significantly different in prevalence compared to overall continental populations as represented by the 1000 Genomes.Of these, 135 SNPs were significantly different in prevalence compared to the 1000 Genomes and observed in at least 6 of 14 QE7 alleles.Of these, 39 were either in a gene previously identified within a healthlinked gene (n = 9) or in a gene previously linked to human health, but a different SNP than previously reported (n = 30).Listed in this table are the 10 examples linked to diseases relevant to Qatar.SNPs from this list for which there is literature supporting a link to human health; the first 7 are from the category of ''previously identified within a health-linked gene'' and the last 3 are from the category of ''a gene previously linked to human health, but a different SNP than previously reported.''These 10 genes were validated by TaqMan PCR in an independent group of 86 Qataris (QT86, 172 alleles, including 82 Qatari overlapping with the QA149 and 4 non-overlapping).
2 Gene symbol and name obtained from the Consensus Coding Sequence (CCDS) NCBI database [45], amino acid substitution position and residues obtained from dbSNP when available, otherwise SIFT online webserver [46].

3
SNP information includes chromosome amino acid substitution, dbSNP build 134 rsID if available, chromosome, position in GRCh37 human reference genome assembly, reference and alternate allele in QE7.Ref = references; alt = alternative; * = risk allele could not be determined.
Phenotype information from OMIM [12], HGMD [37], PharmGKB [38] or HUGE [39] database.Shown is the risk allele, the health-associated phenotype and the reference(s). 5 For more details and references see Details S1.
Shown is the alternate allele frequency determined by TaqMan in QE7 individuals; no genotypes discordant with the QE7 exome sequences were observed.
7 Shown is the alternate allele frequency in the validation set of QT86 individuals (n = 86 Qatari, 172 alleles).Failed genotypes are accounted for in the allele frequency.For statistical comparisons of the QE7 and QT86 allele frequencies, see

Figure 2 .
Figure 2. Validation of allele frequency for potentially deleterious nonsynonymous missense SNPs observed in n = 7 Qatari exomes using Affymetrix 5.0 array genotyping of n = 149 Qataris or TaqMan genotyping of n = 86 Qataris (n = 82 overlapping).A. To confirm the allele frequency estimates for the Qatari population based on the number of alleles observed in QE7 (n = 14 alleles) for potentially deleterious SNPs, the QE7 allele frequency observed in at least 1 of 14 (7%) QE7 alleles was compared to the allele frequency in QA149 (n = 298 alleles) generated using Affymetrix 5.0 SNP microarrays.Of the 2,750 potentially deleterious nonsynonymous SNPs identified in QE7, 149 probes were on the Affymetrix 5.0 array.Shown is the QE7 allele frequency along the x-axis and the QA149 allele frequency along the y-axis for 131 SNPs, excluding 18 Affymetrix 5.0 SNPs where the QE7 allele frequency could not be validated due to partial missing genotypes.B. Validation of allele frequency for potentially deleterious nonsynonymous missense SNPs significantly higher or lower in Qatari exomes using TaqMan genotyping of n = 86 Qataris.To confirm the allele frequency estimates for the Qatari population based on the number of alleles observed in QE7

3 SNP
information includes chromosome amino acid substitution, dbSNP build 134 rsID if available, chromosome, position in GRCh37 human reference genome assembly, reference and alternate allele in QE7.Ref = references; alt = alternative.

Figure 3 .
Figure 3. Principal component analysis (PCA) validation of exome genotypes for the QE7 individuals.In order to verify the overall quality of the genotyping call set, the seven Qatari exomes were compared to 1,092 individuals from four continents (1000 Genomes Project October 2011 Integrated Phase 1 Variant Set Release) at 18,865 SNPs segregating in both QE7 and 1000 Genomes that are present in dbSNP build 134 using SMARTPCA[14].Plotted is PCA 1 (x-axis) vs PCA 2 (y-axis).Individuals are color-coded by continent of origin (European = red, Asian = green, African = blue, American = grey, Qatar = orange).Clustering of the Qatari individuals was verified to be consistent with our prior report[9], where Q1 cluster near Europeans, Q2 in between Q1 and Asians, and Q3 between Q1 and Africans.doi:10.1371/journal.pone.0047614.g003

Figure 4 .
Figure 4. Identification of autosomal exome SNPs in the QE7 Qatari individuals with an allele frequency distinct from at least one continent (Europe, Asia, Africa, the Americas) as estimated from exomes vs 1000 Genomes. A. Illustration of threshold selection.Fixation index (Fst; x-axis) and -log10 (q-values; y-axis) for a binomial test for all SNPs assessed versus each continent (red = QE7 vs Europeans, green = QE7 vs Asians, blue = QE7 vs Africans, tan = blue = QE7 vs Americans).Shown is the threshold selected for identifying enriched SNPs; Fst .0.25[43] and FDR ,0.05[44].B-E.Heat maps of the false discovery rate[44] for enrichment of higher or lower than expected number of alternative alleles tested on 126,924 exome SNPs.Shown is the allele counts for the 7 Qatari exomes (y axis) vs the continental alternative allele frequency in 1000 Genomes continental populations (x axis).The map shows combined FDR and Fst thresholds for all SNPs (enriched = red = Fst .0.25 and FDR ,0.05; not enriched = blue = Fst ,0.25 or FDR .0.05; white = no observations).B. Qataris vs Europeans (EUR), C. Qataris vs Asians (ASN), and D. Qataris vs Africans (AFR).D. Qataris vs Americans (AMR).E. Venn diagram of 25,803 SNPs enriched in Qataris vs at least one continent.doi:10.1371/journal.pone.0047614.g004

Table 1 .
Potentially Deleterious Missense Coding SNPs in the Qatari Genome Identified by Exome Sequencing 1 .

Table 2 .
Affymetrix Microarray Validation of Qatari Exome Potentially Deleterious SNPs Where the SNP and Gene Have Been Previously Linked to Human Health 1 .Analysis of the exomes in the QE7 14 alleles identified 131 missense coding SNPs where the SNP and gene have been previously identified as linked to human health (Table1, 4 th row).To validate this observation in a larger group of Qataris, the Affymetrix Genome-Wide SNP Array 5.0 was used to assess an independent group of 149 Qataris (QA149, 298 alleles).Of the 2,750 missense potentially deleterious SNPs identified in at least 1 of the QE7 14 alleles, 131 were on the microarray.Of these, 49 were in genes linked to human health, including 16 where both the gene and the SNP are linked to human health.Of these 16, listed are 10 chosen as examples of missense SNPs linked to human health.

Table 3 .
Affymetrix Microarray Validation of Qatari Exome Predicted Deleterious SNPs in Genes Linked to Human Health, but a Different SNP than Previously Reported 1 .

Table 4 .
Predicted Deleterious SNPs in Known Health-associated Genes Enriched in Qatari Exomes Compared to Worldwide Populations and Validated by TaqMan PCR in a Larger Qatari Population 1 .