A Genome-Wide Search for Greek and Jewish Admixture in the Kashmiri Population

The Kashmiri population is an ethno-linguistic group that resides in the Kashmir Valley in northern India. A longstanding hypothesis is that this population derives ancestry from Jewish and/or Greek sources. There is historical and archaeological evidence of ancient Greek presence in India and Kashmir. Further, some historical accounts suggest ancient Hebrew ancestry as well. To date, it has not been determined whether signatures of Greek or Jewish admixture can be detected in the Kashmiri population. Using genome-wide genotyping and admixture detection methods, we determined there are no significant or substantial signs of Greek or Jewish admixture in modern-day Kashmiris. The ancestry of Kashmiri Tibetans was also determined, which showed signs of admixture with populations from northern India and west Eurasia. These results contribute to our understanding of the existing population structure in northern India and its surrounding geographical areas.


Introduction
The Kashmiri population is an Indo-European ethno-linguistic group from Jammu and Kashmir state in northern India. The precise origins of the Kashmiri population are unknown. It has been suggested that they are descendants of one of the "lost tribes" of Israel who were exiled in 722 BCE [1]. They are believed to have traveled along the Silk Road into the countries of the Middle East, Persia, and Afghanistan until they reached the Kashmir Valley and settled there [2]. The claim of Israelite ancestry is widespread among Kashmiris, who cite historical records and the similarity of geographical names and cultural and social traditions.
It has also been proposed that many of the rural tribes of Kashmir are of Greek descent as a result of the conquests of Alexander the Great [3]. It is thought that many of Alexander's conscripts and soldiers settled in parts of India, including Kashmir, and intermixed with the local population once his conquests ended in India. This hypothesis has been supported by archeological evidence of ancient Greek presence in Kashmir [3]. For instance, a number of ancient Greek coins that date from shortly after the time of Alexander the Great's presence in India have been found in Kashmir [3]. Furthermore, there is evidence that a number of Grecian words have been adopted into the local Kashmiri vernacular [3,4]. There were also multiple Indo-Greek kingdoms, which were the remnants of further Hellenistic conquests into India, that ruled in what is now modern-day northern India, Pakistan, and Afghanistan [5]. The rule of the Indo-Greek kingdoms began in the Second Century BC and continued until the early First Century AD, furthering the notion of a substantial ancient Greek influence in Kashmir. It is thus reasonable to hypothesize that the current Kashmiri population may possess significant Greek ancestry.
To date, no genome-wide analyses have been performed to determine the degree to which Jewish or Greek genetic contributions exist in the Kashmiri population. The identification of such genetic admixture in the people of Kashmir would help to elucidate the population history of the region. It will also help to shed light on the history of nearby populations, such as the Pathans of Afghanistan and Pakistan, which also have been suggested to be of Jewish or Greek origin [1,6]. One of the primary goals of this study is to determine whether detectable Greek or Sephardic Jewish genetic admixture is seen in individuals of Kashmiri descent using genomewide genotyping assays. To our knowledge, this is the first attempt to answer this question using genome-wide genotype data.

Methods
DNA was collected from 15 Kashmiri individuals from the Kashmir Valley who provided written consent. The collection and study of DNA from these individuals was approved by institutional review boards at the Sher-i-Kashmir Institute of Medical Sciences and the University of Utah School of Medicine (IRB 00017665). The families of these individuals have resided in the Kashmir Valley for at least three generations and have no history of marriages outside of the valley or to a non-Kashmiri. These individuals were then genotyped for single nucleotide polymorphisms (SNPs) using the Affymetrix Genome-Wide Human SNP Array 6.0. Sixteen Kashmiri Tibetans from Srinagar, India, who have Tibetan ancestry and now practice Islam, were also genotyped. We performed further genotyping on a previously undescribed population of 32 first-and second-generation Tibetan exiles in McLeod Ganj, which included two individuals from the Tibetan Children's Village, to serve as population references. Genotype data from 573 persons of Jewish descent who represent 16 populations, including two Sephardi populations, were collected from the Jewish Hapmap Project [7] and served as a Jewish ancestry reference group. Single nucleotide polymorphism (SNP) genotypes from 471 previously studied individuals of Ashkenazi Jewish descent [8] were also analyzed. 423 HapMap samples (62 CEPH (Utah residents with ancestry from northern and western Europe; CEU); 45 Han Chinese in Beijing, China (CHB); 90 Chinese in metropolitan Denver, Colorado (CHD); 90 Gujarati Indians in Houston, Texas (GIH); 46 Japanese in Tokyo, Japan (JPT); 90 Toscani in Italy (TSI)) were included in the analysis to provide a broader assessment of Eurasian population genetic structure. Previously collected genotypes from 25 Buryats [9], 24 Kurds [9], 25 Kyrgyzstanis [9], 94 Tibetans [10,11], 42 Mongolians (Qinghai) [12], and 26 Slovenians [9] were also analyzed. In total Affymetrix SNP 6.0 array data were assembled from 1,750 individuals (Table 1).
Contrast quality-control (QC) measurements, which measure how well homozygote and heterozygote genotype calls can be differentiated on the genotyping chip, were performed on each sample array using the Affymetrix Genotyping Console Software. Sample arrays that had poor contrast QC measurement (>0.40) were removed from the analysis. For arrays that passed QC, genotypes were called using the Birdseed-v2 algorithm [13] in the Affymetrix Genotyping Console Software. Only autosomal SNP genotypes were retained for analysis. Samples that had low outlier call rates (<95% genotyping rate) were removed from further analysis. SNPs that had a minor allele frequency less than 0.05 and a genotyping call rate less than 0.05 were filtered from the dataset. The method of moments identity by descent (IBD) procedure in the PLINK software package [14,15] was used to identify pairs of individuals related at the first cousin level or closer; one member of each pair was removed from analysis. Additional samples that were not assayed on the Affymetrix 6.0 array were collected in order to assess finer genetic structure in Eurasian populations, specifically that of Greeks and samples from the Indian subcontinent. 180 samples, which were SNP genotyped on the Affymetrix GeneChip Human Mapping 500 K Array Set, representing 36 populations were selected from the Human Genome Diversity Project [16], were included. Genotypes in this dataset were converted from hg18 to hg19 genomic coordinates using the UCSC liftOver tool [17]; unmappable SNPs were removed. 348 previously studied individuals from 49 populations in the Indian subcontinent [18] were used to better compare the Kashmiri population with surrounding populations. Lastly, a large number of previously studied individuals (2,457) from 20 European countries [19], including Greek and other Balkan populations, were included to test whether the Kashmiri population contains Greek admixture. SNPs that had a minor allele frequency less than 0.05 and a genotyping call rate less than 0.05 were omitted from each respective dataset. The three aforementioned datasets were merged with the Affymetrix SNP 6.0 dataset after removing ambiguous SNPs to avoid strand orientation errors. Genotype filtering and dataset merging were performed using the PLINK [14] and PLINK 1.9 [20] software packages and custom scripts.
Several methods were used to test for evidence of global Greek or Jewish ancestry in the Kashmiri population and to determine the ancestry of Tibetans from Mcleod Ganj. First, a principal components analysis (PCA) was performed on the SNP genotype dataset to determine whether genetic outliers existed and to identify inter-individual genetic similarity. PCA was performed using the Smartpca program in the EIGENSOFT package version 6.0.1 [21,22]. Samples that appeared to be extreme outliers (likely due to genotyping error or strong genetic dissimilarity from the rest of the dataset) were removed. Extreme genetic outliers were defined as samples that were 6.0 standard deviations away from the mean principal component value for the first 10 principal components. F st values, which measure population differentiation, were also calculated (Smartpca) in a pairwise fashion to determine the amount of differentiation between each population. Next, individual ancestry estimation was performed using the ADMIXTURE software tool [23]. ADMIXTURE is a model-based global ancestry estimation approach that estimates the proportion of ancestry derived from K ancestral populations for each studied individual from genotype data. In order to control for linkage disequilibrium (LD), which can skew ADMIXTURE results, markers found to be in LD were removed from the dataset specifically for ADMIXTURE analysis using the PLINK 1.9 'indep-pairwise' parameter [20]. A sliding window of 50 SNPs was used, removing SNP pairs with an r 2 value above 0.1. ADMIXTURE was performed assuming the number of ancestral populations ranged from 1 to 10. The crossvalidation error was recorded for each ADMIXTURE test to determine the most likely model.
Lastly, the f 3 test [24] was utilized to specifically determine whether the Kashmiri population derives a significant amount of ancestry from two selected ancestral populations, thus indicating an admixture event. Every pairwise combination (8,001 total) of the populations in this study (excluding the Kashmiri population) was used as the two ancestral populations in the f 3 test. Any f 3 test with a negative score and a significant z-score (z < -1.64) was considered to be evidence of admixture in the Kashmiri population arising from the two selected 'ancestral' populations.

Results
In total, 38 sample arrays failed the contrast QC step and were removed from the analysis (Table 1), and an additional 147 samples (including 3 Kashmiri samples) were removed due to low (<95%) genotyping rates (Table 1). 145 subjects were removed because they were related at the first cousin (or closer) level to another study subject (Table 1). 93,666 autosomal SNP genotypes were retained for these samples after genotype filtering and merging each dataset into one unified PLINK file. PCA identified 54 samples as genetic outliers that were subsequently discarded from further consideration (Table 1). 1,437 samples that were genotyped by the Affymetrix SNP 6.0 array passed all QC standards. 4,351 samples in total, including 12 Kashmiris, were used for analysis. The PCA demonstrates clear genetic patterning of populations in European, South Asian, and East Asian subcontinental groups (first two principal components, shown in Fig 1). A number of populations fall between the subcontinental genetic clusters (North African, West Asian, South Asian, and some East Asian populations). The Kashmiri samples are grouped near other previously studied groups from northern India and Pakistan, which indicates similar genetic ancestry. Further, the mean principal component 1 and 2 coordinate of a previously studied population in Kashmir (15 Kashmiri Pandits) is found to plot in the center of the collected Kashmiri cluster, indicating similar heritage and validating the quality of the Kashmiri genotypes. A number of populations residing in nearby Pakistan also show genetic similarity to the Kashmiris, including the Burusho, Balochi, Brahui, Sindhi, and Kalash. Principal components 1 and 2 also show that the Indo-European ethno-linguistic populations from northern India lie on a cline between western European populations and Dravidian ethno-linguistic groups of southern India. This pattern suggests that Indo-European ethno-linguistic groups of northern India, including the Kashmiris, share a complex ancestral history with both west  A in S1 Appendix).

Eurasian and Indian populations. None of the Kashmiris clustered near the northern Greek or Sephardic Jewish (Greece and Turkey) populations in principal components 1 and 2. A similar pattern is observed in principal components 3 and 4 (Fig
The PCA also showed evidence of genetic heterogeneity and admixture in the Kashmiri Tibetan samples. The Kashmiri Tibetans plotted more closely to the populations of South Asia than to the Tibetan reference populations that include Tibetans from Mcleod Ganj (Fig 1). Furthermore, the standard deviation of the Kashmiri Tibetans' first component is much higher (0.0048) than that of other Tibetans (0.00078), which includes Tibetans in Mcleod Ganj and Qinghai. This pattern is suggestive of recent Tibetan admixture with populations from northern India or west Asia. F st estimates also show that the Kashmiris are very genetically similar to the 15 previously studied Kashmiri Pandits, as expected if the Kashmiri samples represent individuals from the region ( Table A in S1 Appendix). F st analysis also shows that the Kashmiris are very genetically similar to other nearby South Asian Indo-European linguistic populations in northern India and Pakistan; the 10 closest related populations are of northern Indian or Pakistani origin (Table A in S1 Appendix). Larger genetic distances with these nearby populations would be expected if Kashmiris had substantial amounts of Greek or Sephardic Jewish ancestry. The northern Greek population had a genetic distance of 0.021 with the Kashmiri population, while the Turkish Sephardic Jews and Greek Sephardic Jews had distances of 0.021 and 0.022, respectively. These genetic distances are approximately the same as those between persons from Italy and persons of Finnish descent [25]. The mean genetic distance of European populations with the Kashmiris was 0.024. European similarity seen in Kashmiris does not reflect a contribution from a specific European population because a majority of the European F st values are near this mean value.
While PCA and F st can robustly determine how genetically similar distinct populations or individuals are to each other, they do not directly estimate individual ancestry proportions. Model-based approaches, such as ADMIXTURE [23], attempt to estimate the relative contributions of each of K ancestral populations to each individual's genetic makeup. A K value of 7 had the smallest level of cross-validation error (cross-validation error = 0.56024; Fig D in S1 Appendix), indicating it is the most likely ancestral population model. However, the small difference in cross-validation error ranging from K = 6 to K = 10 implies that other models could potentially explain the data as well. While ADMIXTURE ancestry proportions should not be considered direct estimates of admixture levels [26], the overall pattern of ancestry sharing can be used to determine if populations are genetically similar. The ADMIXTURE profile of Kashmiris shows a high degree of similarity with other geographically nearby populations. Specifically, the Kashmiris in the K = 7 (Fig 2) ADMIXTURE model possess ancestry profiles similar to those of Indo-European and Dravidian ethno-linguistic populations in India and Pakistan. Further, none of the ancestral components of the Kashmiris appear to be derived specifically from Greeks or Sephardic Jews, which would make the Kashmiris appear dissimilar from nearby populations. The overall ADMIXUTRE profiles of these Indian and Pakistani populations are distinct from those of west Eurasian populations and especially those of East Asian descent. These general patterns largely hold true for K = 5 through K = 10 (Figs H-L in S1 Appendix). Of note, the ancestry profile found in the Kashmiris is virtually indistinguishable from the 15 previously studied Kashmiri Pandit samples. Interestingly, the ancestry profile of the Kashmiri Tibetans is somewhat dissimilar from those found in Tibetan reference populations as it possesses ancestry components found largely in west Eurasian and South Asian populations.
While model-based analysis provides direct estimates of ancestry proportions, it is not a formal statistical test of admixture. The f 3 statistic [24] provides a formal statistical test of the Investigating Greek and Jewish Admixture in the Kashmiri Population hypothesis that a pair of 'ancestral' populations contributed to the ancestry of a present-day population. 3,366 of 8,001 total pairs showed evidence of jointly contributing to the Kashmiri population. 69 of the significant pairwise tests involved the northern Greek population ( Table 2). These 69 significant tests involved populations from South Asia (50), Central and East Asia (17), ( Table 2). The results could suggest that an admixture event took place between Greeks and an ancestral South or East Asian population, contributing to the Kashmiri population. However, European and west Eurasian populations show almost identical f 3 results with the Northern Greek population. For instance, the z-scores of the f 3 tests that involve the northern Greek population are nearly perfectly correlated with those of populations of European, North African, and West Asian ancestry (r = 0.994 on average) (Fig 3). The Sephardic Jewish populations from Greece and Turkey both have nearly identical f 3 results that were as suggestive of Kashmiri admixture as the northern Greeks ( Table 2). The f 3 results of both Sephardic Jewish populations are also highly correlated with those of European and West Asian populations, just as the northern Greeks. These results together suggest that Kashmiris generally share ancestry with west Eurasian and South Asian populations, as opposed to having directly received significant genetic contributions from Greek or Jewish populations. The full list of f 3 results can be found in S1 Dataset.

Discussion
Our results do not support the hypothesis of a substantial genome-wide Greek or Sephardic ancestral contribution to the Kashmiri population. This finding is consistent with previous literature that found no evidence of Jewish admixture in Kashmiris using Y chromosome haplotype data [27]. In addition, we found no evidence of substantial Greek admixture in the Burusho, Kalash, or Pathans, which have all been suggested to have ancestry derived from Alexander the Great's armies and the Greeks [6,[27][28][29][30]. The PCA and ADMIXTURE results suggest that the Kashmiris are very similar genetically to other geographically proximate populations. The f 3 results suggest that the Kashmiris derive a significant amount of ancestry from Indian/South Asian and western Eurasian sources. These results, together with those of previous investigations, suggest that substantial Greek or Jewish admixture did not occur specifically in the Kashmiri population. Instead, the results suggest that the Kashmiri population, and nearby surrounding populations, share genetic ancestry broadly with west Eurasian and South Asian populations.
There are, however, a number of possible reasons why recent Greek or Jewish admixture might be undetected in these analyses. It is possible that more cryptic admixture, in the form of specific Greek or Jewish autosomal haplotypes, exists. Tests such as rolloff [24], ALDER [31], and GLOBETROTTER [32] can detect admixture by utilizing linkage disequilibrium and haplotype data. However, this study did not have sufficient SNP density (93,666 autosomal SNPs) to capture linkage disequilibrium and haplotype structure. High-density genotyping array or next-generation whole genome sequencing, applied widely in diverse populations, would provide these data.
Another potential explanation for the lack of Greek and Jewish ancestry in the Kashmiris is that the Kashmiris sampled here are not representative of those who lived when the supposed admixture event took place more than 2,000 years ago. The same is true of the putative Greek and Jewish ancestral populations. As previously discussed, there is archeological evidence to suggest that the ancient Greeks were in the Kashmir region [3]. Another limitation of this study is the small Kashmiri sample size. It is reassuring, however, that this sample is genetically very similar to the 15 previously studied Kashmiri Pandits.
It is also possible that the Southern European and Mediterranean admixture seen in the Kashmiri individuals represents Greek or Sephardic Jewish ancestry. However, these patterns   are not Kashmiri-specific and are seen in a number of nearby Indo-European ethno-linguistic populations in northern India and Pakistan. Taken together, these findings suggest strongly that the Kashmiri population is genetically similar to nearby populations and does not have a distinctly different ancestral origin. Lastly, we noted that the Kashmiri Tibetans displayed ancestry from both Tibetan and South/West Asian sources. The Kashmiri Tibetans show ancestry deriving from the various populations of India, Pakistan, and western Asia (Fig 2). The degree of ancestry deriving from these populations in the Kashmiri Tibetans is also highly variable, which is a pattern consistent with recent admixture. Ancestry from these populations could suggest that admixture took place between Tibetans and Arabic-speaking peoples in the past. Such events are thought to have occurred as early as the eighth century A.D. when Islam was first introduced to Tibet [33]. In addition, some Kashmiri Tibetans claim to have originated from Kashmir, migrated to Lhasa, and returned to India after the incorporation of Tibet into the People's Republic of China. These migratory events could have resulted in additional admixture.
Our results also show that the Tibetans in McLeod Ganj are very genetically similar to previously studied Tibetan populations found on the Tibetan Plateau. As a result, studies of this population could be useful in elucidating the genetic and physiological mechanisms by which Tibetans are able to adapt and survive in high altitude and hypoxic conditions. Supporting Information S1 Appendix. Supplementary Appendix. Supporting Figures: Fig A) A principal components plot of principal components 3 and 4 representing the studied genotypic data. Fig B) A principal components plot of principal components 1 and 2 representing the studied genotypic data including 60 unrelated Yoruban individuals genotyped on the Affymetrix SNP 6.0 array. Fig C) A principal components plot of principal components 3 and 4 representing the studied genotypic data including 60 unrelated Yoruban individuals genotyped on the Affymetrix SNP 6.0 array. Fig D) A plot of the cross-validation error vs. varying levels of K in the ADMIXTURE analysis. Fig E) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 2) contributes to each studied population. Fig F) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 3) contributes to each studied population. Fig G) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 4) contributes to each studied population. Fig H) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 5) contributes to each studied population. Fig I) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 6) contributes to each studied population. Fig J) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 8) contributes to each studied population. Fig K) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 9) contributes to each studied population. Fig L) An ADMIXTURE plot showing the proportion of ancestry each hypothetical ancestral population (K = 10) contributes to each studied population. Supporting Tables. Table A