Novel Population Specific Autosomal Copy Number Variation and Its Functional Analysis amongst Negritos from Peninsular Malaysia

Copy number variation (CNV) has been recognized as a major contributor to human genome diversity. It plays an important role in determining phenotypes and has been associated with a number of common and complex diseases. However CNV data from diverse populations is still limited. Here we report the first investigation of CNV in the indigenous populations from Peninsular Malaysia. We genotyped 34 Negrito genomes from Peninsular Malaysia using the Affymetrix SNP 6.0 microarray and identified 48 putative novel CNVs, consisting of 24 gains and 24 losses, of which 5 were identified in at least 2 unrelated samples. These CNVs appear unique to the Negrito population and were absent in the DGV, HapMap3 and Singapore Genome Variation Project (SGVP) datasets. Analysis of gene ontology revealed that genes within these CNVs were enriched in the immune system (GO:0002376), response to stimulus mechanisms (GO:0050896), the metabolic pathways (GO:0001852), as well as regulation of transcription (GO:0006355). Copy number gains in CNV regions (CNVRs) enriched with genes were significantly higher than the losses (P value <0.001). In view of the small population size, relative isolation and semi-nomadic lifestyles of this community, we speculate that these CNVs may be attributed to recent local adaptation of Negritos from Peninsular Malaysia.


Introduction
Southeast Asia is believed to be one of the earliest regions of Homo genus habitation recorded outside Africa. This may have occurred nearly 2 million years ago, following the arrival of the ancient Javanian, known as Homo erectus [1]. The Negrito people are believed to be direct descendants of humans who arrived in Peninsular Malaysia more than 60,000 years ago [2][3][4]. Ancestral Homo sapiens who originated from Africa [5] migrated into Asia along the coastal route [6]. The Negritos from Peninsular Malaysia are of Austroasiatic origin [7] and thought to be related to the Philippine Aeta and Andaman Islanders as well as the Melanesians, Tasmanians, and certain tropical Australian rainforest foragers based on superficial anatomical features and foraging lifestyles [8]. In Malaysia, Negritos are divided into six tribes based on linguistics, socio-cultural practices, and geographical region inhabited namely, Bateq, Mendriq, Jehai, Kensiu, Lanoh and Kintak, numbering approximately 0.15% of the total population [9]. Studies on various genetic markers including autosomal microsatellite markers and mitochondrial DNA suggest that these tribes are genetically similar and may have experienced high levels of genetic drift [8,10,11]. It is believed that they may have adapted to the environmental changes throughout the centuries to cope with limited food resources and the tropical rainforest environment. Currently, the number of Negritos is dwindling rapidly as Malaysia becomes more developed and forests are cleared. Characterizing the genetic variation of the isolated populations such as Negritos provides valuable information to the gene mapping of complex diseases [12]. Thus it is crucial to unveil their genetic makeup in order to better understand how genetic variation contributes to the well-being and health of human populations especially in the Southeast Asian region.
Copy number variations (CNV) typically range from 1 kb to several megabases in size [13] and are acknowledged as a major contributor to genetic diversity. This variability plays an important role to determine phenotypes such as physical features and conferring susceptibility to a number of common and complex diseases including HIV, psoriasis, and a number of neuropsychiatric diseases [14][15][16][17]. This occurs via potentially altering gene expression levels and influencing the gene dosage [18,19]. They account for a significant proportion of the genome [13,20], are highly variable, and often harbor regions with genes sensitive to the environmental stimulation such as those involved in immunity, metabolism, olfactory receptors [21][22][23]. Due to their non-random distribution across the genome, it is believed this phenomenon may have trended towards selection bias [21,24].
Most genetic diversity data in indigenous populations have been based on single nucleotide polymorphisms (SNP)/single nucleotide variations (SNV) [6,25,26] and maternal lineage mitochondrial DNA [4,8], except for a handful of studies [27][28][29][30][31][32][33]. To date, CNVs in indigenous populations of Peninsular Malaysia have not been reported. As a complement to the existing SNP data, we explored the first CNV map of Negrito individuals from Peninsular Malaysia and report the distribution of novel and population-specific CNVs. Our findings may be able to provide fundamental insights to the genetic architecture of the Negritos which can be translated to aid biomedical and evolutionary investigations.  (17)]. Prior to sample collection, the headman of the tribe and/or the community members were first consulted in a customary courtesy visit and their consent were obtained. During sampling, all participants were interviewed, and informed and written consent were obtained. Process of interview and informed written consent was conducted in Malay language and witnessed by the officer from JAKOA. Only Negrito participants 18 years who gave consent were selected. We collected 10 ml of peripheral blood from 34 unrelated individuals (17 males and 17 females) after obtaining informed consent. The samples consisted of both males and females from sub-tribes Jahai, Bateq, Mendriq and Kensiu. DNA was extracted from using Qiagen Blood Extraction Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol.

Microarray Genotyping
Genotyping was performed using the Affymetrix SNP6.0 Array platform according to the manufacturer's instructions. Briefly, 250 ng of genomic DNA was digested and ligated. The ligated products were then PCR amplified. Amplicons were electrophoresed, purified and quantified to ensure that the samples passed quality control (QC) measures before further experiment. The products were then fragmented, hybridized onto the Affymetrix SNP6.0 chips and stained. Chips were scanned and raw data was generated using Affymetrix Genotyping Console Software (GTC) version 3.0.2 with default settings. WDR4 and QCNV4 showed copy number normal and therefore considered as false positive. QCNV2 was detected as a CN gain by microarray, inconsistent with the qPCR validation, therefore considered as false positive. Parentheses, unrounded copy number values calculated using the relative quantification, standard deviation. doi:10.1371/journal.pone.0100371.t001

Copy Number Variation Analysis and Validation
CNVs were called independently using three algorithms, Affymetrix GTC, Birdsuite and iPattern (TCAG) as described previously [35]. We applied stringent filtering criteria such that CNV had to be a minimum of 1 kb and span 5 consecutive probes, and be detected by at least 2 out of the 3 algorithms. In addition we excluded CNVs that were on the X and Ychromosomes, or approximately 300 kb adjacent to the centromeres and telomeres. To define a set of rare CNVs we excluded known polymorphic loci (ie. Copy number polymorphism, CNP, targeted by the array) and those CNVs with more than 50% reciprocal overlap with those reported in DGV.
The filtered CNV calls were then compared with the HapMap3 dataset and subsequently the Singapore Genome Variation Project (SGVP) (http://www.statgen.nus.edu.sg/,SGVP/), to further identify CNVs unique to the Peninsular Malaysia Negritos. We defined a CNV as putative novel and unique to Negritos (denoted  as population-specific CNVs) when it is not present in any of the HapMap3 and SGVP samples (defined as ,50% reciprocal overlap with HapMap3 and SGVP CNVs).
Annotated CNVs unique to the Negrito samples studied with underlying genes were validated with qPCR SyBr Green assay as previously described [34]. A total of 50 ng (10 ng/ml) genomic DNA was amplified in a reaction mixture containing 12.5 ml iQ Sybr Green Supermix (Biorad), 1 ml (7 mM/ml) of respective forward and reverse primers, and top up to total volume of 25 ml with ddH 2 O. Cycling conditions were 95uC for 3 min, and then 40 cycles of 95uC for 30 s, followed by respective annealing temperatures of each locus for 15 s and 72uC for 30 s.
Melting curve was performed to check for specificity of the assay. Efficiency of the assay was observed by the generation of standard curve by created a serial of five-fold dilutions of a top standard of 50 ng/ml to 0.08 ng/ml (10 ng to 0.016 ng) of a single genomic DNA sample. All reactions were run in triplicate, except a few when the genomic DNA was insufficient, were run in duplicate. Normalization to the control gene Forkhead Box P2 (FOXP2) (primers 5'-TGACATGCCAGCTTATCTGTTT-3' and 5'-GAGAAAAGCAATTTTCACAGTCC-3') was used to give an estimate of copy number. The reproducibility of the qRT-PCR assay for each sample was calculated by estimating the within-sample variation measured through the coefficient of variation (C.V. % = 100*[standard deviation]/mean). Copy number of the target sequence in each test sample is determined by using comparative CT (2-DDCT).
Eight out of 12 (66.7%) CNVs were true positive (8 out of 9 were CNVs .10 kb in length). However, all 3 CNV less than 10 kb failed to validate. Considering the low replication rate, we removed the CNVs sized ,10 kb from further analysis. The primer sequences and the copy number amplified for the candidate CNVs is listed in Table 1.
The microarray dataset has been submitted to NCBI dbGaP. The accession number assigned is: phs000664.v1.p1.

General Characteristics of CNV and CNVR
We identified 1,333 autosomal CNVs from Genotyping Console (Affymetrix), with an average 39.2 CNVs per genome, whilst the total number of CNVs being called by Birdsuite and iPattern were   Table 2). After applying stringent filtering criteria, 1,111 overlapping CNVs were successfully merged, with an average 32.7 CNVs per genome (CNV call per genome ranged from 19-54), corresponding 105,909,572 bp of the total autosomal genome ( Figure 1). These corresponded to 263 CNVRs comprising of 161 losses, 94 gains and 8 multi-allelic sites. Figure 2 shows the length distribution of CNVRs in this study.

Comparison of Common CNVs
We first compared the diversity of common CNVs with the HapMap3 populations derived from 10 populations (consisting 1,072 samples). A set of CNVs that showed significant differences of allele frequencies are listed in Table 3. Notably, CNV losses at chromosome 3p22.2 (37,957,108-37,961,932) were observed in 56% of the Negrito samples in this study as compared to the rest of the HapMap3 populations (Table 3; Figure 3). The gene CTDSPL involved in this CNV was found to be associated with prostate cancer (https://www.genome.gov/26525384). The CNV in chromosome 15q13.3 was another region of interest. Frequency of this CNV was found to be higher (0.44) as compared to the HapMap3 samples (ranging from 0.09-0.21). The gene CHRNA7 involved in this CNV was found to be associated with schizophrenia and epilepsy [35,36].

Population Specific CNVs
Our dataset was further compared with HapMap3 dataset. Analysis revealed 62 CNVs (corresponded to 36 CNVRs) unique to our Negrito samples. However, due to the high false discovery, the CNVs sized ,10 kb were excluded from further analysis, hence 48 CNVs remained (24 gains; 24 losses), of which 32 were singletons (Table 4). Length distribution of the CNVRs specific to Negritos is shown in Figure 4.
To confirm the uniqueness of these CNVs in Negritos, we further compared our dataset with the metropolitan Chinese, Indians and Malays from SGVP. Seven CNVRs were covered in SGVP but none of these putative CNVs we found had been previously reported.

Gene Ontology and Pathway Analyses
To understand the putative functional implications of these CNVs, we performed the Gene Ontology (GO) and pathway analyses on the gene set within the Negrito-specific CNVs using PANTHER and DAVID ( Figure 5). Of the 48 CNVs specific to Negritos, 29 carried annotated genes while the remaining were gene-poor regions (Table 4). For all the CNVRs enriched with Table 3.   genes, copy number gains were significantly higher than the losses (15 gain versus 6 losses) (P,0.001). GO analysis by PANTHER revealed fourteen genes involved in immune system function and regulation, response to stimulus and metabolic pathways; whereas DAVID revealed that transcription regulation, and regulation of RNA metabolic processes to be the most significant GO term. The list of genes involved in the major biological processes is listed in Table 5.

Discussion
It is estimated that approximately 96% of the current genomewide association studies were conducted on individuals of European ancestry [37]. There is a growing need to unveil the spectrum of human genetic diversity by investigating minority populations, for instance the aboriginal populations in Southeast Asia (SEA) countries. The Negrito populations from Peninsular Malaysia are of interest, as they are known to be the descendants of earliest migrants to Southeast Asia. Due to their relatively long period of isolation and semi-nomadic lifestyles, they have had less exposure to urbanization. Their genomes are therefore perceived to be considerably less diverse owing to genetic drift and possibly founder effects. This makes them ideal for investigating genetic forces acting in human evolution, which provides fundamental knowledge to inform disease-based genetic studies as well as gene mapping.
In this study, we identified 263 CNVRs in 34 Negrito subjects from Peninsular Malaysia, of which 27 we believe are novel and unique to Negritos. After excluding the small CNVs, an average 23 CNVs was observed per Negrito genome. It was found to have more losses (72.6%) than gains, in line with most reported studies [12,29,32]. Overall size of the CNV observed also corresponded well. Approximately 58% of the CNVs found in Negrito were ,30 kb, in line with reports by Yim et al. [27] on the Korean genomes and Ku et al. [32]; but was relatively higher than the reported by Zhang et al. [30] and McElroy et al. [33]. The average number of CNVs detected in the HapMap3 dataset (average CNV call per genome = 102.2) (data not shown) and the Chinese populations (average CNV call per genome = 140.9) [29] were much higher. The number of novel CNVRs identified in Negrito was also lower (0.85 per genome) than those previously reported [29][30][32][33]. This is expected as we have excluded all the small CNVs ,10 kb from our analyses in this study (comprised ,30.8% of the total CNVs identified). Moreover, more populations being genotyped, the CNV map gets more saturated consequently hence less novel variants are observed. Collectively we observed less CNVRs but more alleles (CNVs) in the Negrito genomes. Though in general, the CNV profile of Negrito genomes looks similar to those reported especially by Ku et al. [32] in three other SEA populations except for the X-chromosome which was not considered in our study.
The variation of the number of CNVs detected could be attributed to several reasons: i) the technology applied for CNV detection and its resolution; ii) levels of stringency applied when performing the CNV call; iii) the algorithms applied when performing CNV call; iv) we excluded the X-chromosome, telomeric and centromeic CNVs. The application of three independent CNV algorithms would minimize the false positive result rates, as evidenced by Pinto et al. [38]. The poor validation rate for the small CNVs (,10 kb) could be attributed to several reasons: (i) poor signal to noise ratio of the samples thus leading to false positive calls by the algorithms; and (ii) inaccurate estimation of breakpoints for the small CNVs due to the limitation of the probe density, thus leading to inaccuracy when identifying a  Figure 5. Gene Ontology and pathway analyses on the gene set within the Negrito-specific CNVs using PANTHER and DAVID. (a) PANTHER analysis suggests a major involvement of the genes harboring the population specific CNVs in the immune system process and response to stimulus, as well as the metabolic process; (b) DAVID analysis suggests the involvement of the genes harboring the population specific CNVs in the transcription and regulation of RNA metabolic processes. doi:10.1371/journal.pone.0100371.g005 Table 5. Pathways and biological processes of the genes underlying the population specific CNvs in Negrito from Peninsular Malysia.

Pathways/Biological functions GO Term Genes
Immune systems and processes GO:0002376 TNFRSF8, CSMD1, SH2D4B, TNFRSF1B, LRRC30 precise CNV during qPCR validation. Therefore precautious should be taken when analysing the small CNVs. Collectively, our approach would increase the confidence of higher quality calls, at an expense of fewer positives being called. Although the number of CNVs was relatively lower then the previous studies, this report is considerably more stringent and with a higher confidence level. We believe more novel CNVRs unique to Negrito could be identified if larger sample sizes were to be investigated. Interestingly, the CNVRs enriched with genes showed a significantly higher copy number gains. In addition to that, these genes were known to be involved in immunity and response to stimuli, as well as metabolic pathways. We speculate that the Negrito may have undergone processes of local adaptation and positive selection, which necessitated their expansions and eventual settlement in forest habitats. This hypothesis is supported by several previously reported studies [24,30,39]. However, possibilities of other processes such genetic drift due to random duplications or deletions should not be ruled out [40]. Nevertheless, further investigations should be carried out to confirm the findings.
The health of Negritos has not been studied comprehensively for several decades and there are few recent publications [9]. However, early studies indicated that Negritos were under various medical stresses especially with high prevalence of communicable diseases including malaria, tuberculosis, leptospirosis and various intestinal infections [41]. This could be attributed to their life style in the early days, whereby the hunting-gathering activities were practiced hence are exposed to a variety of transmissible diseases. Malnutrition has been reported to be common amongst aboriginal communities, especially women [42]. Although there are no specific reports on the nutritional status on Negritos of late, our observations and direct communications with the Negrito tribes lead us to believe that the majority is undernourished. Although we cannot provide unequivocal evidence, it is conceivable that their biomedical stresses experienced in over the years resulted in the enrichment of selected genes in these Negrito specific CNVs. This is the first study of genome-wide CNVs in the Negrito population from Peninsular Malaysia. We identified putative novel CNVs unique to the Negrito populations from Peninsular Malaysia. Although the smaller sample size does not allow us to perform functional and statistical analysis, our data was analyzed with most stringent QC criteria and were then compared with a number of datasets including DGV, HapMap3 and the SGVP. As such, we think our data is highly reliable.
Population studies to catalogue the patterns, frequencies and distribution of CNV in non-disease based cohort is crucial to provide fundamental understanding of its impact to the phenotypic diversity and disease susceptibility. Hence, characterization of more diverse populations is needed to improve the saturation of CNV map for human genome. To this end, we are continuing our investigations amongst the indigenous in Malaysia.

Summary
Our findings provide fundamental knowledge, different perspectives and insights to the genetic diversity of Negritos of Peninsular Malaysia. This can inform studies of local adaptation, natural selection and also potentially influence health programmes in the near future.