Genome-Wide Profiling of Structural Genomic Variations in Korean HapMap Individuals

Background Structural genomic variation study, along with microarray technology development has provided many genomic resources related with architecture of human genome, and led to the fact that human genome structure is a lot more complicated than previously thought. Methodology/Principal Findings In the case of International HapMap Project, Epstein-Barr various immortalized cell lines were preferably used over blood in order to get a larger number of genomic DNA. However, genomic aberration stemming from immortalization process, biased representation of the donor tissue, and culture process may influence the accuracy of SNP genotypes. In order to identify chromosome aberrations including loss of heterozygosity (LOH), large-scale and small-scale copy number variations, we used Illumina HumanHap500 BeadChip (555,352 markers) on Korean HapMap individuals (n = 90) to obtain Log R ratio and B allele frequency information, and then utilized the data with various programs including Illumina ChromoZone, cnvParition and PennCNV. As a result, we identified 28 LOHs (>3 mb) and 35 large-scale CNVs (>1 mb), with 4 samples having completely duplicated chromosome. In addition, after checking the sample quality (standard deviation of log R ratio <0.30), we selected 79 samples and used both signal intensity and B allele frequency simultaneously for identification of small-scale CNVs (<1 mb) to discover 4,989 small-scale CNVs. Identified CNVs in this study were successfully validated using visual examination of the genoplot images, overlapping analysis with previously reported CNVs in DGV, and quantitative PCR. Conclusion/Significance In this study, we describe the result of the identified chromosome aberrations in Korean HapMap individuals, and expect that these findings will provide more meaningful information on the human genome.


Introduction
A study involving analysis on the structure of a genome is very important because it can describe individual genetic differences and discover genetic factors that can affect phenotypes such as disease susceptibility. With the recent discovery of copy number variation [1,2], scientists have realized that individual genetic differences are a lot more varied and complicated than previously expected, and studies on how these variations affect the risk of disease have been ongoing [2,3,4,5]. Recently, with the advancement of next generation sequencer (NGS) and the fast development of computational analysis technologies such as massive paired-end mapping and Pindel [6,7,8], CNV research has shown big progress. Also, a very accurate high-resolution mapping of human genomic variations is expected to be completed when the 1000 Genomes Project (www.1000genomes.org) is finished. In the case of microarray used in CNV study, high-density chip is being developed continually, and consequently, the numbers of CNVs discovered are increasing [9]. Although the use of NGS provides exact CNV breakpoint and abundant genomic resources, it is still very expensive to use and analysis of huge raw data is very complicated. On the other hand, with microarray, genomic resources from a huge number of individuals can be attained at a relatively cheap price, which also benefits from various programs developed for analysis and are currently in use. Chromosome aberration is defined by a change in the chromosome structure or number including duplication, loss of heterozygosity (LOH), inversion and translocation. Chromosome aberration is usually found in cancer cells and is reported to be related with many other diseases [10,11,12].
The goal of International HapMap Project is to discover and provide all the variations in human genome, so as to promote the genetic study of complex diseases. Although Japan (JPN) and China (CHB) population are often combined into a single Asian sample in International HapMap Project, a recent study reported on the diverse characteristics between the two populations based on autosomal and Y-chromosomal markers [13]. He et al. claimed that conclusions derived from studies conducted on the combined Asian (ASN) population, JPN and CHB should be interpreted with caution. In addition, the authors discussed that the Korean population is distinct from an Asian HapMap sample [13]. Despite having different regions, Yamaguchi-Kabata et al reported on a genetic differentiation observed in Japan [14]. Although Korea is geographically located between both China and Japan, Korean population does not necessarily share the same genetic background with either population. Korean government initiated the Korean HapMap project with goals of finding genetic variation unique to Koreans and genomic studying of Koreans. With international HapMap Project being such a large scale, a lot of genomic DNAs are required, and in order to meet this requirement, Epstein-Barr virus immortalized lymphocytes [lymphoblast cell line (LCLs)] are used instead of blood, which could result in genomic aberration [15]. Because of this, Redon et al., investigated chromosome aberration on cell lines using CGH array and SNP array [2] before undergoing human CNV discovery research. However, such aberrations may lead to wrong SNP genotype call which is a problem for the present project in achieving its goal. Also, when blood-extracted genomic DNA and LCL-extracted samples were comparatively analyzed for genomic alteration, similar variations were found in blood samples but more aberrations were observed in LCLs [15]. These results suggest that studies using LCLs sample need to undergo advance investigation for chromosome aberration. Especially in stem cell studies wherein chromosome aberrations need to be discovered fast and conveniently before being applied clinically [15].
In this study, we analyzed chromosome aberrations on Korean HapMap individuals using high-density SNP array, and we expected that the results would provide meaningful information for understanding human genome.

Results
In this study, we used Illumina HumanHap500 BeadChip (555,352 markers) and three types of program to identify chromosome aberrations including loss of heterozygosity (LOH), large-scale, and small-scale copy number variations in Korean HapMap individuals. In order to identify loss of heterozygosity (LOH), we used ChromoZone program on 90 samples and found 28 LOH regions from 22 samples with a total length of 141.7 mb and an average length of 1.6 mb. Among the 28 regions, only chr4:58000000-89000000 region from KOBB060838 sample was a copy-loss LOH, whereas the others were copy-neutral LOH (cnLOH) ( Table 1). We used cnvPartition, a program often used for large-scale aberration discovery to identify large-scale copy number variations including duplication and deletion over 1 mb, and found 35 aberrations from 17 samples. From the 35 chromosome aberrations, KOBB060876, KOBB060886, KOBB060889 and KOBB060890 showed complete duplication of chromosome 12, and KOBB060886 showed complete duplication of chromosome 5 (Table S1). Chromosome 12 repeatedly showed an entire duplication pattern, which was quite peculiar. Figure 1 displays identified loss of heterozygosity and large-scale copy number variation (.1 mb) plotted against B allele frequency (allelic intensity) and log R ratio (signal intensity). Heterozygosity disappeared in cnLOH without change in copy number ( Figure 1A). Entire duplication that appeared in 4 samples showed heterozygosity from plot of B allele frequency split into two lines, which increased all the signal intensity ( Figure 1B). KOBB060838 sample had a large-scale deletion (Size: 31.0 mb) in chr4:58000000-89000000 region. Also, in the plot of B allele frequency of the sample, LOH was found in the same region which decreased all signal intensity ( Figure 1C). Figure 1D shows large-scale duplication in chr8:3674807-5938053 (Size: 2.3 mb) of KOBB060879 sample, with split of heterozygosity clearly displayed in plot of B allele frequency and increase of signal intensity. Figure 2 displays the map of chromosome aberration including LOH, large-scale deletion, and duplication found in this study.
To find small-scale copy number variations, we simultaneously used both signal intensity and B allele frequency. In addition, we selected 79 samples after quality check and removed identified chromosome aberration regions for detection of reliable smallscale CNVs. As a result, we identified a total of 4,989 individual small-scale CNVs and 1,324 small-scale CNV regions (Table 2;  Table S2). Included in our findings were a total of 123 common small-scale CNVRs (frequency .1.0%) and 2,493 genes (Table  S3). Figure S1 shows the size distribution of identified CNVs and aggregated CNV regions in this study. Most of the identified CNVs and CNVRs were distributed within 1-50 kb in range. Figure 3 displays genetic map of the identified CNVs with one hundred and twenty three common small-scale CNVs represented by green triangles. Identified small-scale CNVs in this study were compared with previously reported CNVs from database of genomic variants (http://projects.tcag.ca/variation/), and were found to overlap with 3,741 of 4,989 small-scale CNVs (75.0%). After running 50% reciprocal overlapping analysis on 3,741 CNVs using alternative scan script in pennCNV, we found that 1,026 CNVs (27.4%) overlapped with CNVs in DGV. In order to validate the identified CNVs, we performed quantitative PCR for four randomly selected CNVs. Quantitative measurement values of all samples in four regions (chr4:162083343-162146977 and chr12:11411688-11434269 for duplication, chr1:147305744-147427061 and chr4:10006425-1000925 for deletion) matched the expected copy number through estimation by visual examination ( Figure S2; Figure S3).

Discussion
Chromosome aberration can lead to a variety of genetic disorders such as Down syndrome, Jacobsen syndrome, Hodgkin disease and Turner syndrome [16,17,18]. Such large-scale aberrations have been usually analyzed using low resolution cytogenetic method. With recent advancements on SNP array technology, it is now possible to detect previously elusive genetic variations with high resolution. Originally, SNP array was developed to obtain individual SNP genotype, but now, it has also been considered as a tool for analyzing extended homozygosity and copy number change [15,19]. This is because SNP array provides not only signal intensity for copy number but also B allele frequency, which is required for detecting LOH. We used Illumina HumanHap500 BeadChip (555,352 markers) on Korean HapMap individuals to investigate genome-wide profiling of structural genomic variations. As a result, we found 28 LOH regions from 22 samples, and 33 large-scale aberration including duplication and deletions from 19 samples. We also found 4,989 small-scale copy number variations with sizes ranging between 1 kb and 1 mb. One unusual finding was the fact that LOHs with sizes larger than 3 mb were mostly copy-neutral LOH (cnLOH). The loss of heterozygosity in cnLOH occurs without copy number change. In addition, cnLOH has been known to be developed by mitotic recombination or deletion and reduplication, and in fact, a previous study reported that it can affect the risk of tumorigenesis of MUTHY-associated polyposis (MAP) [20]. However, the underlying mechanisms of how cnLOH affects complex diseases are yet to be discovered.
Large-scale deletion and duplication is rarely found in healthy individuals. Simon-Sanchez et al. found 11 duplications and 2 deletions by blood sample analyses (n = 275) [15]. Except for one individual, the sizes of the duplications were smaller than 2.6 mb. On the other hand, 11 variations not initially detected in blood samples were found using LCLs analyses, with 4 variations occurring on an entire chromosome and other variations detected with larger sizes. In the present study, we were also able to discover variations from LCLs with larger sizes, such as 5 entire chromosome duplication, and similar to the findings of Simon-Sanchez et al, we were able to detect 10 variations with sizes larger than 3 mb. Some of the factors as to why such aberrations occur more in LCLs are suggested to be faulty use of donor tissue, the artifact of EBV immortalization process, creation during culture, etc [15]. Variations occurred using LCLs do not affect SNP genotype with huge samples, but may influence the accuracy of SNP genotype taking into account small sample size. Therefore, detecting abnormal individuals having chromosome aberration regions before SNP genotyping would be a good method to increase accuracy of SNP genotype. The present study took these problems into consideration and subsequently got rid of samples with aberrations larger than 1 mb. With these samples, we identified small-scale CNVs, and then removed abnormal regions from identified small-scale CNVs. As a result, we found 4,989 small-scale CNVs which overlapped at 75% when compared with previously found small-scale CNVs in Database of Genomic Variants (DGV). After comparing our results with two previous studies [21,22] that discovered structural variants from Koreans by whole-genome sequencing, 44.9% (106/236 variants) [21] and 22.4% (820/3,662 variants) [22] of our CNVs appeared to overlap. The previous studies ran high-resolution scanning on one human sample, and are very significant in that it suggested the direction of future CNV research. As to why overlapping rate between the two papers and our CNVs was not high, we considered the differences in samples and platforms used. The ideal study method would be to use NGS (Next Generation Sequencer) on CNV studies, but obtaining genomic resource from more than a thousand samples by NGS would require an astronomical amount of fund. Since genomic resources of many samples are required in genetic epidemiological study, chip-based strategy is still going to be used as a main study tool. After comparing our identified CNVs and segmental duplications (SDs), we found that many CNVs were located in SDs region (43.5%). This observation is similar to the results of other studies [2,23,24] wherein CNVs discovered were closely related with segmental duplication (SD). Also, result from our gene ontology (GO) analysis showed that cell adhesion displayed similar distributions from previous studies. However, sensory perception of smell and neurophysiological processes did not have a highly enriched region in the GO category (Table S4). In our study, we suggest a faster and more accurate method that is, the detection of chromosome aberrations of LCLs using advance studies and increase accuracy of acquired genomic resource information. Previously, array CGH and SNP array were usually used to detect chromosome aberration. However, a new method that uses Genome Viewer of BeadStudio software of Illumina HumanHap300 BeadChip (317K) to find large-scale CNVs including cnLOH was recently introduced [15]. Despite the accuracy and efficiency of this method, it requires visual examination from the user which may take longer time to analyze the data and may result to decreased efficiency. Based on these studies other programs, some of which have been utilized by the present study, that can automatically detect various chromosome aberrations, to suggest a fast and accurate method of discovering chromosome aberration have been developed. We expect this method will help find abnormal chromosomal regions rapidly and accurately for studies using stem cells or LCLs.
In summary, we genotyped Korean HapMap individuals using Illumina HumanHap500 BeadChip (555K), and used three programs to identify cnLOH, large-scale CNV and small-scale CNV. Findings from this study can be used for faster and more accurate detection of chromosome aberration, and the newly found genomic alterations will serve as important materials for understanding human genomic structure.

Subject collection and whole-genome SNP genotyping
All individuals included in this study were of Korean ethnic origin. The samples were obtained from Korean HapMap project  [25,26]. All donated samples for genetic tests after signing the informed consent form that was approved by the relevant institutional review board (The Korean HapMap Project Consortium). We used the Illumina Human-Hap500 BeadChip containing 555,352 markers (Illumina, Inc., USA) to obtain log R ratio (signal intensity) and B allele frequency (allelic frequency). Approximately 750 ng of genomic DNA, which was extracted from Epstein-Barr virus immortalized cell lines, was used to genotype each sample. The assay procedure has been described in our previous study [27]. Raw intensity and allelic data were normalized through five processes (outlier removal, translation correction, rotational correction, shear correction and scaling correction) using Illumina BeadStudio 3.2 software [19]. Also, in this study, a default cluster provided by Illumina encountered a problem where the value was different depending on ethnicity and experiment time, so we normalized all markers by automatic clustering. For an accurate CNV identification, we excluded samples with LRR_SD (standard deviation of log R ratio) value over 0.30 or samples with unusually high number of identified CNVs. The overall SNP genotyping call rate was 99.69%, which indicated that high-quality data was extracted for this study.
Identification of chromosome aberration including loss of heterozygosity (.3 mb) and large-scale copy number variation (.1 mb) In order to identify chromosome aberrations including loss of heterozygosity (.3 mb) and large-scale copy number variation (.1 mb), we used ChromoZone (Illumina, USA) and cnvPartition (Illumina, USA) programs. ChromoZone program was used to  Sample selection and identification of small-scale copy number variation (,1 mb) We used high quality samples to identify reliable small-scale copy number variations with call rate .99.0%, number of smallscale CNVs ,177 [(Average number of identified small-scale CNVs per sample (87.6) +1 standard deviation (89.3)], and standard deviation (SD) of log R ratio (LRR) ,0.30 being the criteria. However, we did not use the wave-adjust function for identifying CNV in pennCNV because the selected samples had good waviness factor values (WF value) (average of WF 6 standard deviation = 0.0160.03). In order to identify authentic small-scale CNVs, we used multiple factors including log R ratio (LRR), B allele frequency (BAF), marker distance, and population frequency of the B allele. After these parameters were extracted using Illumina BeadStudio 3.2 (Illumina, USA), individual small-scale CNVs were identified by pennCNV [28,29]. We downloaded all scripts and library files including hhall.hg18.pfb from the website (http://www.openbioinformatics.org/penncnv), and we performed the identification of small-scale CNVs by the described procedure in Kai et al [28].

Validation by quantitative PCR
To validate the existence of both the CNV region, we performed TaqMan real-time PCR on an ABI Prism 7900 sequence detection system (Applied Biosystems, Foster City, CA) for four randomly selected CNVs among the identified CNVs in this study. Specific probes were generated by Primer Express 2.0 (Table S5). The RNaseP gene (part number: 4403328) was co-amplified with the marker which was then used as an internal standard. Amplification reactions (10 ul) were carried out using 10 ng of template DNA, 16 TaqMan Universal Master Mix buffer (Applied Biosystems, Foster City, CA), 900 nM of each primer, and 250 nM of each fluorogenic probe. Thermal cycling was initiated with 2 min incubation at 50uC, followed by a first denaturation step of 10 min at 95uC, and then 40 cycles of 15 sec at 95uC and 1 min at 60uC. Three replicate reactions were performed for similar primer pairs and each copy number of individuals was calculated by Copy Caller v1.0 (Applied Biosystems, Foster City, CA) using comparative C T method.

CNV mapping
We used the karyoview of Ensemble for drawing the maps of the identified genomic variations including CNVs (http://apr2006. archive.ensembl.org/Homo_sapiens/karyoview). The input data of each variation were prepared following this format; chromosome, start position, end position and status (optional). The input data of all variants were fixed to ''paste file content'' menu before the map was drawn. Figure S1 Size distribution of identified copy number variations and aggregated copy number variation regions.