SL, GS, PCN, LF, ALH, BPW, NA, JH, EFK, GD, VB, KAR, YHR, MEF, SWS, RLS, and JCV conceived and designed the experiments. SL, GS, PCN, LF, ALH, BPW, NA, JH, EFK, GD, VB, YL, JRM, AWCP, MS, AT, DAB, KYB, TCM, JG, JB, and YHR performed the experiments. All authors analyzed the data. NA, GD, YL, TBS, and VB contributed analysis tools. SL, GS, PCN, LF, ALH, BPW, JH, EFK, GD, TBS, AT, YHR, MEF, SWS, RLS, and JCV wrote the paper.
The authors have declared that no competing interests exist.
Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
We have generated an independently assembled diploid human genomic DNA sequence from both chromosomes of a single individual (J. Craig Venter). Our approach, based on whole-genome shotgun sequencing and using enhanced genome assembly strategies and software, generated an assembled genome over half of which is represented in large diploid segments (>200 kilobases), enabling study of the diploid genome. Comparison with previous reference human genome sequences, which were composites comprising multiple humans, revealed that the majority of genomic alterations are the well-studied class of variants based on single nucleotides (SNPs). However, the results also reveal that lesser-studied genomic variants, insertions and deletions, while comprising a minority (22%) of genomic variation events, actually account for almost 74% of variant nucleotides. Inclusion of insertion and deletion genetic variation into our estimates of interchromosomal difference reveals that only 99.5% similarity exists between the two chromosomal copies of an individual and that genetic variation between two individuals is as much as five times higher than previously estimated. The existence of a well-characterized diploid human genome sequence provides a starting point for future individual genome comparisons and enables the emerging era of individualized genomic information.
Comparison of the DNA sequence of an individual human from the reference sequence reveals a surprising amount of difference.
Each of our genomes is typically composed of DNA packaged into two sets of 23 chromosomes; one set inherited from each parent whose own DNA is a mosaic of preceding ancestors. As such, the human genome functions as a diploid entity with phenotypes arising due to the sometimes complex interplay of alleles of genes and/or their noncoding functional regulatory elements.
The diploid nature of the human genome was first observed as unbanded and banded chromosomes over 40 years ago [
Over the past decade, with the development of high-throughput DNA sequencing protocols and advanced computational analysis methods, it has been possible to generate assemblies of sequences encompassing the majority of the human genome [
The ongoing analyses of these DNA sequence resources have offered an unprecedented glimpse into the genetic contribution to human biology. The simplification of our collective genetic ancestry to a linear sequence of nucleotide bases has permitted the identification of functional sequences to be made primarily through sequence-based searching alignment tools. This revealed an unexpected paucity of protein coding genes (20,000–25,000) residing in less than 2% of the DNA examined, suggesting that alternative transcription and splicing of genes are equally important in development and differentiation [
Building on the existing genome assemblies, numerous initiatives have explored variation at the population level, in particular to generate markers and maps as a means of understanding how sequence variation evolves and can contribute to phenotype. The initial drafts of the two human genomes provided an excess of 2.4 million SNPs [
The ability to generate a diploid genome structure via haplotype phasing for the HapMap samples is limited by the SNPs that were genotyped and their spacing. By using LD measures, it was possible to identify diploid blocks of DNA averaging 16.3 kb for Caucasians (CEU), 7.3 kb for Yorubans (YRI), and 13.2 kb for grouped Han Chinese and Japanese (CHB+JPT) [
To understand fully the nature of genetic variation in development and disease, indeed the ideal experiment would be to generate complete diploid genome sequences from numerous controls and cases. Here we report our endeavor to fully sequence a diploid human genome. We used an experimental design based on very high quality Sanger-based whole-genome shotgun sequencing, allowing us to maximize coverage of the genome and to catalogue the vast majority of variation within it. We discovered some 4.1 million variants in this genome, 30% of which were not described previously, furthering our understanding of genetic individuality. These variants include SNPs, indels, inversions, segmental duplications, and more complex forms of DNA variation. We used the variant set coupled with the sequence read information and mate pairs to build long-range haplotypes, the boundaries of which provide coverage of 11,250 genes (58% of all genes). In this manner we achieved our goal of the construction of a diploid genome, which we hope will serve as a basis for future comparison as more individual genomes are produced.
The individual whose genome is described in this report is J. Craig Venter, who was born on 14 October 1946, a self-identified Caucasian male. The DNA donor gave full consent to provide his DNA for study via sequencing methods and to disclose publicly his genomic data in totality. The collection of DNA from blood with attendant personal, medical, and phenotypic trait data was performed on an ongoing basis. Ethical review of the study protocol was performed annually. Additionally, we provide here an initial foray into individualized genomics by correlating genotype with family history and phenotype; however, a more extensive analysis will be presented elsewhere.
The donor's three-generation pedigree is shown in
(A) Three-generation pedigree showing the relation of ancestors to study DNA sample. The donor is identified in red. (B) Cluster analysis based on 750 SNP genotype information to infer the ancestry of the HuRef donor. The figure shows the proportion of membership of the HuRef donor (yellow) to three pre-defined HapMap populations (CEU = Northern and Western Europe, YRI = Yoruban, Ibadan, Nigeria, and JPT+CHB = Japanase, Tokyo, and Han Chinese, Beijing). The results indicate that the HuRef donor clusters with 99.5% similarity to the samples of northern and western European ancestry.
(A) HuRef donor G-banded karyotype. (B) Spectral karyotype analysis.
The assembly, herein referred to as HuRef, was derived of approximately 32 million sequence reads (
Clone Insert Library Types and Reads Used for HuRef Genome Assembly
HuRef is a high-quality draft genome sequence as evidenced from the contiguity statistics (
Summary of HuRef Assembly Statistics and Comparison to the Human NCBI Genome
Genomic variation was observed by two approaches. First, we identified heterozygous alleles within the HuRef sequence. This variation represents differences in the maternal and paternal chromosomes. In addition, a comparison between HuRef and the National Center for Biotechnology Information (NCBI) version 36 human genome reference assembly, herein referred to as a one-to-one mapping, also served as a source for the identification of genomic variation. These comparisons identified a large number of putative SNPs as well as small, medium, and large insertion/deletion events and some major rearrangements described below. For the most part, the one-to-one mapping showed that both sequences are highly congruent with very large regions of contiguous alignment of high fidelity thus enabling the facile detection of DNA variation (
The one-to-one mapping to NCBI version 36 (hereafter NCBI) was also used to organize HuRef scaffolds into chromosomes. HuRef scaffolds were only mapped to HuRef chromosomes if they had at least 3,000 bp that mapped and the scaffold was mostly not contained within a larger scaffold. With the exception of 12 chimeric joins, all scaffolds were placed in their entirety with no rearrangement onto HuRef chromosomes. The 12 chimeric regions represent the misjoining of a small number of chimeric scaffold/contigs by the Celera Assembly [
The NCBI autosomes are on average 98.3% and 97.1% represented by runs and matches, respectively, in the one-to-one mapping to HuRef scaffolds (
The Y chromosome is 59% covered by the one-to-one mapping due to difficulties when producing comparison between repeat rich chromosomes. In addition, the Y chromosome is more poorly covered because of the difficulties in assembling complex regions with sequencing depth of coverage only half that of the autosomal portion of the genome. The X chromosome coverage with HuRef scaffolds is at 95.2%, which is typical of the coverage level of autosomes (mean 98.3% using runs). However it is clear that the X chromosome has more gaps, as evidenced by the coverage with matches (89.4%) compared with the mean coverage of autosomes using matches (97.1%). The overall effects of lower sequence coverage on chromosomes X and Y are clearly evident as a sharp increase in number of gaps per unit length and shorter scaffolds compared to the autosomes (
Note that the autosomes have more contiguous sequence with fewer gaps compared to chromosomes X and Y, probably due to half the read depth compared to the autosomes and the presence of extensive sequence similarity between the sex chromosomes.
Since NCBI, WGSA, and HuRef are all incomplete assemblies with sequence anomalies, assembly-to-assembly mappings also reflect issues of completeness and correctness. We compared three sets of chromosome sequences to evaluate this issue (see
The HuRef assembly and the one-to-one mapping between the HuRef genome and the NCBI reference genome resulted in the identification of 5,061,599 putative SNPs, heterozygous indels, and a variety of multi-nucleotide variations events (see
HuRef consensus sequence (in red) with underlying sequence reads (in blue). Homozygous variants are identified by comparing the HuRef assembly with NCBI reference assembly. Heterozygous variants are identified by base differences between sequence reads. SNP = single nucleotide polymorphism; MNP = multi-nucleotide polymorphism, which contains contiguous mismatches.
The Application of Distinct, Independent, Filtering Methods on the Detection Rate of SNPs, Heterozygous Indels, and Complex Variants Identified from the HuRef Assembly
Identification of Variants Found within the HuRef-NCBI One-to-One Assembly Map (Internal HuRef-NCBI map) and Those Variants in HuRef Sequence Not Aligned to NCBI (External HuRef-NCBI Map)
The one-to-one mapping of HuRef to NCBI produced approximately 150 Mb of unaligned HuRef sequence inclusive of partially mapped and nonmapped HuRef scaffolds. Within this unaligned HuRef sequence, we identified 233,796 heterozygous variants including SNPs, indels, and complex variants after application of the same filters described above (see
In addition to the aforementioned filtering approach, a small fraction (∼1%) of the 693,941 putative homozygous insertion/deletion variants were subsequently characterized as heterozygous variants. This was accomplished by finding exact matches of 100-bp sequence 5′ and 3′ of the insertion point sequence and the deletion sequence in both HuRef scaffolds and unassembled reads. This fraction of heterozygotes is likely to be a conservative estimate of the total number of true heterozygotes (see below). The alternate alleles of these heterozygous variants were primarily found (96% of the time) in scaffolds less than 5,000 bp long or in unassembled reads. This highlights the value of small scaffolds and unassembled reads in defining the variant set in an assembled genome and suggests that these elements are a rich source of genomic variation. Therefore, subsequent to the removal of the variants by read-based filtering (19% mentioned above) and the recategorization as heterozygous variants (1% above), the remaining variants included approximately equal numbers of insertion (275,512) and deletion (283,961) alleles and 90 inversions as outlined in
In summary, using the combined identification and filtering approaches, it was possible to identify an initial “raw” set of 5,775,540 variants, from which we generated a higher-confidence set of 4,118,889 variants, of which 1,288,319 variants are novel relative to current databases (dbSNP).
To examine sequence diversity in the genome, we estimated nucleotide diversity using the population mutation parameter θ [
Across all autosomal chromosomes, the observed diversity values for SNPs and indels are 6.15 × 10−4 and 0.84 × 10−4 respectively. When restricted to coding regions only, θSNP = 3.59 × 10−4 and θindel = 0.07 × 10−4, indicating that 42% of SNPs and 91% of indels have been eliminated by selection in coding regions. The strong selection against coding indels is not surprising, because most will introduce a frameshift and produce a nonfunctional protein. Our observed θSNP falls within the range of 5.4 × 10−4 to 8.3 × 10−4 that has been previously reported by other groups [
Our observed θindel (0.84 × 10−4) is approximately 2-fold higher than the diversity value of 0.41 × 10−4 that was reported from SeattleSNPs (
Values of θindel are consistent among the chromosomes (
This is most likely an under-estimate of the true diversity, because a fraction of real heterozygotes were missed due to insufficient read coverage.
The SNP variants identified in the HuRef genome include a larger-than-expected number of homozygous variants than those commonly observed in population-based studies (compare ratios of heterozygous SNP:homozygous SNP in
Modeling the Occurrence of Heterozygous to Homozygous Variant in a Shotgun Assembly
We also modeled the inter/intraindividual genome comparison using directed resequencing data from SeattleSNPs data (see
In an attempt to explain the discrepancy in the heterozygous to homozygous indel ratio (
Another further explanation for the overabundance of homozygous indels is the error-prone nature of repeat regions. Using a subset of genes (55) completely sequenced by SeattleSNPs, we found that 28% of the potential 92 HuRef homozygous indels overlap with indels in these genes, as opposed to 75% confirmation rate for homozygous SNPs described earlier. When one categorizes the repeat status of a homozygous indel, a higher confirmation rate (46%) is seen for indels excluded from regions identified by RepeatMasker or TandemRepeatFinder. The confirmation rate for an indel in a transposon or tandem repeat region is much lower at 16%. Therefore, indels in nonrepetitive loci have a higher probability of authenticity than indels in repeat regions.
The ratio of SNPs to indels is lower in the HuRef assembly than what is observed by the SeattleSNPs data (
Summary of Variant Types Identified in the HuRef Genome Assembly
We identified in the HuRef assembly 263,923 heterozygous indels spanning 635,314 bp, with size ranges from 1 to 321 bp. The characteristics of the indels we detected, their distribution of sizes <5 bp, and the inverse relationship of the number of indels to length are similar to previous observations [
Distributions of heterozygous (A) and homozygous (B) indels lengths of 1–100 bp (A and B, respectively) and at greater detail in the range 1–20 bp (C and D, respectively). Note that heterozygous indels range from 1–321 bp and homozygous indels between 1–82,711 bp, however both polymorphisms type have greater than 47% of indel events being single base. Also even-length indels appear to be overrepresented.
There are 6,535 homozygous indels that are at least 100 bases in length for which both flanks of the indel can be located precisely on HuRef and NCBI assemblies. These comprise 3,431 insertions uniquely occurring on HuRef, totaling 2.13 Mb, and 3,104 deletions, totaling 1.82 Mb, found only on NCBI (
Note that the number of indel events are similar but that there are more longer insertions than deletions.
Repetitive Elements in the Complete HuRef Assembly, Homozygous Insertions and Deletions Were Identified Using RepeatMasker
To evaluate the accuracy and validity of SNP calling from the sequencing reads, the donor DNA was interrogated using hybridization-based SNP microarrays: the Affymetrix Mapping 500K Array Set, which targets 500,566 SNP markers, and the Illumina HumanHap650Y Genotyping BeadChip, which targets 655,362 SNPs. The Affymetrix array experiment was performed twice to provide a technical replicate for genotyping error estimation, and 0.12% of genotype calls were discordant. Of the 92,144 assays with an annotation in dbSNP that overlap between the two different platforms, 99.87% were concordant (0.13% discordant). Thus, the discordance rate between platforms was similar to that between Affymetrix technical replicates. Genotype calls that were discordant between technical replicates or between the Affymetrix and Illumina platforms were excluded from further analysis. This resulted in 1,029,688 nonredundant SNP calls from the two genotyping platforms, which were then compared to the HuRef assembly and to the single nucleotide variants extracted from the sequencing data. Of these, 943,531 genotypes (91.63%) were concordant between the genotyping platforms and the HuRef assembly (
Concordancy in SNP Genotyping Validation Comparing Independent Genotype Calls Using Affymetrix 500K, Illumina HumanHap650Y in Comparison with Sequence from the HuRef Assembly
Discordant Calls in SNP Genotyping Validation Using Affymetrix 500K, Illumina HumanHap650Y in Comparison with Sequence from the HuRef Assembly
Model of the false-negative rate of heterozygous SNP detection found on Affymetrix or Illumina genotyping platforms in relation to the number of supporting reads found in the HuRef assembly at these loci. The observed false-negative rate of detected heterozygous SNPs in the HuRef assembly closely follows the modeled rate given a Poisson model. The predicted false-negative error is based on the thresholds of requiring at least 20% of the reads supporting the minor allele, two reads minimum. The increased false-negative error at 11 is due to the increased number of reads required to call the minor allele compared to two reads being required at 4×–10× coverage. Therefore, at 11×–15× coverage, three reads are required. The additional read changes the binomial distribution and increases false-negative error (See
Distribution plot of number of underlying reads (average number of reads = 8.8) in HuRef heterozygous SNPs confirmed by the Affymetrix and Illumina genotyping platforms. This is compared to a distribution (average number of reads = 5.2) for SNP detected by the platforms but missed in the HuRef assembly.
Another possible form of error would be to erroneously call a truly homozygous position a heterozygous variant. Of the 65,337 homozygote calls that were concordant between the Affymetrix and Illumina platforms, none were called as heterozygous in the HuRef assembly. Therefore, the upper bound for the false-positive rate is 0.0046% (one-tailed 95% confidence interval), and one would expect false-positive heterozygote calls approximately once every 22 kb from the upper bound of this confidence interval. However, this estimate may be lower than the genome-wide false-positive error, because it is based on the positions chosen by the microarray platforms, which tend to be biased away from repetitive, duplicated, and homopolymeric regions. Approximately three-quarters of the novel heterozygous SNPs (73%) and novel heterozygous indels (75%) are in a region identified by RepeatMasker, TandemRepeatFinder, or a segmental duplication. Therefore, approximately three-quarters of the novel heterozygous variants are in regions that are most likely underrepresented in the microarrays. Consequently, we cannot readily extrapolate the false-positive error determined from the microarrays to be the discovery rate of the HuRef variant set. The repetitive regions are likely to have a higher false-positive rate due to sequencing error and misassembly. Further, they are not represented in the current estimate of the false-positive rate. However, they also exhibit a higher rate of authentic variation.
Homozygous and heterozygous insertions and deletions identified in the HuRef assembly were computationally validated by comparison to previously published datasets. As indicated in
Comparison of HuRef Heterozygous Indels to Indel Variants Identified from Other Studies
Comparison of HuRef Homozygous Indels to Indel Variants Identified from Other Studies
Next, the HuRef homozygous deletions were compared to three other sets of previously identified deletion polymorphisms [
We sought further evidence in support of the longest indels identified by the one-to-one HuRef–NCBI mapping. We focused on the 20 longest insertions (9–83 kb) and the 20 longest deletions (7–20 kb) and examined the presence of these large indels in the genomes of eight other individuals by identifying fosmid clones that map to these 40 loci (
We selected 19 non-genic heterozygous indels in a nonrandom manner, ranging in length from 1 to 16 bp, for experimental validation using PCR coupled with PAGE detection of allelic forms. We ensured that the read depth coverage was in an acceptable range (not greater than 15 reads), suggesting that these loci were not in segmental duplications and would therefore not produce spurious PCR amplification. Three Coriell DNA samples and HuRef donor DNA were examined, and 15 out of 19 PCR assays assessed generated results consistent with the positive and negative controls. The indel lengths that yielded experimental data ranged from 1 to 8 bp in length. In four out of 15 indels, the heterozygote variant was identified in all four DNA samples, and in three out of 15, it was only found the HuRef donor DNA. For the remaining eight out of 15 cases, the indels were differentially observed among the four DNA samples (
We selected 51 putative homozygous HuRef insertions in a nonrandom manner for validation in 93 Coriell DNA samples based on their proximity to annotated genes, their size range of 100–1,000 bp, the absence of transposon repeat or tandem repeat sequence, uniqueness in the HuRef genome, and the absence of any similarity to chimpanzee sequence. The experimental results (
In 22 (61%) confirmed experiments, the HuRef donor bears homozygous insertions in agreement with our computational analyses. There are four insertions in this set, among the 22, where the HuRef donor and all 93 Coriell DNA donors tested were homozygous for insertions. This suggests that these sequences were either not assembled in the NCBI human genome assembly or that the NCBI donor DNA sequenced had a rare deletion in these regions.
For the remaining 14 insertions (39%), the HuRef donor was heterozygous for the insertion instead of homozygous as was predicted by our indel detection pipeline. We searched for these alternative shorter alleles in the HuRef assembly and observed that two of the alternative alleles matched degenerate scaffolds and two matched singleton unassembled reads. These are sequence elements that are typically small or unassembled elements respectively, signifying that the assembly process selected one allele.
We note that many of the insertions tested (84%) are polymorphic in the Coriell panel tested, and although many are intronic, there are instances of UTR and exonic insertions whose impact on function may be more directly ascertained.
It has previously been shown that extended regions of high sequence identity complicate de novo genome assembly [
Copy number variants (CNVs) have been identified to be a common feature in the human genome [
Copy Number Variants Identified on the HuRef Sample
Numerous HuRef sequences that span the entire or partial scaffolds did not have a matching sequence in the NCBI genome. Some had putative chromosomal location assignments (e.g., sequences extending into NCBI gaps), whereas others were unanchored scaffolds with no mapping information. We selected sequences >40 kb in length with no match to the NCBI genome and identified fosmids (derived from the Coriell DNA NA18552) mapping to these sequences based clone end-sequence data. The fosmids were then used as FISH probes with the aim of confirming annotated locations for anchored sequences and assigning chromosomal locations to unanchored scaffolds. Fosmids were hybridized to metaphase spreads from two different cells lines. At least 10 metaphases were scored for each probe, and a differentially labeled control fosmid was included for each hybridization. For 23 regions, there was no mapping information available from mate-pair data or the one-to-one mapping comparison. Of the remaining 26 regions, 24 had a specific chromosomal location assigned at the nucleotide level (
Sequences from the HuRef donor that had no match based on the one-to-one mapping or BLAST when compared to the NCBI Human reference genome were tested by FISH. Fosmids were used as probes and the experiments were run, using Coriell DNA, to confirm the localization of the contigs or to map contigs with no prior mapping information. Shown here are four representative results. (A) An insertion at 7q22 where the FISH confirmed the HuRef mapping, (B) FISH result confirming the mapping of a sequence extending into a gap at 1p21. (C) Localization of a contig with no prior mapping information to chromosomal band 1q42. (D) An example of euchromatic-like sequence with no prior mapping information, which hybridizes to multiple centromeric locations.
Haplotypes have more power than individual variants in the context of association studies and predicting disease risk [
The set of autosomal heterozygous variants described above (
The distribution of the number of other variants to which a given variant can be linked using sequencing reads only or using mated reads as well is shown. Linkage of variants based on individual sequencing reads is limited, regardless of sequence coverage beyond a modest level, but is substantially increased by the incorporation of mate pairing information. The size of the effect is considerably more than simply doubling read length, due to variation in insert size; consequently, benefits of increasing sequencing coverage drop off much more slowly.
Using this dataset, haplotype assembly was performed as described in
(A) Reverse cumulative distribution of haplotype spans (bp) (N50 ∼ 350 kb). (B) Reverse cumulative distribution of variants per haplotype (N50 ∼ 400 variants).
Both internal consistency checks and comparison to HapMap data indicate that the HuRef haplotypes are highly accurate. Comparing individual clones against the haplotypes to which they are assigned, 97.4% of variant calls were consistent with the assigned haplotype. Moreover, the HuRef haplotypes were strongly consistent with those inferred as part of the HapMap project [
We accessed the 120 phased CEU haplotypes from HapMap and identified the subset of heterozygous HuRef SNP variants that also coincided with the HapMap data. For adjacent pairs of such variants that were in strong LD (
Haplotypes inferred from the HuRef data are strongly consistent with HapMap haplotypes. The probability in the HapMap CEU panel of the observed genotypes being phased as per the HuRef haplotypes is high for variants in strong LD (as measured either by D′ or
The restriction to variants in strong LD has no clear selection bias with respect to our inferred haplotypes. On the other hand, it provides only weaker confirmation for the HapMap phasing, since it is restricted to the easiest cases for phasing using population data—namely only those pairs of variants in strong linkage disequilibrium.
The lengths and densities of the inferred HuRef haplotypes described above are possible due to the use of paired end reads from a variety of insert sizes. Given the relatively simple means that were used for separating haplotypes, the high accuracy of phasing is likewise due to the quality of the underlying sequence data, the genome assembly, and the set of identified variants. The rate of conflict with HapMap with regard to variants in high LD can be further decreased by filtering the variants more aggressively (particularly excluding indels; unpublished data), although at the expense of decreasing haplotype size and density. It is also possible to improve the consistency measures described above by using more sophisticated methods for haplotype separation. One possibility we have explored is to use the solutions described above as a starting point in a Markov chain Monte Carlo (MCMC) algorithm. This produces solutions for which the fraction of high LD conflicts with HapMap is reduced by ∼30%. This approach has other advantages as well: MCMC sampling provides a natural way to assess the confidence of a partial haplotype assignment. Assessment of this and other measures of confidence is a topic for future investigation.
We used the generated haplotypes to view how well they span the current gene annotation. We were able to identify 84% (19,407 out of 23,224 protein coding genes) of Ensembl version 41 genes partially contained within a haplotype block and 58% of protein coding genes completely contained within a haplotype block. We note that in population-based haplotypes, denser sampling of SNPs in regions of low LD leads to reduction in the size of the average haplotype block [
The sequencing, assembly, and cataloguing of the variant set and the corresponding haplotypes of the HuRef donor provided unprecedented opportunity to study gene-based variation using the vast body of scientific literature and extensively curated databases like OMIM [
(A) The distribution of the OMIM genes in Ensembl version 41 protein coding genes that contain one or more SNP or indel in their coding and/or UTR regions. (B) A similar distribution for the variants found in coding and/or UTR regions for all Ensembl version 41 genes.
Understanding potential genotype-to-phenotype relationships will require many more extensive population-based studies. However, the complexities of assessing genotype–phenotype relationships begin to emerge even from a very preliminary glimpse of an individual human genome (
Genotypes for Some Traits in the HuRef Donor
In our preliminary analysis of the HuRef genome, we also identified some genetic changes related to known disease risks for the donor. For example, approximately 50% of the Caucasian population is heterozygous for the
We also found some novel changes in the HuRef genome for which the biological consequences are as yet unknown. For example, we found a 4-bp novel heterozygous deletion in
We have also been able to detect inconsistencies between detected genotypes in the donor's DNA and the expected phenotype based on the literature given the known phenotype of the HuRef donor. For example, the donor's
We describe the sequencing, de novo assembly, and preliminary analysis of an individual diploid human genome. In the course of our study, we have developed an experimental framework that can serve as a model for the emerging field of en masse personalized genomics [
The most significant technical challenge has been to develop an assembly process (points ii–v) that faithfully maintains the integrity of the allelic contribution from an underlying set of reads originating from a diploid DNA source. As far as we know, the approach we developed is unique and is central to the identification of the large number of indels less than 400 bp in length. We attempted de novo recruitment of sequence reads to the NCBI human reference genome, using mate pairing and clone insert size to guide the accurate placement of reads [
The genome assembly approach with allelic separation allows the detection of heterozygous variants present in the individual genome with no further comparison. The one-to-one mapping of our HuRef assembly against a nearly completed reference genome permits the detection of the remaining variants. These variants arise from sequence differences found within and also outside the mapped regions, where the precision of the compared regions is being provided by the genome-to-genome comparison [
While several new approaches for DNA sequencing are available or being developed [
We have been able to categorize a significant amount of DNA variation in the genome of a single human. Of great interest is the fact that 44% of annotated genes have at least one, and often more, alterations within them. The vast majority—3,213,401 events (78%) of the 4.1 million variants detected in the HuRef donor—are SNPs. However, the remaining 22% of non-SNP variants constitute the vast majority, about 9 Mb or 74%, of variant bases in the donor. Using microarray-based methods, we also detected another 62 copy number variable regions in HuRef, estimated to add some 10 Mb of additional heterogeneity. Given these potential sources of measured DNA variation, we can, for the first time, make a conservative estimate that a minimum of 0.5% variation exists between two haploid genomes (all heterozygous bases, i.e., SNP, multi-nucleotide polymorphisms [MNP], indels, [complex variants + putative alternate alleles + CNV]/genome size; [2,894,929 + 939,799 + 10,000,000]/2,809,547,336) namely those that make up the diploid DNA of the HuRef assembly. We also note that there will be significantly more DNA variation discovered in heterochromatic regions of the genome [
We had mixed success when attempting to find support for the experimentally determined CNVs in the HuRef assembly itself or the data from which it was derived. More than 50% of the CNVs overlapped segmental duplications, and these regions are underrepresented in HuRef, which complicated the analysis. We attempted to map the sequence reads onto the NCBI human genome and then identify CNVs by detecting regions with significant changes in read depth. However, we found significant local fluctuations in read depth across the genome, limiting the ability for comparison and suggesting that a higher coverage of reads may be required to use this approach effectively.
As we have emphasized throughout, a major difference of the genomic assembly we have described is our approach to maintain, wherever possible, the diploid nature of the genome. This is in contrast to both the NCBI and WGSA genomes, which are each consensus sequences and, therefore, a mosaic of haplotypes that do not accurately display the relationships of variants on either of the autosomal pairs. For BAC-based genome assemblies such as the NCBI genome assembly, the mosaic fragments are generally genomic clone size (e.g., cosmid, PAC, BAC), with each clone providing contiguous sequence for only one of the two haplotypes at any given locus. Moreover, there are substantial differences in the clone composition of different chromosomes due to the historical and hierarchical mapping and sequencing strategies used to generate the NCBI reference assemblies [
In contrast, for WGSA, the reads that underlie most of the consensus sequence are derived from both haplotypes. This can result in very short-range mosaicism, where the consensus of clustered allelic differences does not actually exist in any of the underlying reads. To address this issue, the Celera assembler was modified to consider all variable bases within a given window and to group the sequence forms supporting each allele before incorporation into a consensus sequence (see
Partial haplotypes can be inferred for an individual from laboratory genotype data (e.g., from SNP microarrays) in conjunction with population data or genotypes of family members. However, at least in the absence of sets of related individuals (e.g., family trios), it is difficult to determine haplotypes from genotype data across regions of low LD. We have shown that sequencing with a paired-end sequencing strategy can provide highly accurate haplotype reconstruction that does not share these limitations. The assembled haplotypes are substantially larger than the blocks of SNPs in strong LD within the various populations investigated by the HapMap project. In addition to being larger, haplotypes inferred in our approach can link variants even where LD in a population is weak, and they are not restricted to those variants that have been studied in large population samples (e.g., HapMap variants). We note that in addition to the implications for human genetics, this approach could be applied to separating haplotypes of any organism of interest—without the requirement for a previous reference genome, family data, or population data—so long as polymorphism rates are high enough for an acceptable fraction of reads or mate pairs to link variants.
There are several avenues for extending our inference of haplotypes. As noted, although the naive heuristics used here give highly useful results, other approaches may give even more accurate results, as we have observed with an MCMC algorithm. There are various natural measures of confidence that can be applied to the phasing of two or more variants, including the minimum number of clones that would have to be ignored to unlink two variants, or a measure of the degrees of separation between two variants. The analysis presented here provides phasing only for sites deemed heterozygous, but data from apparently homozygous sites can be phased as well, so we can tell with confidence whether a given site is truly homozygous (i.e., the same allele is present in both haplotypes) or whether the allele at one or even both haplotypes cannot be determined, as occurs as much as 20% of the time with the current dataset. Lastly, it should be possible to combine our approach with typical genotype phasing approaches to infer even larger haplotypes.
Our project developed over a 10-year period and the decisions regarding sample selection, techniques used, and methods of analysis were critical to the current and continued success of the project. We anticipated that beyond mere curiosity, there would be very pragmatic reasons to use a donor sample from a known consented individual. First and foremost, as we show in a preliminary analysis, genome-based correlations to phenotype can be performed. Due to the still rudimentary state of the genotype-phenotype databases it can be argued that at the present time, DNA sequence comparisons do not reveal much more information than a proper family history. Even when a disease, predisposition, or phenotypically-relevant allele is found, further familial sampling will usually be required to determine the relevance. Eventually, however, populations of genomes will be sequenced, and at some point, a critical mass will dramatically change the value of any individual initiative providing the potential for proactive rather than reactive personal health care. In a simple analogy, absent of family history, genealogical studies can now be quite accurate in reconstructing ancestral history based purely on marker-frequency comparisons to databases. Here, with a near-unlimited amount of variation data available from the HuRef assembly, we can reconstruct the chromosome Y ethno-geneographic lineage (
The HuRef donor Y chromosome haplotype suggests descent from several European/US groups given the Y chromosome ethno-geographic markers. The haplogroup membership is R1b6 with includes individuals from the United Kingdom, Germany, Russia, and the United States, which is consistent with the self-reported family tree provided by the HuRef donor. The thick red line denotes the markers needed to trace the haplotype from the mapping of the chromosome Y markers to the HuRef genome. Data and figure from the Y Chromosome Consortium;
There are always issues regarding the generation and study of genetic data and these may amplify as we move from what are now primarily gene-centric studies to the new era where genome sequences become a standard form of personal information. For example, there are often concerns that individuals should not be informed of their predisposition (or fate) if there is nothing they can do about it. It is possible, however, that many of the concerns for predictive medical information will fall by the wayside as more prevention strategies, treatment options, and indeed cures become realistic. Indeed we believe that as more individuals put their genomic profiles into the public realm, effective research will be facilitated, and strategies to mitigate the untoward effects of certain genes will emerge. The cycle, in fact, should become self-propelling, and reasons to know will soon outweigh reasons to remain uninformed.
Ultimately, as more entire genome sequences and their associated personal characteristics become available, they will facilitate a new era of research into the basis of individuality. The opportunity for a better understanding of the complex interactions among genes, and between these genes and their host's personal environment will be possible using these datasets composed of many genomes. Eventually, there may be true insight into the relationships between nature and nurture, and the individual will then benefit from the contributions of the community as a whole.
We used the assembled chromosome sequence of the human genome available as NCBI version 36. The gene annotation of this genome was provided by Ensembl (
200-μl aliquots of thawed, whole blood were processed using the MagAttract DNA Blood Mini M48 Kit and the MagAttract DNA Blood >200 μl Blood protocol on the BioRobot M48 Workstation running the GenoM-48 QIAsoft software (version 2.0) (Qiagen;
Phytohemagglutin-stimulated lymphocytes from peripheral blood were cultured for 72 h with thymidine synchronization. G-banding analysis was performed on metaphase spreads from peripheral blood lymphocytes using standard cytogenetic techniques.
Spectral karyotyping was performed on metaphase spreads from cultured lymphocytes. SkyPaint probes were used according to manufacturer's instructions (Applied Spectral Imaging;
The Celera Assembler Software (
For this project we made specific modifications to the Celera Assembler to enable the grouping of reads into separate alleles when heterozygous variants were encountered. Instead of taking a column-by-column approach to determine the consensus sequence from a set of aligned reads, the region of variation was considered as a whole, defined as that between at least 11 bp nonvariant columns. In practice, variant regions would most frequently be single columns (SNPs), but the new algorithm only applied to longer regions. The reads spanning a variant region were split between alleles. An allele, for this purpose, was one or more spanning reads sharing an identical sequence for the variant region, and was considered confirmed if represented by two or more reads. Each allele was assigned a score equal to the sum of average quality values for the spanning portions of its reads. The highest-scoring confirmed allele was used for the consensus sequence. Alternate confirmed allele sequences were reported separately. As expected, there were usually two confirmed alleles in each region of sequence variation. Regions with more than two apparent confirmed alleles represented either collapsed repetitive sequence or a group of reads with systematic base calling error, rather than true genetic variation.
The set of The Institute for Genome Research (TIGR) BAC ends [
We used open-source software (
For each one-to-one mapping we determined three levels: matches, runs, and clumps. A match is a maximal high-identity local alignment, usually terminated by indels or sequence gaps in one of the assemblies. Runs may include indels, and are monotonically increasing or decreasing sets of matches (linear segments of a match dot plot) with no intervening matches from other runs on either axis. Clumps are similar to runs but allow small intervening matches/runs (such as small inversions) to be skipped over. The total number of base pairs in matches is a measurement of how much of the sequence is shared between assemblies. Within a run, the number of base pairs in each assembly is different, because indels are allowed among matches in the run. These could be gaps that are filled in one assembly but not the other, polymorphic insertions or deletions, or artifactual sequence. Runs span regions in both assemblies that have no rearrangements with respect to each other, providing a direct measure of the order and orientation differences between a pair of assemblies. Clumps provide a similar measure of rearrangement but allow for small differences that may be due to noise or polymorphic inversions. Remaining sequence may be unique to one assembly or the other, but some will also be large repetitive regions without good one-to-one mapping but present in some copy number in both assemblies. Apparently unique sequence may also represent some form of contaminant.
We determined an initial set of potentially chimeric scaffolds by finding those that contained more than one clump of at least 5,000 bp relative to NCBI version 36. By mapping all HuRef and Coriell fosmid mate pairs to NCBI human reference genome and to HuRef, we assessed whether mate pair constraints were violated at the potentially chimeric junctions. Accordingly, we split 12 scaffolds.
DNA variants were characterized by alignment of sequencing reads in the HuRef assembly and by comparison of regions of difference in the one-to-one HuRef to NCBI reference genome map. The contribution of each sequence read to a single position in the HuRef consensus was evaluated both during and after the assembly process to identify positions that contain more than one allele. This process identified heterozygous SNPs and indel polymorphisms, and typically two or more reads were required for the initial identification of an alternate allele. Homozygous SNPs and MNPs were identified when (respectively) single or multiple contiguous loci differed in the one-to-one mapping, and all underlying HuRef reads supported one allele. Finally, homozygous insertion or deletion loci were identified where the HuRef assembly had or lacked sequence relative to the NCBI assembly, respectively. These were commonly referred to as homozygous indels unless it was relevant for analysis purposes, computational or experimental, to refer to a homozygous insertion or deletion as a way of indicating presence or absence of the sequence, respectively, in the HuRef assembly.
DNA variations were identified by examining the base changes within the HuRef assembly multialignment and between the HuRef assembly and the NCBI reference human genome. 5,061,599 SNPs and heterozygous variants were identified initially, after which filters were applied to eliminate erroneous calls. For a potential SNP, each read supporting that SNP was considered, and if the QV was <15 at the putative SNPs position in the read, then the read was considered invalid and was discarded as evidence for that particular variant. We also observed that deletions were overcalled at the beginnings and ends of reads, and insertions were overcalled at the ends of reads (
Subsequent to the quality value and read location filtering the remaining variants were inspected for the percentage, number, and directionality of reads supporting the alternate alleles. Additionally these variants were inspected for the total number of reads in their assembled locus and the repeat sequence status (transposon and tandem repeat). Transposon repeats were identified using the RepeatMasker program (
Manual inspection showed that some neighboring variants identified within the one-to-one mapping of HuRef to the NCBI genome reference would be more precisely represented as one larger variant after realignment. To address these regions of clustered variants, we identified these problematic regions by clustering SNPs within 2 bp of each other or any non-SNP variants with 10 bp of another variant. For these variable regions, we recalled the variant(s) using the variant calling algorithm developed as part of the consensus sequence generation found in the Celera assembler.
Homozygous insertion/deletions were filtered in the same manner as SNPs and heterozygous variants. All variants that were not confirmed by two or more reads were eliminated, as were those that did not fulfill minimal requirements of at least one spanning mate pair, and that the inserted sequence on the HuRef assembly or deleted sequence on the NCBI assembly not contain any ambiguous bases
We estimate the population mutation parameter (θ) [
Two individuals of European ancestry were randomly selected from the SeattleSNPs data (
We developed a statistical model based on our assembly read coverage in the single diploid genome and on the filtering criteria used for calling high confidence variants. We assumed that chromosomes containing each of the two alleles are equally likely to be sampled and that allele loci are independent. At a given heterozygous locus, the probability of observing both alleles in at least
A number of heterozygous indels between 1 and 20 bp were manually selected for experimental validation by verifying trace quality in the region of the indel, read coverage depth, and repeat sequence status. In order to detect heterozygous indels from the HuRef assembly, we ran PCR-amplified genomic DNA on PAGE to look for homoduplex and heteroduplex bands. Large insertions and deletions were also recognized by this process.
Primers were designed by centering the targeted indel to produce amplicons 150–250 bp in length with the melting temperatures of these amplicons ranging between 70 °C and 86 °C. PCR for polymorphism analysis was carried out in 10-μl volume reactions containing 30 ng of purified genomic DNA, 1× PCR buffer, 20 μM deoxynucleoside triphosphates, 2 mM MgCl2, 8% glycerol, 0.18 μM primers, and 0.0375 U AmpliTaq Gold DNA polymerase. Post-amplification treatment of each sample involved digestion with shrimp alkaline phosphatase (0.5 U) and exonuclease I (1.76 U) for 45 min at 37 °C, 15 min at 50 °C, with heat inactivation for 15 min at 72 °C.
PAGE was carried out at room temperature for 4 h at 650 V (constant) in a standard vertical gel measuring 1 mm thick, 20 cm wide, and 30 cm long (apparatus Model SG-400–20, CBS Scientific Company Inc,
Fifty-one apparent homozygous insertions in the HuRef assembly were selected based on assembly structure (appropriate read depth coverage and supporting mate pair evidence), their proximity to annotated genes, and their size. The insertion sequences were from 100 to 1,200 bp with few repeat sequences, and no detectable alignments to human (NCBI 36) or chimpanzee [
The amplicons were classified according to theoretical melting temperatures (
2.0 μl of PCR product was combined with 5.0 μl of diluted loading dye (Invitrogen) and run on a 2.0% agarose gel, containing ethidium bromide. Gels were run for 45 min at 90 V and imaged using a Gel Doc and Quantity One Software (Bio-Rad Laboratories). Gel images were manually evaluated for the presence or absence of expected products.
Segments of the human genome that were found exclusively in either HuRef or NCBI version 36 represent potential misassemblies or genuine variations. In order to distinguish between these possibilities, we attempted to confirm the existence of the largest one-to-one HuRef–NCBI indels in a collection of fosmid clones, derived from eight individuals (see
Haplotypes of heterozygous variants were inferred using a greedy heuristic with iterative refinement of the initial solution.
The HuRef sample was genotyped in duplicate on each of the GeneChip Human (500K) Mapping
The
The HuRef sample was genotyped using the Sentrix HumanHap650Y Genotyping BeadChip according to the manufacturer's instructions. All chips were scanned using the Sentrix Bead-Array reader and the Sentrix Beadscan software application. The results from the BeadChip were analyzed for CNV content using QuantiSNP as previously described [
The Agilent human genome CGH array contains 244,000 60mer probes on a single slide. The experiment was run using 2.5 μg of genomic DNA for Cy3/Cy5 labeling for each hybridization, with a standard dye-swap experimental design. DNA sample NA10851 was used as a reference. The slides were scanned at 5-μm resolution using the Agilent G2565 Microarray Scanner System (Agilent Technologies;
CGH was performed using the Nimblegen human genome CGH array. The array contains 385,000 isothermal probes yielding a median spacing of 6 kb across the human genome. The experiment was performed as previously described [
FISH analysis was performed to find the location of DNA segments present in the HuRef DNA but either missing or represented by gaps in HuRef assembly. The FISH analysis was performed as previously described [
PAGE detection of an 8-bp indel (GATAAGTG/--------) in three Coriell DNA samples (lane 1 = NA05392 , lane 2 = NA05398, lane 3 = NA07752, and lane 4 = HuRef donor DNA). Note the detection of two bands signifying the presence of two allelic forms in individual NA05392 and HuRef and the short and long alleles in individuals NA05398 and NA07752 respectively.
(118 KB PDF)
Note the increased occurrence of variant at the beginning and end of reads. The relative position of increased rate of variant identification was used and reads calling variant outside these threshold were removed as positive evidence for the presence of that particular variant. This approach led to a significant improvement in the quality of the variant sets and was used as part of the variant filtering process.
(132 KB PDF)
The dbSNP distribution indicates which “raw” variant have been previously reported in dbSNP. The intersection point of these two distributions at lower values determines the minimum percentage of minor allele threshold with which variant could be filtered to improve their quality using dbSNP as a guide. These threshold values were deemed to be 25% for SNP and 20% for indels. Indels are referred to separately as insertions and deletions depending on whether the shorter or longer form, respectively, was used in the HuRef consensus sequence. Ultimately these variant loci are all determined heterozygous indels as indicated in
(121 KB PDF)
(82 KB DOC)
Matches are maximal high identity local aligned segments with no indels. Runs are sequentially adjacent matches with no intervening matches from other genomic sequences with the possibility of indels. Clumps are the same as runs but allow small intervening matches/runs to be skipped over in addition to indels (small is a settable parameter). This allows for example small inversions not to interrupt a longer alignment. All subsequent analyses (i.e., variant detection and analysis) discussed in the manuscript were performed using HuRef scaffolds. N50 is the scaffold length such that 50% of all base pairs are contained in scaffolds of this length or larger.
(64 KB DOC)
Matches are maximal local identical alignments with no indels. Runs are monotonically increasing sets of matches with no intervening matches with indels allowed. The values in the table denote the percentage of the NCBI chromosome found in matches or runs counting alignments containing bases (i.e., no ambiguous or gaps, Ns.)
(44 KB DOC)
(205 KB XLS)
Fosmid clones were mapped to the sites of large insertions that are predicted to occur uniquely on HuRef (insertions) or NCBI (deletions) as described in the
(168 KB DOC)
I/I provides the number of Coriell DNA samples that are homozygous for the inserted sequence, heterozygous (I/N), and homozygous for no insertion (N/N). The Hardy-Weinberg
(95 KB DOC)
(59 KB XLS)
(24 KB XLS)
This genome-wide view attempts to illustrate the wide spectrum of DNA variation in the diploid chromosome set of an individual human, J. Craig Venter. The genome sequence is displayed on a nucleotide scale of approximately 1Mb/15 mm. The background of the chromosome tracks shows an approximate correspondence of features from the chromosome cytogenetic map. The different data tracks are grouped into two major sections: a representation of a current set of transcription units and a set of summary plots for different variation features at sequence level.
For each DNA strand (forward and reverse), each mapped gene is shown at genomic scale and is color-coded according to the presence of transcript isoforms (see Gene Variants panel on figure key). A total of 54,253 transcript isoforms were mapped. The genes are given a minimum length of 10 kb for display purposes at this level. The largest transcript isoform for all genes that are between 2.5 kb and 250 kb and have at least five exons are shown, in an additional pair of tracks, expanded to a resolution close to 100 kb /15 mm. Due to their high gene density, the resolution is smaller for Chromosomes 17, 19, and 22 at approximately 80 kb/15 mm.
In these expanded tiers, exons are depicted as black boxes and introns are color coded according to a set of Gene Ontology categories (GO,
Below the reverse strand annotations track there is the copy number variation (CNV) track. Here, results from four different experimental platforms (Affymetrix, Illumina, Agilent, and Nimblegen) determine regions where a CNV gain or loss was detected, shown as green or red boxes respectively. Nonoverlapping haplotype blocks are distributed into nine tracks, using distinct colors to enhance visibility. The longest blocks at each given loci are drawn as yellow boxes with a red outline to highlight them from the rest. A summary of the variation features defined for all the haplotype blocks is shown as a gray-scale gradient. Alternating color gradient tracks display count densities for homozygous SNP, heterozygous SNP, multiple nucleotide (MNPs), insertion/deletion polymorphisms, and complex forms of variation. The last two gradients contain count densities for tandem repeats and transposon repeats respectively. All these color gradients were produced using a 5-kb sliding window.
The figure was generated with “gff2ps” (
(88 MB PDF)
The GenBank (
The authors would like to thank Dr. Roderic Guigo for discussion on generating the HuRef genome poster and Dr. Douglas Smith (Agencourt Bioscience) for producing and making publicly available fosmid end sequence data from six HapMap individuals. Dr. Victor A. McKusick provided many helpful discussions during this study. Mr. Adam Resnick acquired trace files from the trace archive and reprocessed both Celera and Joint Technology Center traces. Mr. Justin Johnson submitted the scaffold and chromosome consensus sequences to GenBank, and Dr. Sarah Shaw Murray provided thoughtful comments on the manuscript. We would like to thank Ms. Jasmine Pollard and Mr. Matthew LaPointe for their assistance in creating the figures and the genome poster, respectively. The authors wish to thank The Centre for Applied Genomics at The Hospital for Sick Children including the Cytogenetics laboratory for technical assistance.
comparative genomic hybridization
grouped Han Chinese and Japanese
copy number variant
Caucasian
bacterial artificial chromosome
fluorescence in situ hybridization
linkage disequilibrium
mulit-nucleotide polymorphism
quality value
short interspersed nuclear element
single nucleotide polymorphism
whole-genome shotgun assembly
Yoruban
References 105 and 106 were added at the proof stage and so are cited out of order in the text. During the review process we became aware of a recently published paper on haplotype assembly that deserves mention for its relatedness to our haplotype separation approaches [