The Community Page is a forum for organizations and societies to highlight their efforts to enhance the dissemination and value of scientific knowledge.
I have read the journal's policy and have the following conflicts: Paul Flicek is married to the deputy editor of
The availability of a high quality human genome assembly has revolutionized biomedical research. Genomics has now entered the realm of clinical genetics, with many groups using either whole genome sequencing
While the human reference assembly is the highest quality mammalian assembly available, it is not without shortcomings. The “finished” assembly
The GRC (the GRC consists of The Genome Institute at Washington University, The Wellcome Trust Sanger Institute, The European Bioinformatics Institute, and The National Center for Biotechnology Information) is an international consortium with expertise in genome mapping, sequencing, and informatics. The goal of the GRC is to provide high quality genome assemblies that will allow a user to place any sequence greater than 500 bp into a chromosome context. While this report focuses largely on recent GRC advances concerning the human reference assembly, the GRC is also responsible for the mouse and zebrafish reference assemblies. Continued improvement of the human reference assembly is critical as we move towards an era of clinical and personal genomics. The reference genomes of mouse and zebrafish are similarly critical in light of their importance as model organisms and the significant investments made in creating community resources such as gene knockout collections.
Two major problems faced the GRC at the outset of this project, the decentralized nature of the Human Genome Project and the lack of a suitable data model for representing complex genomes. Much of the data underlying curation decisions had not been captured nor standardized. The human reference assembly had never been submitted to the International Nucleotide Sequence Database Collaboration (INSDC)
Initial efforts at assembling the human genome were guided by the concept of “a golden path”
The GRC has addressed these problems by establishing common tools and standard operating procedures (SOPs) so that the genome assembly is now constructed in a regularized fashion. We have developed a single database to store all data underlying the genome assembly. Finally, we have developed a system to track individual regions that are under review. All of these data are made publicly available through our Web site (
Additionally, the GRC has formalized an assembly model (
The top panel shows an ideogram representation of the human genome. The primary assembly unit contains sequences for the non-redundant haploid assembly; this includes the scaffolds that make up the chromosome sequence as well as unplaced and unlocalized scaffolds that are thought to represent novel sequence (not shown in this picture). Alternate loci and patches are placed in separate assembly units to facilitate annotation. Note the seven alternate scaffolds in the MHC region are all placed in different assembly units, as they all represent different representations of the same sequences. Other alternate loci can be added to these assembly units at the next major release if they don’t overlap the existing alternates. All patches are placed in the PATCHES assembly unit and minor releases are cumulative such that the latest minor release will contain all patches. The red triangle, yellow circles, and blue circles represent regions that contain additional sequences that are not given actual chromosome coordinates, but rather are given a chromosome context via alignment to the primary assembly. The red triangles represent regions’ alternate loci; these are sequences that provide an additional tiling path to the one given in the chromosome representation and are essential for representing structurally complex loci. The circles represent patch sequences; these are minor updates made to the assembly outside of the major build cycle. Yellow circles represent “fix” patches: regions of the chromosome assembly that will change with the next major assembly update. Blue circles represent “novel” patches: these are sequences that represent new alternate loci in the next major assembly update. Unlocalized and unplaced sequences are not represented in this figure. Sequences within the assembly are placed within containers known as assembly units. Note: a region can point to more than one type of extra chromosomal sequence; for example, a region could point to an alternate locus and to a fix or novel patch.
We have also introduced the concept of a “minor” assembly update, in the form of genome patches. This mechanism provides users with timely access to genome improvements without inducing frequent changes to the coordinate system upon which assembly annotations are based. Because genome patches take the same form as alternate loci the two forms of data can be similarly managed.
The release cycle for major assembly updates will not occur on a fixed schedule. In order to minimize the need for frequent re-annotation, major assembly updates will occur infrequently when we have produced at least 100 fix patches or affected >1% of the euchromatic sequence. The GRC will announce planned updates on their Web site at least 6 months in advance of any major assembly release. Additional, detailed information regarding major releases will be publicly announced via the Web site as data freeze dates approach. Minor assembly updates will be made quarterly.
We have produced a major release of the human reference assembly, GRCh37, which was submitted in June of 2009 to the INSDC (GCA_000001405.1), and four minor assembly updates, with the last patch, GRCh37.p4 (GCA_000002405.5), released in April 2011. Detailed information concerning genome assembly construction is on our Web site (
The top part of
(Top Panel) Issues for GRCh37, GRCh37.p1, and GRCh37.p2, broken down by type. Issue types are: Clone Problem: The issue is contained within a single clone. This may be a single nucleotide difference or a clone mis-assembly. Path Problem: There is evidence that the tiling path within a given region is incorrect and we will need to update the path. GRC Housekeeping: Changes use to help regularize the tiling path. Missing Sequence: Sequence that we can’t yet place on the assembly. Mapping studies are ongoing to help place these sequences. Variation: There is evidence to suggest that complex variation is complicating a region and an alternate allele may need to be produced. Gap: The issue concerns filling a gap. Unknown: Issue is still under investigation for classification. (Bottom Panel) Details for issue HG-2, a Path Problem. The representation in NCBI36 was a mixed haplotype. The tiling paths for NCBI36 and GRCh37 are shown. Blue clones are anchor clones that are in NCBI36, the GRCh37 chr4 path, and the GRCh37 alternate locus path. Red clones represent the UGT2B17 insertion path and dark gray clones represent the UGT2B17 deletion path. The light gray clone was not used in NCBI36, but was used in GRCh37 to complete the alternate locus.
While the model changes described above facilitated our assembly management and reporting, we also wished to investigate whether these updates would allow for improved genome analysis. To investigate this, we first tried to recover sequence identified as novel in a personal genome, theYH1 human assembly
We also wished to investigate the impact on alignment of next generation sequencing reads. We selected two samples from the 1,000 Genomes project
We envision the high quality reference assemblies generated by the GRC having a long-term role in biomedical research because they most accurately capture all forms of human genetic variation and facilitate investigation of human disease in model organisms. With this in mind, we have built a reference assembly infrastructure to support transparent curation and assembly production. We have also updated the assembly model so that it better represents our current understanding of genome structure and diversity. We will use this model to encompass new discoveries and ultimately capture all significant variations in the human population structure as discovered through projects such as 1,000 genomes. Additionally, we wish to engage the research and clinical communities to identify regions that require targeted effort and to incorporate information from groups performing detailed work on specific loci. The GRC can only be truly successful with community input. Users can report problems directly to the GRC via our Web page (
It is difficult to overstate the importance of the human reference assembly, even in the age of personal genomics. Given current sequencing and assembly technology, there is a clear need for a high quality reference that can represent structural diversity across all populations. Providing a representation of this diversity is critical for next generation sequence analysis. Even using an assembly with only three regions with alternative alleles, we show improved alignment quality and by extension variation calling, which is the primary product of personal genomics. More genomic alignment tools that can take the alternate representations into account need to be developed.
Understanding how genotype influences phenotype necessitates an accurate and complete picture of all loci in multiple populations. For many genomic regions, this can be denoted by a sequence with annotated SNPs and small indels, but other loci will require multiple sequence instances for complete representation. Some human loci, such as the 1q21 region, which remains misassembled in GRCh37.p2, are sufficiently complex that significant effort is needed to obtain even one correct sequence for the region. Additional work is required to sort out the haplotypes segregating among various populations, many of which contribute to phenotypes associated with multiple developmental disorders
While assemblies using next generation sequencing are beginning to approach the quality of long-read Whole Genome Shotgun assemblies
The GRC would like to acknowledge the following contributors to this project: David C. Schwartz, Jane Rogers, Mario Caccamo, Paul Kitts, Michael DiCuccio, Françoise Thibaud-Nissen, Avi Kimchi, Jonathan Mudge, Richard Clark, Andrew Dearlove, Michelle Smitth, Britt Kilian, Karen McLaren, James Gilbert, Laurens Wilming, Darren Ware, Sharmin Begum, Karen Davey, Diana Kidger, Kim Brugger, Tony Gaige, and Jason Walker.
Genome Reference Consortium
International Nucleotide Sequence Database Collaboration