Membership of The Mouse Genome Sequencing Consortium is provided in the Acknowledgments.
The author(s) have made the following declarations about their contributions: Conceived and designed the experiments: DMC LG DCS KLT CPP. Performed the experiments: DMC LG LWH MZ SG XS CJB RA JLC MD WH YK PM DM ZB ACM TG SZ BT KP CC MP JH RR DF JAL ZC TMGSC. Analyzed the data: DMC LG LWH MZ SG XS CJB DCS ZC EEE CPP. Contributed reagents/materials/analysis tools: DMC LG. Wrote the paper: DMC LG LWH MZ KLT EEE CPP.
The authors have declared that no competing interests exist.
A finished clone-based assembly of the mouse genome reveals extensive recent sequence duplication during recent evolution and rodent-specific expansion of certain gene families. Newly assembled duplications contain protein-coding genes that are mostly involved in reproductive function.
The mouse (
The availability of an accurate genome sequence provides the bedrock upon which modern biomedical research is based. Here we describe a high-quality assembly, Build 36, of the mouse genome. This assembly was put together by aligning overlapping individual clones representing parts of the genome, and it provides a more complete picture than previous assemblies, because it adds much rodent-specific sequence that was previously unavailable. The addition of these sequences provides insight into both the genomic architecture and the gene complement of the mouse. In particular, it highlights recent gene duplications and the expansion of certain gene families during rodent evolution. An improved understanding of the mouse genome and thus mouse biology will enhance the utility of the mouse as a model for human disease.
The mouse (
Given the critical role of the mouse as a model organism, it is particularly important to separate shared ancestral characteristics that have been conserved in the mouse and human since their divergence from derived characteristics that are unique to either lineage. Mouse and human genes whose coding sequences have scarcely changed since the last common ancestor and that have remained unduplicated in each lineage are the most likely to have retained their ancestral functions. In contrast, genes that have duplicated along the rodent lineage may contribute to derived traits that are less relevant to human biology, and are thus less appropriate models of human physiology and disease.
Genomic duplication and divergence is a primary source of functional innovation
In late 2002, we published a draft mouse genome assembly, referred to as the MGSCv3, of a single, inbred strain (C57BL/6J, or “B6”)
A careful cost/benefit analysis must be performed when approaching a genomic sequencing project. If lineage-specific biology is important, clone-based finishing of some form will be required.
Despite the great utility of the initial MGSCv3 assembly, the draft genome contained over 176,000 gaps and included entire regions whose positions and/or orientations in the assembly now appear to have been in error. The most serious issue to the use of the MGSCv3 is its almost complete lack of highly sequence-similar and recently segmentally duplicated regions
Here we report the completion of this effort and present a high-quality, largely finished clone-based genome assembly of the C57BL/6J strain of mouse, here referred to as Build 36. This new assembly includes 267 Mb of sequence (
The availability of finished sequence for human, and now mouse, enables more-complete surveys of protein-coding genes in both species. We now estimate that mouse and human reference genomes contain 20,210 and 19,042 protein-coding genes, respectively. The number of mouse genes had been missing or substantially disrupted in the previous MGSCv3 assembly is 2,185. The majority of these arise from rodent lineage-specific duplications, often (61%) embedded within segmentally duplicated regions that were recalcitrant to WGSA. Many of these mouse-specific genes may contribute to rodent-specific functions and, with their inclusion in the assembly, are now available for further investigation.
The mouse genome assembly (Build 36;
The mouse genome assembly (Build 36) was produced largely as described previously
To assess the accuracy of Build 36, the genome assembly was compared to several independent sources of data including a linkage map
Chromosomes are drawn to scale, with MGSCv3 to the left (green) and Build 36 to the right (purple). A female mouse provided the DNA for the MGSCv3, so no Y chromosome was available for this assembly.
The chromosomes of human Build 36 are painted with segments of conserved synteny ≥300 kb long with mouse MGSCv3 (left) and Build 36 (right). Colors indicate mouse chromosomes (see legend bottom right), while lines indicate orientation (top left to bottom right is direct, top right to bottom left is inverted). White regions are not covered by alignments forming a segment ≥300 kb. Red triangles are human centromeres. Note that all undirected blocks (regions of identical color) are identical between the two mouse builds except a region at the centromere of human Chromosome 9, which is itself an artifact in the MGSCv3 map. However, several areas of orientation change, some quite small, can be seen.
Parameter | MGSCv3 | Build 36 |
2.685 Gb | 2.661 Gb | |
2.475 Gb | 2.567 Gb | |
103.9 Mb | 17.1 Mb | |
17.8 Mb | 40.3 Mb | |
176,507 | 1,218 | |
<0.1% | 0.0494 | |
1.046 Gb | 1.118 Gb | |
460.1 Mb | 505.3 Mb | |
22,011 |
20,210 | |
n/a | 191 | |
12,845a | 15,187 | |
1.25%a | 1.27% | |
0.48%a | 0.87% |
Values for MGSCv3 protein-coding genes are taken from the gene catalogue used in the draft mouse genome publication
We identified a total of 334 chromosomal breakpoint intervals between human and mouse and refined the breakpoints to an average interval length of 335 kbp. We found that 50% (167/334) of the breakpoints and that 28.7% by base pair (32.2/111.9 Mbp) intersected with segmental duplications. This 6-fold enrichment is significant (
The revised Build 36 assembly contains 139 Mb of sequence that could not be aligned against, and thus appears to have been absent from, the previous MGSCv3 draft assembly. 108 Mb (77%) of this sequence consists of 119,000 repetitive elements (Table S7 in
Eighty percent of sequence added or corrected in the mouse genome assembly consists of segmentally duplicated regions or interspersed repeats. Most of these have now been ordered and oriented on a chromosome (
Interchromosomal (red) and intrachromosomal (blue) duplications (>95% identity and >10 kbp) in length are shown for both genome assemblies with the requirement that pairwise alignments are shown for only those regions (Build 36) that are also confirmed by the WGS depth of coverage analysis (black vertical bars/ticks). Positions of the centromeres (acrocentric) are shown (purple) for the MGSCv3 build. Initial estimates predicted the amount of segmental duplication to be approximately 1.5–2% of the genome. Calculations performed using Build 36 suggested the amount is much higher, approximately 4.5–5%. In addition, >60% of duplicated sequences were unplaced in the MGSCv3. In Build 36, almost all are assigned to a chromosome
This is partially addressed in
The Build 36 assembly contains many genes that were absent, truncated, incomplete, or misassembled in the initial draft MGSCv3 genome sequence. As we describe below, the vast majority of these genes reside in segmentally duplicated regions. Using gene predictions for human and mouse from both NCBI
This process identified 20,210 high-quality protein-coding gene models in mouse and 19,042 such models in the human genome (
Simple 1∶1 orthologs correspond to genes that have remained intact and unduplicated since the last common ancestor of mouse and human. Using a recently developed phylogenetic approach
Counts of 1∶1 orthologs | 151878 |
0.057 (0.024–0.11) | |
0.58 (0.46–0.75) | |
0.095 (0.043–0.18) | |
88.2% (79.4%–94.7%) | |
85.3% (80.6%–88.8%) | |
443 (283–706) | |
443 (283–706) | |
434 (276–693) | |
97.4% (99.4%–100%) |
Shown are median values and, in parentheses, lower and upper quartiles.
Only eight mouse genes with 1∶1 orthologs in human were entirely absent from the initial MGSCv3 assembly (see Table S8 in
It is thus clear that while MGSCv3 had provided a largely faithful representation of unduplicated 1∶1 orthologs, Build 36 provides across-the-board improvements to the quality of gene predictions. This greatly improved assembly now permits a more-complete understanding of rodent-specific genes. Of 2,185 Build 36 gene models that were substantially disrupted by missing or misassembled sequence in MGSCv3 (see
Mouse lineage-specific gene duplicates are shown in red, and all other genes are shown in blue. The large number of mouse-specific genes that are entirely missing, truncated, or otherwise disrupted in MGSCv3 underscores the value of the finished Build 36 assembly in understanding rodent-specific biology.
(A) The upper left hand corner shows a dot-matrix view of the Build 36 Chromosome 5 (horizontal axis) aligned to the MGSCv3 Chromosome 5 (vertical axis). The triangle marks the portion of the chromosome shown in the zoomed in view. The axes are in the same orientation. 1.5 Mb of sequence that was absent from MGSCv3 has been included in Build 36. This region contains 30
Of the ten gene families that have seen the greatest expansions over the mouse lineage, we find that all but two are associated with reproductive functions (
Gene Families | Mouse Chromosomes | Functional Category | Gene Counts | Genes Overlapping MGSCv3 Gaps | Genes Absent from MGSCv3 | |
Mouse | Human |
|||||
5, 14 | Reproduction | 111 | — | 42 | 14 | |
7 | Reproduction | 90 | — | 52 | 23 | |
4, 5 | Reproduction | 90 | 22 | 55 | 1 | |
2,5,7,8,10,11,12,13,16,17,19 | Transcription regulation | 80 | 5 | 51 | 9 | |
X | Reproduction | 58 | — | 29 | 4 | |
12 | Immunity | 55 | 13 | 5 | 2 | |
5, 7, 10, 13, 14, 17 | Reproduction | 47 | — | 20 | 5 | |
Y | Reproduction | 55 | — | 55 | 55 | |
6 | Reproduction | 37 | — | 1 | 0 | |
13 | Reproduction | 35 | — | 3 | 0 | |
— | — |
Human orthologs for many of the most rapidly expanding mouse gene families cannot be readily identified, either because of gene loss or rapid sequence divergence.
Gene duplicates in the rodent lineage far out-number those in the primate lineage (3,767 in mouse and 2,941 in human). In general, despite particularly fast rates of protein evolution
Evolutionary time is estimated using
Gene duplications have caused large expansions of gene families in primates as well, several of which were highlighted in the manuscript describing the finishing of the human genome
More rarely, a mouse gene may lack an apparent human ortholog simply because rapid evolution renders any similarity in their sequences undetectable. This is the case with human
The largest rodent-specific expansions have occurred among sperm-associated glutamate (E)-rich (
Members of the preferentially expressed antigen of melanoma (
Extensive duplications within two further gene families have been restricted to X and Y chromosomes (
We found that many genes in the four families described above—namely
The transcribed and functional portion of the mouse genome consists of noncoding as well as protein-coding genes. Hundreds of microRNA loci, for example, have been detected within recent mouse genome assemblies
Evidence for conserved transcription is apparent for only a small proportion of long mouse ncRNA sequences, in contrast to protein-coding genes. Of 3,051 well-documented mouse long ncRNA sequences
Rodent lineage-specific sequence includes regions that are copy number variable among laboratory mouse strains. Indeed, many of the largest rodent-specific gene families are known to be copy number variable among mouse strains
The mouse genome assembly (Build 36) is now of high fidelity and completeness, and its quality is comparable to, or perhaps better than, that of the reference human genome assembly. The finished mouse genome adds over 6% additional euchromatic sequence, much of it repetitive, but includes 1,259 mouse-specific genes that were missing or grossly misassembled in the draft. Improvements to the assembly should enhance many coordinated initiatives that are exploiting the utility of the laboratory mouse for understanding human biology and disease processes. For example, an international effort to establish baseline phenotypic measurements on the 40 most commonly used strains has provided a much needed platform upon which more complex phenotypes can be assessed
The original MGSCv3 mouse draft assembly proved comparatively cheap and easy to produce. A large number of other vertebrate genomes have been sequenced to similarly deep coverage, either as aids to model organism biology or to improve our understanding of the human genome. The cost to take a genome to an equivalent finished state is typically at least four times the cost of generating the draft assemblies using traditional Sanger sequencing. Nonetheless, it is clear from our analysis of the finished mouse genome assembly that draft WGSAs will always poorly reflect lineage-specific biology. This conclusion is also supported by analysis of both the dog
Using next-generation sequencing technology, the cost of generating several-fold coverage of a genome drops several orders of magnitude; however, especially for large genomes; it is still not possible to generate a de novo assembly from the collection of such reads. While it is likely that de novo assembly of large genomes using next generation sequencing technologies will be achieved relatively soon, it is unlikely that these assemblies will represent these complex, lineage-specific regions any better than WGSAs generated using traditional Sanger technology. We have seen little evidence from next-generation sequence assemblies of genome or clones that segmental duplications can be adequately resolved with methods other than capillary sequencing of clones. For example, we recently completed an analysis with 96 clones, which contained structural variants and segmental duplications and, not surprisingly, those regions that remained unresolved (by 454 sequence data) were enriched in segmental duplications and large common repeats (Eichler EE, Kidd JM, Fulton RS, Chen L, Graves T, et al. unpublished data). Cost-effectiveness should not be the primary consideration for these regions. Studies of human disease and phenotypes in other organisms show conclusively that the content, copy, and structure are important. Short-read, next-generation sequencing technology, while a significant advance, will not comprehensively capture all of this complex sequence structure. Obtaining large insert clones for these regions is the key, but we need third-generation technology with longer-read lengths to assemble these complex regions accurately. Long-read technology developments
The greatest improvements to the mouse assembly have been to regions that are replete with rodent lineage-specific duplicated sequence. Segmental duplications that were previously found at negligible levels now constitute almost 5% of the genome. Many of these duplications harbour multiple rodent-specific genes that show a strong bias towards reproductive function. This suggests a role for either adaptive forces or clonal selection in shaping the mouse genome. The availability of these mouse genes now allows their experimental investigation.
The comparison of two finished mammalian genomes has enabled the revision of comprehensive and reliable human and mouse protein-coding gene catalogues. The 75% of mouse genes that are in 1∶1 orthologous relationships with human genes are the most likely to have maintained ancestral function in both species, and are, therefore, most appropriately targeted as disease models. Phenotype data, mainly from knockouts, are already available for over 5,000 of these 15,187 genes
The shortcomings of the initial draft assembly are readily apparent now that a more-complete genome assembly is available. Undoubtedly these have led to incomplete or inaccurate understanding of some aspects of mouse biology. The availability of high quality genome sequence for the mouse will lead the way in dismissing some commonly held misconceptions and, more importantly, in revealing many previously hidden secrets of mouse biology.
Supplemental material and data for this paper including validated protein-coding and noncoding gene models can be found at:
Ninety-six percent of the clone-based sequence was derived from four centres, The Genome Center at Washington University in St. Louis, The Wellcome Trust Sanger Institute, The Broad Institute of Harvard and MIT, and The Genome Center at the Baylor College of Medicine (Figure S6 in
To ensure that base level quality of the assembled clones was high, we performed a quality assurance exercise. Each sequencing centre provided the assessing centre with the clone-based shotgun traces they had produced. The assessing centre then used their internal protocols to steal reads and assemble the final insert sequences. The two sequences were then aligned, and all differences were manually assessed by an independent third party. Differences found within SSRs were not counted as true differences. The overall base level error rate was determined to be 1 error per 50,000 bp, well below the accepted finishing standard of 1 error in 10,000 bp (Table S6 in
The genome assembly is driven by a tiling path file (TPF). This provides information concerning clone (component) order as well as the location and characterization of gaps. Two methods were used to obtain clone order: alignment of clone end sequences to the MGSCv3 and clone order as obtained by the mouse fingerprint map
Using the alignments above, AGP files were generated using a program called contig_build (Cherry J, unpublished data). This algorithm takes a tiling path and a set of alignments and generates a contig sequence. It checks for internal consistency with respect to clone order on the TPF and the provided alignments. The generated switch points were selected based on the component overlaps. In a few cases, switch points were manually edited to exclude contaminant sequence or misassembled sequence in one of the components.
To ensure inclusion of as much sequence as possible, the above assembled contigs were compared to the MGSCv3 and a combined assembly was generated essentially as previously described
This was produced essentially as described previously
Both Build 36 and the MGSCv3 were analyzed using RepeatMasker version open-3.1.3 with the following parameters: -w –s –no_is –cutoff 255 –frag 20000 –gff –species mouse
This was produced as described previously
Mouse sequence reads were obtained from the NCBI Trace Archive; quality clipped, and repeat masked using WindowMasker
In order to identify variation based on mate pair violations, the BLAST alignments described above were sorted by best hit. The top scoring hits for either end that were within 200 kb of each other were retained for further analysis. If multiple locations for a clone could be identified, the clone was not kept for the final analysis. We defined a placed read pair as “satisfied” if the calculated insert size was within three standard deviations of the mean. Additional information can be found in
Mouse and human gene models identified using either the Ensembl pipeline (release 43) or the NCBI pipeline (mouse Build36 v1 and human Build 36 v3) were obtained. Comparison of genomic coordinates allowed for the reconciliation of these two sets into a single gene catalogue (
The reconciled gene lists were quality assessed based on their predicted orthologous relationships as previously described
To determine genes that are missing from the MGSCv3, the Build 36 and the MGSCv3 assemblies were aligned to each other using BLAST
We have chosen not to compare the current and initial draft mouse gene catalogues, because gene annotations have benefitted from the many improvements in the availability of transcriptional evidence, gene prediction algorithms, and the heuristics used to evaluate these data. Instead, we determined the extent to which the initial MGSCv3 assembly could have supported the current mouse gene catalogue. We were thus interested in identifying disrupted Build 36 gene models whose corresponding MGSCv3 sequence was (i) not previously placed on chromosome scaffolds; (ii) previously dispersed among two or more different chromosomes, and/or were placed on both strands of a single chromosome; (iii) interdigitated, on the same strand, with sequence corresponding with an unrelated gene model; and (iv) entirely absent from this early assembly. We describe any such gene model as being “unmatched” in MGSCv3. These four unmatched criteria were applied in order. The remaining “matched” Build 36 gene models are contiguous and their exons do not overlap with other gene models on the same strand in both Build 36 and MGSCv3 assemblies. With few (65) exceptions, these gene models are placed on the same chromosome and strand in both assemblies.
For each Build 36 gene model, we then tabulated its exonic regions according to these four unmatched criteria. This allowed us to estimate the proportion of Build 36 exonic nucleotides that could have been predicted correctly in the early MGSCv3 assembly. Build 36 gene models were deemed to be “substantially disrupted” (see Main Text: Mouse and Human Protein Coding Gene Repetoires) in MGSCv3 if greater than 25% of its exonic sequence falls into any of these four categories.
Links to our gene catalogues and further details of these analyses can be found in
We looked for evidence of human transcription for a set of known, mouse long ncRNAs
Supporting figures, tables, and text. All supporting information can be found at the following Website:
(0.04 MB DOC)
The authors would like to thank the extended staff of all of the genome centres involved in this project. Without their hard work and dedication, this project would not have been possible.
The Mouse Genome Sequencing Consortium consists of the following members, displayed with their affiliations:
At the Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, United States of America; and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America: Donna M. Muzny, Shannon Dugan-Rocha, Yan Ding, Steven E. Scherer, Christian J. Buhay, Andrew Cree, Judith Hernandez, Michael Holder, Jennifer Hume, Laronda R. Jackson, Christie Kovar, Sandra L. Lee, Lora R. Lewis, Michael L. Metzker, Lynne V. Narareth, Aniko Sabo, Erica Sodergren, and Richard A. Gibbs.
At The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America: Michael C. Zody, Michael FitzGerald, April Cook, David B. Jaffe, Manuel Garber, Andrew R. Zimmer, Mono Pirun, Lyndsey Russell, Ted Sharpe, Michael Kamal-Kabir Chaturvedi, Jane Wilkinson, Kurt LaButti, Xiaoping Yang, Daniel Bessette, Nicole R. Allen, Cindy Nguyen, Thu Nguyen, Chelsea Dunbar, Rakela Lubonja, Charles Matthews, Xiaohong Liu, Mostafa Benamara, Tamrat Negash, Tashi Lokyitsang, Karin Decktor, Bruno Piqani, Glen Munson, Pema Tenzin, Sabrina Stone, Pendexter Macdonald, Harindra Arachchi, Amr Abouelleil, Annie Lui, Margaret Priest, Gary Gearin, Adam Brown, Lynne Aftuck, Terrance Shea, Sean Sykes, Aaron Berlin, Jeff Chu, Kathleen Dooley, Daniel Hagopian, Jennifer Hall, Nabil Hafez, Cherylyn L Smith, Peter Olandt, Karen Miller, Vijay Ventkataraman, Anthony Rachupka, Lester Dorris, III, Laura Ayotte, Richard Mabbitt, Jeffrey Erickson, Andrea Horn, Peter An, Jerome W. Naylor, Sampath Settipalli, The Broad Institute Genome Sequencing Platform, Broad Institute Genome Assembly Team, Eric S. Lander, and Kerstin Lindblad-Toh.
At The Genome Center at Washington University, St. Louis, Missouri, United States of America: Richard K. Wilson, Tina A. Graves, Robert S. Fulton, Susan M. Rock, LaDeana W. Hillier, Asif T. Chinwalla, Kelly Bernard, Laura P. Courtney, Catrina Fronick, Lucinda L. Fulton, Michelle O'Laughlin, Colin L. Kremitzki, Patrick J. Minx, Joanne O. Nelson, Kyriena L. Schatzkamer, Cynthia Strong, Aye M. Wollam, George M. Weinstock, and Shiaw-Pyng Yang.
At The Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom: Jane Rogers, Darren Grafham, Sean Humphray, Christine Nicholson, Christine Bird, Andrew J. Brown, John Burton, Chris Clee, Adrienne Hunt, Matt C. Jones, Christine Lloyd, Lucy Matthews, Karen Mclaren, Stuart Mclaren, Kirsten McLay, Sophie A Palmer, Robert Plumb, Ratna Shownkeen, Sarah Sims, Mike A Quail, Siobhan L. Whitehead, and David L. Willey.
Other sequence producers include the following:
At the University of Oklahoma Advanced Center for Genome Technology, Norman, Oklahoma, United States of America: Stephane Deschamps, Steven Kenton, Lin Song, Trang Do, and Bruce Roe.
At the National Institutes of Health Intramural Sequencing Center and Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America: NISC Comparative Sequencing Program, Gerard G. Bouffard, Robert W. Blakesley, and Eric D. Green.
At the Harvard Medical School Partners Healthcare Center for Genetics and Genomics, Boston, Massachusetts, United States of America: Raju Kucherlapati, George Grills, Li Li, and Kate T. Montgomery.
At the Lita Annenberg Hazen Genome Sequencing Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America: Melissa Kramer, Lori Speigel, and W. Richard McCombie.
At the Joint Genome Institute, US Department of Energy, Walnut Creek, California: Susan Lucas, Astrid Terry, Laurie Gordon, and Lisa Stubbs.
At Lawrence Livermore National Laboratory, Livermore, California, United States of America: Laurie Gordon, and Lisa Stubbs. Lisa Stubbs' current address is: Institute for Genomic Biology, University of Illinois, Urbana, Illinois, United States of America.
At the Medical Research Council Harwell, Mammalian Genetics Unit, Oxfordshire, United Kingdom: Paul Denny, Steve D. M. Brown, and Anne-Marie Mallon.
At the Medical Research Council Rosalind Franklin Centre for Genomics Research, Hinxton Genome Campus, United Kingdom: R. Duncan Campbell and Marc R. M. Botherby.
At the Medical Research Council Human Genetics Unit, Edinburgh, United Kingdom: Ian J. Jackson.
At Agencourt Bioscience Corp, Beverly, Massachusetts, United States of America: Marc J. Rubenfield, Andrea M. Rogosin, and Douglas R. Smith.
expressed sequence tag
noncoding RNA
simple sequence repeat
tiling path file
vomeronasal receptors
Whole Genome Sequence and Assembly