Transcriptome Analysis of Mouse Stem Cells and Early Embryos

Understanding and harnessing cellular potency are fundamental in biology and are also critical to the future therapeutic use of stem cells. Transcriptome analysis of these pluripotent cells is a first step towards such goals. Starting with sources that include oocytes, blastocysts, and embryonic and adult stem cells, we obtained 249,200 high-quality EST sequences and clustered them with public sequences to produce an index of approximately 30,000 total mouse genes that includes 977 previously unidentified genes. Analysis of gene expression levels by EST frequency identifies genes that characterize preimplantation embryos, embryonic stem cells, and adult stem cells, thus providing potential markers as well as clues to the functional features of these cells. Principal component analysis identified a set of 88 genes whose average expression levels decrease from oocytes to blastocysts, stem cells, postimplantation embryos, and finally to newborn tissues. This can be a first step towards a possible definition of a molecular scale of cellular potency. The sequences and cDNA clones recovered in this work provide a comprehensive resource for genes functioning in early mouse embryos and stem cells. The nonrestricted community access to the resource can accelerate a wide range of research, particularly in reproductive and regenerative medicine.


Introduction
With the derivation of pluripotent human embryonic stem (ES) (Thomson et al. 1998) and embryonic germ (EG) (Shamblott et al. 1998) cells that can differentiate into many different cell types, excitement has increased for the prospect of replacing dysfunctional or failing cells and organs. Very little is known, however, about critical molecular mechanisms that can harness or manipulate the potential of cells to foster therapeutic applications targeted to specific tissues.
A related fundamental problem is the molecular definition of developmental potential. Traditionally, potential has been operationally defined as ''the total of all fates of a cell or tissue region which can be achieved by any environmental manipulation'' (Slack 1991). Developmental potential has thus been likened to potential energy, represented by Waddington's epigenetic landscape (Waddington 1957), as development naturally progresses from ''totipotent'' fertilized eggs with unlimited differentiation potential to terminally differentiated cells, analogous to a ball moving from high to low points on a slope. Converting differentiated cells to pluripotent cells, a key problem for the future of any stem cell-based therapy, would thus be an ''up-hill battle,'' opposite the usual direction of cell differentiation. The only current way to do this is by nuclear transplantation into enucleated oocytes, but the success rate gradually decreases according to developmental stages of donor cells, providing yet another operational definition of developmental potential (Hochedlinger and Jaenisch 2002;Yanagimachi 2002).
What molecular determinants underlie or accompany the potential of cells? Can the differential activities of genes provide the distinction between totipotent cells, pluripotent cells, and terminally differentiated cells? Systematic genomic methodologies (Ko 2001) provide a powerful approach to these questions. One of these methods, cDNA microarray/ chip technology, is providing useful information (Ivanova et al. 2002;Ramalho-Santos et al. 2002;Tanaka et al. 2002), although analyses have been restricted to a limited number of genes and cell types. To obtain a broader understanding of these problems, it is important to analyze all transcripts/genes in a wide selection of cell types, including totipotent fertilized eggs, pluripotent embryonic cells, a variety of ES and adult stem cells, and terminally differentiated cells. Despite the collection of a large number of expressed sequence tags (ESTs) (Adams et al. 1991;Marra et al. 1999) and full-insert cDNA sequences (Okazaki et al. 2002), systematic collection of ESTs on these hard-to-obtain cells and tissues has been done previously only on a limited scale (Sasaki et al. 1998;Ko et al. 2000;Solter et al. 2002).
Accordingly, we have attempted to (i) complement other public collections of mouse gene catalogs and cDNA clones by obtaining and indexing the transcriptome of mouse early embryos and stem cells and (ii) search for molecular differences among these cell types and infer features of the nature of developmental potential by analyzing their repertoire and frequency of ESTs. Here we report the collection of approximately 250,000 ESTs, enriched for long-insert cDNAs, and signature genes associated with the potential of cells, various types of stem cells, and preimplantation embryos.
Of 29,810 mouse genes identified in our gene index ( Figure  1; Dataset S2; Dataset S3), 977 were not present as either known or predicted transcripts in other major transcriptome databases, such as RefSeq (Pruitt and Maglott 2001), Ensembl (Hubbard et al. 2002), and RIKEN (Okazaki et al. 2002) (see Dataset S3 for details and Dataset S4 for sequences). These genes represent possible novel mouse genes, as they either encode open reading frames (ORFs) greater than 100 amino acids or have multiple exons. In particular, 554 of the 977 genes remained novel with high confidence even after more thorough searches against GenBank and other databases. Comparisons of these 977 genes against all National Center for Biotechnology Information (NCBI) UniGene representative sequences showed that 377 genes did not match even fragmentary ESTs and are therefore unique to the National Institute on Aging (NIA) cDNA collection (see Dataset S3). A random subset of 19 cDNA clones representing these genes was sequenced completely to confirm their novelty ( Figure 2). Protein domain searches using InterPro (Mulder et al. 2003) revealed that one of them, U004160, is an orthologue of human gene Midasin (MDN1), but the remaining 18 genes do not encode any known protein motifs. However, they were split into multiple exons in the alignment to the mouse genome sequences, and we therefore considered them genes. As these sequences are mainly derived from early embryos and stem cells, they most likely represent new candidates for genes specific to particular types of stem cells. RT-PCR analysis revealed that they are expressed in specific cell types  (Pertea et al. 2003), 249,200 ESTs were clustered, generating 58,713 consensuses and singletons. NIA consensuses and singletons were further clustered with Ensembl transcripts (Hubbard et al. 2002), RIKEN transcripts (Okazaki et al. 2002), and RefSeq transcripts and transcript predictions (Pruitt and Maglott 2001). Alignments of these sequences to the mouse genome (UCSC February 2002 freeze data, available from ftp://genome. cse.ucsc.edu/goldenPath/mmFeb2002) (Waterston et al. 2002) using BLAT (Kent 2002) helped to avoid false clustering of similar sequences at nonmatching genome locations. Erroneous clusters were reassembled based on the analysis of genome alignment. A total 94,039 putative transcripts were thus generated and then grouped into 39,678 putative genes based on their overlap in the genome on the same chromosome strand and on clone-linking information. Using criteria of an ORF greater than 100 amino acids or of multiple exons (excluding sequences that are potentially located in a wrong strand), 29,810 mouse genes were identified. Finally, 977 genes unique to the NIA database were identified. DOI: 10.1371/journal/pbio.0000074.g001 ( Figure 2; Dataset S5). For example, the expression of gene U035352 was unique to ES cells, expression of U004912 unique to ES and TS cells, and expression of U001905 unique to ES and EG cells. In addition, one gene showed apparent specific expression in several stem cells and is thus a potential pan-stem cell marker (U029765). Taken together, these data suggest that most of the putative genes represented only in the NIA cDNA collection are bona fide genes that have not been previously identified.

Signature Genes That Characterize Preimplantation Embryos and Stem Cells
To identify genes that were consistently overrepresented in a given set of cDNA libraries when compared with other libraries, we performed the correlation analysis of logtransformed EST frequency combined with the false discovery rate (FDR) method (Benjamini and Hochberg 1995) First, we analyzed various combinations of preimplantation stages and identified the following genes: (i) 196 genes specific to unfertilized eggs (oocytes) and fertilized eggs (Group A in Figure 3), (ii) 122 genes specific to two-to four-cell embryos (Group B in Figure 3), (iii) 119 genes specific to eight-cell embryos, morula, and blastocyst (Group C in Figure 3), (iv) 81 genes specific to all preimplantation embryos (Group D in Figure 3), and (v) 143 genes specific to all preimplantation embryos except for blastocysts (Group E in Figure 3) (see also Dataset S7). Blastocyst EST frequencies are unique even among preimplantation embryos, most likely reflecting the switch of the transcriptome from the maternal genetic program to the zygotic genetic program (Latham and Schultz 2001;Solter et al. 2002) or to the differentiation of the trophectoderm. At least 35 out of 196 genes in the egg signature gene list (Group A in Figure 3) have ATP-related protein domains. Genes in the following categories were also enriched in this gene list: the ubiquitin-proteasome pathway, the energy pathway, cell signaling (kinase and membrane) proteins, ribosomal proteins, and zinc finger proteins. Two SWI/SNF-related genes (5930405J04Rik, the homologue of human SMARCC2, and Smarcf1) and two Polycomb genes (Scmh1 and Sfmbt) overrepresented in eggs may be candidate genes for strong chromatin remodeling activity of eggs during nuclear transplantation of somatic cell nuclei.
Addition of ES and EG cells to preimplantation embryos (143 genes; Group E in Figure 3) yielded only 54 signature genes (Group F in Figure 3). Addition of adult stem cells, MS and NS, or MS, NS, and HS (Lin À , Kit þ , Sca1 þ and Lin À , Kit À , Sca1 þ ) cells further reduced the number of signature genes to five and one, in Groups G and H, respectively (Dataset S7). Taken together, these results seem to indicate that preim- Figure 2. Examples of NIA-Only cDNA Clones and RT-PCR Results Expression pattern of 19 novel cDNA clones in 16 different cell lines or tissues: unfertilized egg, E3.5 blastocyst, E7.5 whole embryo (embryo plus placenta), E12.5 male mesonephros (gonad plus mesonephros), newborn brain, newborn ovary, newborn kidney, embryonic germ (EG) cell, embryonic stem (ES) cell (maintained as undifferentiated in the presence of LIF), trophoblast stem (TS) cell, mesenchymal stem (MS) cell, osteoblast, neural stem/progenitor (NS) cell, NS differentiated (differentiated neural stem/progenitor cells), and hematopoietic stem/progenitor (HS) cells. Glyceraldegyde-3-phosphate dehydrogenase (GAP-DH) was used as a control. A U number is assigned to each gene in the gene index (see Dataset S2). The exon number was predicted from alignment with the mouse genome sequence, and the amino acid sequence was predicted with the ORF finder from NCBI. DOI: 10.1371/journal/pbio.0000074.g002 plantation embryos, particularly totipotent fertilized eggs and highly pluripotent cells (ES and EG cells), have quite distinct genetic programs, but that less pluripotent adult stem cells (MS, NS, and HS) have even more specialized genetic programs. This supports the notion of a gradual decrease of developmental potential from preimplantation embryos to stem cells to differentiated cells.
Additional analysis was done to determine genes that are enriched in stem cells, but not in preimplantation embryos and other tissues (see Figure 3; Dataset S6; Dataset S7). In this analysis, 140 genes were identified as signature genes for pluripotent stem cells (ES, EG, NS, and MS in Group I in Figure 3), whereas 93 genes were identified as signature genes for these stem cells and their differentiated forms (cultured cells in Group J in Figure 3). Similarly, 75 and 39 genes, respectively, were identified as ES-and TS-specific (Group K in Figure 3), whereas 44 genes were identified as signature genes for adult stem cells (NS, MS, and HS in Group M in Figure 3). Lists of these genes showed that distinctive sets of genes are responsible for cell specificity (Figure 3). FDR analysis revealed that 113 genes were specifically expressed in ES and EG cells in Group O (the most pluripotent stem cells), but not in all other cell types examined (Figure 3; Dataset S7). The most abundant group of these genes was transcription regulatory factors (about 30% of all specific genes), most of which were members of the zinc finger family, including Mtf2, Ing5, Mkrn1, Hic2, and the KRAB box zinc finger. Other abundant genes specifically expressed in ES and EG cells included matrix/cytoskeleton/ membrane structural proteins such as Itga3, Dstn, Smtn, Dctn1, and Col18a1 and the DNA remodeling proteins such as Rcc1, Kars-ps1, Pola2, Mov10, and Rad54l . These two groups of genes may be associated with the unique feature of ES/EG cell cycle structure, where greater than 70% of the cell population are in S phase (Savatier et al. 1996).
Previous studies have identified genes specific to particular stem cells or genes common to a group of stem cells, although there was little agreement about which transcripts are commonly enriched in these studies (e.g., Anisimov et al. 2002;Ivanova et al. 2002;Ramalho-Santos et al. 2002;Tanaka et al. 2002). The difference in the method and platform used could be a major reason for the difficulty in identifying a common gene set. The analysis of limited number of cell types could also contribute to differences in the resulting gene lists, because genes that appeared specific to certain cell types may also be expressed in other cells that were not included in the analysis. In contrast, the current study has analyzed a large number of different stem cells, preimplantation embryos, and newborn organs from our own EST collections as well as all publicly available ESTs that were derived from a few hundred cell types. Combined with stringent FDR statistics (see Materials and Methods), the analysis of this large number of cell types may provide broader perspectives on this issue. Comparison between the gene lists of the present study and the gene lists from the previously published studies identified areas of agreement (common genes), but also revealed that many genes previously reported as specifically expressed in one cell type or group of cells are actually expressed in other cell types and thus are not specific (see the details in Dataset S8). The signature genes identified in this study distinguish different stem cells, and this gene list may provide a way to recognize or purify specific stem cell types and provide insights into stem cell-specific functions.

Principal Component Analysis Identified Clusters of Cells/ Tissues with Similar EST Frequency
The global expression patterns of 2,812 relatively abundant genes (see Materials and Methods; Dataset S9) were further analyzed by principal component analysis (PCA), which reduces high-dimensionality data into a limited number of principal components. The first principal component (PC1) captures the largest contributing factor of variation, which in this case corresponds to the average EST frequency in all tissues, and subsequent principal components correspond to other factors with smaller effects, which characterize the differential expression of genes. As we were interested in the differential gene expression component, we plotted the position of each cell type against the PC2, PC3, and PC4 axis in three-dimensional (3D) space by using virtual reality modeling language (VRML) ( Figure 4A; Video S1; a full interactive view is available on http://lgsun.grc.nia.nih.gov/ Supplemental-Information). Genes were also plotted in the same 3D space (a version of PCA called a biplot) (Chapman et al. 2002) to see their association with cell/tissue types. Close examination of the 3D model identified PC2 and PC3 as the most representative views of the 3D model ( Figure 4B). A twodimensional (2D) plot of PC2 and PC3 is therefore used for the following discussion, with references to the 3D model. It is important to keep in mind that the distance between cell types along principal components has a substantial error associated with randomness of clone counts in EST libraries. The estimated error range (2*SE) in the PC3 scale is about 7%-9% based on Poisson distribution ( Figure 4B). Nonetheless, PCA identifies major trends and clusters in gene expression among these cell types.
The most conspicuous trend was that cells that differ in their developmental potential appeared well separated along the PC3 axis. In Figure 4A and 4B, preimplantation embryos (unfertilized egg, fertilized egg, two-cell, four-cell, eight-cell, morula, and blastocyst) are positioned at the top of the PC3 axis; embryos and extraembryonic tissues from early-to midgestation stage, such as E6.5, E7.5, E8.5, and E9.5, are positioned at the middle; and cells and tissues mostly from terminally differentiated cells (newborn ovary, newborn heart, and newborn brain) are positioned at the bottom. PCA is unsupervised (performed without using knowledge of developmental stages of each cell types), and so this ordering along the PC3 axis seems to reflect the structures of global gene expression patterns among the cells. The PC2 axis provided an additional dimension to separate cells into developmental stages, functional groups, or both. The correlation of the PC2 axis to known biological stages, functions, or both, however, remains unclear.
Interestingly, both ES cells and adult stem cells are positioned at the middle of the PC3 axis together with whole-embryo libraries from early-to mid-gestation stages ( Figure 4B). ES and EG cells were derived from embryos, and thus their positions matched with their developmental timing. Although NS, MS, and HS cells were all derived from adult organs (brain, bone marrow, and bone marrow, respectively), their position along the PC3 axis corresponded to early embryonic tissues and embryo-derived stem cells (ES and EG). The results are consistent with the notion that adult stem cells acquire or retain the pluripotency with characters of less-differentiated cell types. This also suggests that the PC3 axis does not represent just developmental timing, but also indicates the developmental potential of cells, with totipotent eggs at the top, pluripotent embryonic cells and stem cells at the middle, and terminally differentiated cells at the bottom.
This hypothesis seems to be consistent with another interesting observation that the differentiated forms of stem cells were always positioned lower than their stem cell counterparts (undifferentiated forms) in the PC3 axis ( Figure  4A and 4B). For example, the position of NS (differentiated) cells, a mixture of neuron and glia obtained after culturing NS cells in the differentiation conditions, was lower and nearer to the terminally differentiated cells than were NS cells. Osteoblast cells, which are more differentiated than the MS cells from which they are derived, were again positioned lower than the MS cells. The same holds true for ES (LIF À ) cells (lower PC3 position), which were obtained by culturing ES cells in the absence of leukemia inhibitory factor (LIF), allowing ES cells to differentiate into many different cell types, and ES (LIF þ ) cells (higher PC3 position), which were maintained as highly pluripotent by culturing them in the presence of LIF. For HS cells, all four cell types were selected first as lineage marker-negative cells, and thus they were all relatively undifferentiated cells. These cells were then sorted by c-Kit þ and Sca1 þ into four separate fractions. The most pluripotent cells (Lin À , c-Kit þ , Sca1 þ ) were again positioned higher than other three cell types in the PC3 axis. Finally, TS cells were positioned at the least-potent place among stem cells, which seemed to fit to their known characteristics. It has previously been shown that TS cells are already committed to the extraembryonic lineage and are less pluripotent than ES and EG cells, because TS cells injected back to mouse blastocysts only differentiate into extraembryonic trophoblast lineages (Tanaka et al. 1998). The microarray analysis of TS cells also shows that they already express many placentaspecific genes, which is a sign of lineage-committed cells (Tanaka et al. 2002).
Finally, it is interesting to note that EG cells were positioned closely to E8.5 whole embryos and E9.5 whole embryos, whereas ES cells were positioned closely to blastocysts, E6.5, and E7.5 whole embryos (Figure 4). Because ES cells are derived from E3.5 blastocysts and EG cells are derived from primordial germ cells (PGCs) of E8.5 (in this particular line), these results indicate that the expression patterns of relatively abundant genes in ES and EG cells reflect their developmental stages of origin. Although ES and EG cells were established from different sources, EG cells are often considered to be ES cells and the distinction of their origin is ignored. However, the result here suggests potentially significant differences between the genetic programs of EG cells and ES cells.

Genes Correlated with the Developmental Potential of Cells
To identify a group of genes associated with the PC3 axis, we first fixed the coordinate of each cell type on PC3 and searched for genes whose log-transformed frequencies correlated with this coordinate in each cell type. Correlation analysis combined with the FDR method (FDR ¼ 0.1) revealed 88 genes whose expression levels were significantly associated with PC3 (Dataset S10). To test how well these genes represent PC3, we plotted the sum of log-transformed EST frequencies for these 88 genes versus PC3 projections of the same cell types ( Figure 5). Most cells were positioned diagonally relative to the original PC3 coordinates, indicating that the average expression levels of these 88 genes can roughly represent cell type position along the PC3 coordinate. Because the PC3 axis does not have a unit and cannot be directly translated to variables measured by molecular biological techniques, the possible use of 88 genes as a surrogate for the PC3 axis will help to test this working hypothesis in the future.
Although all 88 genes shared the general trend of continuous decrease of expression levels from eggs to terminally differentiated tissues, these genes can be further subdivided by their expression patterns. First, 53 genes were those identified as preimplantation specific, particularly unfertilized and fertilized egg-specific genes, which include already well-known genes for their functions in oogenesis and zygotic gene activation, such as Gdf9, Bmp15, Rfpl4, Fmn2, Tcl1, Obox5, and Oosp1. Second, ten genes were represented as ESTs in both preimplantation embryos and postimplantation embryos, including Cyp11a and D7Ertd784e. Third, 25 genes were represented well as ESTs in preimplantation embryos, postimplantation embryos, and stem cells, including Mitc1, actin-binding Kelch family protein, Dtx2, Cdc25a, Spin, Rgs2, Prkab1, and Birc2. The seemingly continuous decrease of the expression of these genes is therefore not caused by passive dilution of transcripts that are abundant in oocytes, but is most likely caused by a specific mechanism that actively regulates the expression levels of these genes.

Concluding Remarks
The sequence information and cDNA clones collected in this work provide the most comprehensive database and resources for genes functioning in early mouse embryos and stem cells. All cDNA clones developed in this project have been made available through the American Type Culture Collection (ATCC). The subset of these cDNA clones have been rearrayed into the condensed clone sets, the NIA Mouse 15K cDNA Clone Set (Tanaka et al. 2000;Kargul et al. 2001) and the 7.4K cDNA Clone Set (VanBuren et al. 2002), which have been made available through designated academic distribution centers. Many genes that are uniquely or predominantly expressed in mouse early embryos and stem cells have been recently incorporated into a 60mer oligonucleotide microarray (Carter et al. 2003). Sequence information has been made available at public sequence databases (e.g., dbEST [Boguski et al. 1993]). Finally, all the information discussed here, as well as the graphical interfaces of the Mouse Gene Index, is available on our Web site at http:// lgsun.grc.nia.nih.gov/cDNA/cDNA.html.
Although the full appreciation of these resources is yet to be realized, the initial assessment of the first comprehensive transcriptome of early mouse embryos and stem cells has already provided three major points presented in this report.
First, approximately 1,000 putative genes that were newly identified using our cDNA collection most likely represent mouse genes unidentified previously, as they either encode ORFs greater than 100 amino acids or have multiple exons. The RT-PCR analysis of 19 selected genes confirmed the notion that novel cDNAs from our libraries tend to be expressed specifically in cells and tissues that we used in this project. These gene candidates will be a rich source of genes that are expressed at low levels, but play major roles in ES cells and adult stem cells as well as in early embryos.
Second, the analysis provided lists of genes specific to particular embryonic stages or stem cells and not expressed in other cell types. For example, we have identified signature genes for the individual preimplantation stages, all preimplantation stages, ES cells, and adult stem cells.
Finally, the PCA of 2,812 genes with relatively abundant expression revealed 88 genes with average expression levels that correlate well to the developmental potentials of cells. These genes may provide the first scale to characterize the developmental potential of cells and tissues at the molecular level.
The developmental potential of cells is a fundamental concept in developmental biology, providing a conceptual framework of sequential transition from totipotent fertilized eggs to pluripotent embryonic cells and stem cells to terminally differentiated cells. It is worth noting that genes associated with developmental potential can be identified only by simultaneous analysis of preimplantation embryos and a variety of stem cells. The analyses of stem cells alone could not provide these broader perspectives (Ivanova et al. 2002;Ramalho-Santos et al. 2002;Tanaka et al. 2002). The 88 genes we have identified here may provide a set of marker  Bmp15, Btg4, Cdc25a, Cyp11a, Dtx2, E2f1, Fmn2, Folr4, Gdf9, Krt2-16, Mitc1, Oas1d, Oas1e, Obox3, Prkab1, Rfpl4, Rgs2, Rnf35, Rnpc1, Slc21a11, Spin, Tcl1, Tcl1b1, Tcl1b3, 1810015H18Rik, 2210021E03Rik, 2410003C07Rik, 2610005B21Rik, 2610005H11Rik, 3230401D17Rik, 4833422F24Rik, 4921528E07Rik, 4933428G09Rik, 5730419I09Rik, A030007L17Rik, A930014I12Rik, E130301L11Rik, AA617276, Bcl2l10, MGC32471, MGC38133, MGC38960, D7Ertd784e, and 44 genes with only NIA U numbers (see Dataset S10). DOI: 10.1371/journal/pbio.0000074.g005 genes for scaling the potential of cells. It is important to note that this scale is an operational construct. As such, further studies of the genes in the list will be required to test whether they provide critical clues to resolve the classic problem of the relation of stem cells to development. But the list could have immediate practical utility in assessing the effectiveness of treatments, gene manipulation, or both to convert differentiated cells such as fibroblasts into more potent cells such as ES-one of the most important goals required to achieve stem cell-based therapy.

Materials and Methods
cDNA library construction, clone handling, and sequencing. Sources of tissue materials and RNA extraction methods are available as associated documents in the GenBank DNA sequence records (see also http://lgsun.grc.nia.nih.gov/cDNA/cDNA.html). cDNA libraries were constructed as described elsewhere (Piao et al. 2001). More details are available in Protocol S1.
Assembling of a gene index. See description in the legend to Figure 1 and in Protocol S1.
Analysis of 19 cDNA clones. Sequencing of full-length cDNA clones and RT-PCR analysis were done by the standard methods. More details are available as Protocol S1.
Identification of differentially expressed genes. Most methods for selecting differentially expressed genes from EST frequencies are based on the assumption that each cDNA clone is a random sample from the mRNA pool in the cell and hence that EST frequencies correspond to the Poisson distribution (Audic and Claverie 1997). Real EST libraries, however, do not satisfy this assumption because even small changes in experimental conditions may affect the stability of particular species of mRNA, which in turn will cause a bias in EST frequency. Thus, a reliable detection of differentially expressed genes requires either library replications or comparison of classes of libraries. Because our EST libraries do not have true replications, we selected the latter approach, which yields genes that are specifically expressed in one class of tissues/stages and do not express in other tissues/stages. Some cDNA clones were represented by 59 EST, some were by 39 EST, and some were by both 59 EST and 39 EST. To avoid counting the same cDNA clone twice by 59 EST and 39 EST, all EST frequency analysis was done at the cDNA clone level.
To detect genes specific to a particular group of libraries, we first estimated the correlation between log-transformed clone frequencies, log(1000*n i /N þ 0.05), where n i is the abundance of clone i in the library and N is the total number of clones, with membership indicated (0 or 1) in a particular group (see Dataset S6). The first three group classifications are targeted on oocytes. The next two classifications include all preimplantation stages with and without blastocysts. There are four classifications attempting to differentiate between pluripotent cells and other tissues. The final nine classifications capture various groups of stem cells. Results of these analyses are given in Dataset S7 and a subset of the data is shown in Figure 3. We analyzed only positive correlations because we were interested in genes that are overexpressed in tissues of interest, and P-values were estimated using a one-tailed t-test. Because P-values cannot be used for simultaneous assessment of multiple hypotheses, we determined significant genes using the FDR method (Benjamini and Hochberg 1995). The FDR was set to 0.1, which corresponds to the average proportion of false positives equal to 10%.
As this study is focused on embryo-and stem cell-specific genes, we analyzed EST frequencies in public databases (Boguski et al. 1993) to exclude those genes that are predominantly expressed in adult tissues. A total of 3,338,847 public ESTs have been grouped into the following categories: NIA Collection, Preimplantation, Embryo, Embryonic Stem Cells, Fetus, Neonate, Adult, Adult Gonad, Adult Stem Cells, Adult Tumor, and Unclassified/Pooled Tissues (Dataset S11). Of 29,810 mouse genes, 5,425 genes were not represented by ESTs, 11,574 genes were expressed predominantly in adult tissues (EST frequency in adult tissues exceeds one-third of the maximum EST frequencies in all tissues), and 12,811 were genes expressed in embryos or in gonads, tumors, and stem cells. By removing 2,055 gonad-specific and 56 tumor-specific genes (20 times more ESTs in gonad or tumors than in other tissues), we obtained 10,700 genes that are predominantly expressed in embryos and stem cells (Dataset S12). Only ESTs matching to these genes were analyzed for differential expression.
PCA of clone frequencies. For the PCA shown in Figure 4, we selected 2,812 genes that had transcript frequencies of greater than or equal to 0.1% in at least one library (see Dataset S9). Clone/EST frequencies were log-transformed as log(1000*n i /N þ 0.05), where n i is the number of clones in U-cluster i in the library, and N is the total number of all clones in this library.
Statistical significance of gene contribution to PC3 (see Figure 5) was evaluated using correlation between log-transformed clone frequencies in various libraries and library position on the PC3 axis. P-values, estimated using a one-tailed t-distribution, characterize the significance of correlation for a single clone. To control the proportion of false positives, we used FDR, which was set to 0.1.

Supporting Information
To view this Supporting Information with dynamic Web links, see http://lgsun.grc.nia.nih.gov/Supplemental-Information/. The NIA Mouse Gene Index has recently made available to the public (http://lgsun.grc.nia.nih.gov/geneindex/). The Web interface provides a view of transcripts and genes on the mouse genome sequence. Unique IDs (U plus 6 digits, e.g., U018631) have been assigned to individual genes in the gene index. ''U numbers'' in the following datasets have direct links to corresponding genes in the NIA Mouse Gene Index. Clicking the ''U number'' in the datasets will lead to a Web page of the NIA public Web site.