A Resource for Transcriptomic Analysis in the Mouse Brain

Background The transcriptome of the cerebral cortex is remarkably homogeneous, with variations being stronger between individuals than between areas. It is thought that due to the presence of many distinct cell types, differences within one cell population will be averaged with the noise from others. Studies of sorted cells expressing the same transgene have shown that cell populations can be distinguished according to their transcriptional profile. Methodology We have prepared a low-redundancy set of 16,209 full-length cDNA clones which represents the transcriptome of the mouse visual cortex in its coding and non-coding aspects. Using an independent tag-based approach, CAGE, we confirmed the cortical expression of 72% of the clones. Clones were amplified by PCR and spotted on glass slides, and we interrogated the microarrays with RNA from flow-sorted fluorescent cells from the cerebral cortex of parvalbumin-egfp transgenic mice. Conclusions We provide an annotated cDNA clone collection which is particularly suitable for transcriptomic analysis in the mouse brain. Spotting it on microarrays, we compared the transcriptome of EGFP positive and negative cells in a parvalbumin-egfp transgenic background and showed that more than 30% of clones are differentially expressed. Our clone collection will be a useful resource for the study of the transcriptome of single cell types in the cerebral cortex.


Introduction
Despite the genome sequence of the human and several other mammalian species having been determined [1][2], it has become evident that the transcriptional output of the genome is much more complex than the predicted protein coding genes. Although a mammalian genome contains about 20,000 protein coding genes, there are an additional 23,000 transcriptional units (TUs) comprised of RNA without protein coding potential [3], which originate from more than 181,000 different transcripts in the mouse. As detected by whole genome tiling arrays, the human genome also shows comparable complexity [4]; reviewed in [5]). EST analysis has shown that RNA expression is frequently restricted to specific tissues. For instance, more than 110,000 39 end cDNA clusters, out of a total of 171,000 39 end EST clusters prepared during the Mouse cDNA Encyclopedia Project, were identified from only one library [6], largely from the brain [7]. These data suggest that to comprehensively identify the RNAs involved in specific biological processes, or at least the tissue-specific RNA/mRNA isoforms, it is essential to prepare novel EST collections, preferably using normalization and subtraction to identify rarely expressed mRNAs.
Apart from the importance for expression profiling studies, fulllength cDNAs are essential for the understanding of the structure of the protein encoded in specific tissues. Full-length cDNAs also constitute an invaluable resource for future functional studies such as ectopic expression with lentiviral vectors [8]. Additionally, they provide specific information for gene start and termination sites [9]. In particular, 59 -ends help to identify tissue-specific promoter elements to connect them to TUs, and at the same time to novel datasets that allow for mRNA expression profiling based on measuring transcriptional activity at the start site (or mRNA cap). Accordingly, we have previously developed the cap-analysis gene expression (CAGE) method ( [10][11]), which is based upon the production of short (20-21 nt) tags corresponding to the 59 end of capped RNA transcripts. After high-throughput profiling, these are aligned onto the genome to identify their promoter elements, and the specific activity at a certain start site is measured as the frequency of CAGE tags. Full-length cDNA collections, and 59 ESTs are very beneficial to map CAGE tags to TUs, which exhibit alternative promoter usage in at least 50% of the cases ( [12]).
Here, we have developed a full-length cDNA and microarray resource for studies with particular emphasis on the mouse visual cortex. Due to tissue restriction of mRNA expression [6], we tested the resource for differential gene expression across different cell types by comparing the transcriptome of parvalbumin (pvalb) cells versus the rest of cortex. The pvalb-cell network has recently been identified as an important trigger for critical period plasticity in the postnatal visual cortex [13]. Sensory deprivation typically between postnatal day P20-P40 in mice leads to a functional and structural rewiring of visual cortical circuits that underlies an enduring loss of visual acuity (amblyopia). Direct manipulation of postnatal GABA function can delay or accelerate plasticity onset [13], carrying potentially broad implications for therapies in adulthood as well as a deeper mechanistic understanding of cognitive developmental disorders.

Results and Discussion
We previously prepared four full-length cDNA libraries from whole extracts of wild-type mouse visual cortex at P18, P24, P28 and P55 using the CAP trapper method [14]. The libraries were subjected to subtraction and normalization [15]. A total of 76,999 59 ends and 75,757 39 ends were sequenced, and a cluster analysis showed that the resulting libraries had an average redundancy of 1.7 [15]. These libraries were used as a starting point for the preparation of a non-redundant clone collection representing the transcriptome of the visual cortex across its postnatal development.
We pooled the 59 ESTs from all four libraries and grouped them into 6,818 clusters and 9,391 singletons using the stackPACK software [16]. Thus, the redundancy increased to 4.75 after in silico pooling. We counted the number of 59 ends per cluster, and identified the genes with the most imbalanced expression patterns, in order to detect potential genes of interest for the study of the maturation of visual cortex. We used the method developed by Stekel et al. [17] for comparing digital expression patterns, and calculated a R statistic for each cluster, using an in-house program available upon request. This method did not allow a quantitative prediction of false discovery rate. Hence, the upper interval of R values which cannot be fitted to an exponentially decreasing distribution were kept as potentially significant, as suggested in Stekel et al. [17]. We then arbitrarily selected 11 as a cut-off value (Figure 1), which highlighted fourteen protein-coding genes ( Table 1).
One representative clone was selected for each cluster. Together with the singletons, they constituted 16,209 different clones that were selected for rearray and PCR amplification in order to set up a microarray platform. 400 cDNAs were full-length sequenced and annotated as FANTOM 3 clones [3]. Of these, 179 have been annotated as non-coding by the FANTOM 3 consortium. This suggests that our whole collection is well enriched in non-coding cDNAs. To annotate the clones for which only partial sequence was available, we compared their 59 ends to Transcription Units (TUs) of FANTOM3's Representative Transcript Set (RTS) that includes ESTs (ftp://fantom.gsc.riken.jp/RTPS/fantom3_ mouse/primary_est_rtps). TUs are clusters of full-length transcripts that contain a common core of genetic information [18]. 13,189 clones whose 59 EST was co-located to a TU inherited their annotation and identification number. The full-length cDNAs that define the TUs originate mostly from libraries made totally or partially from brain tissues (brain, whole head, whole embryo), from libraries related to the immune system (immune cells are present in most tissues and therefore contribute to their transcriptome), and from a testis library. The 3,020 clones whose 59 end did not match a FANTOM3 TU are those whose sequence satisfied our quality criteria for clustering, but were discarded by the more stringent filters of the FANTOM3 RTS. There are 580 FANTOM 3 ESTs that are unique to our visual cortex libraries according to the FANTOM3 representative transcript set, of which 292 have been included in our spotted clone collection.
To confirm the expression of our clones in the visual cortex, we prepared CAGE libraries [10][11] from the visual cortex of mice at four developmental stages (P21, P26, P54 and P71) and compared the overlap of the cDNA and CAGE libraries for gene detection. We mapped the CAGE tags to the mouse genome in order to assign them to TUs [12] and pooled them in one virtual library of 180,984 tags. 16,194 TUs were identified by this technique as being expressed in the visual cortex. Of these, 6,874 had a counterpart in our clone collection. Interestingly, 7,005 TUs of our collection also had a counterpart in a CAGE library of similar sequencing depth (211,541 tags) made from cerebellar tissue [12]. This suggests that our collection can also be useful in the analysis of non-cortical brain tissues ( Figure 2).
Cerebral cortex is composed of many cellular and neuronal types. Our spotted clone collection represents their combined transcriptomes. We sought to investigate the complexity of a homogeneous subset of cortical cells. Parvalbumin (pvalb) is a calcium-binding protein which is used as a molecular signature of some GABAergic interneuron cell types [19][20]. Using a BAC transgenic line expressing the green fluorescent protein (EGFP) under the control of the pvalb regulatory sequences [21], we dissociated and sorted fluorescent and non-fluorescent cells from mouse cerebral cortex at the peak of the visual critical period (P28-29) and compared their transcriptional profiles on microarrays made with the first 4,512 clones of our collection, each spotted three times. Cy5-and Cy3-labeled cDNAs were prepared from the total RNA of 15,000-20,000 cells using the method of [22], and compared five samples using a dye-swap strategy. The data discussed in this publication have been deposited in NCBIs Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/ geo/) and are accessible through GEO Series accession number GSE8968.
The slides were scanned with the photomultiplier set to maximize signal intensity while avoiding excess saturation. The fluorescence intensity values were normalized using LOWESS normalization. To ensure data quality, we first investigated the overall quality of each slide and decided to discard the data from one slide. Second, we examined data consistency using dye-swap information within each sample and excluded clones that did not meet criterion from later statistical analysis, lowering the total number of included clones to 4,250. We ranked them by p-values based on a t-test. To address multiple comparisons, we chose qvalues as a false discovery rate-based measure of significance [23]. By setting the significance criterion at p,0.05, 1,400 of the 4,250 clones were differentially expressed (p,0.05) (Table 2, Figure 3A). The q-value corresponding to criterion was 0.066, so the number of false positives was estimated to be at most 92 (92 = q6n).
As a control for the quality of cell sorting, we checked for genes known to be absent from pvalb-expressing neurons. Both GABAsynthetic enzymes, GAD65 and GAD67, were included as control genes, however, they are not ideal, as pvalb-cells constitute only a subset of GABAergic neuron [13]. Thus, the non-pvalb fraction of our homogenates would include a substantial amount of GAD gene expression. Instead, we used pvalb itself as a control. Moreover, the glutamate receptor Gria2 (GluR2) was significantly underrepresented in the sample prepared from fluorescent cells (p = 0.00048, M = 20.77), as expected for pvalb-cells that express little or no GluR2 subunits [24]. Instead, Gria1, another receptor from the same family that is expressed in pvalb-expressing cells, was overrepresented in the fluorescent samples (p = 0.0042, M = 1.2).
We inspected the significantly over-or under-represented genes exhibiting the strongest expression differences between EGFPpositive and negative samples. Table 3 lists the 24 strongest variations in both directions. Prompted by the co-occurrence in Table 3 of genes implicated in ATP metabolism, such as subunits of the mitochondrial ATPase or components of the oxidative chain, we examined all other genes from these families, and found that they are particularly abundant among over-represented genes ( Figure 3B, Table 4). This may reflect the intense metabolic activity of the fast-spiking pvalb-expressing cells [25].
To confirm that the fluorescent cells consisted only of neurons, we checked for the expression of Myelin Basic Protein (Mbp). Our slides contained two different clones coding for the Mbp gene, both showing a strong underrepresentation (M<22.1 each). Despite this strong value, it was not prominently ranked since the p-value was slightly lower than 5%. These two Mbp clones exemplify a class of genes which display a strong variability in their differential expression across the analyzed samples ( Figure 3C). Among these, we could also find another myelin-related gene, plp1, and a component of the MAP kinase pathway, Map2k6. In the case of the Mbp gene, we hypothesize that the source of the noise affecting the p-value was the quantity of oligodendrocytes included amongst the non-fluorescent cells, which can vary significantly with each dissection.  We then sought to investigate the differential expression of noncoding transcripts. Since we do not have the full-length sequence of most of these clones, we relied on their annotation to identify those that are potentially non-coding. We reasoned that most protein-coding genes are given an original gene symbol, whereas the clones whose function was unknown keep their transient MGI (Mouse Genome Informatics) name for a long time. In our collection, these clones typically have a symbol constructed by adding ''Rik'' to the clone identifier. Using the FANTOM 3 genome browser, we inspected the mapping of the the top 100 clones (ranked by their p-value) which inherited a gene symbol including ''Rik''. Only seven of these were likely to be non-coding ( Figure 4).
This low occurrence of non-coding clones in the differentially regulated set can be explained as follows. First, non-coding RNAs are often weakly expressed, which makes the detection of variation in expression harder. Second, it has been reported that more than one third of all transcribed sequences exist in polyadenylated and non-poly-adenylated forms [4]. Since we used oligo dT primers during the reverse-transcription, this aggravates low expression levels. Importantly, because the function of the non-coding transcriptome is still poorly characterized, it remains unknown whether technical reasons alone or biological considerations explain the low occurrence of differentially expressed non-coding transcripts. From this perspective, our results suggest that the noncoding transcriptome is homogeneous at the level of different mature cortical populations.
We created a collection of 16,209 RIKEN clones by subtracting, normalizing and clustering cDNA libraries from the mouse visual cortex. This clone collection is representative of the brain transcriptome as confirmed by an independent transcriptional profiling using the CAGE technology. Comparison of a homogeneous interneuron population to a whole-tissue control with our cDNA microarray platform identified more than 30% of differentially expressed RNAs even taking a conservative approach. Besides full-length cDNA arrays, and the investment in equipment and resources that these technologies require, other alternatives strategies can be conceived. Possible choices are the oligonucleotide arrays. However, due to prevalence of alternative promoters, exons and polyadenylation site, one would ideally have to prepare multiple oligonucleotides for each coding and noncoding gene. Our clone collection aims at maximizing the possibility of implicating novel non-coding cDNAs in a neural function, and its advantage is that when a candidate is found, a full-length cDNA is available. For analysis focused on transcriptional control, we would rather recommend CAGE when the technology to make CAGE libraries from limited amount of tissue will be available.
The large heterogeneity of gene expression between cells strongly suggests that future studies will require the separation of single neuron populations. This is particularly true for cells that are rare in a tissue, such as the pvalb-expressing neurons, whose specific transcriptome variations during development and brain plasticity would be masked by the large majority of unrelated cells. In general, these data emphasize once more the importance of dissecting neuronal sub-types when expression profiling to study specific brain regions. In summary, these findings show that our clone collection, assembled with an unbiased and hypothesis-free strategy, is suitable for probing the transcriptome of purified cell populations and is broad enough to provide control genes. Our publicly available cDNAs are thus a proven tool for the profiling and comparison of homogeneous cell types within the brain. This will help progress toward better defining these cell types at a transcriptomic level [26][27].

Annotation
The annotation of full-length cDNAs which were sequenced after completion of the FANTOM3 project was made in a pipeline containing five filters (Figures S1 and S2). First, the FANTOM3 representative transcript set [3] and refSeq [31] were searched for identical DNA sequences (. = 98% similarity over . = 100 bp) using the fasta34 program from the fasta3 package [32] with the parameters -Q -H -n -d0 -m9 -E100. Clones with no identical match were searched for open reading frames identical or similar to the entries of Swiss-Prot [33], TrEMBL [33] or refSeq using the fasty34 program from the fasta3 package, with the parameters -Q -H -n -d0 -m9 -E100. cDNA clones with no match were scanned for potential open reading frames (ESTs were skipped) using the DECODER [34], geneid 1.2 [35] and orfind (Tatusov, unpublished) programs. ESTs were compared to the UniGene database [36] using fasta34 with the same parameters. Clones which were not caught by any of these filters were mapped on the mouse genome (release UCSC mm5 [2]) using the sim4 [37] program.

CAGE analysis
Cage libraries were prepared as in [10], using the visual cortex of mice at age P21, P26, P54 and P71. The libraries were sequenced and analyzed as in [12], and were deposited in DDBJ under the accession numbers AAAAN0000001-AAAAN0074253, AAAAI0000001-AAAAI0019356, AAAAM0000001-AAAAM0054286 and AAAAK0000001-AAAAK0041805 respectively. The cerebellar CAGE library is available under the accession numbers AAAAA0000001-AAAA0240780.

Rearray and clone amplification
cDNA clones were rearrayed with a Q-bot instrument (Genetix) from 20% glycerol/LB ampicillin stocks that were kept at 280uC. A growth check was performed on all glycerol stocks. For the growth check, 384-well plates were translated into four 96-well plates per 384-well plate and one bacterial culture per glycerol stock was inoculated. All plates were analyzed after 18 h of growth at 37uC by visual inspection. We performed one Plasmid DNA Preparation per Glycerol Stock using a Montage Purification Kit (Millipore) and our robotic systems. One quality control was

Animals and tissue preparation
We used C57Bl/6 BAC transgenic mice expressing the enhanced green fluorescent protein (EGFP) under the control of the parvalbumin regulatory sequences [21]. Animals were maintained on a 12 h light/dark cycle with access to food and water ad libitum. All procedures involving animals and their care were carried out in accordance with the directives of RIKEN's Institutional Animal Care and Use Committee. All experimental groups were sacrificed at a similar time of day to avoid possible circadian effects. Primary visual cortex (area V1) from anesthetized mice was dissected and homogenized by sonication.
Total RNA was extracted from 15,000-20,000 FACS-sorted cells (Wagatsuma et al, in prep.) using the RNEasy Mini kit from Qiagen according to the manufacturer's instructions, and stored at 280uC until use.
The following protocol is for one slide, and was scaled accordingly if more than one slide was to be hybridized. Total RNA samples were mixed with 2 ml of SPIKE Mix (GE Healthcare Lucidea universal scorecard), and probes were prepared following the method of [22], except that 3 mg of cDNA were used per slide, and that the probes were dried and resuspended in 6 ml of water. The probes were then denatured at 95uC for 2 min and then mixed with 7.5 ml 46 Microarray hybridization solution version 2 (Amersham, Cat. No RPK0325), 15 ml formamide, and 145 ml of hybridization buffer (2.5 ng/ml oligo(dT) 12-18 primers (Invitrogen), 500 ng/ml mouse Cot-1 DNA (Invitrogen), 0.56Microarray hybridization buffer version 2 (Amersham, Cat. No RPK0325, stock concentration: 46), 25% formamide (Nacalai Tesque), 50% ULTRAhyb buffer (Ambion)), at 50uC.
The mixture of Cy3-and Cy5-labeled probes was applied at 75uC onto arrays, and hybridized for 3 h at 65uC, 3 h at 55uC and 12 h at 50uC in the dark, using an automatic hybridizer, GeneMachines HybStation (Genomic Solutions). After hybridization, slides were washed once in 26saline-sodium citrate (SSC) and 0.2% SDS at 53uC, in 16SSC and 0.2% SDS at 53uC, in 26SSC at 24uC, and in 0.26SSC for 5 min at 24uC using one cycle of 10 s flow and 1 s hold, and 15 (first wash) or 13 (other washes) cycles of 1 s flow and 10 s hold, and then dried by spinning for 1 min.
The scanning was done using an Axon GenPix 4000B scanner (Axon Instruments, Union City, CA) at 10 mm resolution. PMT voltage settings were varied to obtain maximum signal intensities with ,5% probe saturation. TIFF images were captured and analyzed with GenePix 6.0 (Axon Ins.) software and Acuity 4.0 (Axon Ins.). We used the Autoflag function to exclude spots for which the signal intensity was not higher than the local background. In addition, we also flagged the spots for which the scanning area was lower than 100 pixels. We call ''standard'' the experiments in which the probes prepared from the fluorescent cells were labelled with Cy5, and ''swapped'' the ones where the label was Cy3.

Statistical analysis
Lowess (locally weighted scatter plot for smoother) analysis was used for normalizing microarray data. Within-print-tip-group normalization was conducted by using the loess.smooth function in the statistical software package R [38]. In short, this lowess normalization approach is to use the dependence of M = log 2 (Cy5/ Cy3) on A = log 2 (Cy56Cy3) for estimating its normalizing curve, which is a function of A [39][40]. Due to its estimation procedure, its normalization curve is considered as estimated mostly based on unexpressed clones (which should be the majority) in each small segment of A [40] (we used 50 segments). We tried to further ensure this point in our lowess normalization; for each segment of A we defined a window that included 40% of the points that are nearest. Inside this window, the standard deviation of M values was calculated and lowess normalization curve was estimated by using only data points whose M values fell into a range within three times the standard deviation in each A segment. Similar results could be obtained using smaller cutoff values, e.g. 1 s.d., suggesting that this three s.d. is stringent enough (data not shown). Once the normalization curve was estimated, data points out of the range were also normalized.
To ensure data quality, we then excluded some clones from later statistical analysis by using the following two criteria, called global filtering and individual filtering. For global filtering, overall quality of each sample was examined using 3 repeated spottings (called replicates and denoted by rep1, rep2, rep3 below). For each standard or swap dataset (of each sample), the correlation of M values was calculated between replicates (e.g. the correlation between rep1 and rep2) and averaged over three cases (i.e. rep1 vs. rep2, rep2 vs. rep3, and rep3 vs. rep1). In all standard and swap datasets except the sample 2 standard dataset, the averaged correlation was larger than 0.899, whereas it was only 0.663 in the sample 2 standard dataset. We inspected the sample 2 standard dataset by creating the color image of M values in the format of the original microarray plate, and found blurred traces largely stretched over the plate. Given these, we decided to discard the sample 2 standard dataset from later statistical analysis.
Next, for individual filtering, we examined the clone-wise consistency between standard and swap datasets (within each sample). We used the lowess-adjusted M values for this purpose and first obtained the mean of the adjusted M values over replicates for standard and swap datasets, denoted by M