Equivalent high-resolution identification of neuronal cell types with single-nucleus and single-cell RNA-sequencing

Transcriptional profiling of complex tissues by RNA-sequencing of single nuclei presents some advantages over whole cell analysis. It enables unbiased cellular coverage, lack of cell isolation-based transcriptional effects, and application to archived frozen specimens. Using a well-matched pair of single-nucleus RNA-seq (snRNA-seq) and single-cell RNA-seq (scRNA-seq) SMART-Seq v4 datasets from mouse visual cortex, we demonstrate that similarly high-resolution clustering of closely related neuronal types can be achieved with both methods if intronic sequences are included in nuclear RNA-seq analysis. More transcripts are detected in individual whole cells (~11,000 genes) than nuclei (~7,000 genes), but the majority of genes have similar detection across cells and nuclei. We estimate that the nuclear proportion of total cellular mRNA varies from 20% to over 50% for large and small pyramidal neurons, respectively. Together, these results illustrate the high information content of nuclear RNA for characterization of cellular diversity in brain tissues.


11
Understanding neural circuits requires characterization of their cellular components. Cell types in mam-( Figure S5B and Table S5). We measured the nuclear and cell sizes in situ, and calculated the nuclear 148 proportion of each cell as the ratio of nuclear to soma volume ( Figure S5C). We found that the average 149 nuclear proportion was significantly lower for layer 5 cells compared to layer 4 cells, as predicted based on 150 RNA-seq data ( Figure 5D). 151 In addition, nuclear proportion estimates based on in situ size measurements were systematically higher than 152 predicted for layer 5 but not layer 4 neurons. This could be the result of under-estimating the soma volume 153 based on cross-sectional area measurements of these large non-spherical (pyramidal) neurons. Alternatively, 154 layer 5 neuronal nuclei may have a lower density of nuclear transcripts or there may be cell type-specific biases 155 in our RNA-seq based estimates. We then performed an unbiased survey of nuclear proportions across the 156 full depth of cortex to test whether layer 4 or layer 5 neurons were exceptional compared to neurons in other 157 layers. We found that layer 5 neurons tend to be larger and have proportionally smaller nuclei ( Figure S5D) 158 than other cortical neurons, as was recently reported in rat primary visual cortex (öckner2017?). 159 Next, we determined the nuclear versus cytoplasmic distribution of transcripts for individual genes. The 160 nuclear proportion of 11,932 transcripts was estimated by the ratio of nuclear to whole cell expression mul-161 tiplied by the overall nuclear fraction of each cell type and averaged across cell types (Table S6). Different 162 functional classes of genes had strikingly different nuclear proportions ( Figure 5E). Many non-coding trans-163 cripts were localized in the nucleus, but some were abundantly expressed in the cytoplasm, such as the long 164 non-coding RNA (lncRNA) Tunar that is highly enriched in the brain, is conserved across vertebrates, and 165 has been associated with striatal pathology in Huntington's disease (Lin et al., 2014). Most protein-coding 166 transcripts were expressed in both the nucleus and cytoplasm with a small number restricted to the nucleus, 167 including the Parkinson's risk gene Park2. We found that pseudogenes were almost exclusively cytoplasmic 168 and were highly enriched for house-keeping functions. 169 We compared our estimates of nuclear enrichment in cortex to mouse liver and pancreas based on data 176 Surprisingly, non-coding genes and pseudogenes are better markers of cell types, on average, than protein-177 coding genes ( Figure 5F). lncRNAs are known to have more specific expression among diverse human cell 178 lines (Djebali et al., 2012), and we show that this is also true for neuronal types in the mouse cortex. Many 179 pseudogene transcripts, most of which are enriched in the cytoplasm, were selectively depleted in the two 180 cell types, L4 Arf5 and L5a Hsd11b1. This is consistent with our previous analysis that showed that neurons 181 of these types have relatively less cytoplasm. We also find that nucleus-enriched transcripts are slightly 182 better cell-type markers than cytoplasm-enriched transcripts, although this is highly variable across genes 183 ( Figure 5G). 184 Finally, we compared our estimates of nuclear localization of transcripts for three genes -Calb1, Grik1, 185 and Pvalb -to relative counts of transcripts in nuclei and cytoplasm using multiplex RNA fluorescence in 186 situ hybridization (mFISH). We found that the relative nuclear proportions estimated by scRNA-seq and 187 mFISH were consistent although the absolute levels were quite variable ( Figure 5H). Both methods confirmed 188 that Pvalb transcripts were mostly excluded from the nucleus, and this explained why 2 out of 35 nuclei Unlike scRNA-seq, snRNA-seq enables transcriptomic profiling of tissues that are refractory to whole cell 193 dissociation and archived frozen specimens. snRNA-seq is also less susceptible to perturbations in gene 194 expression that occur during cell isolation, such as increased expression of immediate early genes that can 195 obscure transcriptional signatures of neuronal activity (Lacar et al., 2016). However, these advantages come 196 at the cost of profiling less mRNA, and until this study, it was unclear if the nucleus contained sufficient 197 number and diversity of transcripts to distinguish highly related cell types.

198
To directly address this question, we profiled a well-matched set of 463 nuclei and 463 cells from layer 5 of 199 mouse primary visual cortex and identified 11 matching neuronal types: 2 interneuron types and 9 similar 200 excitatory neuron types. Including intronic reads in gene expression quantification was necessary to achieve 201 high-resolution cell type identification from single nuclei. Intronic reads substantially increased gene detection 202 to 7000 genes per nucleus. In addition, intronic reads were more frequently derived from long genes that 203 are known to have brain-specific expression (Gabel et al., 2015) and that help define neuronal connectivity 204 and signaling. Intronic reads may also reflect other cell-type specific features, such as retained introns or 205 alternative isoforms. For example, intron retention provides a mechanism for the nuclear storage and rapid 206 translation of long transcripts in response to neuronal activity (Mauger et al., 2016). 207 We found that nuclei contain at least 20% of all cellular transcripts, and this percentage varies among cell 208 types. Two small pyramidal neuron types have large nuclei relative to cell size that contain more than half 209 of all transcripts. We detect 4000 more genes in single cells than single nuclei, but the majority of genes are 210 detected equally well in both. Cytoplasm-enriched transcripts are missed by profiling single nuclei but include 211 mostly house-keeping genes and pseudogenes, which are not related to neuronal identity. Nucleus-enriched 212 transcripts include protein-coding and non-coding genes that are more likely to be cell-type markers than 213 cytoplasmic transcripts. Overall, single cells do provide somewhat better detection of cell-type marker genes, 214 thereby resulting in slightly better cluster separation for two pairs of highly similar cell types. Therefore, as 215 more nuclei and cells are profiled, it is possible that finer discrimination of cell types may require single cell 216 profiling. However, the benefits of profiling single nuclei may outweigh potential loss in the finest cell type 217 resolution. 218 snRNA-seq is well suited for large-scale surveys of cellular diversity in various tissues and has the potential to 219 be less cell-type biased. For example, single cell profiling of adult human cortex isolated more interneurons Dropout weighted Pearson correlations were calculated between all pairs of nuclei and cells using 42,003 358 genes expressed in at least one nucleus and one cell. The cell with the highest correlation to any nucleus 359 was selected as the best match, and this cell and nucleus were removed from further analysis. This process was repeated until 463 best matching cells were selected, and the expression correlations were compared to 361 correlations of the best matching pairs of nuclei ( Figure 1B). The Cre-lines and dissected cortical layers of 362 origin of the best matching cells were summarized as bar plots ( Figure S1). Unweighted Pearson correlations 363 were also calculated between all pairs of nuclei and cells to test the effect of accounting for dropouts on 364 sample similarities ( Figure 2B).

392
For each gene, log 2 (CPM + 1) expression was centered and scaled across samples. Noise models were used to 393 select significantly variable genes (adjusted variance > 1.25). Dimensionality reduction was performed with 394 principal components analysis (PCA) on variable genes, and the covariance matrix was adjusted to account 395 for gene dropouts using the product of dropout weights across genes for each pair of samples. A maximum 396 of 20 principal components (PCs) were retained for which more variance was explained than the broken stick 397 null distribution, a conservative method of PC retention (Jackson, 1993).

398
Nearest-neighbor distances between all samples were calculated using the "nn2" function of the R pack-399 age RANN, and Jaccard similarity coefficients between nearest-neighbor sets were computed. Jaccard coeffi-  Clustering was applied iteratively to each sub-cluster until the occurrence of one of four stop criteria: 1) 416 fewer than six samples (due to a minimum cluster size of three); 2) no significantly variable genes; 3) no 417 significantly variable PCs; 4) no significant clusters.

418
To assess the robustness of clusters, the iterative clustering procedure described above was repeated 100 times 419 for random sets of 80% of samples. A co-clustering matrix was generated that represented the proportion of 420 clustering iterations that each pair of samples were assigned to the same cluster. Average-linkage hierarchical 421 clustering was applied to this matrix followed by dynamic branch cutting using "cutreeHybrid" in the R and maximum between cluster co-clustering) was calculated for all clusters. Marker genes were defined for 426 all cluster pairs as described above, and clusters were merged if they had a co-clustering separation <0.25 427 or either cluster lacked at least one marker gene.

428
Scoring marker genes based on cluster specificity 429 Many genes were expressed in the majority of nuclei or cells in a subset of clusters. A marker score (beta) 430 was defined for all genes to measure how binary expression was among clusters, independent of the number 431 of clusters labeled. First, the proportion (x i ) of samples in each cluster that expressed a gene above back-432 ground level (CPM > 1) was calculated. Then, scores were defined as the squared differences in proportions 433 normalized by the sum of absolute differences plus a small constant (ε) to avoid division by zero. Scores 434 ranged from 0 to 1, and a perfectly binary marker had a score equal to 1.
expression level of the top 1200 marker genes (i.e. highest beta scores) was calculated for each cluster.

438
A correlation-based distance matrix (D xy = 1−ρ(x,y) 2 ) was calculated, and complete-linkage hierarchical 439 clustering was performed using the "hclust" R function with default parameters. The resulting dendrogram 440 branches were reordered to show inhibitory clusters followed by excitatory clusters, with larger clusters first, 441 while retaining the tree structure. Note that this measure of cluster similarity is complementary to the 442 co-clustering separation described above. For example, two clusters with similar gene expression patterns 443 but a few binary marker genes may be close on the tree but highly distinct based on co-clustering.   Next, nuclei and cell clusters were directly compared using the above analysis. All 11 clusters had reciprocal 454 best matches that were consistent with cluster labels assigned based on similarity to published types. The 455 most highly conserved marker genes of matching clusters were identified by selecting genes expressed in a 456 single cluster (>50% of samples with CPM > 1) and with the highest minimum beta score between nuclei 457 and cell clusters. Two additional marker genes were identified that discriminated two closely related clusters.

458
Violin plots of marker gene expression were constructed with each gene on an independent, linear scale.

459
Nuclei and cell clusters were also compared by calculating average cluster expression based only on intronic 460 or exonic reads and calculating a correlation-based distance using the top 1200 marker genes as described 461 above. Hierarchical clustering was applied to all clusters quantified using the two sets of reads. In addition,  In situ hybridization data for mouse cortex was from the Allen Mouse Brain Atlas (Lein et al., 2007). All data 503 is publicly accessible through www.brain-map.org. Data was generated using a semiautomated technology 504 platform as described in Lein et al. (2007). Mouse ISH data shown is from primary visual cortex (VISp) in with 4% PFA at 4°C for 60 minutes and the protease treatment step was shortened to 15 minutes at room 511 temperature. Probes used to identify nuclear and cytoplasmic enriched transcripts were designed antisense 512 to the following mouse genes: Calb1, Grik1, and Pvalb. Following hybridization and amplification, stained 513 sections were imaged using a 60X oil immersion lens on a Nikon TiE epifluorescence microscope.

514
To determine if spots fell within the nucleus or cytoplasm, a boundary was drawn around the nucleus to 515 delineate its border using measurement tools within Nikon Elements software. To delineate the cytoplasmic 516 boundary of each cell, a circle with a diameter of 15um was drawn and centered over the cell (Fig. 5). RNA 517 spots in each channel were quantified manually using counting tools available in the Nikon Elements software.

518
Spots that fell fully within the interior boundary of the nucleus were classified as nuclear transcripts. Spots 519 that fell outside of the nucleus but within the circle that defined the cytoplasmic boundary were classified 520 as cytoplasmic transcripts. Additionally, if spots intersected the exterior boundary of the nucleus they were 521 classified as cytoplasmic transcripts. To prevent double counting of spots and ambiguities in assigning spots 522 to particular cells, labeled cells whose boundaries intersected at any point along the circumference of the 523 circle delineating their cytoplasmic boundary were excluded from the analysis. A linear regression was fit to 524 nuclear versus soma probe counts, and the slope was used to estimate the nuclear proportion.

528
Maximum intensity projections from six confocal stacks of 1-μm intervals were processed for analysis. Initial in Layer 5 from Rbp4-Cre KL100;Ai14 mice. A linear regression was fit to nuclear versus soma area to 534 highlight the differences between Cre-lines.

535
For measurements of nucleus and soma size agnostic to Cre driver, we used 16 μm-tissue sections from P56 536 mouse brain. To label nuclei, DAPI was applied to the tissue sections at a final concentration of 1mg/ml.

537
To label cell somata, tissue sections were stained with Neurotrace 500/525 fluorescent Nissl stain (Ther-538 moFisher Scientific) at a dilution of 1:100 in 1X PBS for 5 minutes, followed by brief washing in 1X PBS.

539
Sections were coverslipped with Fluoromount-G (Southern Biotech) and visualized on a Nikon TiE epiflu-540 orescence microscope using a 40x oil objective. Soma and nuclei area measurements were taken by tracing 541 the boundaries of the Nissl-stained soma or DAPI-stained nucleus, respectively, using cell measurement tools 542 available in the Nikon TiE microscope software. All cells with a complete nucleus clearly present within the 543 section were measured, except that we excluded glial cells which had very small nuclei and scant cytoplasm.

544
Measurements were taken within a 40x field of view across an entire cortical column encompassing layers 545 1-6, and the laminar position of each cell (measured as depth from the pial surface) was tracked along with 546 the nucleus and soma area measurements for each cell.

547
For each cell in the experiments above, the nuclear proportion was estimated as the ratio of nucleus and soma 548 area raised to the 3/2 power. This transformation was required to convert area to volume measurements 549 and assumed that the 3-dimensional geometries of soma and nuclei were reflected by their cross-sectional 550 profiles. This is true for approximately symmetrical shapes such as most nuclei and some somata, but will 551 lead to under-or over-estimates of nuclear proportions for asymmetrical cells. Therefore, the estimated 552 nuclear proportion of any individual cell may be inaccurate, but the average nuclear proportion for many 553 cells should be relatively unbiased.   Co-clustering heatmaps show the proportion of 100 clustering iterations that each pair of nuclei were assigned to the same cluster. Clustering was performed using gene expression quantified with exonic reads or intronic plus exonic reads for two key clustering steps: selecting significantly differentially expressed (DE) genes and calculating pairwise similarities between nuclei. Co-clustering heatmaps were generated for each combination of gene expression values, and blue boxes highlight 11 clusters of nuclei that consistently co-clustered using introns and exons (upper left heatmap) and were overlaid on the remaining heatmaps. The row and column order of nuclei is the same for all heatmaps. (B) Co-clustering heatmaps were generated for cells as described for nuclei in (A), and blue boxes highlight 11 clusters of cells. (C) Cluster cohesion (average within cluster co-clustering) and separation (difference between within cluster co-clustering and maximum between cluster co-clustering) are plotted for nuclei and cells and all combinations of reads. Including introns in gene expression quantification dramatically increases cohesion and separation of nuclei but not cell clusters.  Figure S4). (B) Pairwise correlations between nuclear and cell clusters using average cluster expression of the top 490 shared marker genes. (C) Violin plots of cell type specific marker genes expressed in matching nuclear and cell clusters. Plots are on a linear scale, max CPM indicates the maximum expression of each gene, and black dots indicate median expression. (D) Hierarchical clustering of nuclear and cell clusters using the top 1200 marker genes with expression quantified by intronic or exonic reads. Intronic reads group nine matching nuclear and cell clusters together at the leaves, while two closely related deep layer 5 excitatory neuron types group by sample type. In contrast, exonic reads completely segregate clusters by sample type. (E) Box plots of cluster separations for all samples in matched nuclear and cell clusters. Clusters are equally well separated for all but two cell types, L4 Arf5 and L5b Cdh13, that are moderately but significantly (Wilcoxon signed rank unpaired tests; Bonferroni corrected P-value < 0.05) more distinct with cells than nuclei. (F) Cell type marker genes are consistently detected in nuclei and cells, although marker scores (see Methods) were on average 15% higher for cells. The nuclear fraction of transcripts in cell types was estimated with two methods: the ratio of intronic read percentages in cells compared to nuclei; and the average ratio of expression in cells compared to nuclei of three highly expressed genes (Snhg11, Meg3, and Malat1 ) that are localized to the nucleus. The relative ranking of nuclear fractions was consistent (Spearman rank correlation = 0.84), although estimates based on the intronic read ratio were consistently 50% higher. (D) Estimated nuclear proportion (ratio of nucleus and soma volume) of neurons labeled by three mouse Cre-lines in Layers 4 and 5 (see Supplementary Figure S5D). Single neuron measurements (grey points) were summarized as violin plots, and average nuclear proportions (black points) were compared to the range of estimated proportions (blue lines) based on intronic read ratios and nuclear gene expression. (E) Histograms of nuclear fraction estimates for 11,932 genes expressed (CPM > 1) in at least one nuclear or cell cluster and grouped by type of gene. (F) Violin plots of marker score distributions with median and inter-quartile intervals. Non-coding genes and pseudogenes are on average better markers of cell types than protein-coding genes. Kruskal-Wallis rank sum test, post hoc Wilcoxon signed rank unpaired tests: *P < 1 x 10 -50 (Bonferronicorrected), NS, not significant. (G) Box plots of cell type marker scores for genes grouped by estimated nuclear enrichment. Nucleus-enriched genes have significantly higher marker scores (linear regression; P = 2.3 x 10 -8 ). (H) Validation of the estimated nuclear proportion of transcripts for Calb1, Grik1, and Pvalb using multiplex fluorescent in situ hybridization (mFISH). Top: For each gene, transcripts were labeled with fluorescent probes and counted in the nucleus (white) and soma (yellow). Bottom: Probe counts in the nucleus and soma across all cells with linear regression fits to estimate nuclear transcript proportions for each gene. Estimated proportions based on mFISH and RNA-seq data are summarized on the right.    Heatmaps show remarkably similar correlation patterns, supporting the existence of a well matched set of nuclear and cell clusters. Nuclear and cell clusters were annotated based on the reciprocal best matching published cluster name and mapped to two interneuron types and five of eight layer 5 excitatory neuron types. (B) Comparisons of the proportion of nuclei or cells expressing marker genes (CPM > 1) for matched pairs of clusters. Correlations are reported at the top of each scatter plot, and cell type specific markers are labeled. As expected based on Figure 2C, gene detection is consistently higher in cells than nuclei. (C) Matched clusters have similar proportions of nuclei and cells (except for two closely related cell types, L5a Hsd11b1 and L5 Batf3), which supports the accuracy of the initial correlation based mapping of single nuclei to cells. (D) Average gene expression quantified based on intronic reads is more highly correlated between cells and nuclei than expression quantified based on exonic reads, particularly for highly expressed genes. Malat1, Meg3, and Snhg11 are the three highest expressing genes in nuclei and have consistently lower expression in cells, as expected based on their reported nuclear localization. proportions are highly similar with slightly higher reported cytoplasmic enrichment for reported genes. Note that the matched set of genes includes 99% protein-coding genes so the distributions more closely resemble those genes in Figure 5D.