Inference of Cell Type Composition from Human Brain Transcriptomic Datasets Illuminates the Effects of Age, Manner of Death, Dissection, and Psychiatric Diagnosis

Most neuroscientists would agree that psychiatric illness is unlikely to arise from pathological changes that occur uniformly across all cells in a given brain region. Despite this fact, the majority of transcriptomic analyses of the human brain to date are conducted using macro-dissected tissue due to the difficulty of conducting single-cell level analyses on donated post-mortem brains. To address this issue statistically, we compiled a database of several thousand transcripts that were specifically-enriched in one of 10 primary brain cell types identified in published single cell type transcriptomic experiments. Using this database, we predicted the relative cell type composition for 157 human dorsolateral prefrontal cortex samples using Affymetrix microarray data collected by the Pritzker Neuropsychiatric Consortium, as well as for 841 samples spanning 160 brain regions included in an Agilent microarray dataset collected by the Allen Brain Atlas. These predictions were generated by averaging normalized expression levels across the transcripts specific to each primary cell type to create a “cell type index”. Using this method, we determined that the expression of cell type specific transcripts identified by different experiments, methodologies, and species clustered into three main cell type groups: neurons, oligodendrocytes, and astrocytes/support cells. Overall, the principal components of variation in the data were largely explained by the neuron to glia ratio of the samples. When comparing across brain regions, we were able to statistically identify canonical cell type signatures – increased endothelial cells and vasculature in the choroid plexus, oligodendrocytes in the corpus callosum, astrocytes in the central glial substance, neurons and immature cells in the dentate gyrus, and oligodendrocytes and interneurons in the globus pallidus. The relative balance of these cell types was influenced by a variety of demographic, pre‐ and post-mortem variables. Age and prolonged hypoxia around the time of death were associated with decreased neuronal content and increased astrocytic and endothelial content in the tissue, replicating the known higher vulnerability of neurons to adverse conditions and illustrating the proliferation of vasculature in a hypoxic environment. We also found that the red blood cell content was reduced in individuals who died in a manner that involved systemic blood loss. Finally, statistically accounting for cell type improved both the sensitivity and interpretability of diagnosis effects within the data. We were able to observe a decrease in astrocytic content in subjects with Major Depressive Disorder, mirroring what had been previously observed morphometrically. By including a set of “cell type indices” in a larger model examining the relationship between gene expression and neuropsychiatric illness, we were able to successfully detect almost twice as many genes with previously-identified relationships to bipolar disorder and schizophrenia than using more traditional analysis methods.


Introduction 1
The human brain is a remarkable mosaic of diverse cell types stratified into 2 rolling cortical layers, arching white matter highways, and interlocking deep nuclei. In 3 the past decade, we have come to recognize the importance of this cellular diversity in 4 even the most basic neural circuits. At the same time, we have developed the capability 5 to comprehensively measure the thousands of molecules essential for cell function. 6 These insights have provided conflicting priorities within the study of psychiatric illness: 7 do we carefully examine individual molecules within their cellular and anatomical 8 context or do we dissect larger tissue samples in order to extract sufficient transcript or 9 protein to perform full unbiased transcriptomic or proteomic analyses? In rodent 10 models, researchers have escaped this dilemma by a boon of new technology: single 11 cell laser capture, cell culture, and cell-sorting techniques that can provide sufficient 12 extract for transcriptomic and proteomic analyses. However, single cell analyses of the 13 human brain are far more challenging (1-3) - live tissue is only available in the rarest of 14 circumstances (such as temporal lobe resection) and high quality post-mortem tissue is 15 precious, especially tissue donated by the families of individuals with rare psychiatric or 16 neurological disorders. 17 Therefore, to date, the vast majority of unbiased transcriptomic analyses of the 18 human brain have been conducted using macro-dissected, cell-type heterogeneous 19 tissue. They have provided us with novel hypotheses (e.g., (4,5)), but researchers who 20 work with the data often report frustration with the relatively small number of candidate 21 molecules that survive analyses using their painstakingly-collected samples, as well as 22 the overwhelming challenge of interpreting molecular results in isolation from their 23 respective cellular context. At the core of this issue is the inability to differentiate 24 between (1) alterations in gene expression that reflect an overall disturbance in the 25 relative ratio of the different cell types comprising the tissue sample, and (2) intrinsic 26 dysregulation of one or more cell types, indicating perturbed biological function. 27 In this manuscript, we present results from an easily accessible solution to this 28 problem that allows researchers to statistically estimate the relative number or 29 transcriptional activity of particular cell types in macro-dissected human brain 30 microarray data by tracking the collective rise and fall of previously identified cell type 31 specific transcripts. Similar techniques have been used to successfully predict cell type 32 content in human blood samples (6-9), as well as diseased and aged brain samples 33 (10-12). Our method was specifically designed for application to large, highly-34 normalized human brain transcriptional profiling datasets, such as those commonly 35 used by neuroscientific research bodies such as the Pritzker Neuropsychiatric Research 36 Consortium and the Allen Brain Institute. 37 We took advantage of a series of newly available data sources depicting the 38 transcriptome of known cell types, and applied them to infer the relative balance of cell 39 types in our tissue samples in a semi-supervised fashion. We draw from seven large 40 studies detailing cell-type specific gene expression in a wide variety of cells in the 41 forebrain and cortex (2,13-18). Our analyses include all major categories of cortical cell 42 types (17), including two overarching categories of neurons that have been implicated in 43 psychiatric illness (19): projection neurons, which are large, pyramidal, and 44 predominantly excitatory, and interneurons, which are small and predominantly 45 inhibitory (20). These are accompanied by the three prevalent forms of glia that make up the majority of cells in the brain: oligodendrocytes, which provide the insulating 47 myelin sheath that enhances electrical transmission in axons (21), astrocytes, which 48 help create the blood-brain barrier and provide structural and metabolic support for 49 neurons, including extracellular chemical and electrical homeostasis, signal 50 propagation, and response to injury (21), and microglia, which serve as the brain's 51 resident macrophages and provide an active immune response (21). We also 52 incorporate structural and vascular cell types: endothelial cells, which line the interior 53 surface of blood vessels, and mural cells (smooth muscle cells and pericytes), which 54 regulate blood flow (22). Progenitor cells may be less prevalent in the aging human 55 brain, but are widely regarded as important for the pathogenesis of mood disorders (23), 56 and thus were also included in our analysis. Within the cortex, these cells mostly take 57 the form of immature oligodendrocytes (17). Finally, the primary cells found in blood, 58 erythrocytes or red blood cells (RBCs), carry essential oxygen throughout the brain. 59 These cells do not contain a cell nucleus and do not generate new RNA, but still contain 60 an existing, highly-specialized transcriptome (24). The relative presence of these cells 61 could arguably represent overall blood flow, the functional marker of regional neural 62 activity traditionally used in human imaging studies. 63 To characterize the balance of these cell types in psychiatric samples, we first 64 compare the predictive value of cell type specific transcripts identified by diverse data 65 sources and then summarize their collective predictions of relative cell type balance into 66 covariates that can be used in larger linear regression models. We demonstrate that 67 statistically estimating the relative cell type balance of samples can explain a large 68 percentage of the variation in human brain microarray datasets. We also find that the nuclei or the cerebellum. In addition, we artificially generated a list of 17 transcripts 93 specific to erythrocytes (red blood cells or RBC) by searching Gene Card for erythrocyte 94 and hemoglobin-related genes (http://www.genecards.org/). In all, we curated gene 95 expression signatures for 10 cell types expected to account for most of the cells in the 96

brain. 97
Most of the cell-type specific transcripts were derived from microarray 98 experiments using cDNA extracted from laboratory mice, therefore in order to use this 99 information for the analysis of human microarray data it was necessary to identify the 100 respective orthologs for the cell type specific transcripts in humans using HCOP:

Microarray Data from Macro-Dissected Human Dorsolateral Prefrontal Cortex 112
Tissue 113 Next, we examined the collective variation in the levels of cell type specific 114 transcripts in an Affymetrix microarray dataset from 157 high-quality human post-115 mortem dorsolateral prefrontal cortex samples (Suppl. Table  2), including tissue from 116 subjects without a psychiatric or neurological diagnosis ("Controls", n=71), or diagnosed 117 with Major Depressive Disorder ("MDD", n=40), Bipolar Disorder ("BP", n=24), or 118 Schizophrenia ("Schiz", n= 22). The severity and duration of physiological stress at the 119 time of death was estimated by calculating an agonal factor score for each subject 120 (ranging from 0-4, with 4 representing severe physiological stress;; (25,26)). Additionally, 121 we measured the pH of cerebellar tissue as an indicator of the extent of oxygen 122 deprivation experienced around the time of death (25,26) and calculated the interval 123 between the estimated time of death and the freezing of the brain tissue (the 124 postmortem interval or PMI) using coroner records. 125 To predict the relative cell content in each of the samples, we used a technique 126 validated using datasets from purified cell types and artificial cell mixtures 127 (Supplementary Methods and Results, Suppl. Figs 1-4). We identified 2678 gene 128 probe sets in the Affymetrix dataset that were found in our curated database of cell type 129 specific transcripts as matched by official gene symbol. We centered and scaled the 130 expression level of each gene probeset across samples (mean=0, sd=1) to prevent Figure 2. Predicting the relative cell type balance in human brain samples using 139 genes previously-identified as having cell type specific expression. A. We 140 compiled a database of genes that were identified in previous publications as 141 specifically enriched in particular forebrain or cortical cell types. Within these 142 publications, researchers purified particular brain cell types or individual brain cells and 143 transcriptionally profiled them using microarray or RNA-Seq. They then performed 144 statistical comparisons across the purified brain cell types to determine which 145 transcripts showed relatively elevated expression in each cell type. B. Within the 146 Pritzker brain tissue samples, we expected variable cell type balance that would 147 influence the pattern of gene expression measured by microarray. To estimate this 148 variability, we pulled out the microarray data for probe sets representing genes that had 149 been previously identified as having cell type specific expression and then averaged 150 Relative'Predictions'of Sample'Cell'Type'Balance across the transcripts identified in each publication ( Figure  1) as specific to a particular 151 cell type to create 38 different "Cell Type Indices" that predicted relative cell content in 152 each of the cortical samples. 153

Type Specific Transcripts Originating from Different Publications 155
We found that the predicted cell content of the prefrontal cortex samples was 156 relatively similar regardless of the origin of the cell type specific gene lists used to 157 create the predictions. When comparing the pattern of correlations between the 38 cell 158 type indices, they clearly cluster into three large umbrella categories: Neurons, 159 Oligodendrocytes, and Support Cells (Astrocytes, Microglia, and Neurovasculature) 160 even when the cell type signatures were derived from cell type specific gene lists from 161 different source publications, species, and methodologies. This clustering was clear 162 using visual inspection of the correlation matrix (Figure  3), hierarchical clustering, or 163 consensus clustering (Suppl. Figure  5;; ConsensusClusterPlus: (27)). Moreover, the 164 clustering was not due to the different publications identifying a similar subset of cell-165 type specific genes, because the clustering persisted in a follow-up analysis in which 166 data from genes identified as cell type specific in multiple publications (e.g., 167 Cahoy_Astrocyte and Zhang_Astrocyte) were removed list wise from the dataset 168 (Suppl. Figure  6  &  7). Clustering was not able to reliably discern neuronal 169 subcategories (interneurons, projection neurons) or support cell subcategories.

Sample Variability in Microarray Data from Macro-Dissected Cortical Tissue 198
For further analyses, individual cell type indices were averaged within each of ten 199 primary categories: astrocytes, endothelial cells, mural cells, microglia, immature and 200 mature oligodendrocytes, red blood cells, interneurons, projection neurons, and indices 201 derived from neurons in general, with any transcripts that overlapped between 202 categories removed (Suppl. Figure  8). This led to ten consolidated primary cell-type 203 indices for each sample. Using these consolidated cell type indices and principal 204 components analysis, we found that the first principal component, which encompassed noticeably superior to the indices that were averaged by primary cell type for predicting 216 the principal components of variation in the dataset, although the variation in PC1 was 217 slightly better accounted for by the general neuron index derived from ((13), r-218 squared=0.62) and the variation in PC2 was best accounted for by the cortical 219 pyramidal neuron index (r-squared=0.65) and oligodendrocyte index (r-squared=0.57) 220 derived from (17). Human-derived indices did not outperform mouse-derived indices, 221 and indices derived from studies using stricter definitions of cell type specificity (fold 222 enrichment cut-off in Figure  1, e.g., (13) vs. (17)) did not outperform less strict indices. 223 To investigate whether the strong relationship between the top principal 224 components of variation in our dataset and cell type composition indices originated 225 artificially due to cell type specific genes representing a large percentage of the most 226 highly variable transcripts in the dataset, we performed principal components analysis 227 after excluding all cell type specific transcripts from the dataset and still found these 228 strong correlations (Suppl. Figure  9). Indeed, individual cell type indices better 229 accounted for the main principal components of variation in the microarray data than all 230 other major subject variables combined (pH, Agonal Factor, PMI, Age, Gender, 231 Diagnosis, Suicide;; PC1: R-squared=0.4272, PC2: R-squared=0.2176). When 232 examining the dataset as a whole, the six subject variables accounted for an average of 233 only 12% of the variation for any particular probe (R-squared, Adj.R-squared=0.0715), 234 whereas just the astrocyte and projection neuron indices alone were able to account for 235 17% (R-squared, Adj.R-squared=0.1601) and all 10 cell types accounted for an average 236 of 31% (R-squared, Adj.R-squared=0.263), almost one third of the variation present in 237 the data for any particular probe (Suppl. Figure  10). These results suggested that 238 accounting for cell type balance was highly important for the interpretation of microarray 239 data and could improve the signal-to-noise ratio in analyses aimed at identifying 240 psychiatric risk genes. 241 242 243 C. D.

Cell Type Indices Predict Other Genes Known to Be Cell Type Enriched 255
To identify other transcripts important to cell type specific functions in the human 256 cortex, we ran a linear model on the signal from each gene probeset in the Pritzker 257 prefontal cortex microarray dataset that included each of the ten consolidated primary 258 cell type indices as well as six co-variates traditionally included in the analysis of human 259 brain gene expression data (pH, Agonal Factor, PMI, Age, Gender, Diagnosis;; Equation 260 1 in Figure  5). On average, this model explained 35% of the variation in the data (R 2 ). 261 Shown in Figure  6 are the most significant 10 gene probe sets positively associated 262 with each cell type while controlling for the other cell types and co-variates within the 263 model. Additional gene probe sets and statistical details can be found in Suppl. Table  264 266    Table  4. 277 Many of the top gene probesets that we found to be related to each of the cell 278 type indices are already known to be associated with that cell type in previous 279 publications, validating our methodology. Importantly, this is true even when the genes 280 were not included in the original list of cell type specific genes used to generate the 281 index. For example, we found that HLA-E (Major Histocompatibility Complex, Class I, E) 282 and EPAS1 (endothelial PAS domain protein 1) were both strongly associated with our Member 21B) is a kinesin that has been found in the dendrites of pyramidal neurons 296 (34). We also rediscovered probesets representing genes that were listed as alternative 297 orthologs to those included in our original cell type specific gene lists (oligodendrocytes: 298 EVI2A vs.CTD-2370N5.3, microglia: LAIR1 vs. LAIR2, mural cells: COL18A1 vs. 299 COL15A1, ACTA2 vs. ACTG1). Altogether, these results suggest that our cell type 300 indices were associated with the variability of transcripts in the cortex that represented 301 particular cell types and could re-identify known cell type specific markers. 302 303

Data for >840 Samples from 160 Human Brain Regions 305
For validation, we decided to also apply our cell type analysis to a large Agilent 306 microarray dataset (841 samples) spanning 160 cortical and subcortical brain regions 307 from the Allen Brain Atlas (Suppl. Table  3;; (35)). This dataset included high-quality 308 tissue (absence of neuropathology, pH>6.7, PMI<31 hrs, RIN>5.5) from 6 human 309 subjects (36). The tissue samples were collected using a mixture of block dissection 310 and laser capture microscopy guided by adjacent tissue sections histologically stained 311 to identify traditional anatomical boundaries (37). 312 The 30,000 probes mapped onto 18,787 unique genes (as determined by gene 313 symbol). We found that 1608 of these genes were identified as having cell type specific 314 expression within our database. Then, using a procedure similar to that used for the 315 Pritzker prefrontal cortex dataset, we averaged the data from the cell-type specific 316 genes derived from each publication to predict the relative content of each of the 10 317 primary cell types in each sample. 318 319

Predicted Cell Content Accurately Reflects Regional Differences in Cell Type 320
Balance 321 To explore the generalizability of our method to non-cortical samples, we 322 examined the relative balance of each of the 10 primary cell types in all 160 brain 323 regions included in the Allen Brain Atlas microarray dataset. To do this, we used violin 324 plots, which are preferable for visualizing trends in data with small sample sizes (1-6 subjects per region). The results clearly indicated that our cell type analyses could 326 identify well-established differences in cell type balance across brain regions (Figure  7). 327 Within the choroid plexus, which is a villous structure located in the ventricles made up 328 of support cells (epithelium) and an extensive capillary network (38), there is an 329 enrichment of cells related to vasculature (endothelial cells, mural cells) and immunity 330 (microglia). In the corpus callosum, which is the primary myelinated fiber tract 331 connecting the cerebral hemispheres (38), there is an enrichment of oligodendrocytes 332 and microglia. The central glial substance is enriched with glia and support cells, with a 333 particular emphasis on astrocytes. The dentate gyrus, which is one of the only 334 neurogenic regions in the adult brain (39) and which contains the predominantly 335 glutamatergic granule cells projecting into the mossy fibre pathway (40), has an 336 enrichment of both immature-like cells and projection neurons. The internal segment of 337 the globus pallidus, which is highly GABA-ergic and named after its white matter 338 intrusions (38), was enriched for oligodendrocytes, astrocytes, and microglia, as well as 339 a prominent subset of interneurons. The relative cell content predictions for the other 340 brain regions can be found in Suppl. Table  5. Even though this analysis was based on 341 cell type specific genes identified in the forebrain and cortex, these results provide 342 fundamental validation that each of primary consolidated cell type indices is generally 343 tracking their respective cell type in subcortical structures. 344 345 Figure  7. Cell content predictions derived from microarray data capture canonical 347 regional differences in cell type balance. As an additional form of validation, we 348 applied our cell type analysis to a large Agilent microarray dataset (841 samples) 349 spanning 160 brain regions from the Allen Brain Atlas. Depicted below is the relative 350 average cell type specific expression for each subject (dot) for each of the 10 primary 351 cell type indices. A color-coded region encompasses the data cloud for each cell type. 352 Within the example brain regions, we are able to clearly identify canonical differences in 353 cell type balance, with black arrows indicating cell types that are known to be enriched 354 in that respective region. A. Within the choroid plexus there is an enrichment of cells 355     interneurons. The relative cell content predictions for the other brain regions can be 362 found in Suppl. Table  5. 363 Similar to the Pritzker dataset, we outputted a table of the top genes associated 364 with each cell type (as assessed using the model in Equation 4, Figure  5). We found 365 that the results included a mixture of well-known cell type markers and novel findings 366 (Suppl. Figure  11;; Suppl. Table  6). When this model was applied to the principal 367 sometimes differed from what was seen in the prefrontal cortex (Suppl. Figure  12). 373 Overall, these results indicate that our method for statistically predicting cell content can 374 be a useful addition to the analysis of non-cortical as well as cortical data sets. 375 376

Content 379
We next set out to observe the relationship between the predicted cell content of  However, in some regions of the prefrontal cortex, age-related decreases in grey matter 451 are primarily driven by synaptic atrophy instead of decreased cell number (49). This raised the question of whether the decline in our neuronal cell indices with age was 453 being largely driven by the enrichment of genes related to synaptic function in the index. 454 To explore this possibility, we first evaluated the relationship between age and gene 455 expression while controlling for likely confounds using the signal data for all probesets in 456 the dataset (Equation 3, Figure  5). We used "DAVID: Functional Annotation Tool" 457 (//david.ncifcrf.gov/summary.jsp, (50,51) to identify the functional clusters that were 458 overrepresented by the genes included in our neuronal cell type indices (using the full 459 HT-U133A chip as background), and then determined the average effect of age (beta) 460 for the genes included in each of the 240 functional clusters (Suppl. Table  7). The vast 461 majority of these functional clusters showed a negative relationship with age on average 462 (Suppl. Figure  13). However, these functional clusters overrepresented 463 dendritic/axonal related functions, so we blindly chose 29 functional clusters that were 464 clearly related to dendritic/axonal functions and 41 functional clusters that seemed 465 distinctly unrelated to dendritic/axonal functions (Suppl. Table  7). Using this approach, 466 we found that transcripts from both classifications of functional clusters showed an 467 average decrease in expression with age (T(28)=-4.5612, p= 9.197e-05, T(40)=-2.7566, 468 p=0.008756, respectively), but the decrease was larger for transcripts associated with 469 dendritic/axonal-related functions (T(50.082)=2.3385, p= 0.02339, Suppl. Figure  13). 470 Based on this analysis, we conclude that synaptic atrophy could be partially driving age-471 related effects on neuronal cell type indices in the human prefrontal cortex dataset but 472 are unlikely to fully explain the relationship.

Cell Type Balance Changes in Response to Psychiatric Diagnosis 495
Of most interest to us were potential changes in cell type balance in relation to 496 psychiatric illness. In previous post-mortem morphometric studies, there was evidence 497 of glial loss in the prefrontal cortex of subjects with Major Depressive Disorder, Bipolar Disorder, and Schizophrenia (reviewed in (53)). This decrease in glia, and particularly 499 astrocytes, was replicated experimentally in animals exposed to chronic stress (54), and 500 when induced pharmacologically, was capable of driving animals into a depressive-like 501 condition (54). Replicating the results of (46), we observed a moderate decrease in 502 astrocyte index in the prefrontal cortex of subjects with Major Depressive Disorder (β = -503 0.1326572, p= 0.0118), but did not see similar changes in the brains of subjects with 504 Bipolar Disorder or Schizophrenia (Figure  8f). We did not see significant changes in 505 any of the other cell type indices in relationship to diagnosis. 506 507

Improves the Detection of Diagnosis-Related Genes 509
Over the years, many researchers have been concerned that transcriptomic and 510 genomic analyses of psychiatric disease often produce non-replicable or contradictory 511 results and, perhaps more disturbingly, are typically unable to replicate well-512 documented effects detected by other methods. We posited that this lack of sensitivity 513 and replicability might be partially due to cell type variability in the samples, especially 514 since such a large percentage of the principal components of variation in our samples 515 are explained by neuron to glia ratio. Therefore, we compiled a list of genes that had 516 previously documented relationships with psychiatric illness in particular cell types in the 517 human prefrontal cortex, as detected using in situ hybridization, immunocytochemistry, 518 or single-cell laser capture microscopy. These included several genes with a well-519  (19);; (60)) and HTR2A, which encodes a protein 526 increased in projection neurons in subjects who committed suicide (61). As further 527 validation, it seemed prudent to include genes that were known to have differential 528 expression in relationship with non-psychiatric variables in specific cells within the 529 prefrontal cortex as well. These included CALB1 and CALB2, both of which encode 530 proteins in neurons that decrease with age (62).

531
We then examined our ability to detect these known relationships using models   Figure  5). 541 We found that including predictions of cell type balance in our models assessing 542 the effect of diagnosis or age on the expression of our validation genes dramatically 543 improved model fit as assessed by Akaike's Information Criterion (AIC) or Bayesian Information Criterion (BIC), and a 27% reduction in residual standard error (Figure  9, 545 Suppl. Figure  15). These improvements were largest with the addition of the five most 546 prevalent cell types to the model;; the addition of less common cell types produced 547 smaller gains. We also tried replacing the diagnosis term in our models with a more 548 general term representing presence or absence of a psychiatric condition because we 549 had found in the past that many of the genes that were associated with diagnosis in our 550 samples were altered across diagnostic categories. This replacement slightly improved 551 model fit in all versions of the analysis (Eq. 1, 3, 5-7). 552 Overall we found that adding predictions of cell type balance to our models 553 improved our ability to detect previously-documented relationships with diagnosis in the 554 Pritzker dataset (Figure  9). Prior to the addition of cell type to the model, we found that 555 only one of 32 genes with a previously documented relationship to diagnosis in 556 individual cells in the prefrontal cortex showed that relationship with a nominal p<0.05 in 557 our dataset (Eq. 5: HTR2A, Eq.1: SLC6A1). After including cell type balance in the 558 model, the relationship of three genes with diagnosis was now detectable (SLC6A1, 559 SST, COX7B;; Suppl. Figure  16). Overall, the number of validation genes showing the 560 same direction of effect as previously documented increased from 56% (18/32) to 68-561 72% (23/32). Models that included a more general term for presence or absence of a 562 psychiatric condition performed even better (Suppl. Figure  17). When using a basic 563 model (Eq. 5) or when controlling for known confounds (Eq. 3) only one out of 32 564 validation genes were associated with psychiatric illness (SLC6A1). However, once we 565 included predicted cell type balance (Eq. 1) five of the 32 validation genes showed a 566 diagnosis relationship (p<0.05, SST, PVALB, LGALS1, MGST3, ACTR10), and the percentage of validation genes showing the same direction of effect as previously 568 documented increased from 34% (11/32) to 78% (25/32), a significant improvement as 569 indicated by Fisher's exact test (p=0.036). The use of forward/backward stepwise model 570 selection (using the R function stepwise{Rcmdr}, criterion=BIC) drawing from a pool of 571 variables that included diagnosis, general presence of a psychiatric illness, suicide, 572 known confounds, and all 10 cell types, was also successful at detecting several genes 573 in association with psychiatric illness (SST, SLC6A1, MGST3, PCSK1) and suicide 574 (LGALS1), but these results should be viewed more cautiously due to the known 575 presence of overfitting in stepwise procedures producing overly optimistic p-values. 576 Backward/forward stepwise selection was noticeably less sensitive and included 577 multiple false positives (incorrect direction of effect with a p<0.05). Both genes with a 578 previously documented relationship to age (CALB1, CALB2) had such strong age-579 related effects in our dataset (p=4.84E-23, p=8.33E-08, respectively) that model 580 specification had little impact on their results (Suppl. Figure  18). 581 582   We found that adding predictions of cell type balance to our models also 617 improved our ability to detect altered gene expression associated with known genetic 618 risk loci as well as candidate genes identified by convergent functional genomics, and 619 even enhanced our ability to replicate previous findings from macro-dissected 620 microarray. For example, SZGene.org identified 38 top genes associated with genetic 621 risk loci for Schizophrenia (as reported in (63) represented by probesets in our dataset. Only one of these was found to have a 637 significant relationship with psychiatric illness while accounting for confounds detect relationships between gene expression and diagnosis-related genes identified by 662 a variety of techniques (Fisher's exact test: p=0.0221, Figure  9F). 663 664

Content Predictions Pinpoint Known Risk Candidates 666
Although the inclusion of predicted cell type balance in our model improved our 667 ability to detect previously-identified relationships with diagnosis, most relationships still improves the signal to noise ratio of our analyses, then we would expect that the top 676 diagnosis-related genes in our dataset would be more likely to overlap with previous 677 findings. In an attempt to perform this comparison in an unbiased and efficient manner, 678 we limited our search to PubMed, using as search terms only the respective human 679 gene symbol and diagnosis ("Schizophrenia", "Bipolar", or "Depression"). For the genes 680 related to MDD in our dataset, we also expanded the search to include two highly-681 correlated traits that are more quantifiable and likely to have a genetic basis: "Anxiety" 682 and "Suicide". Then we narrowed our results only to studies using human subjects.
We found that only one of the top 10 diagnosis-related genes detected using a 684 model that included diagnosis and known confounds (Equation 3) was previously noted 685 in the human literature (FOS:  (68,69)). The same was true if we replaced diagnosis with 686 a term representing the general presence or absence of a psychiatric illness (ALDH1A1: 687 (64)). In contrast, when we used a model that included diagnosis, known confounds, 688 and predictions for the balance of the five most prevalent cortical cell types (Equation  Table  9, Suppl. 702 Table  11), and 9/10 of the top genes were actually significant with an FDR<0.05 when 703 using permutation based methods (using the R function lmp{lmPerm}, iterations=9999). 704 The top 10 genes associated with psychiatric illness in models selected using In this manuscript, we have demonstrated that the statistical cell type index is a 717 relatively simple manner of interrogating cell-type specific expression in transcriptomic 718 datasets from macro-dissected human brain tissue. We find that statistical estimations 719 of cell type balance almost fully account for the principal components of variation in 720 microarray data derived from macrodissected brain tissue samples, far surpassing the 721 importance of other subject variables (post-mortem interval, hypoxia, age, gender). 722 Indeed, our results suggest that many variables of medical interest are themselves 723 accompanied by strong changes in cell type composition in naturally-observed human 724 brains. We find that within both chronic (age, sex, diagnosis) and acute conditions 725 (agonal, PMI, pH) there is substantial turbulence in the relative representation of 726 different cell types. Thus, accounting for demography at the cellular population level is 727 as important for the interpretation of microarray data as cell-level functional regulation. 728 This form of data deconvolution was particularly useful for identifying the subtler effects of psychiatric illness within our samples, divulging the decrease in astrocytes that is 730 known to occur in Major Depressive Disorder, and doubling the sensitivity of our assay 731 to detect previously-identified diagnosis-related genes. 732 These results touch upon the fundamental question as to whether organ-level 733 degradation, or an overall change in transcription rate. In this regard the index that we calculate does not have a specific interpretation;; rather it is a holistic property of the cell 753 populations, the "neuron-ness" or "microglia-ness" of the sample. Such an abstract 754 index represents the ecological shifts inferred from the pooled transcriptome. That said, 755 unlike principal component scores or other associated techniques of removing 756 unwanted variation from genomic data, our cell type indices do have real biological 757 meaning - they can be interpreted in a known system of cell type taxonomy. When 758 single-cell genomic data uncovers new cell types (e.g., the Allen Brain Atlas cellular 759 taxonomy initiative (78)) or meta-analyses refine the list of genes defined as having cell-760 type specific expression (e.g., (79)), our indices will surely evolve with these new 761 classification frameworks, but the power of our approach will remain, in that we can 762 disentangle the intrinsic changes of individual genes from the population-level shifts of 763 major cell types. The same approach can be extended to studying other structurally 764 complex organs that involve the concerted function of many cell types.

765
Although we generated our method independently to address microarray analysis 766 questions that arose within the Pritzker Neuropsychiatric Consortium, we later 767 discovered that it was quite similar to the technique of population-specific expression 768 analysis (PSEA) introduced by (12) with several notable differences. Similar to our 769 method, PSEA aims to estimate cell type-differentiated disease effects from microarray 770 data derived from brain tissue of heterogeneous composition and approaches this 771 problem by including the averaged, normalized expression of cell type specific markers 772 within a larger linear model that is used to estimate differential expression in microarray 773 data. Likewise, using PSEA, (12) also found that individual variability in neuronal, 774 astrocytic, oligodendrocytic, and microglial cell content was sufficient to account for substantial variability in the vast majority of probe sets, even within non-diseased 776 samples. Most importantly, the PSEA technique has been carefully validated: PSEA 777 was found to successfully predict the content of RNA-mixing experiments (12), cellular 778 expression data from in situ hybridization or laser-capture microdissection experiments 779 (11), and neuron-specific neurodegenerative effects found with laser-capture 780 microdissection (10).The differences between our techniques are mostly due to our 781 access to a large sample size and the recent growth of the literature documenting cell 782 type specific expression in brain cell types. PSEA uses a very small set of markers (4-7) 783 to represent each cell type, and screens these markers for tight co-expression within the 784 dataset of interest, since co-expression networks have been previously demonstrated to 785 often represent cell type signatures in the data (80). This is essential for the analysis of 786 microarray data for brain regions that have not been well characterized for cell type 787 specific expression (e.g., the substantia nigra), but risks the possibility of closely 788 tracking variability in a particular cell function instead of cell content (as described in our 789 results related to aging). Our analysis predominantly focused on the well-studied cortex, 790 thus enabling us to expand our analysis to include hundreds of cell type specific 791 markers derived from a variety of experimental techniques. Likewise, PSEA was 792 designed for use with small microarray datasets, and thus depends on a variety of 793 model selection techniques to minimize the number of terms included in the linear 794 model. Although necessary, this step introduces the risk of mis-assigning effects 795 associated with correlated cell types. Using a large dataset gave us the opportunity to 796 include terms for all major cell types in the analysis, as well as terms representing a 797 number of important identified confounds (age, pH, PMI, gender). Due to these analytical differences, we are able to effectively characterize gene expression 799 associated with less prevalent cell types (e.g., endothelial cells) and compare the utility 800 of cell type specific markers derived from a variety of species and experimental 801 techniques.

802
There was one seemingly-small difference between our method and PSEA that 803 actually turned out to produce a large difference in efficacy: normalization of the original 804 gene expression data using a z-score instead of a ratio of the mean (81). As part of a 805 set of later validation analyses (Suppl. Methods and Results, Suppl. Figures  25-26), 806 we performed a head-to-head comparison of our method and PSEA using a single-cell 807 RNA-Seq dataset and the same database of cell type specific genes. Both methods 808 strongly predicted cell identity, but on average we found that one third of the variation in 809 the predictions of relative cell content derived from PSEA ("population reference signal") 810 were related to the cell identity of the samples versus almost half of the variation in our 811 consolidated cell type indices. We conclude that our method may be a more effective 812 manner of predicting cell type balance in some datasets. 813 Another notable difference between our final analysis methods and those used 814 by PSEA (10-12) was the lack of cell type interaction terms included in our models 815 (e.g., Diagnosis*Astrocyte Index). Theoretically, the addition of cell type interaction 816 terms should allow the researcher to statistically interrogate cell-type differentiated 817 diagnosis effects because samples that contain more of a particular cell type should 818 exhibit more of that cell type's respective diagnosis effect. Versions of this form of 819 analysis have been successful in other investigations (e.g., (11,12,82)) but we were not 820 able to validate the method using our database of previously-documented relationships with diagnosis in prefrontal cell types and a variety of model specifications (e.g., Suppl. 822 Figure  27). Upon consideration, we realized that these negative results were difficult to 823 interpret because significant diagnosis*cell type interactions should only become 824 evident if the effect of diagnosis in a particular cell type is different from what is 825 occurring in all cell types on average. For genes with expression that is reasonably 826 specific to a particular cell type (e.g., GAD1), the overall average diagnosis effect may 827 already largely reflect the effect within that cell type and the respective interaction term 828 will not be significantly different, even though the disease effect is clearly tracking the 829 balance of that cell population. In the end, we decided that the addition of interaction 830 terms to our models was not demonstrably worth the associated decrease in overall 831 model fit and statistical power. 832 One result from our analysis seems particularly worth discussing in greater 833 depth. It has been acknowledged for a long time that exposure to a hypoxic 834 environment prior to death has a huge impact on gene expression in human post-835 mortem brains (e.g., (25,26,83,84)). This impact on gene expression is so large that up 836 until recently the primary principal component of variation (PC1) in our Pritzker data was 837 assumed to represent the degree of hypoxia, and was sometimes even systematically 838 removed before performing diagnosis-related analyses (e.g., (85)). However, the 839 magnitude of the effect of hypoxia was puzzling, especially when compared to the much 840 more moderate effects of post-mortem interval, even when the intervals ranged from 8-841 40+ hrs. Our current analysis provides an explanation for this discrepancy, since it is 842 clear from our results that the brains of our subjects are actively compensating for a 843 hypoxic environment prior to death by altering the balance or overall transcriptional activity of support cells and neurons. Although the differential effects of hypoxia on 845 neurons and glial cells have been studied since the 1960's (86), to our knowledge this is 846 the first time that anyone has related the large effects of hypoxia in post-mortem 847 transcriptomic data to alterations in cell type balance in the samples. This connection is 848 important for understanding why results associating gene expression and psychiatric 849 illness in human post-mortem tissue sometimes do not replicate. If a study contains 850 mostly tissue from individuals who experienced greater hypoxia before death (e.g., 851 hospital care with artificial respiration or drug overdose followed by coma), then the 852 evaluation of the effect of neuropsychiatric illness is likely to inadvertently focus on 853 differential expression in support cell types (astrocytes, endothelial cells), whereas a 854 study that mostly contains tissue from individuals who died a fast death (e.g., car 855 accident or myocardial infarction) will emphasize the effects of neuropsychiatric illness 856 in neurons. 857 Finally, our work drives home the fact that any comprehensive theory of 858 psychiatric illness needs to account for the dichotomy between the health of individual 859 cells and that of their ecosystem. We found that the functional changes accompanying 860 psychiatric illness in the dorsolateral prefrontal cortex occurred both at the level of cell 861 population shifts (decreased astrocytic presence) and at the level of intrinsic gene 862 regulation not explained by population shifts. A similar conclusion regarding the 863 importance of cell type balance in association with psychiatric illness was recently 864 drawn by our collaborators (e.g.,(87)) using a similar technique to analyze RNA-Seq 865 data from the anterior cingulate cortex. In the future, we plan to use our technique to re-866 analyze many of the other large microarray datasets existing within the Pritzker Neuropsychiatric Consortium with the hope of gaining better insight into psychiatric 868 disease effects. This application of our technique seems particularly important in light of 869 recent evidence linking disrupted neuroimmunity (74) and neuroglia (e.g., (46,54,88)) to 870 psychiatric illness, as well as growing evidence that growth factors with cell type specific 871 effects play an important role in depressive illness and emotional regulation (e.g.,  Derived Neurotrophic Factor (BDNF), the Fibroblast Growth Factor (FGF) family, Glial-873 cell derived neurotrophic factor (GDNF), Vascular Endothelial Growth Factor (VEGF);; 874 for a review, see (23,89)). 875 In conclusion, we have found this method to be a valuable addition to traditional 876 functional ontology tools as a manner of improving the interpretation of transcriptomic 877 results as well as removing unwanted noise due to variations in cell content caused by 878 dissection variability. The capability to unravel alterations of cell type composition from 879 modulation of cell state, even just probabilistically, is inherently useful for understanding 880 the higher-level function of the brain as emergent properties of brain activity, such as 881 emotion, cognition, memory, and addiction, usually involve ensembles of many cells. 882 Facilitating the interpretation of gene activity data in macro-dissected tissue in light of 883 both processes provides new opportunities to integrate results with findings from other 884 approaches, such as electrophysiology analysis of brain circuits, brain imaging, 885 optogenetic manipulations, and naturally occurring variation in response to injury and 886 brain diseases.

887
For the benefit of other researchers, we have made our database of brain cell 888 type specific genes (https://sites.google.com/a/umich.edu/megan-hastings-889 hagenauer/home/cell-type-analysis) and R code for conducting cell type analyses relative content of each of the 10 primary cell types in each sample. All of the R script 960 documenting these analyses can be found at 961 https://github.com/hagenaue/CellTypeAnalyses_AllenBrainAtlas. 962