Transcriptomic stratification of late-onset Alzheimer's cases reveals novel genetic modifiers of disease pathology

Late-Onset Alzheimer’s disease (LOAD) is a common, complex genetic disorder well-known for its heterogeneous pathology. The genetic heterogeneity underlying common, complex diseases poses a major challenge for targeted therapies and the identification of novel disease-associated variants. Case-control approaches are often limited to examining a specific outcome in a group of heterogenous patients with different clinical characteristics. Here, we developed a novel approach to define relevant transcriptomic endophenotypes and stratify decedents based on molecular profiles in three independent human LOAD cohorts. By integrating post-mortem brain gene co-expression data from 2114 human samples with LOAD, we developed a novel quantitative, composite phenotype that can better account for the heterogeneity in genetic architecture underlying the disease. We used iterative weighted gene co-expression network analysis (WGCNA) to reduce data dimensionality and to isolate gene sets that are highly co-expressed within disease subtypes and represent specific molecular pathways. We then performed single variant association testing using whole genome-sequencing data for the novel composite phenotype in order to identify genetic loci that contribute to disease heterogeneity. Distinct LOAD subtypes were identified for all three study cohorts (two in ROSMAP, three in Mayo Clinic, and two in Mount Sinai Brain Bank). Single variant association analysis identified a genome-wide significant variant in TMEM106B (p-value < 5×10−8, rs1990620G) in the ROSMAP cohort that confers protection from the inflammatory LOAD subtype. Taken together, our novel approach can be used to stratify LOAD into distinct molecular subtypes based on affected disease pathways.


Introduction
Late-onset Alzheimer's disease (LOAD) is the most common form of dementia in the elderly. The clinical features associated with LOAD are an amnesic type of memory impairment, deterioration of language, and visuospatial deficits. In the later stages of the disease, symptoms may include motor and sensory abnormalities, gait disturbances, and seizures. Without advances in therapy, the number of symptomatic cases in the United States is predicted to rise to 13.2 million by 2050 [1].
Many common, complex diseases such as LOAD present with heterogeneous phenotypes due to interactions between genetic and environmental factors affecting a range of pathways and processes. LOAD has no simple form of inheritance and is governed by a common set of risk alleles across multiple genes that, in combination, have a substantial effect on disease predisposition and age of onset [2]. Genome-Wide Association Studies (GWAS) have become an important tool for identifying variants in complex diseases [3,4]. GWAS for LOAD have identified variants in over 500 genes as potential risk factors with the ε4 variant in APOE as the strongest contributor to overall disease risk [2,5]. LOAD has a strong polygenic component and an estimated heritability of up to 80% [6]. It has been challenging to transition from the identification of associated genetic variants to the molecular mechanisms that lead to the accumulation of amyloid plaques and helical tau filaments [7]. Furthermore, there is mounting evidence that the observed heterogeneity in LOAD is associated with multiple distinct subtypes [8,9].
Gene co-expression modules tend to consist of genes that belong to the same cellular pathways or programs and help explain the global properties of the transcriptome as it relates to disease risk [10]. Networks-based co-expression module approaches have been used to identify causal variants in Late-Onset Alzheimer's disease [7,11]. However, such studies have failed to account for the heterogeneity of mechanisms that lead to complex diseases. Here, we analyze whole genome sequencing (WGS) and whole transcriptome data from three independent human cohorts from the Accelerating Medicines Partnership-Alzheimer's Disease (AMP-AD) Consortium. We use gene co-expression modules to develop quantitative phenotypes that account for the complex genetic architecture and heterogeneity of LOAD to more effectively map associated variants using genome-wide association. Furthermore, the method presented in this paper can be used to identify variants in other complex diseases.

Description of post-mortem transcriptome study populations
To define novel quantitative phenotypes for LOAD, we obtained 26 post-mortem brain coexpression modules (doi.org/10.7303/syn11932957.1) that were harmonized from three independent cohorts of the AMP-AD consortium (Fig 1A, S1 Fig). This included post-mortem brain samples of 623 decedents from the ROSMAP cohort for the dorsolateral prefrontal cortex (DLPFC) brain region, 271 decedents from the Mayo cohort for the temporal cortex (TCX) brain region, and 364 decedents from the MSBB cohort for the frontopolar prefrontal cortex (FP), inferior temporal gyrus (IFG), parahippocampal gyrus (PHG) and superior temporal gyrus (STG) brain regions (Fig 1A). Approximately one-third of the patients were diagnosed  (LOAD). A modified procedure using seven different WGCNA protocols, followed by merging by clustering methods, was performed to obtain four modules based on gene co-expression. (B) Each of the 26 modules was subjected to iterativeWGCNA, a procedure that repeatedly performs WGCNA on expression data to generate highly correlated gene sets and exclude weakly correlated genes. 68 submodules were generated from the 26 modules. (C) The eigengene, or first principal component, was calculated for a subset of 48 submodules from four brain regions (DLPFC, TCX, FP, PHG) and used as a quantitative trait for single-variant association mapping. Furthermore, the eigengene expression for LOAD cases was used to perform cluster analysis and generate subtypes of LOAD cases. The Euclidean distance of each patient from each subtype centroid was used as additional quantitative trait-the subtype specificity metric-in a single-variant association mapping. ROSMAP mapping was used as the baseline, with the Mayo and MSBB cohorts serving as replicates. https://doi.org/10.1371/journal.pgen.1008775.g001

PLOS GENETICS
with LOAD, while two-thirds were considered controls. The control group included elderly with normal cognition, as well as mild cognitive impairment and other forms of dementia. An overview of the post-mortem brain samples used in our analysis pipeline is provided in S1 and S2 Tables. Details on post-mortem brain sample collection, tissue and RNA preparation, sequencing, and sample quality control can be found in published work related to each cohort [12][13][14]. Multiple variables, such as sex, age of death, and sequencing batch effects were used as covariates in the normalization process to remove possible confounding factors across cohorts (Methods) [15].

Refinement of 26 human co-expression modules identifies diseaseassociated transcriptomic signals
We performed an iterative gene list pruning process using the iterativeWGCNA approach [16] to refine the 26 human co-expression modules from the AMP-AD consortium (S3 Table). Each of these 26 modules contains several thousand co-expressed genes, implicated in multiple disease processes across multiple cell types. Therefore, it is often difficult to assign a cell-typespecific role for expression modules linked to a certain brain region. Our approach resulted in 68 distinct subsets, or submodules, of highly correlated genes that were exclusive to each module ( Fig 1B, S4 Table). Genes that were not highly correlated to any submodule were removed since they are less likely to contribute to the overall signal of the submodule and more likely to introduce noise. We then annotated the 68 co-expression submodules to identify molecular pathways and processes that are significantly enriched within submodules across the six brain regions from the three independent LOAD cohorts (S2 Fig). Pathway enrichment analysis was performed using GO terms, KEGG pathways, and Reactome pathway data sets to highlight the biological specificity of co-expression signals captured by the different submodules (S5 Table). We identified multiple functional consensus clusters across the 68 submodules, which showed a significant overlap in functional enrichment for similar biological pathways and processes across the six brain regions (S2 Fig). These functional consensus clusters associated with the 68 submodules revealed gene sets for specific biological pathways, including tau-protein kinase activity, neuroinflammation, myelination, and cytoskeletal reorganization (S2 Fig). Furthermore, incorporating information from previously defined cell-type-specific markers derived from bulk and single cell RNA Sequencing (RNA-Seq) [17] showed that refining the 26 coexpression modules into 68 submodules resulted in multiple novel submodules enriched for cell-type-specific markers (Fig 2, S3 Fig).

Single-variant association mapping of submodule eigengenes across cohorts
To map the genetic drivers of biological disease-associated signals resolved by submodules, we performed single-variant association mapping of submodule eigengenes ( Fig 1C) using wholegenome sequencing data from the AMP-AD knowledge portal (doi.org/10.7303/ syn10901601). We then applied a variance component linear mixed model implemented in the EMMAX software to identify novel genetic loci associated with submodule eigengenes (Methods) [18]. Eigengenes were defined as the first principle component of the gene expression data associated with each of the 68 submodules. They capture most of the variation in gene coexpression and reduce noise associated with the transcriptomic data. We included eigengene expression data from four of the brain regions (TCX, PHG, FP, DLPFC), focusing on tissues from the frontal cortex, temporal cortex, and hippocampus due to their relevance to LOAD neuropathology [19]. QQ plots indicate minimal effects of genomic inflation, and consequently population substructure, on the analyses (S4 Fig). Genome-wide suggestive and significant loci were detected for submodules in the four brain regions (Fig 3, S6-S9 Tables). We identified multiple loci that were replicated across the Eigengene expression for each submodule was used as a quantitative trait when performing single-variant mapping. Multiple submodule eigengenes were associated with SNPs at a genome-wide significance level of p = 5×10 −8 (red dotted line). Loci of interest are annotated with the gene closest to the region. Some SNPs were also detected at a genome-wide suggestive level of p = 1×10 −5 (yellow dotted line). DLPFCblue_3 contains genes related to the TREM2/TYROBP pathway, an important network of genes related to microglial activation during neuroinflammation of the brain. Submodules were associated with both unique and overlapping loci. For example, DLPFCbrown_1 and DLPFCyellow_2 are derived from separate co-expression modules but were both associated with the TMEM106B locus. Similarly, DLPFCyellow_1 and DLPFCyellow_2 were derived from the same co-expression module but were associated with a mix of overlapping and unique loci.

PLOS GENETICS
three cohorts at a genome-wide significant level. For instance, rs1990620 is a known LOADassociated variant in TMEM106B that was identified as genome-wide significant in the DLPFC region from the ROSMAP cohort and was replicated (p < 5×10 −2 ) in the remaining three brain regions from the Mayo and MSBB cohorts. This highlights the usefulness of our newly derived quantitative phenotypes to identify genetic variants associated with specific co-expression submodules driving disease pathology.

Stratification of LOAD cases based on 68 AMP-AD co-expression submodules
We next assessed if submodule composite phenotypes could be used to better account for the observed heterogeneity in the genetic architecture of LOAD by clustering patients based on their co-expression profiles (Fig 1C). Clustering was performed to determine subtypes of LOAD cases for four brain regions (TCX, PHG, FP, DLPFC). The NbClust R package was used to identify the optimal number of clusters for different clustering methods by polling with the majority rule across 30 indices [20]. The NbClust package identified two subtype clusters for the ROSMAP (DLPFC region) and MSBB cohorts (FP, PHG regions), while three clusters were observed for the Mayo cohort (TCX region). An example for the ROSMAP cohort is shown in Fig 4. The number of cases in each identified subtype cluster was balanced across all three cohorts (S10 Table). A comparison of the different methods in terms of cluster assignment indicates that results are not affected substantially by the choice of clustering method in either of the cohorts (S11 Table). Notably, our newly defined molecular subtypes were not enriched for common LOAD-associated covariates, such as sex, APOEε4 genotype, or years of education ( Fig 4C, S5 Fig). Furthermore, eigengene expression profiles for each subtype were used to assess the association of each subtype with molecular and biological pathways associated with submodules ( Fig 4D). We observed no significant enrichment of cognitive or neuropathological measures between the subtypes for the DLPFC region (S5 Fig).

Single-variant association mapping of subtype specificity metric in ROSMAP
In order to determine genetic variants associated with subtype classification, we used the Euclidean distances of each patient from the centroid of each subtype as a quantitative trait for genetic mapping. We performed genome-wide mapping for LOAD subtype association using the 623 patients of the ROSMAP cohort due to the larger sample size when compared to the MSBB and Mayo cohorts.
Genome wide association mapping revealed various significant variants across subtypes in ROSMAP decedents (S6 Fig, S9 Table). Several variants in TMEM106B reached our genomewide significance threshold after multiple testing correction (p < 5×10 −8 ). TMEM106B is a known modifier of neurodegenerative disease and cognitive aging, which has been previously linked with cognitive performance [21]. Among the variants we identified in TMEM106B, one genome-wide suggestive allele was identified for LOAD Subtype B (p < 4×10 −6 , rs1990620 G ) in ROSMAP. This association with the protective rs1990620 G variant reached a genome-wide significant level with three of our previously mapped co-expression submodules from the ROSMAP cohort (Fig 3): DLPFCbrown_2 (p = 3.72x10 -07 ), DLPFCbrown_1 (p = 8.91x10 -11 ), and DLPFCyellow_2 (p = 5.88x10 -14 ). The DLPFCbrown_1 submodule is enriched for genes related to myelination and lysosomal activity (KEGG pathways hsa00600 and hsa04142), while DLPFCyellow_2 is enriched for genes related to endocytosis and potassium channel activity (KEGG pathway hsa04144 and Reactome pathway R-HSA-1296071). We replicated the subtype specific association of the protective rs1990620 G variant in the TCX brain region from Mayo cohort (Subtype B, p = 0.041), while we did not observe a significant association with variants in TMEM106B in the FP and PHG brain regions in the MSBB cohort. To our knowledge, this is the first report associating protective TMEM106B variants with molecular LOAD endophenotypes that link disruption of lysosomal and myelination pathways to disease subtypes. This is in line with results from a study in mice which showed that loss of TMEM106B function rescued lysosomal phenotypes related to frontotemporal dementia [22]. Furthermore, the identified protective allele rs1990620 G disrupts a known CCCTC-binding factor (CTCF) site, which has been shown to modify the inflammatory response in the course of aging [23]. Differential expression analysis of haplotype carriers of the protective rs1990620 G variant in Clustering on eigengene expression in ROSMAP data generates two subtypes. (A) Eigengene expression was used to cluster LOAD cases into subtypes using K-Means clustering for the DLPFC region. The number of clusters were determined by democratizing results across 30 mathematical indices using the NbClust R package. Two clusters with similar number of cases were generated. (B) Silhouette plots were generated by four different clustering methods. The mean silhouette width of a cluster represents how similar objects are to the centroid of the cluster, and the mean silhouette width of all objects represent how well the data have been clustered. In the case of ROSMAP, the K-means method had the highest mean silhouette width across all LOAD cases. We performed a similar analysis for other brain regions. (C) No significant differences in proportion of sex, APOEε4 genotype, and years of education between subtypes. (D) A strong immune and neuronal signal in the scaled eigengene expression profile of the subtypes compared to control decedents (including MCI). https://doi.org/10.1371/journal.pgen.1008775.g004

PLOS GENETICS
TMEM106B showed an up-regulation of neuroactive ligand receptor interactions, while decedents carrying the risk variant showed significant up-regulation for pathways related to neuroinflammation (KEGG pathway hsa04380) (S7 Fig). Besides the association with TMEM106B in Subtype B, protective variants near MTUS2 were identified which are in close vicinity to HMGB1, a locus that has been previously implicated in brain atrophy [24]. In order to provide a better overview of the associated loci, we generated a directed network to visualize loci that were associated with different modules, submodules, subtypes, and diagnostic criteria for the ROSMAP cohort (Methods). Interestingly, we observed that while certain loci were uniquely associated with single modules or submodules, a community of shared loci was associated with modules and submodules annotated for microglia, endothelial cells, astrocytes, and oligodendrocytes ( Fig 5). A separate community of loci was associated with modules and submodules annotated for proteostasis ( Fig 5). Many of the loci associated with diagnostic criterion in the ROSMAP cohort were independent from these two communities ( Fig 5). Only one locus was identified which showed a suggestive association with both Braak stage and Subtype B in our analysis. We created a directed network describing the loci detected from the multiple analyses in this study. Blue nodes represent loci. Red nodes represent phenotypes. An edge from a phenotype to a genetic locus signifies that the locus is associated with that phenotype. Diagnostic phenotypes (red edges; right) were associated with some of the loci detected in this study. The module eigengenes (yellow edges), submodule eigengenes (green edges), and subtypes (blue edges) were associated with both overlapping and unique loci (center and left). A community of loci was associated with multiple submodules associated with microglia, endothelial cells, astrocytes, and oligodendrocytes (center). A small community of loci was associated with submodules related to proteostasis (left). https://doi.org/10.1371/journal.pgen.1008775.g005

Suggestive SNPs in ROSMAP are replicated in the Mayo and MSBB cohorts
To assess the validity of our genetic findings in ROSMAP, we aimed to replicate our results across three brain regions from the Mayo (TCX brain region) and MSBB (PHG, FP brain regions) cohorts ( Fig 1C). In addition, we compared our results to a catalog of recently published GWAS results in order to evaluate the novelty of our findings.
A total of 1326 unique variants representing 163 loci reached a genome-wide suggestive or significant p-value (p < 1×10 −5 ) in the DLPFC region when pooled from all 11 DLPFC eigengenes and two subtype-specific variant mapping analyses (S12 Table). Of these, 645 SNPs were replicated in the PHG analyses, 762 SNPs were replicated in the FP analysis, and 482 SNPs were replicated in the TCX analyses (replication threshold, p < 1×10 −2 ). Overlapping coexpression submodules across brain regions (S2 Fig) were associated with similar loci.
Of the 1326 variants identified in ROSMAP, 29 variants have also been previously reported in the NHGRI-EBI catalog (S13 Table). In each case, the most significant SNP from a prior study was a suggestive SNP in the DLPFC region. Fifteen of these 29 previously reported variants were suggestive SNPs at the TMEM106B locus in the DLPFC region. These 15 variants were previously reported for association with traits such as depression [25][26][27], neuroticism [26,[28][29][30][31], coronary artery disease [32], and frontotemporal dementia [33]. The TMEM106B variant associated with dementia, rs1990620, was replicated with submodule eigengene expression in three out of four brain regions (DLPFC, TCX, PHG) in the AMP-AD cohorts (S12 and S13 Tables). An ITGA2B variant (rs5910), previously associated with Parkinson's disease [34], was replicated in the TCX and PHG regions (S12 and S13 Tables). Three suggestive ROSMAP variants at the LMX1B locus were previously reported for association with glaucoma [35][36][37], and replicated in the TCX and FP regions (S12 and S13 Tables). Taken together, however, a significant number of the 163 loci detected in the ROSMAP cohort implicated novel variants in LOAD processes, many of which were replicated in brain regions from the Mayo and MSBB cohorts.

Molecular LOAD subtypes differ in their inflammatory response
To better understand the underlying molecular differences across the novel LOAD subtypes, we performed differential expression analysis for each subtype against a set of controls in the ROSMAP cohort ( Fig 6A, S8 Fig). The set of controls included 471 decedents who were either cognitively normal or had mild cognitive impairment. Performing this analysis without the mild cognitive impairment cases yielded essentially the same results (S9 Fig). The Venn diagram in Fig 6B depicts the comparison across the two subtypes. Interestingly, we found that cases associated with Subtype A showed a stronger transcriptional response with 127 differentially expressed genes (adjusted p < 0.05, absolute log fold change > 0.5) when compared with controls. Among the most significantly down-regulated genes associated with Subtype A cases was the stress-response mediator corticotropin-releasing hormone (CRH; Fig 6A). Overacting CRH signaling has been implicated in inflammatory disorders and LOAD where it has been proposed as a therapeutic target to reduce the negative effects of chronic stress related to memory function and amyloid beta (Aβ) production [38]. Cases associated with Subtype B had 40 differentially expressed genes (FDR adjusted p < 0.05, absolute log fold change > 0.5), 39 of which were down-regulated when compared to controls. Notably, we found that two key proinflammatory mediators of amyloid deposition (S100A8, S100A9) were among the most significantly down-regulated genes in Subtype B decedents when compared to controls (Fig 6A). Both genes, which are established inflammatory biomarkers, are part of a complex that serves as a critical link between the amyloid cascade and inflammatory events in LOAD [39]. Furthermore, we identified multiple pathways linked to S100A8/9 activation, including IL-10 signaling and complement activation, that were enriched across down-regulated genes in Subtype B but not in Subtype A (Fig 6C). In addition, molecular pathways linked to microglia activation, the immune response, and the stress response were found among the most significant pathways and gene sets that differ across subtypes (S8 Fig, S14 Table). This highlights that our LOAD subtypes differ in their inflammatory response and that known LOAD biomarkers might be used to stratify patients based upon their inflammatory response to the observed disease state. The same analysis in the Mayo and MSBB cohorts revealed that the corresponding subtypes can also be distinguished based on their inflammatory response (S10 Fig). However, the signal derived from the molecular pathway expression profiles in both the Mayo and MSBB cohorts is not as strong as in the ROSMAP cohort, which is likely due to the smaller sample size and

PLOS GENETICS
differences in population structure across both cohorts. Although inflammatory markers were the most differentially expressed, the subtypes are characterized by diametric associations with the eigengenes of multiple submodules annotated for different pathological processes (Fig 4D). Thus, the detection of differentially expressed inflammatory markers between subtypes is likely because inflammation is the strongest post-mortem signal present in transcriptomic data.

Discussion
Common, complex diseases such as LOAD are characterized by phenotypic heterogeneity and the presence of multiple common variants affecting disease risk. In this study, we present an analysis that uses transcriptomic co-expression data and whole-genome sequencing from multiple cohorts to dissect phenotypic heterogeneity and identify potential genetic drivers of complex trait pathology in LOAD.
Here, we used an iterative pruning approach based on 26 human post-mortem co-expression modules to generate 68 novel submodules that contained genes associated with LOAD specific biological pathways and molecular processes. Indeed, we observed that genes in the novel submodules are enriched for functional terms that were specific to pathways associated with LOAD, such as lipid modification, the TREM2/TYROBP signaling axis, and tau-protein kinase activity. Furthermore, submodules from distinct brain regions clustered independently, suggesting that the genes captured in each submodule represented signals that were associated with LOAD pathology rather than cohort-or tissue-specific factors. Notably, our pruning approach identified submodules which were much more specific for markers of different brain cell types when compared to the initial co-expression modules. This is in line with recent studies showing that different cell types in the brain play specific roles at different stages in the pathogenesis of LOAD [40]. Taken together, our results demonstrate that the novel human co-expression submodules identified in this study capture cell-type-specific pathways associated with LOAD pathogenesis in the brain.
Mapping the eigengene expression for individual submodules represents a pathway-or process-level alternative to expression quantitative trait locus (eQTL) mapping for each individual transcript. Since the human co-expression submodules represented pathological, cell-type-specific pathways in LOAD brain tissue, mapping eigengene expression for decedents was expected to identify genetic drivers of LOAD pathology. RNA-Seq data from post-mortem brain tissue in human cohorts contains a strong immune signal, as evidenced by repeated identification of genetic loci related to microglial response in meta-analyses with increasingly large cohorts [5,41]. Using submodule eigengenes as quantitative traits for single-variant association provided an opportunity to identify genetic drivers of biological processes that are known to be drivers of early LOAD pathogenesis, such as astrogliosis, neuronal plasticity, myelination, and vascular blood brain barrier interactions [40]. Suggestive variants identified were unique to subsets of submodules. For instance, the TMEM106B locus was associated at a genome-wide significant level with the DLPFCbrown_1 and DLPFCyellow_2 eigengenes (Fig  3), representing processes related to oligodendrocytic myelination, lysosomal activity, endocytosis, and potassium channel activity. This novel association between protective variants in TMEM106B with molecular LOAD endophenotypes linked to lysosomal and myelination dysfunction is potentially of great interest. The TMEM106B locus has been implicated in cognitive aging, with functional consequences in frontotemporal dementia related to lysosomal activity [21][22][23]. A recent transcriptome study implicated protective TMEM106B variants in differences in neuronal proportions across LOAD patients, supporting the idea that impaired lysosomal function may lead to a toxic buildup of waste in the cell, a common process among many neurodegenerative disorders [42]. Therefore, the presence of TMEM106B variants in combination with other risk factors might alter the course and severity of neurodegeneration across patient subtypes. Furthermore, we identified multiple loci associated with a module linked to microglia function (DLPFCblue_3). This co-expression module contains members of the TREM2/TYROBP signaling pathway, an important mediator of neuroinflammation. Variants in the FAM110A (rs1014897), the CNTNAP5 (rs76854344) and NTM (rs1040103) genes associated with this inflammatory module have been previously linked to posterior cortical atrophy, LOAD [43], and white blood cell count [44]. Taken together, we show that quantitative trait mapping using submodule eigengene expression can identify novel genetic variants impacting relevant disease pathways.
Eigengenes represent a dimensional reduction of transcriptomic data onto axes of pathological relevance. Thus, we expected that clustering on the eigengene expression of LOAD cases would generate pathway-level profiles of putative molecular LOAD subtypes based on case heterogeneity. We observed that average eigengene expression of different subtypes was enriched for different submodules in the four brain regions for which subtype analysis was performed, an example of which is presented for the DLPFC region in Fig 4. Similar diametric enrichment patterns were identified in the remaining brain regions (S10 Fig). These results suggest that the biological programs identified by submodules in this study align themselves along the heterogeneity of transcriptomic data present in LOAD cases across multiple cohorts rather than differentiating solely based on cases and controls. Furthermore, the stratification of patients based on submodule expression profiles demonstrated that there is significant variation in immune response in post-mortem brain tissue (Fig 6, S7 Fig), a process that is considered a hallmark of LOAD pathogenesis. Variants associated with the subtype specificity metric overlapped with the variants associated with individual submodule eigengenes (Fig 5). This suggests that the genetic factors influencing subtypes can be dissected into loci driving specific submodules. Such dissection of genetic loci can provide the basis for more targeted treatment of dysfunctional pathways that contribute to different subtypes of LOAD.
Our subtypes in the DLPFC brain region of the ROSMAP cohort represent differences in transcriptomic profiles of LOAD cases derived from post-mortem RNA-Seq data. A lack of temporal data makes it challenging to decisively interpret these profiles derived from postmortem brain samples. The identified subtypes may represent distinct LOAD endpoints, differences in disease severity, environmental effects, or phases of molecular pathology. Neither of our novel subtypes was associated with cognitive or neuropathological outcome (S5 Fig). Furthermore, covariates such as sex, APOE genotype, and years of education were not significantly enriched in any given subtype (Fig 4C). This suggests that the transcriptomic profiles do not represent transitions in disease severity and that there are overall risk factors not reflected in transcriptomic subtypes. Furthermore, both subtypes are associated with unique loci that belong to the same community of loci detected by submodule mapping (Fig 5), indicating that the subtypes capture various combinations of genetic elements that lead to LOAD pathology. While suggestive, these transcriptomic LOAD subtypes will require further validation in cohorts that adequately account for disease progression.
The methodology presented in this study is not limited to RNA-Seq data and can be performed on other omics datasets, such as proteomics or metabolomics. As such data become available for the decedents in these cohorts, this analysis can be expanded across these additional informative dimensions.

Whole genome sequencing and RNA sequencing data
We obtained whole-genome sequencing and RNA-Seq data from Synapse (https://www. synapse.org/) for three cohorts from the AMP-AD consortium: the Mayo Clinic, Mount Sinai Brain Bank, and Rush University. The Mayo Clinic (Mayo) cohort consists of 276 temporal cortex (TCX) samples from 312 North American Caucasian subjects consisting of cases characterized with LOAD, pathological aging (PA), progressive supranuclear palsy (PSP), or elderly controls [13] (doi.org/10.7303/syn5550404). The Mount Sinai Brain Bank (MSBB) cohort consists of 214 frontopolar prefrontal cortex (FP), 187 inferior temporal gyrus (IFG), 160 parahippocampal gyrus (PHG), and 187 superior temporal gyrus (STG) samples characterized with LOAD, elderly control, or mild cognitive impairment (MCI) (doi.org/10.7303/syn3159438). The Rush University's Religious Orders Study and Memory and Aging Project (ROSMAP) cohort consists of 623 dorsolateral prefrontal cortex (DLPFC) samples of individuals from 40 groups of religious orders from across the United States (ROS) and older adults in retirement communities in the Chicago area (MAP), characterized with LOAD, elderly control, or MCI [7,45] (doi.org/10.7303/syn3219045). A summary of samples from each of the cohorts is provided in S1 and S2 Tables. Sex, age of death, and batch were used as covariates for normalization in the ROSMAP and Mayo data. Sex, age of death, race, and batch were used as covariates for normalization in the MSBB data.

Co-expression modules and iterativeWGCNA
Data on human AMP-AD co-expression modules were obtained for the three cohorts from Synapse (doi.org/10.7303/syn11932957.1). A recent study has identified these modules as part of a transcriptome wide LOAD meta-analysis [15]. In brief, a modified procedure using five different co-expression analysis protocols followed by graph clustering methods was performed to obtain 30 modules across all three cohorts (doi.org/10.7303/syn2580853), 26 of which corresponded to the six tissue regions used in this study. A summary of these modules is provided in S3 Table. We focused on tissues from the frontal cortex, temporal cortex, and hippocampus due to their relevance to LOAD neuropathology [19]. These modules are generally large, containing thousands of genes that represent multiple functions [15]. In order to construct more functionally-specific submodules from these AMP-AD co-expression modules, we subjected them to a repeated pruning process called iterativeWGCNA [16], which includes performing WGCNA on each AMP-AD co-expression module independently. The gene sets produced by this process were then pruned to ensure that only highly correlated genes remained by evaluating the connectivity of the genes to the gene set eigengene. The resulting gene sets, containing highly correlated genes, were combined and the process was repeated until the gene sets converged. The algorithm then attempted to reclassify genes from the residual gene set into submodules. We specified a soft-threshold power of six, a minimum eigengene connectivity of 0.6, and a required module size of 100 to promote the generation of submodules that capture pathway-level signals. The final set of 68 submodules consisted of highly correlated and cell-type-specific genes. The submodules were mutually exclusive for a given cohort but overlapped with submodules from other cohorts. A summary of these submodules is provided in S4 Table. An eigengene for a given submodule is defined as the first principle component of gene expression data within each submodule.

Stratification of LOAD cases based on clustering of human co-expression submodules
Eigengene expression data for TCX, PHG, FP, and DLPFC regions was used to stratify LOAD cases into subtypes based on each brain region separately. We used the NbClust R package to determine the optimal number of clusters across different clustering methods by polling with the majority rule across 30 indices [20]. We tested agglomerative hierarchical approaches (Ward, UPGMA, WPGMA) and a reallocation approach (K-means) on the eigengene expression data and evaluated the within-cluster similarity of cases using silhouettes. The silhouette score of a given object (data point) is a measure that simultaneously assesses how similar this object is to its cluster and how different it is from all the other clusters [46]. A simulation study suggests that no one clustering method outperforms the other consistently and that mean silhouette widths can be used to pick the ideal clustering method and compare clustering across datasets [47]. The silhouette plots revealed that different methods were required for the different regions to generate clusters with the largest average silhouette widths. We determined that K-means was an optimal approach for DLPFC, Ward was optimal for PHG and TCX, and UPGMA was optimal for FP after analyzing silhouette plots of clusters generated by each method for each region. An example of silhouettes used to determine the ideal clustering method for the DLPFC region is shown in Fig 4. A summary of the clusters for each brain region, considered case subtypes, is provided in S10 Table. In the subtypes generated for the DLPFC region from the ROSMAP cohort, we assessed each subtype for enrichment of cognitive and pathological measures. We used Braak stages as a measure of neurofibrillary tangle burden and CERAD scores as a measure of neuritic plaque burden [48,49]. We also assessed the rate of decline in memory, executive function, visuospatial function, and language across the subtypes. Definitions, collection, and standardization of these decline measures can be found in previously published work [50].

Differential expression analysis of case subtypes
For differential expression analysis, control decedents were defined as cognitively normal and MCI decedents for PHG, FP, and DLPFC. In the case of TCX, control decedents included cognitively normal, PSP, and PA decedents. For each of the regions used to stratify LOAD cases (TCX, PHG, FP, and DLPFC), we performed differential expression analysis to compare gene expression in LOAD subtypes with control decedents as described above. We repeated this analysis excluding MCI, PSP, and PA decedents from the control group and got essentially the same results (S9 Fig). We used the limma R package to perform the differential expression analysis between subtype and control decedents [51]. We used the clusterProfiler R package to perform KEGG and Reactome pathway analysis on differentially expressed genes to determine the signal captured by clustering on eigengene expression data [52].

Single-variant association of eigengene expression and subtype specificity
We used EMMAX [18], a variance component linear mixed model, to perform single-variant association, using each submodule eigengene as a quantitative trait. In addition, we developed a subtype specificity metric for each brain region by calculating the Euclidean distance between the eigengene expression profile of each decedent and the centroid of each subtype cluster. This resulted in a vector of scores for each subtype that was mapped as a separate trait. All quantitative trait mapping results had a genomic inflation factor near one, indicating that there was no significant population substructure effect on the mapping. QQ plot analysis on the p-values showed no evidence of population substructure or confounding effects (S4 Fig).

Replication of suggestive and significant SNPs in other cohorts
The ROSMAP cohort represented the most adequately powered cohort in the study and was therefore used as our baseline, while the other cohorts were utilized for assessing replication of suggestive and significant SNPs. SNPs were considered suggestive with a p-value smaller than 1×10 −5 and genome-wide significant with a p-value smaller than 5×10 −8 , which are standard cutoffs for GWAS. Suggestive and significant SNPs from the DLPFC region in ROSMAP were considered replicated in the TCX, FP, and PHG regions if the SNPs were associated with the submodule eigengenes or subtype specificity metric of the given region at a p-value of 0.05. In addition, we compared the ROSMAP loci to prior association studies using summary statistics obtained from the NHGRI-EBI catalog [53]. Loci were considered replicated in this case if suggestive and significant SNPs from the ROSMAP cohort were reported in these studies at a pvalue smaller than 5×10 −8 (Fig 1C and phase 3 in S1 Fig).

Network of loci and associated quantitative phenotypes in ROSMAP
We built a directed network of quantitative phenotypes and associated loci to better visualize the communities of loci that were associated with our newly derived quantitative phenotypes in the ROSMAP cohort (the subtype specificity metric and submodule eigengenes), including both suggestive and significant loci. We included in this network results of single-variant association of diagnostic criteria for other relevant traits, including module eigengenes, Braak stage, CERAD scores, cognitive diagnosis, and case-control diagnosis (using EMMAX [18] as for the other traits). The network was built in Cytoscape version 3.7 (https://cytoscape.org/) [54] and the nodes were organized using the "Circular Layout" option. The color of the edge was used to distinguish the type of association (red for diagnostic criteria, blue for subtype, yellow for module, and green for submodule).   Table. Significant SNP associations from FP region analyses. Significant SNPs that were associated at a genome-wide suggestive level with either a submodule eigengene or the subtype specificity metric are reported. RefSNP IDs are provided if available for the positions, which are aligned to the hg19 human genome build (attached Excel workbook). (XLSX) S9 Table. Significant SNP associations from DLPFC region analyses. Significant SNPs that were associated at a genome-wide suggestive level with either a submodule eigengene or the subtype specificity metric are reported. RefSNP IDs are provided if available for the positions, which are aligned to the hg19 human genome build (attached Excel workbook). (XLSX) S10 Table. LOAD case subtypes for selected brain regions. Subtypes were generated for the TCX, FP, PHG, and DLPFC regions. 3 clusters were generated for TCX and 2 clusters were generated for the rest. The number of cases in each subtype are reported. (XLSX) S11 Table. Comparison of clustering algorithms in ROSMAP, MSBB, and Mayo. Assignment of decedents based on different clustering algorithms was compared using a Pearson's Chi-squared test with Yates' continuity correction in R. The assignment of decedents was comparable across the four algorithms tested in all three cohorts (attached Excel workbook). (XLSX) S12 Table. Genome-Wide suggestive SNPs in DLPFC Replicated in TCX, FP, and PHG. SNPs that were found to be genome-wide suggestive in the DLPFC analyses were assessed for replication in the analyses run for the TCX, FP, and PHG regions. A p-value cutoff of 0.05 was used for the TCX, FP, and PHG regions. The analysis that generated the highest p-value for the SNP in each region are reported, along with the p-values from each. (attached Excel workbook). (XLSX) S13 Table. Genome-Wide suggestive SNPs in DLPFC Replicated in the NHGRI-EBI Catalog. Genome-wide suggestive SNPs from the DLPFC region were assessed for replication in the summary SNPs provided by the NHGRI-EBI catalog (attached Excel workbook). (XLSX) S14 Table. KEGG and reactome pathway annotations of differentially expressed genes in ROSMAP subtypes. Enrichment of KEGG pathway annotations was assessed for differentially expressed genes between controls and each subtype using the clusterProfiler R package. Enrichment of Reactome pathway annotations was similarly assessed using the ReactomePA R package. Pathways and associated scores are reported (attached Excel workbook). Data used in this publication are available via the AD Knowledge Portal (https:// adknowledgeportal.synapse.org). The AD Knowledge Portal is a platform for accessing data, analyses, and tools generated by the Accelerating Medicines Partnership (AMP-AD) Target Discovery Program and other National Institute on Aging (NIA)-supported programs to enable open-science practices and accelerate translational learning. The data, analyses and tools are shared early in the research cycle without a publication embargo on secondary use. Data is available for general research use according to the following requirements for data access and data attribution (https://adknowledgeportal.synapse.org/DataAccess/Instructions).