A transcriptome-based approach to identify functional modules within and across primary human immune cells

Genome-wide transcriptomic analyses have provided valuable insight into fundamental biology and disease pathophysiology. Many studies have taken advantage of the correlation in the expression patterns of the transcriptome to infer a potential biologic function of uncharacterized genes, and multiple groups have examined the relationship between co-expression, co-regulation, and gene function on a broader scale. Given the unique characteristics of immune cells circulating in the blood, we were interested in determining whether it was possible to identify functional co-expression modules in human immune cells. Specifically, we sequenced the transcriptome of nine immune cell types from peripheral blood cells of healthy donors and, using a combination of global and targeted analyses of genes within co-expression modules, we were able to determine functions for these modules that were cell lineage-specific or shared among multiple cell lineages. In addition, our analyses identified transcription factors likely important for immune cell lineage commitment and/or maintenance.


Introduction
The human immune system protects us from microbes (bacteria, viruses, fungi, and parasites) that penetrate the physical and chemical barriers of the body. It fulfills surveillance functions in order to detect and eliminate aberrant cells that result from infection, cancer, and senescence. In contrast, the immune system is also at the heart of pathogenic syndromes and chronic diseases that involve virtually all organ systems (e.g. cardiovascular, gastrointestinal, respiratory, musculoskeletal), and are associated with significant morbidity and mortality [1][2][3][4][5].
Given the central importance of the immune system in human health and disease, extensive work by numerous research groups over the past 100+ years has provided a detailed functional understanding of the cells and organs within the immune system. This includes how the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 multiple different cell populations, which make up the complex immune system, differentiate from a common pluripotent progenitor cell, how different cells communicate through an extensive system of cytokines, chemokines, and associated receptors, and how signaling pathways can regulate cell response to various stimuli [6][7][8]. This functional understanding has been made possible by a combination of human and animal model systems, sensitive and specific reagents, such as antibodies directed against cell surface and intracellular proteins, and the development of technological platforms to isolate and study-specific cell populations based on multi-parameter flow cytometry, fluorescence-activated cell sorting, and mass cytometry [5,9].
Genome-wide transcriptomic-based approaches have been instrumental in the identification of molecular classifiers of leukemias, interferon and granulopoiesis signatures in the blood of autoimmune patients, cytokine response pathways of T cells in psoriasis and the activation of macrophages, to name a few [10][11][12]. Transcriptomic analyses have also been employed to perform surveys of the different cell populations within the myeloid and lymphoid lineages of the mouse immune system in different contexts (e.g. age, differentiation and activation states, tissue location) [13,14]. More recently, the use of single-cell RNA sequencing to evaluate the transcriptome has enabled the identification of previously unrecognized rare cell subsets of the human immune system [15][16][17][18]. Together, these studies suggest that genome-wide transcriptomics is an important tool for studying how the human immune system responds to different conditions (e.g. stimuli, genotypes, health status, response to therapy).
The identification of gene sets that robustly represent different conditions has the potential to become powerful biomarkers, even without fully understanding the impact of the observed transcriptomic changes. The functional interpretation of changes in transcriptome between different conditions, however, remains an important challenge. A common approach to addressing this problem is to first perform gene co-expression analyses, for identifying gene sets that potentially are involved in common biological functions, followed by annotationbased analyses (e.g. enrichment of annotation terms, guilt-by-association) for attributing specific functions to the observed gene sets [19][20][21]. The success of this approach depends on numerous factors such as the complexity of the samples being tested, the true extent of the differences in the transcriptomes between the different conditions, the degree of experimental variability in the collection, processing, and sequencing of the samples [19][20][21].
In the current study, we were interested in testing this approach in the context of circulating human immune cells because of the well-characterized and distinctive nature of the different immune cells in terms of cell-specific markers (and thus genes expressed) and functions, as well as the relative ease by which highly purified immune cell populations can be isolated from peripheral blood. Specifically, we analyzed the transcriptomes of nine distinct immune cell populations from the peripheral blood of twelve healthy volunteers using a combination of global and targeted approaches to functionally annotate modules of co-expressed genes. We were able to attribute functions to most of the modules that were either cell population-specific or shared across more than one cell population. In addition, we identified TFs associated with lymphoid vs. myeloid lineages, T and B cell fates, many of which were previously known to be involved in lineage differentiation, as well as others that we propose as new candidates for having a role in lineage commitment and/or lineage maintenance.

Subject selection criteria, ethics committee approval, and blood sample procedures
Volunteers were then solicited via postings across the Montreal Heart Institute (MHI) and directed to the MHI Biobank Center for screening, informed consent form completion, and venipuncture. To be selected, the following criteria were used: (1) adult males with no history of (or ongoing) chronic inflammatory disease; (2) no current use of anticoagulants; (3) no use of oral steroids or cyclosporine in previous four weeks; (4) no use of NSAIDs in previous three days. After signing the informed consent form, volunteers were assigned a randomly generated ID number by the MHI Biobank Center; they were then called for blood withdrawal according to the availability of the cell sorter platform. The assigned ID number was provided with the blood samples to ensure anonymity. We recruited 12 individuals over a period of nine months. Researchers can request access to the individual-level data from the current study by contacting the Montreal Heart Institute ethics committee at the following institutional email address: cer.icm@icm-mhi.org.

Blood collection and cellular isolation
For lymphocytes and monocytes isolation, blood was collected in Vacutainer CPT Mononuclear Cell Preparation Tube-Sodium Citrate (BD Biosciences, San Jose, CA). Centrifugation was performed according to the manufacturer's instructions. Mononuclear cells were collected and washed three times in saline buffer, then filtered through a 70um mesh cell strainer. The cells were further purified with anti-CD14 coated microbeads using LS columns and the Quad-roMACS separator stand (Miltenyi, Cambridge, MA), according to the manufacturer's instructions. Non-specific binding of antibodies was prevented on both CD14+ and CD14enriched fractions by using FcR blocking reagent (Miltenyi). A portion of each fraction was used for immunophenotyping.
For neutrophil isolation, blood was collected in Vacutainer1 EDTA collection tubes (BD Biosciences). CD15+ fraction was enriched with StraightFrom™ Whole Blood CD15 MicroBeads using Whole Blood columns with the QuadroMACS separator (Miltenyi), according to the manufacturer's instructions. Non-specific binding of antibodies was prevented on the enriched CD15+ fraction by using FcR blocking reagent (Miltenyi). Cells were stained for CD14 (clone M5E2) and CD16 (clone 3G8) (Biolegend). Neutrophil population (CD14-CD16+) was sorted on a FACSAria™ III cell sorter (BD Biosciences) based on cell size, granularity, doublet exclusion, live/dead marker and surface expression of targeted molecules. Purity was > 99%.

Macrophage differentiation and activation
CD14+ enriched mononuclear cells were frozen in the vapor phase of liquid nitrogen until the day of culture. After thawing, the cells were cultured at 37˚C 5% CO 2 in RPMI 1640 medium with GlutaMAX™, 10% heat-inactivated fetal bovine serum, 100 units/mL penicillin, 100 μg/ mL streptomycin (ThermoFisher Scientific, Mississauga, Canada) and 50ng/ml M-CSF (Millipore, Etobicoke, Canada) for a total of 8 days, with media renewal every 2 days, to obtain macrophages. Activated macrophages were also generated by stimulating the macrophage cultures with 1ug/ml LPS (Sigma, Oakville, Canada) during the last 24h of culture.

Immunophenotyping
Following FcR blocking reagent (Miltenyi) step, CD14+ and CD14-enriched mononuclear cells were stained with antibodies described in S1 Table. In vitro generated macrophages were blocked with FcR blocking reagent (Miltenyi) and stained with antibodies listed in S2 Table. Cells were analyzed on a BD™ LSR II flow cytometer (BD Biosciences). Final immunophenotyping results were generated using FlowJo software (FlowJo LLC, Ashland, OR) according to size, granularity, live/dead marker and surface expression of targeted molecules. Results are summarized in S1 Fig.

RNA extraction and sequencing
The total number of samples obtained for each population from each individual is summarized in S3 Table. Total RNA from all cell samples was isolated using a Qiagen RNeasy Plus Mini kit (Qiagen, Toronto, Canada) with an additional step of RNase-free DNase set (Qiagen) DNaseI digestion; quantification, as well as quality control, were performed on Agilent BioAnalyzer 2100 (Agilent, Santa Clara, CA) using an Agilent RNA 6000 Nano Kit, and aliquots for sequencing were prepared and kept at -80˚C until further processing.
RNA sequencing was performed at McGill/Genome-Québec Innovation Center. Total RNA was extracted and all, but one sample had RNA integrity number (RIN) greater than 8 (RIN ranged between 8.4 and 10; the one sample with RIN<8 was excluded). RNA samples were transformed into barcoded DNA libraries using Truseq Stranded mRNA library preparation kits (Illumina, San Diego, CA). Paired-end sequencing, generating 2x100bp reads, was performed on an Illumina HiSeq2000 with raw FASTQ sequences downloaded from the platform's server for local pre-processing and analysis.

Bioinformatic processing of sequence files
Raw FASTQ formatted sequence files were collected from the McGill/Genome-Québec Innovation Center sequence service. Primary QC analysis was performed using FastQC (v0.10.1) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc). Following a manual inspection of all FastQC result files to check for low-quality samples, raw sequence files were trimmed of low-quality bases from reads and removing low-quality reads altogether as well as removing potential adapter contamination with Trimmomatic (v0.33) [22] using a modified adapter FASTA file with the following parameters: ILLUMINACLIP with seed mismatches set to 2, palindrome clip threshold set to 30 and simple clip threshold set to 15, TRAILING set to 20 and MINLEN set to 50. Quality-and adapter-trimmed FASTQ files were inspected again with FastQC to confirm global quality.
Alignments were performed with TopHat 2 (v2.1.0) [23] using iGenome index and annotation files built from build 38 of the human genome (Illumina). Subsequent filtering of the aligned reads was done using SAMTools (v1.6) [24] to keep only properly-paired and singlemapped reads in the final alignments. Summary statistics from all steps for each cell type are shown in S3 Table. For gene expression analysis, R/Bioconductor was used with a variety of additional analysis modules. Alignment files were processed with easyRNASeq (v1. 8.8) [25] to measure gene abundance and create the gene expression matrix of raw read counts that was further processed using edgeR (v3.12.1) [26, 27] and Limma (v3.26.9) [28]. edgeR package was used to generate the expression count matrix and annotations with DGEList function and to filter out non-and low-expressed genes based on a cpm threshold. edgeR was also used to obtain per sample normalization factors. The function voom from the Limma package was used to normalize data from gene expression for library sizes scale with TMM factors and then to stabilize variance in the data. The various cell types collected in the study have expression profiles (e.g. neutrophils) with too much dissimilarity to allow for a normalization approach based on the hypothesis of relatively similar expression, for example, the normalization by quantile (S2 Fig) [29].

Principal component and hierarchical clustering analyses
Analyses of transcriptomic data using Principal Component Analysis (PCA) and hierarchical clustering approaches were performed to provide a global view of the transcriptomic data. These approaches enabled us to identify similarities between samples and structure in the dataset. This allowed a first assessment of the underlying principal sources of variation. After filtration and normalization, principal components were computed on all detectable transcripts with prcomp in R thus representing multiplicative differences between samples. The first few PCs [30-32] were investigated because they represented large axes of variation, using graphical display (Fig 1). In complementary, unsupervised hierarchical clustering was used to investigate the grouping of all samples. For this analysis on the entire normalized transcriptome, Pearson correlation was chosen to evaluate the distance between samples with R base functions, cor, and dist. Hierarchical clustering was used to build our data dendrogram with hclust R function (Fig 2A). Finally, ANOVA was used to determine the percent of gene expression variance explained by cell types. The expression for each gene was centered and scaled to a mean and variance of respectively 0 and 1. Analysis of variance was performed on each gene to obtain the sum of squares (SS) explained by cell types. Multivariate variance explained was estimated by adding explained SS from all genes and then dividing by the total SS. from our nine primary immune cell types. The objective of this tool is to cluster genes based on similar expression profiles between samples [33,34]. For building co-expression networks with WGCNA, we choose a signed topology to compute gene co-expression similarity. Signed topology makes it easier for interpretation purposes or relevant biology, such as making distinction gene repression and activation [19]. A connectivity matrix between genes was then computed that describes how strongly genes are connected to all others. For that purpose, a "soft thresholding" parameter β must be chosen [34]. This parameter is an exponent to the correlation matrix that determines the emphasis put on higher vs lower correlations. The pickSoft-Threshold function was used to help in choosing β. After analysis of network topology for various β values, a value of 12 was selected as a good trade-off between scale-free topology and connectivity (S3 Fig). Therefore, to compute dissimilarities between genes, we used the WGCNA functions adjacency and TomsimilarityFromExpr (Topology Overlap Matrix Similarity From Expression) with the following parameters: Pearson correlation, signed topology and a β of 12. Based on the TOM dissimilarity measure, hclust function was used to build a hierarchical clustering of the genes. Finally, cutreeDynamic function with minClusterSize and deepSplit parameters to 20 and 2, respectively [35], was used to cut the hierarchical tree into gene modules (S4 Fig). To merge closely related modules among our 63 modules, MergeCloseModules function was applied with cutHeight parameter of 0.05. Each module was represented by its first principal component using the function moduleEigengenes. Correlation and corresponding p-values of association between a given module and a cell type were then evaluated from these principal components using cor and corPvalueStudent functions from the WGCNA package.

Global and targeted functional annotation
We used a combination of global and targeted analyses of gene annotations to identify potential biological functions represented in each module. For a global analysis of genes within each module, we used DAVID (Database for Annotation, Visualization, and Integrated Discovery) [36] to test for enrichment of functional annotations, specifically using the annotation from Gene Ontology (GO) terms for Biological Processes, Molecular Functions, and Cellular Components, Up-Keywords, Interpro and KEGG, [37][38][39][40]. In S4 Table, we report all annotations for which the enrichment was � two-fold and a P <0.05.
For the targeted analyses, we undertook a more detailed approach taking advantage of the information within GeneCards [41], PubMed [42] and Google Scholar [43] databases, using gene symbols as query parameters. This approach was applied to two gene lists per module; the first list consisted of the set of genes most highly expressed in each module, with the hypothesis that the highly expressed genes may provide insight to important functions within each module. Specifically, we examined the known functions of the top 2% expressing genes (corresponding to the 98 th percentile) which we called these the Top Expressing Genes (TEGs). The second list consisted of the genes that had the most important impact on the first principal component, regardless of their level of expression. Specifically, we ranked genes based on their contribution to the first principal component of the module in question with the objective of examining the known functions for the genes that accounted for 20% of PC1, or the top 20 genes if this exceeded 20 genes, we called these the Module Representative Genes (MRGs). For modules 22, 26, 38, 41, we also performed this detailed functional annotation to the entire set of genes within the modules in order to compare with the results obtained from these targeted analyses.

Analyses of transcription factors and transcription factor binding sites in B cells
Genes within modules were annotated as TFs based on the catalog of human TFs published by Lambert and colleagues [44]. In this resource, based on a collection of databases including TRANSFAC, HT-SELEX, UniProbe, and CisBP to identify human TFs and their binding site motifs [45-54], a TF was defined as a protein capable of a) binding DNA in a sequence-specific manner and b) regulating transcription. In total, they identified 1,639 human TFs, and binding motifs for two-thirds of these (see http://humantfs.ccbr.utoronto.ca/index.php) Transcriptional control of genes within each module was analyzed in the following fashion: (1) the transcription start site (TSS) was located for each gene using the GRCh38 Ensembl database using the biomaRt tool [55]; (2) promoter regions were then defined as -1000 to +500 base pairs around these TSS; while somewhat arbitrary, this is a commonly used definition of the proximal promoter [56,57]; and (3) these promoter regions were examined for the presence of TFBS as defined by ChIP-Seq data from the ENCODE (Encyclopedia of DNA Elements) Project [58]. As this latter data was referenced to the human genome build GRCh37, TFBSs were converted to GRCh38 version using liftOver [59]. This ENCODE dataset consisted of ChIP--Seq data for a total of 161 TFs in 80 immortalized cell lines. Of the nine immune cell populations within the current study, only the B lymphocyte population was represented in this dataset. For the modules associated with B lymphocytes, we restricted our analyses to the 76 TFs that were studied in the GM12878 immortalized B cell line. Enrichment of TFBS in the promoter regions within each of these modules was calculated with the hypergeometric distribution. Enrichment p-value threshold was 0.05.

Transcription factors potentially associated with cell lineages and states
We examined all 1600 human TFs [44] for expression patterns across the cell types that were consistent with having a role in cell fate decisions, independent of co-expression modules described above. Following the differentiation scheme depicted in S5 Fig, we considered each branch point like a cell fate decision and looked for TFs that appeared to be specific for a given fate (e.g. myeloid vs. lymphoid; neutrophil vs. monocyte; B cell vs. T cell; etc.). Several TFs were very well known to be implicated in cell fate decisions (e.g. PAX5 for B cell differentiation; EOMES for differentiation of cytotoxic cells; FOXP3 for the differentiation of CD4+ T cells). Based on the expression pattern of these well-known TFs involved in differentiation of immune cells, we defined the following algorithm for prioritizing TFs potentially having a role in cell fate decisions (S5 Fig and S5 Table): (1) cell types resulting from a cell fate decision (e.g. lymphoid) should have an average number of read counts for the given TF at least 15-fold greater than the average number of read counts for cells from the same branch point that do not have the same fate (e.g. myeloid); (2) each cell type resulting from a cell fate decision should have an average number of read counts >100 for the given TF in order to reduce artifacts resulting from the greater variance at low read counts; (3) the cell type resulting from a cell fate decision with the lowest read count for the given TF should be at least four-fold greater than the cell type with the highest read count from the same branch point that does not have the same fate (e.g. IKZF3 in CD4+ T cells vs LPS-activated macrophage). Finally, we identify which of these candidate TFs are known to be implicated in associated cell fate decisions based on a search of the PubMed, GeneCards, Entrez, UniProt, SwissProt.

Data generation: Isolation of immune cells and RNA-Seq analysis
For the current study, we isolated neutrophils, B cells, CD4+ and CD8+ T cells, NK cells, γδ T cells and monocytes from the peripheral blood of 12 healthy volunteers. In addition, we differentiated in vitro the monocytes into mature macrophages, as well as stimulated these macrophages with LPS, thus generating a total of nine different cell populations. We obtained highly pure immune cell populations, and immunophenotyping analyses confirmed the representation of subpopulations that can be expected in the circulation of healthy adults (S1 Fig). A transcriptomic analysis of the total RNA extracted from these cells was performed using a paired-end sequencing approach, that resulted in a very good depth of coverage (2.48x10 7 read counts per sample on average; S3 Table), and thus can be expected to provide quantitative measures of transcript abundances even for less common transcripts [60]

Structure of transcriptomic data is a reflection of the known differentiation scheme for immune cell subsets
Prior to embarking upon in-depth analyses of this RNA-Seq data, we wanted to assess whether the patterns of gene expression could potentially be correlated to differences in functions between cell populations. Principal component analysis (PCA) revealed a clear separation of the cell types on the first two components, representing the nine populations into four very distinct clusters (neutrophils, lymphocytes, monocytes, and macrophages), with all individuals clustering tightly within each cell population (Fig 1A). This is concordant with the observation that 86% of the total multivariate variance is between the cell types and the first two components capture a large proportion of the total variance (43% and 25% respectively). We also applied a similar PCA only on lymphoid cell types, which resulted in five distinct clusters representing the five different lymphoid populations: B cells, CD4+ and CD8+ T cells, NK cells and γδ T cells (Fig 1B). Hence gene expression in our data set clearly represents more differences between immune cell populations than between individuals.
Using an unsupervised hierarchical clustering we found that the dendrogram reflected the differentiation scheme of pluripotent stem cells into myeloid and lymphoid cell types, except for B and NK cells, which were interchanged in our hierarchical clustering (Fig 2). Given these observations, we propose that there is a great potential to identify differences in gene expression between immune cell populations that will reflect the functional biology of these populations.

Defining potential functional groups using a combination of gene coexpression and functional annotation
Knowing that there is a significant correlation structure in the transcriptome captured by the RNA expression data sets, and those co-expressed genes are often related functionally [61], we hypothesized that defining modules of co-expressed genes would be a relevant starting point for exploring potential functional units relevant to these immune cells. We, therefore, used a weighted gene co-expression network analysis (WGCNA) approach to define modules of coexpressed genes [33]. Using this approach, we obtained 45 modules with an average of 318 genes per module (ranging from 26 to 1,945 genes; Table 1). As can be seen in Fig 3, each module could be associated with one or more immune cell populations. For example, modules 22 and 38 were highly associated with B cells (r = 1.00, P = 1.97 x10 -106 ; r = 0.74, P = 8.05x10 -
To gain a better understanding of the biology underlying these co-expression modules that were highly associated with one or more specific cell types, we used a combination of global and targeted analyses of gene annotations and identified potential biological functions represented in each module. For the global analysis, we tested for the enrichment of annotation terms within the Gene Ontology (GO), UniProt, InterPro and KEGG databases [37,40]. For the targeted analysis, we undertook two independent approaches. The first was to identify the set of genes most highly expressed in each module, with the hypothesis that the highly expressed genes may provide insight into important functions within each module. Specifically, we examined the known functions of the genes that are within the top 2% of genes expressed within the cell types associated with a given module. We call these the Top Expressing Genes (TEGs). The second targeted approach entailed the identification of the genes with the greatest impact on the first principal component for each module, with the hypothesis that these genes better represent the modules, regardless of the expression level. Specifically, we examined the known functions of the genes that can account for 20% of the variance captured by the first principal component; We call these the Module Representative Genes (MRGs). To assess whether these approaches could identify functions that are relevant to the broader set of genes within each module, we also examined the known function for the entire set of genes within modules 22, 38 and 41.

B cell-specific modules; cell activation and BCR engagements
As mentioned, modules 22 and 38, are highly associated with B cells (Fig 3) and contain 195 and 45 genes, respectively ( Table 1). In our global analysis of module 22, there were eight annotation terms that were below 5% False Discovery Rate (FDR), and included "B-cell activation", "Primary Immunodeficiency", and "Immunoglobulin domain", with a total of 17 shared genes being associated with one or more of the annotation terms mentioned previously (S4 Table). For module 38, 16 annotation terms with P � 0.05 were detected, but one was below 5% FDR, and none appeared to indicate functions specific to B cells (S4 Table).
Regarding TEGs in B cells, there were 12 TEGs within module 22 and three TEGs within module 38 (S6 Table). Querying of public databases (GeneCards, PubMed and Google Scholar) for the known functions of the TEGs within module 22 revealed that many are involved in B cell receptor (BCR) structure and/or signaling (e.g. IGLL5, CD79A, BANK1, MS4A1, FCRL1) [62]. There were also two genes encoding transcription factors (TF), POU2AF1 and PAX5, associated with B cell biology [63,64]. Interestingly, PAX5 also regulates the expression of EBF1, a gene within module 38, a TF crucial for B cell differentiation [65]. In the case of module 38, two of the three TEGs have known functions in B cells: CD22 inhibits BCR activation and FCRLA is involved in immunoglobulin assembly [66,67] (Fig 4).
Regarding the MRGs, there were 28 within module 22 and seven within module 38 (S6 Table), with some overlap between the MRG and TEG lists: the IGJ gene (alias JCHAIN) encoding the immunoglobulin J chain in module 22, as well as the CD22 and FCRLA genes in module 38 (S6 Table). Among the MRGs from module 22 are the TNFRSF13B (alias B cellactivating factor, BAFF) and TNFRSF17 (alias B Cell Maturation Antigen, BCMA) genes, which play an important role in B cell activation. The remainder of these MRGs is poorly-characterized genes or non-protein coding RNAs (9 of 28). Of the poorly characterized genes, some are expressed in B cells (e.g. HTR3A [68]) or involved in B cell development/maturation (e.g. KLHL14 and VPREB3 [69,70]). Others, such as CHL1 and TCL1A have no known links to B cell function and merit further functional studies. Specifically, CHL1 may be involved in cell  adhesion and migration, as these functions are ascribed to its homolog L1CAM [71] and TCL1A is involved in AKT activation, which is important for BCR signaling [72]. There were five MRGs from module 38 that were also not TEGs: BLNK, DENND5B, DIRAS1, RASGRF1, and LOC102724714. BLNK encodes a critical adaptor protein that bridges BCR-associated kinase activity with downstream signaling events [73,74]. DENND5B is a poorly characterized gene containing a MAP kinase activating death (MADD/DENN) domain. While the function of DIRAS1 and RASGRF1 are not well understood, it is likely that these RAS family proteins are involved in B cell signaling.

A module shared by B cells and monocytes; a gene expression program enabling antigen processing and presentation
While modules 22 and 38 are specifics to B cells, module 41 was strongly associated with both B cells and monocytes, two antigen-presenting cells ( Table 1). Global analysis of the 39 genes within this module revealed multiple annotation terms (N = 66) that were significantly enriched, with the three most significant (%FDR<10 −8 ) being "endosome", "MHC classes I/IIlike antigen recognition protein", and "MHC class II protein complex" (S4 Table). In fact, many of the annotation terms found to be enriched in this module can be explained by a core Regarding the genes that are most representative of module 41, there were six MRGs (CD74, SIDT2, FGD2, CD1A, MMP17, and HLA-DRA), three of which are included in the core set of genes described above (CD74, CD1A, and HLA-DRA). Additionally, SIDT2 and FGD2 contribute to intracellular trafficking relevant to antigen processing [77,78]. MMP17 is an endopeptidase that may be involved in the activation of membrane-bound precursors of growth factors or inflammatory mediators, such as TNF-α [79], which enhances antigen presentation. The combination of the global and targeted analyses clearly supports module 41 having a role in antigen presentation in both B cells and monocytes. Fig 3 shows that module 26 is associated with NK, γδ T and CD8+ T cells, strongly suggesting that it contains genes encoding for functions shared by these cell types ( Table 1). The global analyses of the 129 genes within this module identified multiple significantly enriched annotation terms, including "natural killer cell-mediated cytotoxicity", "cellular defense response" and "regulation of immune response", all with FDR<10 −7 (S4 Table). Genes enriched for these functions are located in the KIR locus (KIR2DL1, KIR2DL3, KIR3DL1, KIR3DL2, and KIR2DS4) within the human leukocyte receptor gene complex (LRC), in the NK gene complex (NKC) (KLRC1, KLRC2, KLRC3, KLRC4, KLRD1, KLRF1, and KLRK1), encoding proteins found in cytolytic granules (PRF1, GZMB, and GNLY) and cell surface proteins (CD160, NCR1, and FASLG) that mediate and/or potentiate cell-cell interaction. The primary functions that NK, γδ T, and CD8+ T cells have in common are the recognition and killing of abnormal cells (e.g. infected with viruses or other intracellular pathogens, cancerous), and the genes noted above are essential to these functions (Fig 5). It should be noted that all of the genes have equivalent expression levels across these three cell types, except for the KIR genes which had a differential expression pattern (NK > γδ T > CD8+ T).

A shared module associated with NK, γδ T, and CD8+ T cells; a role in cytotoxic granule and non-self-recognition
When examining the 18 TEGs in these cell types, many of the same genes are highlighted (GNLY, GZMB, KLRD1, KLRF1, KLRK1, and PRF1), although other TEGs, such as GZMA and GZMH, can also be attributed to cytolytic granules. NKG7 (alias GMP-17), another TEG, may also be contained within these granules in all three cell types, since it is found in the membrane of NK cell granules [80]. Many of the other TEGs within this module are involved in regulating the cell activation state, such as GPR56, an inhibitory receptor on NK cells [81], NCR1, that displays either a stimulatory or inhibitory effect depending on the ligand [82], and SH2D1B that regulates signal transduction from cell-surface SLAM receptors and leads to granule polarization to cell-cell synapses [83]. In addition, CD160 is potentially an activating receptor encoded outside the LRC and NKC complexes [84,85].
Also included in this list of TEGs are a number of genes whose functions are poorly characterized in NK cells, γδ and CD8+ T cells, including CST7, CTSW, and FGFBP2. CST7 encodes cystatin F, which may be important for the processing and activation of granule-associated serine proteases, in particular granzymes A and B [86]. CTSW encodes Cathepsin W, which is a putative cysteine protease that is believed to be released during target cell killing, although apparently not essential to the process of cytotoxicity [87]. FGFBP2 is highly expressed in NK cells, γδ T and CD8+ T cells and essentially absent in the other immune cell types studied herein, suggesting that it plays a specific function in these cells.
Regarding the genes that are representative of module 26, there were 21 MRGs, with three (GZMB, KLRD1, and S1PR5) also being in the list of TEGs (S6 Table). Five of the MRGs are encoded within the LRC or NKC complexes (KLRD1, KIR3DL1, KIR2DL3, KLRC3, and KLRC1). This MRG list also contained FASLG, important in T lymphocyte induced cell death, as well as B3GAT1 which encodes a key enzyme in the biosynthesis of the carbohydrate epitope HNK-1 (human natural killer-1, alias CD57). Moreover, this list contains the genes that encode the XCL1 and XCL2 cytokines-two ligands for the chemokine receptor XCR1. Although still poorly characterized, this chemokine likely mediates the chemotactic activity of cDC towards cytotoxic cells [88].
In terms of transcriptional control of genes within this module, it is interesting to note that DUSP2, a TEG of this module, encodes a dual-specificity phosphatase that negatively regulates the activity of the signal transducer and transcriptional activator STAT3, which in turn regulates the transcription of the KLRK1 gene within the NK gene complex [89,90]. The TF RUNX3 is also a TEG and regulates genes implicated in lymphocyte activation, proliferation, cytotoxicity, migration and cytokine production in CD8+ T and NK cells, like NCR1 and
Taken together, these analyses suggest that module 26 has two predominant functions, the first being cytotoxic granule composition, and the other being non-self-recognition, very consistent with the role of NK, γδ T, and CD8+ T cells.

Modules associated with LPS-activated macrophages
In addition to identifying co-regulated genes associated with specific cell types, we also explored our ability to use this experimental approach to examine the transcriptional impact of modifying the activation state of a cell. Specifically, we cultured CD14+ enriched mononuclear cells from human peripheral blood with M-CSF for a total of 8 days to obtain macrophages [93,94] and activated macrophages were generated by stimulating the macrophages with LPS during the last 24h of culture [95]. As can be seen in Fig 3, module 29 stands out as being highly associated with response to LPS cells ( Table 1). A global analysis of the 93 genes within this module identified 29 significantly enriched annotation terms (FDR<5%) that fell within two groups. The first group, including "cellular response to interferon-gamma", "cellular response to interleukin-1"and "cellular response to tumor necrosis factor", is driven by a common set of genes (CCL1, ASS1, CCL20, EDN1, CCL8, CCL19, IRG1 (alias ACOD1), CCL15 and CCL17), many of which are chemokines (CCL or CXCL genes) acting as chemoattractants for immune cells. EDN1 is expressed by endothelial cells, although it can also be produced by macrophages co-cultured with activated T cells [96]. The second group includes annotations related to "metallothionein" and is also driven by a common set of genes (MT1G, MT1H, MT1L, MT1M, and MT2A) known as metallothioneins; they shape the macrophage zinc pool in response to inflammatory and infectious stimuli [97].
Module 29, strongly associated with LPS-activated macrophages, contains five TEGs; CCL8 and MT2A, both highlighted by the global analysis, complement factor B, known to be LPSinduced [98], IDO1, which catalyzes the first and rate-limiting step of the catabolism of the essential amino acid tryptophan along the kynurenine pathway [99], and, finally, ST3GAL1, which is highly expressed in all cell types examined, and thus is likely, not informative to this module. In addition to the TEGs, 14 MRGs were found in module 29 including three metallothioneins (MT1G, MT1H, and MT1M) and one chemokine (CCL15) also highlighted by the global analysis. Among the other MRGs, we found two cytokines, CSF3 and IL36G, as well as PDGFRL, whose molecular function is not established but may be involved in cell proliferation [100]. Taken together, these analyses support an antimicrobial function for this module, with an important signature of chemokine production and zinc metabolism.

The neutrophil transcriptome is distinctive from other immune cell types
Examining the distribution of reads, before and after normalization, we observed that neutrophils have a very different pattern when compared to the other cell types (S2 Fig). Neutrophils have fewer expressed genes (11,626 genes for neutrophil and between 13,161 and 13,614 for other cell populations) and the spread in expression is larger than the other cell types. Not surprisingly, this was also observed in the PCA and hierarchical clustering analyses, representing global differences in the transcriptome of neutrophils as compared to all other cell types examined herein (Fig 1A). In addition, there were 15 transcripts with a greater than the 100-fold difference between the average number of read counts in neutrophils as compared to the average number of read counts across all other cell types combined (see S2 Text).
Global analysis of the 13 neutrophil modules identified general functions such as "transcription" and "phosphoprotein"; however, a number of more specific functions relevant to neutrophils/myeloid cells were also identified, such as "lysosome", "extracellular exosome", "movement of cell or subcellular component", and "superoxide-generating NADPH oxidase activator activity" (S4 Table). In terms of targeted analyses for these 13 neutrophil-associated modules, there were multiple TEGs (n = 349) and MRGs (n = 628). These analyses highlighted neutrophil polarity, lipid metabolism, and chemoattraction as functions related to module 4; calprotectin, azurophil granules, cytoskeleton dynamics, chemoattraction for module 7; actin remodeling & membrane trafficking, nicotinamide metabolism for module 10; neutrophil polarization, trafficking and exocytosis for module 21; and endocytosis and membrane trafficking for module 36.
Moreover, given that the molecular aspects of many neutrophil functions have previously been characterized, we looked for the presence of the transcripts relevant to these functions within our modules. Specifically, we looked for 126 transcripts relevant to neutrophil granules (e.g. azurophil, gelatinase, secretory), antimicrobial proteins, reactive oxygen species (ROS), Neutrophil Extracellular Traps (NETs), cell cross-talk and resolution of inflammation (apoptosis and clearance, lipid mediator class switch) [101][102][103][104]. One hundred 11 transcripts were expressed in our dataset, with 80 of them found within our neutrophil-associated modules. Using this approach, we found that module 11 was associated with the presence of transcripts for azurophil granules, module 15 with defensin-specific granules, module 21 with ROS and NET proteins, and modules 16 with NET-associated Histone Cluster 1 family members. Extending this approach to the neutrophil modules also associated with other myeloid cells, we observe multiple transcripts relevant to neutrophil functions, either because the functions are shared between these cell types (e.g. ROS production, vesicle trafficking) or because the gene products are involved in different functions (e.g. actins are involved in NETs and other cytoskeleton-related functions).

Transcriptional control of genes within B cell modules
In order to gain an understanding of the transcriptional control of genes within each module, we were interested in examining whether there was an enrichment of TF binding sites (TFBS) within the promoters of the genes within its module. In order to do so, we used the empirical ChIP-Seq data from the ENCODE (Encyclopedia of DNA Elements) Project [58]. Given that B lymphocytes are well represented in the ENCODE dataset-76 TFs were studied in the GM12878 immortalized B cell line-we focused our initial analyses on the three modules most associated with B lymphocytes and described above (modules 22, 38 and 41). Specifically, these analyses identified that the TFBS for EZH2 was significantly enriched in the proximal promoters of the 195 genes within module 22 and the TFBS for IKZF1 was enriched in the promoters of the 45 genes within module 38 (S7 Table). Analysis of the promoters for genes within module 41 associated with B cells and monocytes revealed enrichment for nine different TFBSs (BCL11A, CEBPB, EP300, EZH2, IKZF1, MTA3, NFATC1, RFX5, and SPI1) (S7 Table). Interestingly, the enrichment was the greatest for IKZF1, with over four-and five-fold enrichment in modules 38, and 41, respectively. The transcript for IKZF1 is found in module 17, which while associated with lymphoid cells (S6 Fig), has also strong expression across all immune cells (S7 Table). Interestingly, all of the TFs enriched in this analysis have average-to-high levels of expression in B cells. Among the 76 TFs that were evaluated, only two (MYC and NFATC1) have their TFBS enriched in their own modules (modules 3 and 23; respectively). Specifically, the empirical data provided evidence of a TFBS for MYC in the promoter region of 180 genes of the 619 genes from module 3. It has been shown that MYC is necessary to stimulate both proliferation and inhibited differentiation in mature B cells induced by BCR signal [119, 120]. BCR signaling is also known to increase the transcription of the NFACT1 gene, which in turn plays an important role in controlling plasmablast/plasma cell formation [121,122].

Transcription factors potentially associated with cell lineages and states
As described above for the analysis of the B cell-specific modules, the expression of TFs playing a key role in cell fate can be maintained in mature cells. In fact, with the exception of BACH2, TCF3 and IKZF1, all of the key TFs known to be involved in the multiple steps of B-cell-lineage differentiation from common lymphoid progenitor to mature B cell are within modules 22 and 38, despite their being identified from the transcriptome of mature B cells (S8 Table) [ [123][124][125][126]. BACH2, TCF3, and IKZF1 were not identified in these analyses as, while they are necessary to B cell differentiation, their expression is not unique to B cells and in fact, they are ubiquitous in the primary immune cell types studied herein. We were therefore interested in identifying additional TFs that were highly expressed in a given cell type or cell lineage. In order to do so, we examined the 1600 known human TFs [44] for expression patterns that were consistent with the differentiation scheme depicted in Fig 2B. Specifically, we considered each branch point as representing a cell fate decision and looked for TFs that appeared to be specific for a given fate (e.g. myeloid vs. lymphoid; neutrophil vs. monocyte; B cell vs. T cell). For example, we looked for TFs that were highly expressed in lymphoid cells and absent or only very weakly in myeloid cells and identified 13 TFs that are candidates for being involved in lymphoid differentiation and/or function (S5 Fig and S5 Table). This list includes IKZF3, which is a known regulator of lymphoid development and BACH2, a known inhibitor of myeloid differentiation [127,128]. In addition, when looking for TFs potentially involved in the differentiation step leading to monocytes, we identified SMAD1, which is known to be implicated in monocyte differentiation, polarization and inflammation [129,130] and MITF which has not previously been associated with monocyte differentiation but is known to be a phagocyte-restricted TF in macrophage [131]. Altogether, we identified 74 TFs differentially expressed in one cell type or lineage (S5 Fig and S5 Table), with over a third (22/74) having confirmatory evidence in the literature as being involved in the relevant differentiation step. While not definitive proof, this certainly indicates that these TFs are strong candidates for playing a role in cell fate decisions and/or maintenance.

Discussion
Cells within the immune system are generated through successive differentiation steps from a common pluripotent hematopoietic stem cell progenitor. As in all differentiation processes, certain functions are gained and others are lost along the way, such that each immune cell subset has different functions, some unique to a given cell type while others are shared. For decades, such differences have been exploited in flow cytometry-based experimentation, with specific cell surface markers (e.g. CD3, CD4, CD8, and CD14) used to identify, qualify and quantify immune cell populations within different experimental contexts. Sequencing the transcriptome (RNA-seq) of these immune cells can provide a complementary approach to understanding the differences between immune cell types or between different experimental conditions, disease states. RNA-seq based approaches certainly come with the challenge of analyzing and interpreting the data on thousands of transcripts per sample. We, therefore, set out to generate, analyze and interpret RNA-seq data of primary human immune cells from healthy individuals in order to evaluate the feasibility and usefulness to understand some of the biology underlying these cells and could play a major role in many immune-mediated diseases.
Starting with the premise that genes with correlated expression patterns are more likely to have shared biological functions than if they have disparate expression patterns, we aimed to identify functional modules within and across primary human immune cells by defining sets of co-expressed genes. Indeed, this unsupervised approach identified co-expression modules that were either highly correlated to a single cell type or to a group of cell types (Fig 3). We then assigned potential functions to these modules using a combination of global and targeted annotation approaches. From these analyses, it was clear that modules associated with a single cell type were linked to known functions of those cell types; for example BCR signaling pathway genes in the B cell-specific module or genes encoding proteins linked to functions of specific granules, ROS and NET production in different neutrophil-associated modules. As importantly, modules shared across different cell types also had biologic meaning; for example, MHC Class II antigen processing and presentation in B cells and monocytes, or cytotoxic granule composition and non-self-cell identification in NK, γδ T, and CD8+ T cells. These results confirm that this approach can reveal co-expressed genes that do share functions.
While not the primary objective of the current study, these co-expression modules also provide an opportunity to use a "guilt by association" approach in order to identify additional genes involved in the specific functions revealed by our analyses. For example, a number of genes in module 41 encode proteins implicated in vesicular traffic (BLOC1S6, LAMP5, TBC1D5, and TRAK1), potentially revealing novel players in the trafficking of endosomal vesicles involved in antigen processing and presentation. Additional examples are: (1) two RAS family proteins, DIRAS1 and RASGRF1, that are potentially involved in B cell signaling [72]; (2) FGFBP2 potentially having a role in cytotoxic functions of NK, γδ T and CD8+ T cells [88]; (3) NECAB2 potentially being an important regulator of neutrophil apoptosis, autophagy and NETs; and (4) the enzyme-couple channel TRPM6 involved in neutrophil chemotaxis. While experimental validation is required to confirm these proposed functions, nonetheless, this illustrates how additional functional hypotheses can be generated from the information contained within the co-expression modules.
Also found within these modules are genes encoding TFs and transcriptional regulators; multiple TFs with known roles within the immune system were found within the relevant modules. Prime examples are the B cell-specific TFs PAX5 and EBF1 that are respectively within the B cell-specific modules 22 and 38, CIITA is known to regulate MHC class II genes found within module 41, and DUSP2, in module 26, is a dual-specificity phosphatase that influences indirectly the expression of killer receptors within the NK gene complex. In order to assess whether TFs controlling the expression of genes within their own co-expression modules or in distinct modules, we focused our analyses on B cell modules, given the extensive ENCODE Chip-seq data for the 76 TFs in cell lines from the B cell lineage. While we detected an enrichment signal for 45 TFBS in 12 B cell-associated modules. Only two TFs (MYC and NFATC1) have their TFBS enriched in their own modules [119][120][121][122], supporting a network model of TFs regulating the transcription of genes in distinct co-expressed gene sets, which is consistent with current literature [19][20][21]132].
While the focus of our analyses was to identify functional subgroups of genes, we did perform an analysis of TFs that were independent of the co-expression modules described above. Specifically, we searched for TFs that differentially expressed in a given cell type or lineage. Indeed, this approach identified many known lineage-specific TFs that are likely involved in maintaining lineage commitment and key functions associated with those commitments. Prime examples were IKZF3 known to be involved in lymphocyte development and homeostasis [127], PAX5 in differentiation and function of mature B cells [133], and FOXP3 in CD4+ T cell differentiation [134][135][136]. This analysis also highlighted a few TFs previously unknown to be associated with specific lineages, such as ZNF860 in B cell differentiation, ZNF385A in monocyte differentiation, CREB5 in the myeloid lineage, TFCP2L1 in γδ T lineage, and MSC in the activation of macrophages in response to LPS (S5 Fig and S5 Table). While the expression patterns were certainly consistent with a role in cell fate decision and/or maintenance, with many having published confirmatory evidence (18 of 49), future studies will be required to confirm the role of the remaining candidates.
In the future, the approach described herein could be used for the analysis of circulating immune cells in a variety of contexts, such as in cross-sectional comparisons between different immune-mediated diseases, in longitudinal studies of disease progression or response to therapy, or in a comparison of circulating immune cells versus immune cells taken from peripheral inflammatory sites. It is clear that comparisons in such study designs can be complex, with different expression patterns being a result of changes in cell populations, in addition to gene expression within cells, and therefore it will be important to rely on careful immunophenotyping and cell sorting strategies. Alternatively or in complementarity, with an inevitable decrease in the costs associated with single-cell RNA sequencing (scRNA-seq), this may become a better option, with the analysis strategy presented herein also being relevant to this experimental approach.