Lung gene expression and single cell analyses reveal two subsets of idiopathic pulmonary fibrosis (IPF) patients associated with different pathogenic mechanisms

Idiopathic pulmonary fibrosis is a progressive and debilitating lung disease with large unmet medical need and few treatment options. We describe an analysis connecting single cell gene expression with bulk gene expression-based subsetting of patient cohorts to identify IPF patient subsets with different underlying pathogenesis and cellular changes. We reproduced earlier findings indicating the existence of two major subsets in IPF and showed that these subsets display different alterations in cellular composition of the lung. We developed classifiers based on the cellular changes in disease to distinguish subsets. Specifically, we showed that one subset of IPF patients had significant increases in gene signature scores for myeloid cells versus a second subset that had significantly increased gene signature scores for ciliated epithelial cells, suggesting a differential pathogenesis among IPF subsets. Ligand-receptor analyses suggested there was a monocyte-macrophage chemoattractant axis (including potentially CCL2-CCR2 and CCL17-CCR4) among the myeloid-enriched IPF subset and a ciliated epithelium-derived chemokine axis (e.g. CCL15) among the ciliated epithelium-enriched IPF subset. We also found that these IPF subsets had differential expression of pirfenidone-responsive genes suggesting that our findings may provide an approach to identify patients with differential responses to pirfenidone and other drugs. We believe this work is an important step towards targeted therapies and biomarkers of response.


Introduction
Idiopathic pulmonary fibrosis (IPF) is a chronic and progressive fibrosing disease of the lung with a median survival time of <5 years after diagnosis [1,2]. IPF is characterized histologically by a pattern of usual interstitial pneumonia and the appearance of honeycombing cysts and fibroblastic foci [1,2]. Although the disease is associated with infiltration and accumulation of inflammatory cells, IPF patients typically do not improve with anti-inflammatory therapy and the only approved IPF therapies, nintedanib and pirfenidone, are anti-fibrotic and not curative [3,4]. Despite recent advances in genome-wide association studies (GWAS) [5][6][7], the mechanisms connecting genetic susceptibility, environmental factors and molecular and pathological changes in IPF are incompletely understood. IPF is a heterogeneous disease with differences in clinical outcome and rates of disease worsening, suggesting that there are subsets of IPF patients with different molecular mechanisms of pathogenesis [8,9]. As such, a better understanding of IPF pathogenesis and subset heterogeneity is essential to advance new therapies for this devastating disease. A previous attempt by Yang et al. [10] to understand the molecular basis of IPF heterogeneity identified two subsets of IPF patients that were primarily differentiated on the basis of high and low expression of genes from ciliated epithelium; the former was associated with greater pulmonary honeycombing. However, this finding has not been replicated in another study and the pathophysiologic correlates of both subsets have not been explored, including the interactions of ciliated epithelium with other cell types. Cell phenotype-based studies of IPF patient blood and lung samples have shown that increases in plasma cells and mast cells and decreases in T cells were respectively associated with mild versus severe disease [11][12][13]. Importantly, the overlap between subsets identified using different data sources such as gene expression, and cell phenotypes have not been investigated. The development of molecular classifiers to reliably detect and separate subsets of IPF patients using machine learning has not been attempted.
In the current study, we found that the subsets described by Yang et al. (GSE32537, referred to henceforth as 'Schwartz-Univ of Colorado bulk expression cohort') [10] were replicated in our analysis of a new overlapping cohort of IPF patients from a study by Kaminski and colleagues (GSE47460, referred to henceforth as 'Kaminski-LGRC bulk expression cohort') [14][15][16][17] and in our analysis of a non-overlapping independent cohort of IPF patients (GSE134692 (BMS bulk RNA-seq cohort) [18]). We characterized the cellular changes associated with each subset of IPF patients using cell type signatures derived from recently published single cell RNA sequencing (scRNAseq) data obtained from IPF patients and healthy lungs including GSE132771 (i.e. 'Sheppard-UCSF single cell cohort'), GSE135893 ('Kropski-Vanderbilt Univ single cell cohort') and GSE136831 ('Kaminski-Yale Univ single cell cohort') [19,20,21]. Importantly, we identified coordinated changes in genes associated with different cell types in each subset of IPF patients that have important implications for the molecular mechanisms driving disease. Finally, we developed molecular classifiers using machine learning approaches to reliably distinguish subsets of patients.

Processing of single cell RNA datasets, development of cell signature scores and application of cell signature scores to GSE47460 (Kaminski-LGRC bulk expression cohort) and GSE134692 (BMS bulk RNA-seq cohort)
Single cell data from Tsukui et al. (GSE132771, Sheppard-UCSF single cell cohort) [19], Habermann et al. (GSE135893, Kropski-Vanderbilt Univ single cell cohort) [24] and Adams et al. GSE136831 (Kaminski-Yale Univ single cell cohort) [20] were either processed with filtered sparse matrix output from 'cellranger' (10x Genomics) (GSE132771 (Sheppard-UCSF single cell cohort) [19], GSE136831 (Kaminski-Yale Univ single cell cohort) [20] or we used the analyzed data provided by the authors (GSE135893, Kropski-Vanderbilt Univ single cell cohort) [24]. Sparse matrices were processed using the R package 'Seurat' [32]. Cell cluster signatures were determined using differential gene expression with the R package 'MAST' (https://github.com/RGLab/MAST/) and by maximizing the power of the gene signature to GSE136831 Kaminski-Yale Univ single cell cohort Single cell RNA-seq 32 28 [20] https://doi.org/10.1371/journal.pone.0248889.t001 discriminate a particular cell type from the other cell types using an AUROC-based metric (see 'Code availability' and reference [33]). To summarize, we first annotated cell clusters in the scRNAseq data based on canonical markers. We calculated differentially expressed genes (DEGs) for each cluster by comparing the cluster to all other cells in the dataset using the 'FindMarkers' function in the R package 'Seurat' [32]. We then ranked DEGs in decreasing order according to their effect sizes and performed a step-wise search to identify the smallest gene signature that accurately classified a given cell type from every other cell type in the scRNAseq dataset [33]. This included the following steps: (1) estimation of the classification accuracy of the first gene on the list using the area under receiver operating characteristic (AUROC) curve; (2) incremental addition of one gene at a time based on the gene's rank and re-computation of the AUROC corresponding to the new gene set, and 3.) repetition of this process until we identified the minimal gene set that produced an AUROC proximal to the maximum (with ε = 0.005), requiring a minimum of 5 genes per signature. We performed this strategy on each cell type across the scRNAseq dataset [33]. This method focused on finding the best performing gene set that distinguished a given cell type from the rest of the cell types in the dataset and therefore resulted in cell type signatures with partial overlaps with gene signatures from other cell types in the dataset. We created two sets of gene expression signatures for GSE132771 (Sheppard-UCSF single cell cohort) [19]: one for total lung cell suspension samples (sample identifiers GSM3891621, GSM3891623, GSM38916215, GSM3891627, GSM3891629, GSM3891631) and one for 'Lineage-sorted samples' (GSM3891620, GSM3891622, GSM3891624, GSM3891626, GSM3891628, GSM3891630). These two separate sets of gene signatures were created to achieve better resolution of mesenchymal cell types as previously described [19]. We created one set of gene expression signatures each for GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] as this dataset included only total lung suspension samples. We calculated correlation matrices for each signature score derived from GSE47460 so its performance could be assessed (S3 and S5 Figs) using R package 'ggcorrplot' (https://github.com/kassambara/ggcorrplot).
Gene signature scores from bulk IPF RNA microarray (GSE47460, Kaminski-LGRC bulk expression cohort, [14][15][16][17]22]) and RNA-seq (GSE134692, BMS bulk RNA-seq cohort, [18]) were calculated using normalized, batch-corrected gene expression data using the 'GSVA' R package with 'method = 'gsva" setting to calculate gene signature enrichment scores using gene sets derived from the single cell RNA sequencing data. We used the Gene Set Variation Analysis (GSVA) method as described [34]. The GSVA method has several advantages over previously published gene set enrichment methods such as combined z-score, PLAGE and ssGSEA [35][36][37] since GSVA calculated sample-wise gene enrichment scores as a function of genes inside and outside the gene set specified and estimated variation of gene set enrichment over samples independent of any class label in a non-parametric, unsupervised manner [34]. GSVA also alleviated the issue of partially overlapping signatures in the cell type signatures as it relies on the ranking of the entire gene set used as input and used efficient normalization and outlier removal methods so a given cell type-specific signature was not driven by outlying genes but was driven instead by the entire signature [34]. We used changes in gene signature scores to estimate changes in cell type composition in bulk RNA microarray and RNA-seq data.

Determination of ligand-receptor interactions using single cell RNA datasets
We calculated cell type percentages in GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] by dividing the number of cells in each cluster by the total number of cells in the data for the purposes of creating patient subsets in GSE135893 [24]. Subsequently, ligandreceptor interactions from single cell RNA sequencing data were inferred using the PyMINEr [38] and NicheNet [39] R packages. PyMINEr was implemented as an R package and used ligand-receptor network inference from single cell data (Clarivate Analytics; Philadelphia, PA). Ligand-receptor interactions were obtained from Ramilowski et al. [40]. Chord diagrams were created using the R package 'circlize' [41].

Development of pirfenidone response signature
We used the combination of genes significantly downregulated by pirfenidone as reported in Supplementary Table 1 of reference [42]. We combined genes downregulated in response to pirfenidone in lung homogenates only (labeled 'LH only' in [42]) and downregulated in both lung homogenates and isolated fibroblasts (labeled 'both down' in [42]) using a log fold change cutoff of 1.41 and p value cutoff of 0.05. Pirfenidone signature score was calculated using the GSVA method as outlined above for scRNAseq signature scores.

Consensus clustering results using data from GSE47460 (Kaminski-LGRC bulk expression cohort)
A previous attempt to identify subsets of IPF patients based on total lung gene expression identified subsets with large differences in cilia-related gene expression and MUC5B gene expression levels (GSE32537, Schwartz-Univ of Colorado bulk expression cohort) [10]. Therefore, we initially repeated and confirmed the identification of the same two IPF patient subsets in GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10] and used GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17]22] as a replication cohort. We applied the same consensus clustering approach on both datasets for the sake of consistency in processing the data instead of the subsetting method used in the original publication of GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10]. Importantly, this approach had the advantage that it analyzed a partially independent patient cohort (see below; GSE47460, Kaminski-LGRC bulk expression cohort) [14][15][16][17] that measured gene expression on a platform (Agilent) different from that used in the original GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) study (Affymetrix 1.0ST) [10]. We used a data-driven, hypothesis-free approach of consensus clustering of the data obtained from IPF patients in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17], and elected to use k = 2 as the number of consensus clusters ('consensusclasses') in both the GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10] and the GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] based on the consensus clustering results (Fig 1A and S1 Fig). We calculated the Proportion of Ambiguous Clustering (PAC) score [26] to assess performance of the consensus clustering process for a range of possible cluster numbers. PAC scores are regarded as the best  [14][15][16][17] (when comparing Subset 1 and Subset 2 (x axis) using the 75 samples not appearing in GSE32537 and compared to GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10] (y axis). Genes with reported absolute log fold change of larger than 0.58 and adjusted p value < 0.05 were used in this analysis from both datasets. F. Consensus clustering of IPF patients in GSE134692 (BMS bulk RNA-seq cohort) [18] based on the 5,000 most variable genes in IPF patients showing distribution of samples based on k = 2 consensus clusters. G. Hierarchical clustering of IPF samples from GSE134692 (BMS bulk RNA-seq cohort) [18] using top 5,000 most variable genes. x axis represents individual patients, y axis represents genes. Subsets are indicated in x axis color bar and legend of heatmap and correspond to classes shown in Fig 1F. H. PCA of IPF and Normal samples from GSE134692 (BMS bulk RNA-seq cohort) [18] with IPF subsets and Normal indicated. Subsets are indicated in legend of PCA plot and correspond to classes shown in Fig 1F. I. Expression of cilium-related genes from GSE134692 (BMS bulk RNA-seq cohort) [18] previously identified by Yang et al. [10]. Subsets are indicated on x axis of box plots and correspond to classes shown in Fig 1F. Adjusted p values determined by ANOVA and post-hoc Dunn's test are reported on plots. J. Correlation plot of log fold changes calculated in GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10] (when comparing Subset 1 and Subset 2 (x axis, logFC_GSE32537) compared to GSE134692 (BMS bulk RNA-seq cohort) [18] subsets (y axis, logFC_GSE134692). Genes with reported absolute log fold change of larger than 0.58 and adjusted p value < 0.05 were used in this analysis from both datasets. current metric for assessing clustering performance (the lower, the better performance of clustering) [26]. In our consensus clustering results of GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10], k = 2 produced the highest PAC scores. In our consensus clustering analysis of GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17], the PAC score was lower for k = 2 than for k = 3 or k = 4 and minimally higher than k = 5 (S1A Fig). We elected to use k = 2 clusters for subsequent analyses to balance good performance of clustering and reasonable sample numbers for achieving adequate statistical power. With k = 2 clusters, 53% of patients were in consensus class 1 and 47% of patients were in consensus class 2 in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (Fig 1A), thereby presenting a well-balanced dataset. Hierarchical clustering based on the 5,000 most variable genes in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] showed the distribution and relative gene expression of the two IPF subsets (Fig 1B). We will subsequently refer to consensus class 1 as 'Subset 1' and consensus class 2 as 'Subset 2'.
We performed a PCA of the GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] dataset to identify the main features contributing to the separation of the subsets. We included normal control samples in the PCA to understand how the two subsets of IPF patients separated from each other and from healthy control samples. As shown in Fig 1C, the first principal component had the strongest association with the subject cohort (IPF patient or healthy subject). The PCA analysis of IPF samples indicated that there was no correlation with any of the clinical characteristics reported by the authors. We used differentially expressed genes in Subset 1 and Subset 2 as compared to healthy controls to conduct pathway enrichment using the Ingenuity Pathway Analysis tool. With the same data, we conducted a gene set enrichment analysis using the Reactome pathway database ( Table 2). Both Subset 1 and 2 showed an enrichment in extracellular matrix-related processes (Table 2). Importantly, in IPA analyses, only Subset 1 showed an enrichment in 'Role of Macrophages Fibroblasts and Endothelial Cells in Rheumatoid Arthritis' and only Subset 2 showed an enrichment in cilium biologyrelated Reactome pathways ( Table 2).
To further substantiate our results, we analyzed an additional independent cohort of IPF patients (GSE134692 (BMS bulk RNA-seq cohort) [18] to validate our findings. To the best of our knowledge, GSE134692 (BMS bulk RNA-seq cohort) [18] used a completely non-overlapping set of samples with LGRC. Consensus clustering results using GSE134692 (BMS bulk RNA-seq cohort) [18] also revealed two main subsets of IPF patients (Fig 1F-1J and S1B Fig).

Markers of fibrosis and differences in clinical data in IPF subsets
We next determined whether markers of fibrosis and clinical data associated with the severity of IPF were different between the two subsets identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17]. We did not detect changes in the expression of fibrotic genes (including collagen expression, tenascin C (TNC) and IL-11 mRNA levels) between Subset 1 and 2 in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (Fig 2A); these results were consistent with results from the study by Yang et al. (GSE32537, Schwartz-Univ of Colorado bulk expression cohort) [10]. We found no significant differences in percent diffusing capacity of carbon monoxide (%DLCO), forced expiratory volume in 1 second (FEV1) or forced vital capacity (FVC) between the two subsets of patients in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (Fig 2B); these findings were mostly consistent with the reported differences in honeycombing only in IPF patients in the study by Yang et al. (GSE32537, Schwartz-Univ of Colorado bulk expression cohort) [10], in which patients with prominent ciliated epithelial gene expression had worse honeycombing. Overall, this indicated that analysis of clinical parameters, by including additional samples from GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17], did not change the original findings and conclusions from GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10] related to the IPF patient subsets. Interestingly, 3 out of the 4 markers (GREM1, MMP7, CTHRC1 and FHL2) identified by Kaminski and colleagues [43] as having a significant negative correlation with %DLCO and as markers that separate IPF patients by disease severity were expressed at a higher level in Subset 2 of GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (S2  Fig), which suggested a differential prognosis for the two subsets of IPF patients. MMP7 and FHL2 were also reported to be different between subsets in the study by Yang et al. (GSE32537, Schwartz-Univ of Colorado bulk expression cohort) [10]. We did not detect differences in age, sex and smoking history between Subset 1 and 2, which is in agreement with the earlier findings by Yang et al. (GSE32537, Schwartz-Univ of Colorado bulk expression cohort) [10].

Cellular changes in IPF subsets
The pathological process in IPF leads to marked changes in cellular composition of lung tissue and is associated with alterations in both hematopoietic and non-hematopoietic cell populations [44]. Advances in single cell RNA sequencing (scRNAseq) have enabled the quantification of these cellular changes at an unprecedented resolution. To this end, we developed a pipeline based on differential expression of genes in clusters of cells identified in IPF lung samples and created gene signatures from recently published scRNAseq data [9][10][11][12]. This approach bridged the problems inherent to low sample sizes in scRNAseq datasets that precluded reliable consensus clustering. For this analysis, we first used scRNAseq results from Tsukui et al. (GSE132771, Sheppard-UCSF single cell cohort) [19] to develop cellular signatures by re-analyzing the raw data from GSE132771 (Sheppard-UCSF single cell cohort) [19]. Clusters identified through the reanalysis of published scRNAseq data matched well with the clusters published by Tsukui et al. (S3 Fig). We determined cell type-specific gene sets using the methods described in [33]. Cell signatures for each cell type are listed in S1 Table. We applied the cellular signatures derived from Tsukui et al. GSE132771 (Sheppard-UCSF single cell cohort) [19] to the data from Kaminski and colleagues (GSE47460, Kaminski-LGRC bulk expression cohort) [14][15][16][17]. First, we assessed the overlap of genes and the correlation of gene signature scores derived from GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] Next, we calculated gene signature scores using Gene Set Variation Analysis (GSVA) as described in 'Methods' and observed strong, coordinated changes in gene signature scores for epithelial and endothelial cell populations that differed significantly between Subsets 1 and 2 (Fig 3). Our workflow is detailed in Fig 3A. We detected the most significant differences in gene signature scores for ACKR1 negative endothelial cells and ciliated epithelial cells (Fig 3B  and 3D). The latter result matched well with the finding that ciliated epithelium-related gene expression was significantly higher in Subset 2 of IPF patients as previously shown in Fig 1D. Among hematopoietic cell populations, gene expression scores also differed significantly between Subsets 1 and 2 (Fig 4). Specifically, Subset 2 had higher levels of B cells/ plasma cells and lower levels of T cells compared to Subset 1 (Fig 4A and 4B). Gene signature scores for monocytes and macrophages were increased in Subset 1 versus Subset 2 ( Fig 4C). This conclusion was also supported by the expression of individual marker genes across patient subsets (examples are shown in S4 Fig).
To confirm the findings using cellular signatures developed from the data of Tsukui et al. (GSE132771, Sheppard-UCSF single cell cohort) [19], we used more recent scRNAseq datasets Additionally, to further validate this approach in additional IPF cohorts with bulk transcriptomic data, we repeated consensus clustering and cell type signature analysis of an additional IPF cohort in which bulk RNAseq data was available (from lungs removed from IPF and control patients, GSE134692 (BMS bulk RNA-seq cohort)) [18]. This analysis showed similar trends to those seen in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17]; there were two main subsets of patients in GSE134692 (BMS bulk RNA-seq cohort) with one subset expressing high levels of ciliated epithelium-related genes and another subset enriched in macrophage gene signatures (S7 Fig).

Changes in fibroblast, pericyte and smooth muscle populations in IPF subsets
The scRNAseq data allowed identification of novel subtypes of fibroblasts in IPF as reported by Tsukui et al. [19]. Importantly, these authors identified a subset of disease-specific fibroblasts in IPF characterized by high level expression of the CTHRC1 gene and pro-fibrotic mediators including type I and III collagen. Using the gene expression signatures and the same methods, we evaluated changes in fibroblast subpopulations in the two subsets identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (Fig 5). Gene signature scores for many of the fibroblast, pericyte and smooth muscle subpopulations were similarly enriched in IPF patient Subsets 1 and 2. However, gene signature scores for alveolar fibroblast populations were more enriched in Subset 1, whereas gene signature scores for peri-bronchial and adventitial fibroblasts were more enriched in Subset 2, indicating important differences in fibroblast biology between subsets of IPF patients (Fig 5A). Gene signature scores for CTHRC1 + fibroblasts were not different between the two subsets. Taken together, this data indicated that the two subsets identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17], based on gene expression, also had concomitant differences in mesenchymal cell biology. Table 3 provides a short summary of cell type and representative gene expression changes associated with Subsets 1 and 2 in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17]. Based on the cell population changes described in Table 3, we will refer to Subset 1 as 'Myeloid-enriched IPF' and Subset 2 as 'Ciliated epithelium-enriched IPF'.

Differential ligand-receptor networks as potential drivers of cell recruitment in IPF subsets
We hypothesized that the differential cellular make-up in the Myeloid-enriched IPF subset as compared to the Ciliated epithelium-enriched IPF subset was likely due to the differential activation of chemokine and chemokine receptor networks. To test this hypothesis, first, we compared the expression patterns of chemokine ligands across the two subsets identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17]. As shown in Fig 6A, this analysis of GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] showed clustering of samples and chemokine ligands differentially expressed between the two IPF patient subsets. The Myeloid-enriched IPF subset (Subset 1) had increased expression of XCL1, CCL17, CCL5, CXCL9, CXCL10 and CXCL11, whereas the Ciliated epithelium-enriched IPF subset (Subset 2) had increased expression of CCL15, CXCL1, CXCL6, CCL7, CXCL17, CXCL13, CCL14 and CXCL14. We next examined the cellular origin of these chemokines using single cell data from  Table 3. Summary of cellular and gene expression changes in patient subsets in GSE47460.

Subset in GSE47460
Associated cluster-specific cell type changes Example gene expression changes Subset 1. Myeloid cell populations", mast cells", CTHRC1+ fibroblasts", pericytes", Alveolar fibroblasts subtypes" CCR2", CD11b", PPBP"" Subset 2. B cells/Plasma cells"", Ciliated epithelium"", Peribronchial fibroblasts", CTHRC1+ fibroblasts" MZB1", POU2AF1", FOXJ1"" Column 2: ", moderately increased, "", strongly increased; Column 3: ", moderately upregulated over healthy (1.5-2x); "", strongly upregulated over healthy (>2x). https://doi.org/10.1371/journal.pone.0248889.t003 Habermann et al. (GSE135893, Kropski-Vanderbilt Univ single cell cohort) [24]. Not all chemokines were detectable in scRNAseq data. Fig 6B shows detectable chemokine genes with differential expression between subsets of IPF patients. We found that CCL17 was primarily produced by MHC class II high macrophages and was higher in the Myeloid-enriched IPF subset. CCR4 is the receptor for CCL17 and is expressed by helper T cells, which may account for the increased numbers of T helper cells in the Myeloid-enriched IPF subset of patients ( Fig  4B). CCL15 was primarily produced by ciliated epithelial cells, and was accordingly higher in the Ciliated epithelium-enriched IPF subset ( Fig 6B). Additionally, we also conducted a transcriptome-wide analysis of single cell RNA seq data from Habermann et al. (GSE135893, Kropski-Vanderbilt Univ single cell cohort) [24] using the recently published 'PyMiner' approach for differential ligand-receptor expression [39]. This analysis confirmed expression of chemokine and chemokine receptor pairs using single cell data subsets and confirmed IPF subsetting into Myeloid-enriched and Ciliated epitheliumenriched IPF subsets and extended our analysis to additional ligand-receptor pairs active in subsets of IPF patients. The GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] data set contained 19 IPF subjects and we separated these subjects into Myeloid-enriched and Ciliated epithelium-enriched IPF subsets using the percentage of all ciliated epithelial cells (i.e. sum of Ciliated_1, Ciliated_3, Diff_ciliated_15, Ciliated_28 populations in S8 Fig) from each subject (Fig 7A) to separate IPF subjects into 'Ciliated_low' (< 20% of cells are ciliated epithelial cells, analogous to the Myeloid-enriched IPF subset in GSE47460, Kaminski-LGRC bulk expression cohort) [14][15][16][17] and 'Ciliated_high' (> 20% of cells are ciliated epithelial cells, analogous to the Ciliated epithelium-enriched IPF subset in GSE47460, Kaminski-LGRC bulk expression cohort) [14][15][16][17] (S8 Fig) subsets. Our subsetting of the 19 donors in GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] based on 'Ciliated_low' and 'high' criteria was validated based on significant differences between subsets in the percentages of myeloid populations, with the most significant differences between subsets in macrophages (S3 Table), thereby matching subset data in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (Fig 1). Pathway enrichment analysis based on differential gene expression between  Table). Additionally, we confirmed that percentages of relevant cell types followed the expected changes between subsets of IPF patients in GSE135893 (Kropski-Vanderbilt Univ single cell cohort, S8B Fig) [24]. Indeed, we detected significant differences in myeloid cells and a trend in endothelial cell percentages between subsets (with myeloid cells being enriched in the 'Ciliated_low' subset) similarly to the bulk RNA results shown above (S8 Fig). Full breakdown of individual donorlevel percentages of all cell populations in GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] between 'Ciliated_high' and 'Ciliated_low' IPF subsets is provided in S3 Table. We highlighted cell populations significantly different between 'Ciliated_high' and 'Ciliate-d_low' subsets in GSE135893 (Kropski-Vanderbilt Univ single cell cohort) [24] in S3 Table. In our ligand-receptor analysis using PyMiner, we focused on differential ligands produced by cell populations that were significantly different in percentage between 'Ciliated_low' and 'Ciliated_high' subsets (specifically macrophage populations and ciliated epithelial cell populations, respectively). Fig 7B depicts circular diagrams with ligands produced by macrophages in 'Ciliated_low' subjects and Fig 7C depicts circular diagrams with ligands produced by macrophages in 'Ciliated_high' donors ( Fig 7C). In Fig 7D, we depicted ligands produced by ciliated epithelium in 'Ciliated_low' donors, and in Fig 7E we depicted ligands produced by ciliated epithelium in 'Ciliated_high' donors ( Fig 7E). We examined the top 10 th percentile of ligandreceptor pairs from macrophages and epithelial cells (based on z score output provided by PyMiner) in each subset and determined the expression of matching receptors from each condition for visualization purposes. As expected, significant differences in the profiles of inferred active ligand-receptor networks were detected. We confirmed these results using another ligand-receptor network approach, NicheNet [39] (S9 Fig). Overall, differential gene expression results derived from single cell and bulk RNA expression data suggested that a monocytemacrophage chemoattractant axis (including potentially CCL2-CCR2 and CCL17-CCR4) was highly activated in 'Ciliated_low' (Myeloid-enriched IPF subset) patients and was possibly responsible for recruiting inflammatory macrophages in this subset, whereas ciliated epithelium-derived chemokine production (e.g. CCL15) may play an important role in cell recruitment in the Ciliated epithelium-enriched IPF subset of patients.

Development of machine learning-based classifiers for distinguishing the Myeloid-enriched IPF subset versus the Ciliated epithelium-enriched subset
Because the two IPF patient subsets we identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] may have a differential pathogenesis, our findings may have implications for treatment and disease progression. Therefore, we developed models to identify key features to distinguish the subsets. We used machine learning models with recursive feature elimination to 1.) identify the cell types that best distinguished the subsets using the gene expression signature scores described in Figs 3-5 and 2) identify gene expression that can be used to classify IPF patients into subsets. The two approaches offer distinct advantages for predicting subset membership: the first approach identifies biopsy histological features that may distinguish subsets, whereas the second approach permits development of RNA-based assays to distinguish subsets by measuring transcript levels from biopsy samples.
We used support vector machines with a linear kernel, elastic net and a gradient boosting machine to create models for cell type-based classifiers and gene expression values. We trained our models on a randomly selected 70% of IPF patients in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] and used the remaining 30% as a validation set with 5-fold crossvalidation. All three methods (linear kernel, elastic net and gradient boosting) produced high accuracy models with AUROC values > 0.95 (Fig 8A for cell signature scores and 8C for gene expression values). The cell signature approach identified ciliated epithelium, plasma cells, cytotoxic T cells and ACKR1 negative endothelium as the most important features separating the two subsets of IPF (Fig 8B). Recursive feature elimination using gene expression values identified FOXJ1, NELL2, SCGB3A1, LRRC34 and MYL3 as the top five most predictive genes (Fig 8D).

Differences in pirfenidone response signature between subsets of IPF patients
Finally, we asked whether the two subsets of IPF patients would respond differentially to approved IPF therapies. Currently, there are two FDA-approved therapies available for IPF patients, pirfenidone and nintedanib [3,4]. We developed a lung pirfenidone signature using genes downregulated in response to pirfenidone in lung homogenates [42] (Fig 9A). Applying this gene signature to the two subsets we identified in GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17], we found that pirfenidone-responsive genes were upregulated in both subsets of IPF patients; however, pirfenidone-responsive genes were more significantly upregulated in the Ciliated epithelium-enriched IPF subset (Fig 9B). These results suggested that the Ciliated epithelium-enriched IPF subset may be more responsive to pirfenidone as compared to the Myeloid-enriched IPF subset.

Discussion
We used a data-driven, unsupervised clustering of RNA expression data from IPF patient lung samples, that was reproducible across patient cohorts and was associated with changes in the cellular composition of the lungs in IPF. We believe this study provides novel ideas on differential mechanisms of pathogenesis in this heterogeneous disease. We used single cell RNA sequencing data to uncover subpopulations of both mesenchymal and hematopoietic cell populations associated with disease pathogenesis [19,6,[45][46][47]. The throughput of scRNAseq studies limits sample sizes; therefore, we devised an analysis pipeline to bridge this gap by applying single cell RNA-based signature analysis to gene expression data derived from whole tissue gene expression. Through this analysis, we identified key alterations in cellular compositions and molecular mechanisms specific for 2 subsets of IPF patients. Furthermore, we identified potential biomarkers through the use of well-established machine learning techniques to develop classifiers based on both cell type signatures and gene expression values. Overall, we believe this body of work will help with the development of IPF diagnostics and deepen our understanding of cell types involved in the pathogenesis of IPF in different subsets of patients.
Prior studies suggested that both plasma cells and T cells are associated with disease progression [11][12][13], and B cells and plasma cells are enriched in IPF lung tissue and plasma cell gene expression is associated with faster disease progression and poorer survival [13]. These findings were confirmed using additional patient cohorts and different methodologies [48]. However, the nature of plasma cell and IPF-specific autoantibody involvement in IPF is unclear and current studies do not provide a mechanism to connect plasma cell and autoantibody increases to IPF pathogenesis [49,50]. The subset associated with the strongest B cell/ plasma cell signature was the same subset (Ciliated epithelium-enriched IPF subset; Subset 2) associated with a decrease in cytotoxic T cells and helper T cells. Similar to increases in B cells/ plasma cells, decreases in T cell responses have been shown to be associated with a poor prognosis in IPF [11,51]. Evaluation of the expression of genes suggested to be prognostic in IPF [43] suggested that the Ciliated epithelium-enriched IPF subset represented the subset of patients with more severe disease. Gene expression across the two subsets did not suggest an association with acute exacerbations of IPF; for example, some published markers of acute IPF exacerbations (MMP1, MMP7) were higher in Subset 2, while others (AGER, DEFA3) were lower in Subset 2 or not significantly different (COL1A2, CCNA2) [52]. Additionally, we found that 3 out of the 4 markers (GREM1, MMP7, CTHRC1 and FHL2) identified by Kaminski and colleagues [43] as having a significant negative correlation with %DLCO and as markers that separate IPF patients by disease severity and predicted progression were expressed at higher levels in 'Ciliated epithelium-enriched' patients of GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] (S2 Fig), potentially indicating a differential prognosis for the two subsets of IPF patients. Clinical follow up data will be valuable to determine whether the Ciliated epithelium-enriched IPF subset is associated with a worse prognosis and more likely to have acute IPF exacerbations.
A potential concern related to our study is that the subsets detected in our analysis were associated with differential sampling of lung tissue in each study. There are several lines of evidence disputing this conclusion including the fact that the pattern of IPF patient subsets we detected were observed across multiple independent patient cohorts and using different technologies (bulk RNA sequencing and scRNAseq, references [10,12]). Additionally, we re-analyzed samples from GSE124685 (Kaminski-Yale Univ bulk progression RNA cohort) [12], a study that analyzed various stages of IPF lungs by bulk RNA-sequencing. This study analyzed a small (n = 10) number of IPF donors and sampled their lungs in various anatomical locations to obtain transcriptional profiles of IPF lungs in various stages of fibrosis. Key markers identified in this study such as CCR2, ITGAM, FOXJ1 and SNTN did show significant changes by location of the lung samples [12]. It is also possible that the differences between subsets were due to differences in the stage of disease when tissue was sampled. Although we currently have no way of unequivocally determining whether the subsets were at different stages of disease (i.e. Subset 1 progresses into Subset 2), we think this is unlikely because the Myeloid-enriched IPF subset had a higher overall fibroblast gene expression signature compared to the Ciliated epithelium-enriched IPF subset (Fig 3). CTHRC1 + fibroblasts were recently shown to be present in fibrotic lungs [19] and were suggested to be pathogenic based on the high expression of several well-known pro-fibrotic mediators and extracellular matrix components. However, CTHRC1 + fibroblasts were not differentially expressed in the two subsets (Figs 3-5). Besides CTHRC1 + fibroblasts, several well-known profibrotic genes were similarly expressed between the two subsets (Fig 2). Additionally, Subset 1 patients expressed lower levels of genes shown to be prognostic of disease progression [43], and the Myeloid-enriched IPF subset was associated with increased mast cells compared to the Ciliated epithelium-enriched IPF subset (S6 Fig, 'MC_27'). Mast cells have been shown to be associated with a subset of IPF patients with a milder prognosis [53], further suggesting that the subsets we identified may have a different pathogenesis and prognosis. Taken together, these results suggested that Subsets 1 and 2 are not different stages of IPF but rather are subsets of IPF with different underlying pathologies and disease severity. This question can only be definitively answered using longitudinal gene expression data.
The Myeloid-enriched IPF subset was characterized by the presence of increased myeloid cell gene expression. Myeloid cells have been shown to be significant contributors to the development of fibrosis [54][55][56], and increases in SPP1-producing monocytes and macrophages were shown to be a hallmark of IPF pathogenesis [47]. Our analysis indicated that SPP1-producing monocytes along with other subsets of CD14 + cells were increased as compared to control samples in the Myeloid-enriched IPF subset but not in the Ciliated epithelium-enriched IPF subset (Fig 4C). This difference in macrophage numbers was associated with a significant, albeit small, increase in CCL2-expressing alveolar fibroblasts in the Myeloid-enriched IPF subset ( Fig 5B) and increased expression of CCR2, the receptor for CCL2. We analyzed receptorligand interactions in IPF single cell data between myeloid cells and fibroblasts and found that myeloid cells potentially provide important ligands for the activation of fibroblasts and vice versa (Figs 6 and 7, see below). With the accumulation of more scRNAseq data, an essential question to ask will be if different ligand-receptor interactions contribute to the pathogenesis of IPF in the Myeloid-enriched IPF subset.
We extended the receptor-ligand analysis to better understand all potential ligand-receptor changes between subsets of patients; this confirmed the differential activation of the CCL2-CCR2 ligand-receptor pair in the Myeloid-enriched IPF subset and revealed additional major changes in active receptor-ligand interactions between Subsets 1 and 2. Notable examples included the predicted activation of the EREG-EGFR ligand-receptor in the Myeloidenriched IPF subset and the predicted activation of CTGF signaling and the high level expression of CXCL13 in the Ciliated epithelium-enriched IPF subset (Fig 7B-7E). EGFR overexpression has been shown to be a hallmark of IPF [57] and EGFR activation has been shown to contribute to fibrosis in the bleomycin model of IPF [58]. As such, EGFR inhibition may represent an attractive therapeutic strategy for the Myeloid-enriched IPF subpopulation of IPF patients. Other studies have shown that CTGF is a key contributor to fibroblast activation and IPF [59]. The CTGF-blocking antibody pamrevlumab was beneficial in a recently completed phase II clinical trial in IPF [60]. A key question that our results may address is whether the Ciliated epithelium-enriched IPF subset responds better to pamrevlumab treatment. CXCL13 is a major chemokine responsible for the recruitment of antibody-producing cells and formation of germinal centers [61]. B cells and plasma cells are enriched in IPF lung tissue, plasma cell gene expression is associated with faster disease progression and poorer survival [13], and some IPF patients are responsive to B cell depletion with rituximab [62]. Therefore, it would be of interest to see whether rituximab responsiveness is associated with the subset of patients with high levels of B cells/plasma cells in the Ciliated epithelium-enriched IPF subset.
One of the major differences between Myeloid-enriched and Ciliated epithelium-enriched subsets of IPF patients was the expression of genes associated with ciliated epithelium as well as the increased expression of MUC5B in Subset 2. One study has suggested that ciliated epithelium is an important driver of IPF pathogenesis [63]. Additionally, MUC5B is a reproducible susceptibility locus identified in IPF genome-wide association studies (GWAS) [5][6][7].
Polymorphisms in the MUC5B promoter were shown to be associated with different levels of MUC5B expression [64][65][66]. Our analysis indicated that MUC5B upregulation was not a uniform feature of all IPF patients and was associated with ciliated epithelium abnormalities. An interesting question that arises from this analysis is whether patients in the Ciliated epithelium-enriched IPF subset are enriched in MUC5B polymorphisms associated with the pronounced upregulation of MUC5B mRNA. The potential connection between MUC5B upregulation and increased CTGF production found in our study is largely unexplored in the literature. Differential MUC5B production may also be valuable as a biomarker since MUC5B is routinely measured from sputum [67,68]. Exploring the feasibility and value of MUC5B as a biomarker for differentiating the two subsets of IPF patients is worth considering.
We also developed biomarkers to distinguish IPF patient subsets based on either cellular alterations or changes in gene expression. Through this work, we generated a list of potential biomarkers to separate IPF subsets with high accuracy (Fig 8). Using the methods described herein, we found that several of the cell populations different across subsets may also be used as accurate predictors of IPF patient subset (Fig 8). Additionally, we were also able to find a gene set that may function as a predictor of IPF patient subset (Fig 8D). After further validation of our results, it will be essential to develop markers that reliably identify what subset individual patients belong to, i.e. to stratify them into the Myeloid-enriched (Subset 1) or Ciliated epithelium-enriched (Subset 2) subsets. Some of the top 5 genes in our classifier (Fig 8D;  FOXJ1, LRRC34, MYL3, NELL2, SCGB3A1) have known relevance to the biology of ciliated epithelium (e.g. FOXJ1 is a key transcription factor in the development of cilia [69] and LRRC34 is a candidate causative gene in Mendelian disorders of cilium development [70]).
Additionally, we presented a patient stratification hypothesis for one of the currently FDAapproved treatments for IPF, pirfenidone. We showed that our Ciliated epithelium-enriched subset presented with significantly higher pirfenidone-responsive gene expression (Fig 9). These findings may lead to new hypotheses about differential patient treatment of IPF with pirfenidone and suggest a similar approach to the development of biomarkers for other approved therapies for IPF, such as nintedanib.
Our study has several strengths, including connecting alterations in cellular composition to gene expression and offering hypotheses on the differential pathogenesis underlying subsets of patients in IPF; however, it also has several limitations. First, although we used the best available method to assess consensus clustering performance (Proportion of Ambiguous Clustering (PAC) score, [26]), determining the optimal number of clusters from consensus clustering methods has known limitations [26]. It is possible that there is hidden sub-structure in the clusters detected and with a larger number of samples additional subsets could be discovered. Second, the number of donors in single cell RNA sequencing studies used to generate gene expression reference matrices for deconvolution of bulk data are small. Third, there are cell populations not reflected in the single cell RNA sequencing data (such as neutrophil granulocytes) that cannot be estimated in the bulk gene expression data.
Also, we used scores determined by gene set enrichment to estimate levels of cell type enrichment; as such, the scores calculated represent cell type-specific signatures and are not a direct measurement of each cell type. Despite this potential limitation, we believe that we used the most relevant GSVA method to determine cell type-specific gene signature scores; GSVA offers distinct advantages over calculating gene signature scores due to its efficient ranking and outlier smoothing algorithms [33,34].
In addition, we believe that the conclusions we made using gene signatures developed from the scRNAseq datasets are also supported by the observation that the correlation between GSVA-signature scores calculated from total lung suspension datasets GSE132771 (Sheppard-UCSF single cell cohort) and GSE136893 (Kropski-Vanderbilt Univ single cell cohort) are low across the dataset (S3C and S5B Figs). Although we observed higher correlation values using the 'Lineage sorted' dataset from GSE132771 (Sheppard-UCSF single cell cohort), we believe this does not change the main conclusions derived from the results. For example, in Fig 5B there was a high correlation and overlap between THY1high_alv_fib_0 (Alveolar_0) and THY1neg_alv_fib_5 (Alveolar_5) fibroblasts and gene signature scores for these populations (along with the other two alveolar fibroblast populations) across subsets. Therefore, using this example, we found that Subset 1 (Myeloid cell-enriched subset) showed an enrichment in gene signature scores for all alveolar fibroblast populations (Fig 5B) as compared to Control/Subset 2 (Ciliated epithelium-enriched subset) but a decrease in adventitial and peribronchial fibroblasts (Fig 5A).
Another limitation of our study is that it represents 'hypothesis generation' and lacks experimental validation. Unfortunately, we were unable to link and validate our findings to histopathological and longitudinal clinical and gene expression data. Future datasets may answer the question of whether the changes we observed based on gene expression are reflected in cellular changes observable by other methods and if gene expression differences are relevant to prediction of clinical disease course in IPF. However, we believe that our hypotheses have generated valuable insights despite this shortcoming as our results provide testable ideas with suggested associated biomarkers.

Conclusions
In conclusion, we developed an analysis pipeline to subset IPF patients in a data-driven, unsupervised manner and demonstrated an association of cellular changes with gene expression in the two identified subsets. We believe this work provides novel insights into the pathogenesis of IPF and provides testable hypothesesabout differential alterations of cellular composition of the lung in subsets of IPF patients in this difficult-to-treat disease.  [14][15][16][17]. B. PAC scores as a function of number of clusters (k) calculated based on consensus clustering results in GSE134692 (BMS bulk RNA-seq cohort) [18]. C. Distribution of patient subsets from Fig  1A across IPF samples overlapping or non-overlapping between GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17] and GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) [10]. D. PAC scores as a function of number of clusters (k) calculated based on consensus clustering results using the 75 unique samples (not overlapping with GSE32537) from GSE47460 (Kaminski-LGRC bulk expression cohort) [14][15][16][17].  Table. (XLSX) S2 Table. (XLSX) S3 Table. (XLSX)