Meta-analysis of Gene Expression Microarray Datasets in Chronic Obstructive Pulmonary Disease

Chronic obstructive pulmonary disease (COPD) was classified by the Centers for Disease Control and Prevention in 2014 as the 3rd leading cause of death in the United States (US). The main cause of COPD is exposure to tobacco smoke and air pollutants. Problems associated with COPD include under-diagnosis of the disease and an increase in the number of smokers worldwide. The goal of our study is to identify disease variability in the gene expression profiles of COPD subjects compared to controls. We used pre-existing, publicly available microarray expression datasets to conduct a meta-analysis. Our inclusion criteria for microarray datasets selected for smoking status, age and sex of blood donors reported. Our datasets used Affymetrix, Agilent microarray platforms (7 datasets, 1,262 samples). We re-analyzed the curated raw microarray expression data using R packages, and used Box-Cox power transformations to normalize datasets. To identify significant differentially expressed genes we ran an analysis of variance with a linear model with disease state, age, sex, smoking status and study as effects that also included binary interactions. We found 1,513 statistically significant (Benjamini-Hochberg-adjusted p-value <0.05) differentially expressed genes with respect to disease state (COPD or control). We further filtered these genes for biological effect using results from a Tukey test post-hoc analysis (Benjamini-Hochberg-adjusted p-value <0.05 and 10% two-tailed quantiles of mean differences between COPD and control), to identify 304 genes. Through analysis of disease, sex, age, and also smoking status and disease interactions we identified differentially expressed genes involved in a variety of immune responses and cell processes in COPD. We also trained a logistic regression model using the 304 genes as features, which enabled prediction of disease status with 84% accuracy. Our results give potential for improving the diagnosis of COPD through blood and highlight novel gene expression disease signatures.

Introduction 1 bronchial biopsies highlighted the presence of increased concentrations of inflammatory 22 cells throughout the lungs [11,12]. Studies have also suggested that COPD acts like an 23 autoimmune disease due to persistent inflammation even after smoking has 24 ceased [12][13][14]. In addition to environmental pollutants, there is also also a genetic 25 deficiency, alpha-1 antitrypsin deficiency (AATD), that is associated with COPD [8]. 26 AATD protects the lungs, and without it the lungs become vulnerable to COPD. The 27 prevalence of COPD is expected to rise due to increasing smoking rates and larger 28 populations of elderly individuals in many countries [5]. 29 COPD is often underdiagnosed and despite tobacco exposure being the highest risk 30 factor, not all smokers get COPD, and non-smokers can also develop COPD. Previous 31 work has been done to identify biomarkers for earlier diagnosis of COPD in blood, a 32 non-invasive approach. Bahr et al., compared expression profiles of smokers with COPD 33 and smokers without COPD [15]. They used multiple linear regression to identify 34 candidate genes and pathways. Their results highlighted pathways involved in the 35 immune system and inflammatory response [15]. Another study of blood gene expression 36 in COPD explored using pre-existing gene interaction networks to perform unsupervised 37 clustering to identify COPD disease sub-types [16]. More recently, Reinhold et al., took 38 a different approach by conducting a meta-analysis that identified groups of genes 39 associated with COPD by using consensus modules of gene co-expression. They built 40 networks of genes that were co-expressed and associated with COPD phenotypes [17]. 41 In our meta-analysis, the objective was to identify the effects of age, sex, and 42 smoking status on gene expression in COPD. We investigated gene expression changes 43 in blood for 1,262 samples (574 healthy samples and 688 COPD samples) to identify 44 genes and their associated pathways in COPD. Our study is the largest meta-analysis 45 on blood expression for COPD to date, to the best of our knowledge, and our results 46 offer prospective gene and pathway associations that may be targeted for improving 47 COPD diagnosis and treatment. Our meta-analysis also highlighted disease genes that 48 interact with smoking status, and these genes can be used to further characterize the 49 effects of smoking on COPD development. 50 Materials and Methods 51 We used seven publicly available COPD microarray gene expression datasets in our 52 meta-analysis to evaluate variation in gene expression across samples due to disease 53 status, sex, age and smoking status ( Table 1). The 7 expression datasets were from 3 54 June 14, 2019 2/26 different microarray platforms: Affymetrix GeneChip Human Genome U133 Plus 2.0, 55 Affymetrix Human Gene 1.1 ST Array and Agilent Whole Human Genome Microarray 56 4x44K. Our current meta-analysis pipeline (similar to Brooks et al. [18]), included 5 57 main steps (Fig 1): (1) data curation; (2) pre-processing of raw expression data; (3) 58 analysis of variance (ANOVA) on our linear model which compared gene expression 59 changes due to disease state, smoking status, sex and age group; (4) post-hoc analysis 60 using Tukey Honest Significance Difference test (TukeyHSD) for biological significance; 61 and (5) Gene ontology (GO) and pathway enrichment analysis of the differentially 62 expressed and biologically significant genes. 63  To gather the datasets for our meta-analysis, we searched the National Center for 66 Biotechnology Information (NCBI)'s data repository, Gene Expression Omnibus 67 (GEO) [19], and the European Bioinformatics Institute (EMBL-EBI)'s data repository, 68 Array Express (AE) [20] for microarray expression data. We used the following 69 keywords to search the repositories: COPD, Homo sapiens, blood (whole blood and 70 peripheral blood mononuclear cells) and expression profiling by array ( Fig 1B). The 71 search results were further filtered to include datasets where the age, sex and smoking 72 status of the samples were reported (Fig 1B). We found 3 datasets from GEO 73 (GSE42057 [21], GSE71220 [22], GSE54837 [23]) and 1 from AE (E-MTAB-5278 [24]) 74 that met our search criteria (Table 1 and Fig 1B). We conducted an additional search 75 on GEO and AE to find healthy subjects with their smoking history reported to balance 76 our control subjects with our COPD subjects. The search keywords included: Homo 77 sapiens, blood, smoking and expression profiling by array. We also filtered these search 78 results for datasets that reported the age, sex and smoking status of subjects. With this 79 additional search, we added 3 more datasets: GSE87072 [25], GSE47415 [26], and 80 E-MTAB-5279 [24] which helped improve the balance between COPD and control 81 subjects (Table 1 and DF1 of online data files).

82
After selecting the datasets for our meta-analysis, we retrieved the raw microarray 83 expression data for each dataset, and created a demographics file per study, which 84 included sample characteristics using e-utils in Mathematica [27] ( Table 2). The 85 demographics files were further filtered to eliminate samples that did not fit our 86 inclusion criteria. For example, GSE71220 included subjects that were using statin 87 drugs [22], and hence we excluded all samples that were receiving treatment from our 88 analysis. For GSE87072, we removed the samples that were moist snuff consumers [25] 89 and only used smokers and non-smokers in our analysis. In our additional search for 90 controls with smoking status reported, we filtered the selected datasets (GSE87072,

91
GSE47415 and E-MTAB-5279) and only used the healthy samples for our analysis. In 92 addition to this, we excluded the subjects in GSE23515 [28] from our analysis because   probes were annotated and summarized ( Fig 1C). For the Affymetrix Human Genome 107 Plus 2.0 expression data files, the expresso function was used to pre-process the files 108 with the following parameters: background correction with robust multi-array analysis 109 (RMA), correcting the perfect-match (PM) probes, and 'avdiff' to calculate expression 110 values [31]. Subsequently, the avereps function from limma was used to summarize the 111 probes and remove replicates [34]. The Affymetrix Human Gene 1.1 ST data files were 112 also background corrected using RMA, and the probes were summarized and replicates 113 removed using the avereps function. As for the Agilent data files, background 114 correction was performed using the backgroundCorrect function with NormExp

115
Background Correction as the method from the limma package [35]. The probes for 116 both Affymetrix Human Gene 1.1 ST and Agilent were also summarized and replicates 117 were removed using the avereps function from limma. Once pre-processing was 118 completed, the 8 datasets (Table 1) were merged by common gene symbols into a single 119 matrix file. Using the ApplyBoxCoxTransform function and the StandardizeExtended 120 function from the MathIOmica (version 1.1.3) package [27,36] in Mathematica, we As we also previously described [18], 130 the study factor is directly related to the microarray platform type. To address this, the 131 ComBat function in the sva package was used to correct for variation in the data due to 132 the study factor [38,39]. PCA plots were used to visualize variation in expression data 133 before and after batch correction with ComBat [40] (Fig 2 and S1 Fig),  impact on gene expression in COPD, we modeled (see linear model below) our merged 139 expression matrix (DF2 of online supplementary data files) and then ran ANOVA to 140 identify differentially expressed genes ( Fig 1C) using aov and anova from base R's 141 stats package (as previously described [18]

163
To determine the biological effect of the ANOVA statistically significant genes 164 (disease status factor) and calculate relative expression (difference in means) to 165 determine up-or down-regulation of genes, we conducted a post-hoc analysis with 166 TukeyHSD function in the stats package in base R using our linear model outlined 167 above. We added an additional column to the TukeyHSD results which contained 168 BH-adjusted TukeyHSD p-values, and all GO terms and pathways with a BH-adjusted 169 p-value <0.05 were considered significant. To find genes that were significantly up-and 170 down-regulated, we further filtered the gene list by difference in means by using the 171 two-tailed 10 and 90% quantile. With these results we carried out GO and pathway 172 enrichment to identify which biological processes and pathways the genes were enriched. 173 We used the disease genes and explored sex, smoking status and aging effects on their 174 gene expression.

175
Machine Learning with COPD

176
Machine learning classification was carried out in Mathematica using the Classify 177 function [45], with the Method parameter set to "LogisticRegression". We first trained 178 on all 1262 samples, using the statistically significant disease genes, filtered with a two 179 tailed 10 and 90% quantile selection for effect size as features (304 genes). We also

184
Our meta-analysis selection criteria for data curation ( Fig 1B)  Prior to designing our linear model, we wanted to visualize variation introduced into the 192 data due to batch effects, and how the variation changes when the data is adjusted with 193 ComBat for batch effects. We used ComBat in R to adjust for the study effect on the 194 data and generated PCA plots before and after batch correction (Fig 2). In Fig 2A, 195 before running ComBat, the data separates into four major clusters with a variance of 196 49.9% in PC1 and 15.7% in PC2. After running ComBat, the clustering of the data is 197 removed, and variance reduced to 17.7% in PC1 and 4.4% in PC2 ( Fig 2B). We also   Of the 1,513 disease genes we further filtered our ANOVA results (see DF4 of online 230 supplementary data files) to identify genes with statistically significant interactions with 231 smoking status (disease:smoking status, BH-adjusted p-value < 0.05). We found 39 232 genes that had a statistically significant pairwise interaction between disease status and 233 smoking status (see DF14 of online supplementary data files). Using the 39 interacting 234 genes, we calculated the row means across the different pairings of smoking status and 235 disease status to compare expression (Fig 7). We used the row means of the  To assess biological effect and determine factorial differences in gene expression we ran 245 TukeyHSD on our 1,513 statistically significant disease genes. We first focused on 246 COPD and control gene expression differences and used BH-adjusted p-value <0.05 to 247 determine significance. We also filtered further by using a 10% two-tailed quantile 248 cutoff to identify significantly up-and down-regulated genes. Once we filtered by 249 p-value, we calculated to 10 and 90% quantiles using differences in group means. For 250 the COPD-control TukeyHSD comparisons we found 304 statistically significant genes 251 that we classified as up-regulated (mean differences -0.0260) and down-regulated     We 276 did not find any statistically significant interacting genes between disease status and sex 277 from our ANOVA results.

278
To determine the age effect on our DEG associated with COPD (304 genes), we 279 focused on our TukeyHSD results where the age group <50 was the baseline. We 280 selected for significance (BH-adjusted p-value <0.05) and two-tailed 10% (up-regulated 281 0.421 and down-regulated -0.193) on the difference in means results to find 282 significant age-group effects. We identified 304 significant age-group comparisons across 283 95 unique genes (see DF13 of online supplementary data files). We plotted the relative 284 expression (difference in means) across all age comparisons with <50 as the baseline. 285 We identified two clear clusters of the genes by expression which indicated that there 286 are significant differences in expression profiles due to aging ( Figure 8). However, we 287 did not find any statistically significant genes with an interaction between disease status 288 and age from our ANOVA results. Mathematica for predicting whether a profile belongs to the control or COPD group.

322
Our ANOVA highlighted 1,513 statistically significant (BH-adjusted p-value <0.05; 323 disease status factor) disease genes (see DF4 of online supplementary data files). One of 324 our genes, FAM13A, has previously been associated with COPD susceptibility [8,49]. 325 Other genes such as GPR15, CLEC4D and MPO have also been associated with COPD 326 and inflammation within the lungs. Our GO and pathway enrichment results highlight 327 some immune pathways (Table 3) and GO terms such as innate immune response, 328 adaptive immune response and inflammation (DF7 of online supplementary data files) 329 that have previously been associated with COPD. For example, primary 330 immunodeficiency (weakened immune system due to deficiencies in immune cell 331 production) is linked to recurrent infections in subjects with COPD [50]. This 332 recurrence in infections due to a weakened immune system also causes chronic 333 inflammation and airway remodelling and obstruction [50]. Humoral deficiencies and 334 inadequate antibody production and responses to infections also reduce the effectiveness 335 of vaccinations such as the influenza and pneumococcal vaccines [50,51]. Studies have 336 suggested that antibody replacement therapy can help to reduce the recurrence of 337 bacterial and viral infections in COPD subjects [50,52]. In our results, we highlighted 338 the primary immunodeficiency KEGG pathway to determine how our genes are 339 regulated in the pathway (Fig 3). Of the 8 genes highlighted in the T cell maturation differentiation was down-regulated in COPD subjects (Fig 3). IL2RG is also associated 343 with severe combined immunodeficiency [53]. Previous findings suggest that 344 down-regulation of the soluble common gamma chain is a mechanism to reduce 345 inflammation by T cells in response to cigarette smoke in a COPD mouse model [54].

346
Up-regulated γC promotes interferon-γ production and inflammation in the respiratory 347 tract [54]. IL7-Ra and JAK3 are also linked to severe combined immunodeficiency.

DCLRE1C (Artemis) and CD3E are both involved in Pro-T to Pre-T cell differentiation 349
and were up-and down-regulated respectively in COPD subjects (Fig 3). Genes such as 350 LCK ZAP70 and RFXAP are involved in T cell differentiation into CD8+ and CD4+ 351 cells and were found to be down-regulated in COPD (Fig 3). In B-cell differentiation, 352 our gene hits, BTK (B-cell development) and IKBKG (alias IKKγ ) were up-regulated 353 in COPD while Igα was down-regulated (Fig 3). Reduced Igα or deficiencies in Igα 354 promote reoccurring infections and disease exacerbation in COPD subjects [50,55].  4). Cytokines play a major role in the inflammatory response observed in COPD 359 subjects. For instance, CCR8 (chemokine) was up-regulated in COPD subjects (Fig 4). 360 Increased levels of CCR8 has been previously observed in allergic asthmatics [56] and 361 has a functional role in macrophage processes and release of cytokines in the lungs [57]. 362 We also visualized our up-and down-regulated gene hits in the other enriched 363 KEGG pathways (Table 3 and  cancer and it leads to 1% of cancer cases each year [58]. Furthermore, there is a five-fold 366 increase to developing lung cancer in patients with COPD compared to individuals with 367 normal pulmonary function [58]. Some of our highlighted genes are involved in smoke. Additionally, dysregulation of the lysosomal pathway has also been previously 375 described in COPD patients [59]. 376 We observed some down-regulated genes in the adherens junction pathway for modules for each cohort [17]. Out of our 304 genes, 97 of them overlapped with their 398 findings while 207 of our genes were unique. We used BINGO in Cytoscape v.3.7.1 for 399 GO analysis on our 207 unique genes (Fig S9 and S3 File) [62,63]. Our BINGO results 400 (BH-adjusted p-value < 0.05) include GO terms such as defense response, response to 401 bacterium, response to stress, response to wounding, immune response, cell adhesion, 402 and inflammatory response (Fig S9 and S3 File).

421
To assess the effect of smoking status on gene expression, we focused on the 422 biologically significant genes with a significant interaction between disease status and 423 smoking status. We identified 39 disease genes that significantly interacted with 424 smoking status (Fig 7). The baseline in Fig 7 was  apoptosis and helps control cancer cells [66] and POU2AF1 is a regulator of host 434 defenses but cigarette smoke suppresses its gene expression [67] (Fig 7). In our analysis 435 there was only 1 COPD non-smoker which was excluded from this analysis. observed altered expression of elastin and collagen in COPD compared to controls, and 445 the stage/severity of COPD affected extracellular matrix remodeling [68,69]. Studies on 446 COPD and sex, previously suggested higher prevalence in males due to them having 447 higher smoking rates [70,71]. However, currently with larger numbers of women 448 smoking the prevalence of COPD in women is on the rise. Studies have shown that 449 women are 50% more susceptible to COPD than males and why this is the case is still 450 an on going debate [70,71]. Some reasons include, smaller airways so larger 451 concentrations of tobacco smoke in the lungs and hormonal effects [70,71]. Of the 44 452 genes with a sex effect, we did not find any genes with a significant interaction between 453 disease status and sex.

454
Aging trends were visualized on the biologically significant disease genes. 95 genes 455 showed significant aging trends compared to our baseline (<50) (DF13 of online 456 supplementary data files). Symptoms for COPD can be detected between ages 40 and 457 50 [72], and because of this we used our subjects grouped as <50 as our baseline. The 458 data clustered into two distinct groups with similar gene expression patterns (Fig 8).

459
Group 1 genes were significantly up-regulated for all age groups compared to the 460 baseline (Fig 8). Pathway enrichment analysis indicated that the genes in group 1 are 461 involved in the Neutrophil degranulation pathway (p-value 1.06e-04 and FDR 0.029) 462 which has previously been described above as being up-regulated in COPD subjects.

463
Genes within group 2 displayed an opposite trend with most genes being down-regulated 464 with increasing age (Fig 8). These genes did not result in any statistically significant 465 enrichment. However, genes in group 2 include C11orf74 (involved in transcription 466 regulation) [53], CD163 (previously found to be over expressed in lungs of individuals 467 with severe COPD) [73], TCF7 (natural killer and lymphoid cell development) [53], 468 CYP1B1 (previously shown to be up-regulated in COPD and smokers) [74] and SASH1 469 (involved in TLR4 signaling and can promote cytokine production) [53]. In addition to 470 this, we did not find any significant interacting genes between disease status and age.

471
To test the possibility of using blood expression data from micro-arrays to predict 472 disease status, we performed machine learning with a logistic regression model using the 473 304 disease genes. This resulted in an average accuracy of 84.2% (Fig 9). These results 474 are promising despite using aggregate expression versus cell-type specific expression.

475
Previous studies explored using computed tomography (CT) images COPD patients and 476 controls for disease classification [75]. Some studies also used patient reported data 477 (such as heart rate, respiratory rate) to predict disease exacerbation and resulted in an 478 ROC of 0.87 [76] and another with 70% sensitivity and 71% specificity [77].

479
Conducting a meta-analysis with microarray expression data limits our findings to 480 annotated genes, and hinders us from discovering novel genes and looking at the entire 481 transcriptome. Additionally, using publicly available data limits us to specific factors we 482 can explore in our analysis due to subject characteristics not being reported uniformly 483 across datasets (see S1 File). For example, all studies did not report ethnicity and 484 therefore we could not investigate the effect of ethnicity on gene expression in COPD. 485 This would be a good factor to explore due to over 90% of COPD cases occurring in 486 low-middle class communities [5,10]. We also did not have consistently reported disease 487 severity information to factor into our analysis and findings. Our selection criteria for 488 the publicly available data limits our sample size (Fig 1). In addition to this, the