Differential Gene Expression in Colon Tissue Associated With Diet, Lifestyle, and Related Oxidative Stress

Several diet and lifestyle factors may impact health by influencing oxidative stress levels. We hypothesize that level of cigarette smoking, alcohol, anti-inflammatory drugs, and diet alter gene expression. We analyzed RNA-seq data from 144 colon cancer patients who had information on recent cigarette smoking, recent alcohol consumption, diet, and recent aspirin/non-steroidal anti-inflammatory use. Using a false discovery rate of 0.1, we evaluated gene differential expression between high and low levels of exposure using DESeq2. Ingenuity Pathway Analysis (IPA) was used to determine networks associated with de-regulated genes in our data. We identified 46 deregulated genes associated with recent cigarette use; these genes enriched causal networks regulated by TEK and MAP2K3. Different differentially expressed genes were associated with type of alcohol intake; five genes were associated with total alcohol, six were associated with beer intake, six were associated with wine intake, and four were associated with liquor consumption. Recent use of aspirin and/or ibuprofen was associated with differential expression of TMC06, ST8SIA4, and STEAP3 while a summary oxidative balance score (OBS) was associated with SYCP3, HDX, and NRG4 (all up-regulated with greater oxidative balance). Of the dietary antioxidants and carotenoids evaluated only intake of beta carotene (1 gene), Lutein/Zeaxanthine (5 genes), and Vitamin E (4 genes) were associated with differential gene expression. There were similarities in biological function of de-regulated genes associated with various dietary and lifestyle factors. Our data support the hypothesis that diet and lifestyle factors associated with oxidative stress can alter gene expression. However genes altered were unique to type of alcohol and type of antioxidant. Because of potential differences in associations observed between platforms these findings need replication in other populations.


Introduction
Diet and lifestyle factors have been associated with several chronic diseases including colon cancer. Several of these diet and lifestyle factors may influence risk through their impact on inflammation and oxidative stress. Cigarette smoking has been associated with colon cancer and other chronic disease and has been linked to inflammation through increased oxidative stress [1][2][3][4][5][6][7][8]. Likewise, alcohol intake has been associated with risk of colon cancer and may act through several mechanisms, including increasing levels of oxidative stress [9][10][11][12]. One of the factors most consistently inversely associated with colon cancer risk is use of aspirin and/or non-steroidal anti-inflammatory drugs (NSAIDs). Dietary antioxidants likewise have been associated with colon cancer risk [13][14][15][16]. We and others have shown that the oxidative balance score (OBS) is associated with colon and rectal cancer [17][18][19], with risk lowest among those with a score that is higher in antioxidants and low in pro-oxidant factors. Components of this score are cigarette smoking, use of aspirin or non-steroidal anti-inflammatory drugs (NSAIDs), and dietary pro-and anti-oxidants. While it has been hypothesized that these factors act as anti-oxidants or pro-oxidants, it is unknown if they directly influence gene expression that may alter these and other pathways. However, changes in gene expression were observed among men in an intervention study of diet and lifestyle modification [20], supporting the hypothesis that diet and lifestyle factors influence gene expression. Others also have suggested that oxidative stress may modulate gene expression [21].
In this study we examine how dietary antioxidants, cigarette smoking, alcohol consumption, NSAID use, and the OBS influence gene expression in non-tumor colonic tissue. We hypothesize that differences in gene expression levels with be associated with level of diet and lifestyle factors that influence inflammation and oxidative stress. We hope that information on how these diet and lifestyle factors alter gene expression will provide direct insight into how they influence cancer and other chronic diseases.

Methods
Total RNA was available from colonic non-tumor tissue for 175 colon cancer cases who were part of the Diet, Activity, and Lifestyle study, an incident, population-based, case-control study of colon cancer from Utah and the Kaiser Permanente Medical Research Program (KPMRP), and the Twin Cities Metropolitan area. Cases had tumor registry verification of a first primary adenocarcinoma of the colon and were diagnosed between October 1991 and September 1994. Tumor tissue blocks were obtained for 97% of all Utah cases and for 85% of all KPMRP cases [22] and included those who signed informed consent and those retrieved by local tumor registries and sent to study investigators without personal identifiers. Individuals with known adenomatous polyposis coli (APC), Crohn's disease, or inflammatory bowel disease were not eligible for the study. Individuals with MSI high tumors were sequenced for inherited mutations in mismatch repair genes and excluded from the study if such mutations existed [23]. The study was approved by the Institutional Review Board of the University of Utah and at KPMRP. In this study we use colonic non-tumor tissue that was from the same location as the tumor to evaluate how recent cigarette smoking, recent alcohol use, recent use of aspirin or non-steroidal anti-inflammatory agents (NSAIDs), and recent dietary intake of antioxidants influence gene expression.

Diet and Lifestyle Data
Data were collected by trained and certified interviewers using laptop computers. All interviews were audio-taped as previously described and reviewed for quality control purposes [24]. The referent period for the study was two years prior to diagnosis for cases and selection for controls. Dietary information was obtained for the referent year from an extensive diet history questionnaire adapted from the validated CARDIA diet history [25]. As part of the study questionnaire, information was collected on regular use and current use of aspirin and non-steroidal anti-inflammatory drugs and cigarette smoking history including start and stop dates for smoking. Alcohol intake of beer, wine, and hard liquor was ascertained for the referent year as well as for 10 and 20 years prior to the referent year.

RNA processing
RNA was extracted from formalin-fixed paraffin embedded tissues. We assessed slides and tumor blocks that were prepared over the duration of the study prior to the time of RNA isolation to determine their suitability. Older slides produced comparable RNA quality as more recent slides and was not correlated with time lapse between slide preparation and RNA preparation. The study pathologist reviewed slides to delineate tumor and non-tumor tissue. Cells were dissected from 1-4 sequential sections on aniline blue stained slides using an H&E slide for reference. Total RNA was extracted, isolated, and purified using the RecoverAll Total Nucleic Acid isolation kit (Ambion), RNA yields were determined using a NanoDrop spectrophotometer.

Sequencing Library Preparation
Library construction was performed using the Illumina TruSeq Stranded Total RNA Sample Preparation Kit with Ribo-Zero. Briefly, Ribosomal RNA was removed from 100 ng total RNA using biotinylated Ribo-Zero oligos attached to magnetic beads that are complimentary to cytoplasmic rRNA. Following purification, the rRNA-depleted sample is fragmented with divalent cations under elevated temperatures and primed with random hexamers in preparation for cDNA synthesis. First strand reverse transcription is accomplished using Superscript II Reverse Transcriptase (Invitrogen). Second strand cDNA synthesis is accomplished using DNA polymerase I and Rnase H under conditions in which dUTP is substituted for dTTP, yielding blunt-ended cDNA fragments in which the second strand contains dUTP. An A-base is added to the blunt ends as a means to prepare the cDNA fragments for adapter ligation and block concatamer formation during the ligation step. Adapters containing a T-base overhang were ligated to the A-tailed DNA fragments. Ligated fragments were PCR-amplified (13 cycles) under conditions in which the PCR reaction enables amplification of the first strand cDNA product, whereas attempted amplification of the second strand product stalls at dUTP bases and therefore is not represented in the amplified library. The PCR-amplified library was purified using Agencourt AMPure XP beads (Beckman Coulter Genomics). The concentration of the amplified library was measured with a NanoDrop spectrophotometer and an aliquot of the library is resolved on an Agilent 2200 Tape Station to define the size distribution of the sequencing library.

Sequencing and Data Processing
Sequencing libraries (18 pM) were chemically denatured and applied to an Illumina TruSeq v3 single read flow cell using an Illumina cBot. Hybridized molecules were clonally amplified and annealed to sequencing primers with reagents from an Illumina TruSeq SR Cluster Kit v3-cBot-HS. Following transfer of the flowcell to an Illumina HiSeq instrument, a 50 cycle single-read sequence run was performed using TruSeq SBS v3 sequencing reagents. The singleend 50-base reads from the Illumina HiSeq2500 were aligned to a sequence database containing the human genome (build GRCh37 / hg19, February 2009, from genome.ucsc.edu) plus all splice junctions generated using the USeq MakeTranscriptome application (version 8.8.1, available here: http://useq.sourceforge.net/). Alignment was performed using novoalign version 2.08.01 available from novocraft.com, which also trimmed any adapter sequence. Following alignment, genome alignments to splice junctions were translated back to genomic coordinates using the USeq SamTranscriptomeParser application. The resulting alignments were sorted and indexed using the Picard SortSam application (version 1.100, available here: http:// broadinstitute.github.io/picard/). Aligned read counts for each gene were calculated using pysam (https://code.google.com/p/pysam/) and samtools (http://samtools.sourceforge.net/). A python script using the pysam library was given a list of the genome coordinates for each gene, and counts to the exons and UTRs of those genes were calculated. Gene coordinates were downloaded from http://genome.ucsc.edu.
We compared our data to a gene table with 51,041 molecular features. We dropped features that were not expressed in our data or for which the expression was unavailable for the majority of samples. Using the BioMart tool on the Ensembl website (http://www.ensembl.org), we created a list of known regions linked to protein-coding genes from the human GRCh38 gene annotation dataset. Non-protein coding genes were also dropped from our analysis. Our final analysis included 17,462 protein-coding features.

Statistical Methods
Of the 197 initial tumor/non-tumor tissue pairs, five subjects failed quality control (QC) based on low number of sequence counts for both tumor and non-tumor tissue, and 17 were dropped because the non-tumor colonic tissue failed QC, leaving 175 subjects with high quality expression data. Of these 144 had questionnaire data for diet and lifestyle data for inclusion in the analysis. In terms of specific dietary factors, we focused on carotenoids and antioxidant nutrients, including beta carotene, vitamin C, total alpha tocopherol (vitamin E), lycopene, and lutein + zeaxanthin. We also considered alcohol consumption, recent use of NSAIDs, and current smoking. For each dietary and lifestyle factor, our analysis centered on contrasting gene expression levels of individuals with lower intake or exposure levels to those of individuals with higher intake or exposure levels. Each individual's intake or exposure was assigned a category based on their dietary and lifestyle data. Dietary data were categorized into tertiles [i.e. low (T1), moderate (T2), or high (T3)] based on the empirical distributions in the population. Cigarette smoking was categorized as never, former, or current smoker. Alcohol was categorized into non-drinker, low intake, or high intake for each type of alcohol. Use of NSAIDs (which included aspirin and/or non-steroidal anti-inflammatory drugs) was categorized as either being a recent user (i.e. using NSAIDs during the referent period) or a non-user. To summarize risk associated with multiple exposures, we developed an oxidative balance score (OBS) that consisted of 13 diet and lifestyle factors that were pro-oxidants (dietary iron and polyunsaturated fat and cigarette smoking) and anti-oxidants (vitamin C, vitamin E, selenium, beta carotene, lycopene, lutein+zeaxanthin, vitamin D, calcium, and folic acid and NSAID use) [17]. To create the OBS, these diet and lifestyle factors were assigned values of 2 for low levels of exposure for each pro-oxidants or high exposure to anti-oxidants (low-risk), one for intermediate levels of exposure, and zero for high levels of exposure to pro-oxidants and low exposure to anti-oxidants (high-risk). The individual scores for the 13 variables were then combined to obtain the OBS. Higher summary score corresponded to greater oxidative balance; individual's OBSs were categorized as low, intermediate, or high based on tertiles associated with the empirical distribution of the OBSs.
For each variable of interest (specific dietary factors, NSAIDS use, smoking and OBS), we assessed which genes displayed statistically significant differential expression between low and high categories using the Bioconductor package DESeq2 written for the R statistical programming environment. DESeq2 assumes the RNA-seq counts are distributed according to negative binomial distributions. It utilizes generalized linear modeling to test individual null hypotheses of zero log2 fold changes between high and low categories (i.e. no differential expression) for each gene and it employs both an independent-filtering method and the Benjamini and Hochberg [26] procedure to improve power and control the false discovery rate (FDR). For further details regarding DESeq2, see Love et al. [27]. In identifying genes with significant differential expression, an FDR of 0.10 was used.
To help describe the data, we report the average DESeq2-adjusted gene expression levels (adjusted counts) among individuals in the high and low categories of the dietary or lifestyle variables of interest for each differentially expressed gene and include fold change calculations associated with these genes. Included as a descriptive detail rather than reflecting direct DESeq2 output, fold change was calculated as the ratio of a gene's mean expression among individuals in the high category of a dietary or lifestyle variable to its mean expression among individuals in the low category; a fold change greater than one indicates a positive differential expression (i.e. up-regulated) while a fold change between zero and one indicates a negative differential expression (i.e. down-regulated).
To visualize differential gene expressions between individuals in high and low categories of related diet and lifestyle variable groups, we created heat maps. Each heat map features the log2 transformation of the fold changes, calculated as described above, associated with genes identified as significantly differentially expressed between high and low categories of the diet and lifestyle variables considered for the specific heat map. Our heat maps were created using the heatmap.2 program in the 'gplots' package of R (http://cran.r-project.org). Distance between two vectors of log2 transformed fold changes was measured via the Euclidean metric and median linkage was selected for this programs' agglomerative hierarchical clustering algorithm.
Bioinformatics analysis was performed on the list of Ensemble IDs associated with genes identified as differentially expressed with QIAGEN's Ingenuity Pathway Analysis (IPA) [28]. We used genes from Ingenuity Knowledge Base and considered both indirect and direct relationships. Networks were limited to 35 molecules and 25 networks per analysis and included both causal and interaction networks. We included all data sources in our IPA assessment, but did not restrict to species or specific tissue when compiling networks. We applied the Benjamini-Hochberg (B-H) multiple testing correction to assess pathways in IPA.

Results
The majority of study participants were male (Table 1). A large percentage of people never drank alcohol and were not current smokers. The mean age of the population was 64.5 years and 37.3% of the population reported recent use of NSAIDs.
Forty-six genes were differentially expressed between current smokers and non-smokers (S1 Table); the majority of these genes were up-regulated among current smokers. These genes enriched three networks as indicated by the focus molecules (those that are deregulated in our data) and were involved in cell morphology, cellular development, endocrine system development and function, cellular functions and maintenance, lipid metabolism, and vitamin and mineral metabolism ( Table 2). Upstream regulators to which significant number of our deregulated genes among smokers were linked included MAPkinase genes TEK and MAP2K3 and PEBP4 (Fig 1A-1C).
Five genes were differentially expressed between high recent alcohol consumers and nonconsumers of alcohol (Table 3). Six genes were differentially expressed for high beer consumers versus non-consumers, six genes were differentially expressed for high consumers of wine, and four genes were differentially expressed among liquor consumers versus non-consumers. Both up-and down-regulated genes were observed, although overall alcohol consumption and beer consumption were more likely to be associated with upregulated genes and liquor consumption was more likely to be associated with down-regulated genes. As illustrated in Fig 2, there were uniquely de-regulated genes based on specific type of alcohol consumption (p values corresponding to de-regulated genes by alcohol type can be found in Table 3). Beta carotene (one gene), total alpha tocopherol or Vitamin E (four genes) and lutein+ zeaxanthin (five genes) were associated with differentially expressed genes between high and low nutrient intakes. Vitamin C and lycopene intake were not associated with any differentially expressed genes when the FDR was controlled at <0.1 (Table 4). With few exceptions high intake of antioxidants and carotenoids were associated with higher gene expression than low nutrient intake. Genes differentially expressed by high intake versus low intake of dietary antioxidants and carotenoids are shown in Fig 3. Three genes were significantly differentially expressed based on recent NSAIDs (Table 4), while three genes showed differential expression between high and low oxidative balance score ( Table 4). Two of the three genes associated with NSAIDs were up-regulated among users while all of these genes associated with OBS were upregulated among those with a higher OBS (lower oxidative stress).

Discussion
We observed that several genes were differentially expressed between high and low categories of diet and lifestyle variables. Genes that were differentially expressed among recent cigarette  Genes in bold were upregulated in our data while genes underlined were down-regulated in our dataset smokers and alcohol consumers were usually down-regulated. Likewise, individuals with greater intakes of antioxidants and carotenoids were more likely to have down-regulated genes compared to low intakes of these nutrients. We observed that specific type of alcohol consumed influenced specific genes, with virtually no overlap between differentially expressed genes by alcohol type, suggesting that different types of alcohol have specific biological actions beyond that of alcohol in general.
In assessing the data, it is important to keep in mind several features of gene expression data. First, gene expression profile is relevant to current exposure of diet and lifestyle variables. Thus, we assessed current smokers, recent consumers of alcohol, current NSAID users, and diet close to the time of diagnosis, when the tissue would have been biopsied. Thus this represents gene expression at the time of diagnosis. For some variables such as NSAID or specific dietary factors, recent use may have been too far removed to accurately correlate with gene expression. This could explain why few gene were de-regulated for NSAIDs specifically or for dietary antioxidants and alcohol. While lack of findings could indicate too dissonant timing between exposure and tissue sample ascertainment, finding associations would imply that the exposure is recent enough to alter the expression. It is also important to note that the data being analyzed are gene expression data and does not reflect protein expression. Additionally, we have utilized colonic non-tumor tissue, so genes would have to be expressed in colon tissue for detection. Thus, these diet and lifestyle factors could influence other genes in other tissue sources. Utilizing colonic tissue that is located close to a tumor could alter gene expression, however, it would alter it is a manner that was not dependent on diet or lifestyle exposure.
We focused on networks where several of the genes in our data were de-regulated based on lifestyle factors that could result in oxidative imbalance or stress. We utilized IPA to identify networks and upstream regulators to gain insight into biological mechanisms associated with the observed gene expression changes for recent cigarette smoking since that exposure was  associated with enough de-regulated genes to assess networks. To control for type I errors we set the estimated FDR to 0.1 to generate more complete networks. Causal networks in IPA are based on the Ingenuity Knowledge Base, which is contains information from the literature as well as from other databases [29]. IPA defines upstream regulators as any gene or small molecular that has been observed to affect gene expression, either directly or indirectly [29]. Up-stream regulators are defined as those that are likely connected to our dataset genes through a set of direct or indirect relationships. The causal network analysis in IPA builds on the upstream regulator analysis and builds a more comprehensive picture of possible root causes linked to the gene expression profiles. Causal network analysis includes intermediaries which are intermediate regulators that link de-regulated genes to upstream regulators. In our data, de-regulated genes were generally linked to the upstream regulator through intermediaries. While there were several networks in which genes in our data were involved, we focused on those networks that contained several genes that were importantly associated with the upstream regulators. Recent cigarette smoking was the only lifestyle factor evaluated that had a sufficient number of genes de-regulated to evaluate in IPA. In our data, genes de-regulated by recent cigarette smoking were linked to causal networks involving TEK (endothelial tyrosine kinase also known as TIE2), MAP2K3, and PNBP4. The literature supports up-regulation of these pathways by cigarette smoking [30][31][32]. The angiogenic growth factors ANG1 and ANG2 act through TEK. It is involved in vascular integrity and is mainly expressed in endothelial cells. TEK has been shown to be down-regulated among smokers and individuals with chronic obstructive pulmonary disease [30]. Using IPA and the upregulated genes identified with an FDR of <0.1 the TEK network was significantly enriched and TEK was predicted to be up-regulated. However in our data, current smokers had lower levels of TEK expression than never smokers (0.55 fold change; unadjusted p value of 0.046) as previously reported. Since TEK was not significant using an FDR <0.1 it was not uploaded into IPA or identified as significantly differentially expressed. It is likely that other intervening features mediated the predicted and observed association. Few studies have evaluated smoking and gene expression. One study conducted by Nielsen and colleagues studied 28 individuals with Crohn's disease who smoked and 14 who did not [33]. They observed three differentially expressed genes after replication by RT PCR, RNF138, MT2A, and STEAP3. While these genes were not differentially expressed in our sample of individuals without Crohn's Disease, we did observe that STEAP3 was associated with recent aspirin/NSAID use. Oxidative stress can lead to activation of several MAPK signaling cascades [34]. MAP2K3 is activated by cytokines and environmental stress and is involved in regulation of p38. MAPK pathways have been linked to cigarette smoking [35]. Our data support the literature that suggest MAPK signaling being influenced by tobacco given the number of up-regulated genes in the MAP2K3 network. Many of the genes in the TEK and MAP2K3 network were also central to the PNBP4 network. Given the heavy overlap in number of enriched genes in these networks supports a strong role of cigarette smoking influencing MAPK signaling and inflammation-related genes.
Few genes differentially expressed by overall alcohol consumption and by subtype. Only one gene was differentially expressed in more than one category; CACNA1C was up-regulated among those who consumed high levels of liquor as well as those who had high overall alcohol levels. Disrupted genes varied by type of alcohol consumed, however there were similarities in function for specific types of alcohol. For instance, two of the five genes de-regulated for alcohol were involved in biological mechanisms of calcium and two were involved in MAPK signaling. Similarly, two of the six de-regulated genes influenced by beer consumption were involved in AKT signaling, two were involved in immune response and inflammation and two had functions related to apoptosis. Two of the five genes de-regulated by wine consumption were associated with biological oxidation pathway and response to wound healing as was one of the four deregulated genes associated with liquor consumption.
Previous studies have shown that diet modification can alter gene expression [20]. In controlled setting, beta carotene has been shown to exhibit anti-inflammatory properties by inhibiting NFκB activation [36]. Lutein, a carotenoid with anti-inflammatory properties, has previously been shown to influence cell cycle progression; cells treated with lutein showed increased expression of IGF1R, EGFR, BRCA1, CDK5, KLK14, and PCA3 and decreased expression of GSTP1 and RASSF1 [37]. However, few of the dietary antioxidants and carotenoids we evaluated altered gene expression and did not alter expression of those previously reported. We evaluated over 17,000, and restricted those that were considered differentially expressed based on FDR. Other studies of few genes did not have make these adjustments. In our data lutein+zeaxanthin altered five genes, vitamin E altered expression of four genes, and beta carotene altered expression of one gene. The genes that were altered were unique by antioxidants and carotenoids and were involved in biologically relevant functions including cell cycle processes, transcription factors response to hypoxia, inflammation, metabolic and apoptotic processes.
We also assessed the OBS, a composite of the individual variables assessed and only observed altered expression in three genes which were associated with cell cycle process, transcription and growth factor activity. The OBS has previously been associated with reduced risk of colon cancer [17].
We used RNAseq to determine differential gene expression by lifestyle factor given its widespread use, high level of repeatability, the cost, and amount of sample required to generate the data. As RNAseq produces global gene expression data for each RNA sample, it is an ideal method to undertake a discovery study as we have done here [38,39]. However, it is recognized that different platforms carry different technical weaknesses that can effect cross platform replication such as library construction methods and number of amplification cycles required for data generation. Comparative studies estimate that around 50% of results may replicate between platforms [40,41]. Given the number of associations identified and the limited amount of sample available from the FFPE tissue, it is not reasonable or feasible to replicate all findings. To obtain a crude estimate of replication by qPCR methods, we analyzed data from two markers, HELZ and DUT. All results showed the same direction of differential expression as observed with RNAseq, however the magnitude and level of significance varied by test. DUT was significantly associated with smoking (p = 0.01), HELZ was of marginal significance with alcohol (p = 0.08), and although smokers had lower HELZ expression than never smokers as with RNAseq, the difference was not statistically significant (p = 0.34). Thus, it is essential to validate these findings in other populations using the methods we used here as well as other platforms to better understand associations between lifestyle factors and gene expression.
Our data support the hypothesis that levels of diet and lifestyle factors associated with oxidative balance and stress alter gene expression profiles. Cigarette smoking had the most influence on expression levels. Many of the pathways and networks associated with de-regulated genes involved cell cycle and oxidative stress. Also of interest is the observation that different types of alcohol affected expression of different genes. Confirmation of these findings by other similarly designed studies is needed.
Supporting Information S1 Table. Summary of genes showing differential expression at the FDR level of 0.1 with recent cigarette smoking. (DOCX)