A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma

We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.


Introduction
Recent developments in liquid chromatography and high-resolution mass spectrometry (MS) have allowed for highly precise, comprehensive characterization of proteomes [1]. Proteogenomics, which combines discovery proteomics with genomic technologies, has emerged as an important tool in the understanding of the complexities of the tumor genome and how they are integrated into a functional proteome [2]. Groups such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) are taking advantage of these advances and setting the bar high with thorough proteogenomic characterizations of a large numbers of tumor samples (i.e., cohorts of~100 patients) [3]. There are several computational challenges to performing proteogenomic studies [4]. New tools and pipelines have been developed to tackle these problems, but a one-size-fits-all solution remains elusive [5][6][7][8][9].
Sample throughput is an issue in MS-based discovery proteomics because the necessary biological and technical replicates easily generate dozens of samples. These numbers can expand significantly when analyzing large groups of patient samples in order to develop specific biomarkers or molecular classification schemes, and these numbers can expand even further if samples are fractionated using chromatographic methods in order to increase the depth of proteome coverage. One solution to this increasing sample number problem is to utilize multiplexing methods with isobaric chemical labeling reagents [e.g., tandem mass tag (TMT) or isobaric tags for relative and absolute quantitation (iTRAQ)]. These methods provide a means for parallelization by running many samples simultaneously in a single mass spectrometer [10].
Proteomic methods are often being used to comprehend the underlying biology of lung cancer and to identify potential biomarkers [11], and proteomic profiling is showing promise for clinical use [12,13]. Proteomic comparisons of two prominent non-small cell lung cancer (NSCLC) subtypes, adenocarcinoma (ADC) and squamous cell carcinoma (SCC), have been previously reported using label-free quantification and SILAC [14][15][16][17], and some proteogenomic studies of ADC and SCC have also been undertaken [18,19]. Our first goal was to understand how a TMT-based proteogenomics study would perform in our hands experimentally and to determine whether special considerations would arise while analyzing TMT-derived data. If our results were promising, then our next goal was to use a TMT-based proteogenomic approach, moving forward with a larger study with more patient samples. Here, we undertook a pilot proteogenomic study of three ADC and three SCC tissue samples using 6-plex TMT and a custom Affymetrix microarray [Gene Expression Omnibus (GEO) no. GPL15048]. Max-Quant was used to identify peptides and to extract and quantify reporter ion intensities for comparison of ADC and SCC tissues [20]. We then compared our results to previously published studies that used different discovery proteomic strategies in order to identify prevalent candidate biomarkers (see Table 1).

Sample Preparation for Liquid Chromatography-Mass Spectrometry
This study was approved by Liberty IRB Inc., an independent review board company, under Institutional Review Board approval number 12.11.0023. Tissue samples for this analysis were from patients who had provided prospective written informed consent to be included in Moffitt's Total Cancer Care 1 (TCC) institutional protocol. Quantitative proteomic analysis was performed on tumor tissues isolated from six individual patients: three diagnosed with lung ADC and three with SCC (S1 Table). To aid tissue lysis, frozen samples were macerated in a tissue pulverizer. The pulverizer was then rinsed with 2 mL water, and each sample transferred to a 2-mL Eppendorf tube to lyse erythrocytes. Samples were centrifuged at 5,000g for 2 min, and the supernatant was removed. Lysis buffer (1.0 mL of 50 mM HEPES, pH 8.5, supplemented with 2% SDS) was added, and the samples were incubated for 20 min at room temperature. The samples were then heated for 5 min at 99°C and sonicated (Covaris) to reduce viscosity. After centrifugation at 16,000g for 15 min, the supernatant was collected. The protein amount in each sample was determined by using the bicinchoninic protein assay (Pierce Biotechnology/ ThermoFisher Scientific, Waltham, MA) followed by an adaptation of the filter-aided sample preparation method [21,22]. Three samples each of ADC and SCC (100 μg of lung tissue lysate) were reduced with 100 mM dithiothreitol at 99°C for 5 min and transferred into VIVACON 500 filter units (Vivaproducts Inc., Littleton, MA). SDS-containing buffer was removed from the sample by centrifugation and exchanged with 8 M urea in 100 mM Tris-HCl buffer. Proteins were alkylated with 50 mM iodoacetamide and washed with 50 mM triethyl ammonium bicarbonate. Porcine trypsin (Promega Corp., Madison, WI) was used to digest the proteins in an enzyme-to-protein ratio of 1:100 (wt/wt).
For relative protein quantitation, the six samples were separately derivatized with 6-plex TMT reagents (ThermoFisher Scientific), according to the instructions provided by the manufacturer. The ADC and SCC samples were labeled with TMT 126, 127, 128 and TMT 129, 130, 131, respectively. The labeled tryptic digests were pooled and concentrated by solid-phase extraction (MacroSpin columns 30-300 μg capacity; The Nest Group Inc., Southborough, MA). Samples were basified with 20 mM ammonium formate prior to injection onto a reversed-phase column (150×2.0 mm Gemini 1 NX-C18 3 μm 110Å; Phenomenex, Torrance, CA) using an Agilent 1200 series high-performance liquid chromatography (HPLC; Agilent Biotechnologies, Palo Alto, CA) with ultraviolet detection at 214 nm with solvent A (20 mM ammonium formate, pH 10, in 5% acetonitrile) and solvent B (20 mM ammonium formate, pH 10, in 90% acetonitrile). Peptides were separated at a flow rate of 100 μL/min and eluted from the column with a linear gradient from 0% to 70% solvent B. Seventy-two time-based fractions were collected into a 96-well plate, with the first 18 and last 7 fractions pooled into 3 and together with all other fractions transferred to a total of 50 HPLC vials. Samples were acidified with 5% formic acid, organic solvent was removed in a vacuum concentrator at 45°C, and peptides were resolubilized and diluted in 100 to 400 μL of 5% formic acid, depending on the intensities of the individual ultraviolet traces [23]. Individual fractions were analyzed at pH 2.4 on an Agilent 1200 nano-HPLC system (Agilent Biotechnologies, Palo Alto, CA) coupled to a hybrid linear trap quadrupole Orbitrap Velos mass spectrometer (ThermoFisher Scientific) utilizing the Xcalibur software version 2.1 for data acquisition. Single fractions were loaded onto a trap column (Zorbax 300SB-C18 5 μm, 5 × 0.3 mm; Agilent Biotechnologies, Palo Alto, CA) with a binary pump at a flow rate of 45 μL/min. Loading and washing solvents were composed of 0.1% trifluoroacetic acid in water (solvent A) and 0.1% trifluoroacetic acid in 70% methanol and 20% isopropanol (solvent B). The peptides were eluted by back-flushing from the trap column onto a 16-cm fused silica analytical column with an inner diameter of 50 μm packed with C18 reversed-phase material (ReproSil-Pur 120 C18-AQ, 3 μm, Dr. Maisch GmbH, Ammerbuch-Entringen, Germany). LC-MS solvents were composed of 0.4% formic acid in water (solvent A) and 0.4% formic acid in 70% methanol and 20% isopropanol (solvent B). Elution was achieved with a 27-min gradient ranging from 3% to 30% solvent B, followed by a 25-min gradient from 30% to 70% solvent B and, finally, a 7-min gradient from 70% to 100% solvent B at a constant flow rate of 100 nL/min. The analysis was performed in data-dependent acquisition mode. The 10 most intense ions were isolated and fragmented by higher-energy collision-induced dissociation (HCD) for peptide identification and relative quantitation of TMT reporter ions. Dynamic exclusion for selected ions was 60 sec and a single lock mass at m/z 445.120024 (Si(CH 3 ) 2 O) 6 ) was used for internal mass calibration. [24] The maximally allowed ion accumulation time was set to 500 and 200 ms for MS 1 and MS 2 scans, respectively. Overfilling of ion traps was prevented by automatic gain control set to 10 6 ions for a full Fourier transform MS scan and 5×10 5 ions for MS 2 HCD scans. Intact peptides were detected in the Orbitrap mass analyzer at a resolution of 30,000 with a signal threshold of 2,000 counts for triggering an MS/MS event. HCD-MS 2 spectra were acquired with 1 microscan at a resolution of 7,500. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD002622 [25].

MaxQuant
Raw files from the Kikuchi et al. study were obtained directly from the authors, and raw files from the Li et al. study were obtained from ProteomeXchange (http://www.proteomexchange. org/) [25]. Raw files were grouped and analyzed by study using MaxQuant version 1.5.1.2. Peaks were searched against the UniProt human database (20,193 sequences; released August 2014; http://www.uniprot.org) with the Andromeda search engine [26]. When possible, all raw files were processed using similar parameters; however, workflow-specific changes were needed: the TMT data were processed by setting the "type" parameter to "reporter ion" and selecting the relevant TMT 6-plex settings. At least seven amino acids per peptide were required, and as many as two missed cleavages were allowed. A false discovery rate of 0.01 was used for both peptides and proteins. The "match between runs" option was selected using a time window of 4 min. N-terminal acetylations or methionine oxidations were both modifications allowed in protein quantification. MaxQuant results were manually spot-checked using the MS browser, OpenChrom [27].

Affymetrix Gene Microarray
Tissues were processed and assessed for RNA quality according to the TCC protocol prior to expression analysis. The six lung tumor samples were profiled using a custom Affymetrix Gen-eChip that measures the expression of 60,607 distinct transcripts (GEO GPL15048). All probes with standard deviation > 1 (11,008 or 18% of total probes) were used for the gene expression heat map.

Peptide and Protein Data Processing
Peptide intensities were extracted from MaxQuant output and input into Libaffy [28]. Libaffy consists of a set of routines for accessing the various file types and post-processing them using a variety of algorithms, including iterative rank-order normalization (IRON) [28][29][30]. Peptide intensities were normalized using IRON with proteomic parameters and then input into R/RStudio with corresponding metadata [31,32]. Peptides were filtered for posterior error probability > 0.1, reverse sequences, non-human contaminant peptides, and missingness. Peptides were excluded from further analysis if there were more missing values than the larger of the two NSCLC sample sizes. Remaining peptide intensities were log 2 -transformed and mapped to possible protein matches.
In the event that peptides mapped to multiple proteins, called the protein inference problem, MaxQuant will choose the protein from a set of peptides based on which protein has the most matching peptides and would thus be perceived to be the most abundant [33]. Constituent peptide intensities are then summed in MaxQuant to give the resulting protein intensity. The major drawback to this approach is that the next most abundant protein or proteins, even if they are very close to the abundance of the first protein, will not be included in the results. Additionally, the authors of MaxQuant point out that selecting the protein with the most peptides might not always be the best decision [34]. Furthermore, the process of summing the peptide intensities to obtain the protein intensity can artificially lower the final protein intensity in the case of missing data, and this can be a pervasive issue since proteomic data are rife with missingness [35]. We implemented a more liberal peptide-to-protein mapping methodology to address these issues. We began by mapping peptides to all possible proteins, and we then took the weighted mean (also known as Tukey's biweight; implemented in the R Bioconductor package "affy") to obtain a protein intensity that is more robust than the summation of intensities [36][37][38]. The weighted mean penalizes outlier values in its calculation so it will more closely reflect the true intensity values for a particular protein. A Welch two-sample T-test was used on the resulting protein intensities to ascertain differentially expressed proteins between ADC and SCC.
We generated mapping tables using data from International Protein Index, UniProt, and Genbank to match the microarray probes to proteins. If a protein had multiple matching probes, then we chose the probe with the highest average intensity to use as the gene expression. Because our goal was to compare the gene expression to the protein expression, we filtered to only include results that were in both the protein expression dataset and the gene expression dataset.

Statistical Analyses
Concordance of gene and protein expression was measured using both Pearson correlation (herein referred to as R) and Spearman correlation (herein referred to as ρ). Clustering was performed using Pearson correlation as a similarity measure and complete linkage to create the dendrogram for the heat maps. Welch two-sample, two-sided T-tests were used to compare expression of genes and proteins. In general, a significance level of 0.05 was used for statistical testing, and we reported the P value or significance level any time a statistical test was performed. Unless otherwise noted, P values were not adjusted for multiple hypothesis testing.
We used KM Plotter (http://www.kmplot.com) to perform survival analysis of MCT1 and GLUT1 in ADC and SCC [39]. This tool facilitates combined survival analysis across multiple microarray datasets including The Cancer Genome Atlas (TCGA), GEO, and caArray. Data was normalized using the MAS5 algorithm and a second scaling normalization was used to reduce batch effects. Only probes that are in common across all datasets were used. KM Plotter provides a user-friendly interface to the R programming language ("survival" Bioconductor package) to perform univariate and multivariate Cox regression (including estimating hazard ratios), generate Kaplan-Meier curves, and calculate logrank P values. KM Plotter was accessed March 2015 for the univariate analyses and October 2015 for the multivariate analyses.

Quantitative Proteomics Identifies Differences Between NSCLC Subtypes
Quantitative proteomic analysis was performed on three lung ADC and three lung SCC tumor samples. MaxQuant reporter ion intensities for peptide expression were compared across samples, and on average we were able to identify 27.24% of the MS/MS spectra across the sample fractions or 51,001 unique peptides with posterior error probability < 0.1 and false discovery rate < 0.01. We then used Tukey's biweight to combine these peptides into 7,241 unique proteins (see Methods for details). Unsupervised, hierarchical clustering of these proteins identified the two histologic subtypes (ADC and SCC; Fig 1A). We found 279 of 7,241 (5%) proteins to be differentially expressed between ADC and SCC (P < 0.05 and fold-change > 1.5 or < 0.6667). Differentially expressed proteins included p63, cytokeratin 5 and cytokeratin 6, and PKP1, which are consistent with the SCC phenotype (S2 Table) [40][41][42][43]. Enriched pathways (S3 Table) were identified using GeneGO Metacore and included cytoskeleton remodeling and cell adhesion pathways. These pathways contained p63, cytokeratins, and PKP1, again consistent with the SCC subtype.

Comparisons of Labeled and Label-Free Proteomic Studies of NSCLC
We then compared our quantitative results to published proteomic studies of ADC and SCC. To reduce analytical variability in our cross-study comparison, we used MaxQuant and the same analysis pipeline to process publically available ADC and SCC lung tissue data from Kikuchi 18,198 unique peptides. There were 6,372 peptides in common between the three datasets (Fig 2A). Peptides that overlapped between our results and the other studies showed low correlation (Fig 3A and 3B). Our data had 395 missing reporter ion intensities (0.1% of all peptide intensities from this     Fig 2C). Noticeably, the Kikuchi et al. data had 6.46 times more unique differentially expressed proteins than our data. A majority of the overlapping, differentially expressed proteins also shared the same direction of change ( Fig  2D). Nine differentially expressed proteins were observed in all three datasets (Table 2), and seven of these nine differentially expressed proteins were more than two-fold higher in SCC compared to ADC. LMO7, a zinc-binding protein thought to have tumor suppressor activity in lung cancer, was the only commonly differentially expressed protein higher in ADC [44]. Both MCT1 (a lactate transporter; SLC16A1) and GLUT1 (a glucose transporter; SLC2A1) were significantly higher in SCC at the protein level (P < 0.05).

Combining Differentially Expressed Proteins Across Studies Yields a Number of Significantly Enriched Pathways in ADC and SCC
Differentially expressed proteins from each study were simultaneously searched against Gen-eGO (Table 3, S4 Table). Eighty-four pathways were found enriched in SCC and 44 were enriched in ADC (P < 0.01). We found Wnt-related pathways enriched in the SCC subtype, consistent with previous findings [41]. Proteins related to glycolysis, including MCT1, were downregulated when Wnt signaling was inhibited in colon cancer cells [45], so the enrichment of Wnt signaling and the expression of glycolytic enzymes like MCT1 and GLUT1 may be similarly linked in SCC. We also found a glutathione metabolism pathway enriched in SCC (S4 Table). This is intriguing since there are known links between glutathione and cancer as well as drug resistance in lung cancer [46,47], and it has been shown that inhibiting MCT1 disables glutathione synthesis and glycolysis in cancer cells [48].

Gene Expression Recapitulates Known Differences Between NSCLC Subtypes and Demonstrates Minor Protein-Gene Expression Correlation
An Affymetrix microarray was used to characterize the same cohort of 3 ADC tissue and 3 SCC tissues from our quantitative proteomic study (S5 Table). We first clustered gene expression to determine whether it could recapitulate the ADC and SCC subtypes in a small sample set ( Fig 1B). We then identified 6,373 genes with matching protein expression for further analysis. The pairwise correlation between all protein expressions and gene expressions was low (R = 0.13, P < 2.2E-16; ρ = 0.16, P < 2.2E-16) (Fig 4A), but correlations grouped by protein and corresponding gene were higher by Pearson's method (mean R = 0.34, Fig 4B) and Spearman's method (mean ρ = 0.31). These results are comparable with the recent CPTAC colon cancer proteogenomic study (ρ = 0.23) as well as results from Wilhelm et al. (ρ = 0.31 to 0.56) [3,49]. We found 980 gene/protein observations to be significantly correlated by Pearson's method (P < 0.05 and R > 0.5; Table 4) and 582 by Spearman's method (P < 0.05 and  Pilot Proteogenomics and Data Analysis of NSCLC ρ > 0.5). Of the 6,373 matched genes, 629 were differentially expressed (P < 0.05) and included genes consistent with the ADC and SCC subtypes (e.g., p63 expression, cytokeratin expression). Similarly, 493 of the 6,373 matched proteins were significant (P < 0.05). Looking at the intersection of the datasets, 119 were differentially expressed at both the gene and protein levels. As expected, the mean correlation between these 119 significant gene and protein pairs was higher (R = 0.79; ρ = 0.76), and 113 pairs (95%) changed in the same direction between ADC and SCC. Six of the nine commonly differentially expressed proteins in Table 2 were also differentially expressed at the gene expression level. This included both the MCT1 gene expression that was 9.75 times higher in SCC than in ADC (P = 0.06; S5 Table) and the GLUT1 gene expression that was 4.78 times higher in SCC compared to ADC (P = 0.02; S5 Table).
Interestingly, Eilertsen et al. found MCT1 to be an independent prognostic marker for survival in NSCLC patients [53]. Furthermore, this same study shows that high expression of both MCT1 and GLUT1 has a negative impact on survival in NSCLC patients. Eilertsen's patient cohort included both ADC and SCC, and, although a majority of patients had the SCC subtype, it is unclear whether this finding would hold for just one subtype. To address this, we performed survival analysis to further investigate the subtype-specific role of MCT1 and GLUT1 in NSCLC. Intriguingly, high expression of MCT1 was not associated with poor survival in SCC patients (P = 0.  Fig  5D). Our findings suggest that MCT1 and GLUT1 are both commonly overexpressed at the gene and protein levels in SCC and thus not prognostic. However, they may serve as potential histological markers or targets for this subtype since they are overexpressed in SCC tumors compared to normal tissue. On the other hand, there appears to be a subpopulation of ADC samples with overexpression of MCT1 and GLUT1, and we suggest that MCT1 and GLUT1 could serve as prognostic markers in the ADC subtype since their overexpression is associated with poor survival. MCT1 and GLUT1 were associated with poor survival in ADC (MCT1 P = 0.0016; GLUT1 P = 0.0005) even after accounting for stage, gender, and smoking history in  Table). We note that an MCT1 inhibitor is being evaluated for its efficacy in small cell lung cancer in the United Kingdom [57].

Discussion
ADC and SCC are very heterogeneous diseases, and this statement is emphasized by the existence of several subtypes within ADC and SCC [58,59]. Making sense of the variations between ADC and SCC can be a difficult task, but alterations in the genomes of these cancers ultimately get integrated and produce a cancer proteome that can be analyzed using modern proteomic tools. However, it is already known that the agreement between ADC gene expression and protein expression is weak, underscoring the need for both genomic and proteomic tools in order to identify differences and ultimately biomarkers and drug targets in NSCLC [60,61]. Here, we undertook a pilot proteogenomic to understand how TMT would perform in our hands experimentally as well as whether special considerations would arise while analyzing TMT-derived data. Protein expression from our experiments shared some overlap with the gene expression, and both protein and gene expression were able to group tissues by ADC and SCC histology (Fig 1). Our gene-protein correlation was similar to previously published results [3,49], and our findings were better than a previous study that had shown poor agreement of gene and protein expression in ADC (ρ = −0.025) [60].
We then compared our results to previously published studies. We reduced variability in our analysis by processing all raw discovery proteomics data through MaxQuant. However, experimental differences, including patient sample differences, were sources of uncontrollable variability. The number of samples and sample fractions, as well as the LC-MS gradients and MS analysis time, certainly influenced the number of identified peptides from each study. For example, we divided our samples into 50 fractions in order to minimize any co-isolation events during precursor ion isolation prior to MS 2 (Table 1), and we believe that this large fractionation directly led to the large number of uniquely identified peptides observed in our study (  (Table 1).
Despite the differences in the content of the discovery proteomics experiments, there was a benefit in having these previous results to compare and analyze with our own. We were able to identify nine differentially expressed proteins in common with the other datasets that also had the same increase or decrease in expression relative to ADC and SCC (Fig 2C; Table 2), and we were able to identify a number of pathways by simultaneously searching all differentially expressed proteins from each study against GeneGO (Table 3, S4 Table). Of the nine common differentially expressed proteins, four were known to be highly expressed in SCC: three isoforms of keratin 6 as well as PKP1 [40][41][42][43]. LMO7 was the only commonly differentially expressed protein higher in ADC. CO7A1, a collagen, and ABCF3, an ATP-binding cassette family member, were also consistently differentially expressed. However, there does not appear to be an obvious link between these two proteins and the ADC or SCC subtypes, and we note that the ABCF3 expression in our dataset was only modestly higher in our TMT SCC results (0.73 log 2 fold-change).
The most interesting of the common differentially expressed proteins were the lactate transporter MCT1 and glucose transporter GLUT1. Because cancer cells rely on aerobic glycolysis and produce excess lactate via the Warburg effect, we wished to understand how these transporters behaved in a larger cohort as well as if they would be differentially expressed compared to normal tissues. Data analysis of publically available datasets (Fig 5A and 5B) showed that MCT1 and GLUT1 were constitutively overexpressed in SCC and therefore may make promising drug targets or histological markers for this subtype. Survival analysis of publically available datasets (Fig 5C and 5D) showed that higher expression of MCT1 and GLUT1 are associated with poor survival in ADC patients and therefore may make promising prognostic markers. Our findings suggest a larger role for MCT1 and GLUT1 in ADC and SCC than what is currently understood, and further work should be undertaken to elucidate their function in these NSCLC subtypes.

Conclusions
These results suggest that our approach using quantitative TMT is a valid approach for future and more expansive proteogenomic studies of cancer. Further, recent studies have found that TMTs were more sensitive than iTRAQ [62]. Additional improvements in throughput can be gained by reducing the number of fractions since other studies have reported large inventories with less fractionation and instrument time [3,63,64]. Finally, switching from microarrays to next generation sequencing technology (e.g., RNA sequencing) will enable the generation of personalized peptide libraries that are likely to increase proteome coverage through variant detection [65]. Improvements in genome and proteome technology, when coupled with high quality cohorts with clinical data and well annotated tumor tissues, are likely to enable new discoveries on proteins and pathways important in cancer.