Circulating MicroRNAs as Non-Invasive Biomarkers for Early Detection of Non-Small-Cell Lung Cancer

Background Detection of lung cancer at an early stage by sensitive screening tests could be an important strategy to improving prognosis. Our objective was to identify a panel of circulating microRNAs in plasma that will contribute to early detection of lung cancer. Material and Methods Plasma samples from 100 early stage (I to IIIA) non–small-cell lung cancer (NSCLC) patients and 100 non-cancer controls were screened for 754 circulating microRNAs via qRT-PCR, using TaqMan MicroRNA Arrays. Logistic regression with a lasso penalty was used to select a panel of microRNAs that discriminate between cases and controls. Internal validation of model discrimination was conducted by calculating the bootstrap optimism-corrected AUC for the selected model. Results We identified a panel of 24 microRNAs with optimum classification performance. The combination of these 24 microRNAs alone could discriminate lung cancer cases from non-cancer controls with an AUC of 0.92 (95% CI: 0.87-0.95). This classification improved to an AUC of 0.94 (95% CI: 0.90-0.97) following addition of sex, age and smoking status to the model. Internal validation of the model suggests that the discriminatory power of the panel will be high when applied to independent samples with a corrected AUC of 0.78 for the 24-miRNA panel alone. Conclusion Our 24-microRNA predictor improves lung cancer prediction beyond that of known risk factors.


Material and Methods
Plasma samples from 100 early stage (I to IIIA) non-small-cell lung cancer (NSCLC) patients and 100 non-cancer controls were screened for 754 circulating microRNAs via qRT-PCR, using TaqMan MicroRNA Arrays. Logistic regression with a lasso penalty was used to select a panel of microRNAs that discriminate between cases and controls. Internal validation of model discrimination was conducted by calculating the bootstrap optimism-corrected AUC for the selected model.

Results
We identified a panel of 24 microRNAs with optimum classification performance. The combination of these 24 microRNAs alone could discriminate lung cancer cases from non-cancer controls with an AUC of 0.92 (95% CI: 0.87-0.95). This classification improved to an AUC of 0.94 (95% CI: 0.90-0.97) following addition of sex, age and smoking status to the model. Internal validation of the model suggests that the discriminatory power of the panel will be high when applied to independent samples with a corrected AUC of 0.78 for the 24-miRNA panel alone.

Conclusion
Our 24-microRNA predictor improves lung cancer prediction beyond that of known risk factors. Introduction stage. The mean follow-up time was 2.48 years. During this follow-up period, 64 of the 100 cases died and 36 were still alive at censoring.

Micro-RNA profiles in plasma
To evaluate whether specific miRNA signatures are detectable in plasma samples of patients with early stages of NSCLC, we performed a high-throughput screen of 754 miRNAs using TaqMan Human MicroRNA Arrays. On average 235 and 115 miRNAs were detected on Card A and Card B, respectively (present in at least 80 out of 200 samples in either cases or controls) (Fig 1). The miRNA expression profiles in plasma were quantile normalized and subjected to differential expression analyses between NSCLC samples and controls. Of 350 detected miR-NAs in plasma, 61 miRNAs were found to be significantly differentially expressed between lung cancer cases and controls (35 on Card A, 26 on Card B) including 33 upregulated and 28 downregulated miRNAs (p-value < 0.05). These comprised 21 miRNAs differentially expressed with significant adjusted p-value corrected for multiple testing (Table 2). Despite the

Prediction models
To select a panel of miRNAs discriminating between cases and controls, a logistic regression model with lasso penalty was fitted. This analysis identified a panel of 24 plasma miRNAs with an optimum classification performance (S2 and S3 Figs). Table 3 shows the complete list of selected miRNAs within the panel. Multiple logistic regression analysis showed that combination of the 24 miRNAs alone could discriminate lung cancer cases from controls with very high AUC of 0.92 (95%CI: 0.87-0.95). In the logistic model including the 24-miRNA panel and adjusted for main lung cancer risk factors, namely sex, age at interview and smoking status, the classification improved to an AUC of 0.94 (95%CI: 0.90-0.97) (Fig 2). The improvement in the magnitude of discrimination attributable to the 24-miRNA panel is substantial compared with that of the main lung cancer risk factors alone. The AUCs for the full logistic models with and without the 24-miRNA panel were 0.94 (95%CI: 0.90-0.97) and 0.72 (95%CI: 0.65-0.78), respectively (Fig 3). Internal validation of the selected 24-miRNA panel model (i.e., validation accounting for variability due to parameter estimation) using the bootstrap optimism corrected AUC suggests that the discriminatory power of the panel will be high when applied to independent samples (corrected AUC of 0.86 for the 24-miRNA panel alone, 0.87 for the model including sex, age and smoking status and 0.89 for the model with pack-years variable). The bootstrap optimism  corrected AUC which took into account the entire miRNA classifier selection process (penalized Lasso logistic regression) was 0.78.

Discussion
Despite extensive advancements of imaging and combined treatment modalities, the 5-year survival rate of lung cancer has improved only marginally over recent decades [2]. A non-invasive, biomarker-driven stratification of early-stage lung cancer could therefore complement LDCT screening and improve therapy management. We have developed a panel of 24 plasma miRNAs capable of discriminating early stage NSCLC cases from controls with a high AUC of 0.92 after screening 754 miRNAs in 100 NSCLC patients and 100 non-cancer individuals. This set of miRNAs was identified through very stringent statistical methods. To our knowledge this is also one of the largest exploratory studies of miRNA in plasma. Tests such as these demonstrate several advantages in the clinical setting including no requirement for invasive sample collection ("liquid biopsy"), low cost when compared to imaging techniques, straightforward laboratory procedures and inclusion of a panel of biomarkers instead of a single miRNA.
To date, several studies have reported miRNA profiles in plasma and serum developed for the diagnosis of NSCLCs [9][10][11][13][14][15][16]. Despite very promising results these studies have shown a rather small overlap between identified miRNA signatures, and were based on limited sample sizes/pools of samples which did not allow for assessment of the contribution by a single patient to the genetic pool, or used a candidate miRNA approach for discovery of the miRNA panel. The variation in pre-analytical factors, such as sample preparation procedures, and different normalization strategies makes comparison between studies difficult. Also the differences in miRNA profiles between serum and plasma [14] may account for some of the lack of correlation of miRNA expression levels between previous studies. Lastly, the described signatures could differ because of the inherent multitude of miRNA targets and their potential redundancy. Given that our study evaluated the highest number of miRNA profiles, we were able to assess the predictive value of previously reported miRNA signatures in our data. Interestingly, despite small overlap of miRNAs between predictors, different population of patients, sample processing, extraction protocols and normalization procedures used when compared to earlier studies [9,10,13], a relatively high predictive value (AUC: 0.68-0.78) of these miRNA panels was observed in our study (S1-S4 Tables, S4 Fig), thus highlighting the potential of circulating miRNAs as biomarkers for lung cancer detection. In our data, the best discrimination between NSCLC cases and controls was found for the 34-miRNA signature reported by Bianchi and colleagues [9] showing an AUC of 0.78 (95%CI: 0.72-0.84). Lower predictive value was observed when using the signature of risk (developed in pre-diagnostic samples) and diagnosis (developed in case-control study) described by Boeri and colleagues [10] and recently validated by the same group [11] showing an AUC of 0.71 (95%CI: 0.65-0.78) and AUC of 0.70 (95%CI: 0.63-0.76), respectively. Finally, the 10-miRNA panel defined by Chen and colleagues yielded an AUC of 0.68 (95%CI: 0.61-0.74) when fitted using our data [13]. Previous reports raised the question whether miRNAs found in circulation originate from tumours. In theory, miRNA in plasma or serum can originate from the tumour or from inflammatory host responses. In the absence of miRNA data from tumour tissue in our set of samples we compared a panel of 24 miRNAs in our predictor to differentially expressed miRNA from AC and SCC of the lung obtained via analysis of the Cancer Genome Atlas (TCGA) miRNA sequencing data (https://tcga-data.nci.nih.gov/tcga/). Twelve of the miRNA included in our predictor were also altered in TCGA data in either histology. However, the direction of association was consistent only for let-7c and miR-218. This suggests that a predictive role of plasma miR-NAs is independent from tissue in line with previous findings of Boeri and colleagues [10].
Recent studies have also suggested that significant variations in abundance of microRNA biomarkers reported in the literature might be a result of the inclusion of haemolyzed samples [17][18][19]. To minimize the effect of haemolysis, the samples used in our study followed standardized protocols and were processed within 2 hours from the time of blood collection. In addition, a QC step was implemented to assess potential haemolysis by evaluation of expression levels of 10 previously reported haemolysis-related miRNA, including miR-451, miRNA miR-16, miR-15b, miR-486-3p, miR-532-3p, miR-886-5p, miR-636, miR-1255B, RNU48 and miR-92a [17,19] among all miRNAs detected in our study. We did not observe significant differences in the haemolysis-related miRNA between lung cancer cases and controls in our series except for miR-15b (p-value = 0.024) (S5 Table). However, alterations of miR-15b have been previously reported in tumour tissues [20,21] and miR-15b has been proposed as a serum biomarker for detection of NSCLC [22]. Moreover, none of the haemolysis-related miRNAs were present in our 24-miRNA predictor.
Our study has several noteworthy strengths, including: careful selection of patients with all clinical data available, standardized and uniform processing of blood samples within 2 hours from blood collection, ultracentrifugation step following defreezing to remove cryoprecipitates and cell debris, large sample size, large number of miRNAs analyzed and very rigorous statistical assessment of the miRNA predictor. Known predictors of lung cancer were forced into models in the study regardless of statistical significance to truly test the added incremental value of our 24-miRNA panel (Fig 3).
A limitation is that, due to the case-control design of our study, blood samples were collected at the time of diagnosis. To partially address this constraint we did restrict our analysis to early stage NSCLC. Also our study lacks an external validation series. To tackle this concern we performed internal validation using a bootstrapping method which indicated that the predictive value of the panel will remain high when applied to an independent series of samples.
The small size of mature miRNAs and their sequence homology to precursor miRNA requires sensitive methods for quantitative analysis. We used the current "gold standard" method for measurement of circulating miRNAs. TaqMan MiRNA Cards use a target-specific, stemloop reverse transcription primer to address the challenge of the short length [23]. Despite the high accuracy and specificity of the qRT-PCR technique, each miRNA expression level measured can be influenced by both systematic experimental bias and technical variations including differences in sample procurement, stabilization, RNA extraction, and sample differences. As a result, data normalization is critical to minimize "noise" and obtain biologically meaningful data and to develop miRNA-based biomarkers. In the absence of stably expressed endogenous circulating miRNAs to function as normalization controls, and to verify the sensitivity of our results several normalization strategies were tested in our study, including quantile normalization, rank normalization, geometric mean normalization, normalization to endogenous U6snRNA control and to the ath-miR-159a spike-in exogenous control. Based on exploratory data analysis plots, instability of endogenous control, difference in abundance of spike-in vs sample miRNA, we decided to quantile normalization was chosen for the analysis. In addition, to reduce confounding from technical variation such as plate-to-plate variation and variation due to purification, we distributed samples such that diagnostic variables were balanced with respect to day of analysis or plate number and randomized within each day and plate.
In summary, our study demonstrates that the 24-miRNA panel is significantly and independently associated with lung cancer following analysis of a liquid biopsy and adds to lung cancer prediction beyond that contributed by established risk factors. Our study should therefore be seen as an exploratory study providing a strong and highly predictive miRNA panel identified through rigorous statistical approaches for further validation in well-designed large prospective cohorts and screening trials. If current findings are validated, it is expected to make important contributions to clinical and public health practice and may lead to more efficient lung cancer screening by improving enrolment criteria for identifying those who would benefit from further screening.

Study population
Lung cancer patients and controls were recruited through an IARC case-control study coordinated in Moscow from 2006 to 2012. Cases were incident cancer patients collected from the Russian N.N.Blokhin Cancer Research Centre and Moscow City Clinical Oncology Dispensary serving Moscow and the surrounding regions. Controls were recruited from individuals visiting two Moscow general hospitals for disorders unrelated to lung cancer and to associated risk factors. All study participants provided written informed consent and were interviewed.
Peripheral blood was collected in EDTA tubes at the time of interview and processed as rapidly as possible (generally within 2 hours). For cases, blood draw was performed before surgery and any adjuvant treatment. Plasma samples were isolated by centrifugation of whole blood at 2000xg for 10 minutes at room temperature. Samples were stored at −80°C. All specimens were obtained in accordance with the declaration of Helsinki guidelines and were approved by the local Institutional Review Board and the IARC Ethics Committee. A total of 100 lung cancer cases and 100 controls were included (Table 1).

RNA isolation
After thawing, plasma samples were centrifuged at 16,000 x g for 5 minutes to remove cryoprecipitates and cell debris. Total RNA was isolated from 300μL of plasma using NucleoSpin miRNA Plasma kit (Macherey-Nagel, Düren, Germany) according to the manufacturer's protocol with Proteinase K digest, addition of 2μg of glycogen carrier and DNAse digest steps. All samples were spiked-in with 10pmol of Arabidopsis thaliana synthetic miR-159a (synthesized by Eurofins MWG Operon, Ebersberg, Germany) to control for variations in the RNA preparation step. Purified RNA was kept at −80°C before being used for reverse transcription.

Profiling by TaqMan Human MicroRNA Arrays
Expression levels of 754 miRNAs (Sanger miRBase v14) were quantified using the TaqMan Human MicroRNA Array A + B Card Set v3.0 (Applied Biosystems, Foster City, CA) as per the manufacturer's instructions (including pre-amplification) (S1 Methods). Quantitative miR-NAs expression data were acquired by ABI 7900HT SDS software v2.4. Cycle threshold (Ct, cycle in which there is the first detectable significant increase in fluorescence) values were set using ExpressionSuite software (Applied Biosystems) on the first 60 samples (automatic baseline and threshold) and these thresholds for Ct were used for the remaining series. The Taq-Man Human MicroRNA Array experiments are MIAME compliant and have been deposited at the NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo) under accession GSE64591.

Statistical analysis
Descriptive comparisons of study variables between cases and controls used the Chi-squared test for categorical data, and the Student's t-test for continuous data. The data were analysed in HTqPCR package [24] using R Bioconductor [25]. miRNAs with undetermined Ct values in more than 120 samples were filtered out. Data were quantile normalized. Following normalization Ct values with an interquartile range (IQR) of less than 1.5 and endogenous controls were removed from subsequent analysis, and limma analysis was performed to identify differentially regulated miRNA between cases and controls. With limma, a one-factorial linear model is fitted for each miRNA and the standard errors (SE) are moderated using an empirical Bayes model resulting in moderated t-statistics for each miRNA [26]. P-values of less than 0.05 were considered statistically significant. We also report adjusted p-values corrected for multiple testing using the Benjamini-Holm method to control for the false positive error rate. Logistic regression with a lasso penalty (with penalty parameter tuning conducted by 20-fold crossvalidation) was used to select a panel of miRNAs for discriminating between cases and controls. Logistic regression models were used to evaluate whether the 24-miRNA panel was associated with lung cancer after adjustment for known risk factors for lung cancer (age, sex and smoking status). The area under the receiver operating characteristic curve (AUC) was calculated to assess the discriminatory power of the model. Internal validation of the selected model was conducted by calculating the bootstrap optimism-corrected AUC. This correction accounts for overfitting of the model parameters for the selected model. Additionally, an optimism-correct AUC was calculated based on bootstrapping the entire model selection process. This accounts for overfitting in both model selection and parameter estimation. All analyses were performed by using STATA v11 (STATA, College Station, TX) and R [25]. All presented p-values are two-sided.