Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Unlocking Biomarker Discovery: Large Scale Application of Aptamer Proteomic Technology for Early Detection of Lung Cancer

  • Rachel M. Ostroff , (RMO); (JJW)

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • William L. Bigbee,

    Affiliation Department of Pathology, University of Pittsburgh School of Medicine, University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America

  • Wilbur Franklin,

    Affiliation University of Colorado Cancer Center, University of Colorado at Denver, Anschutz Medical Campus, Aurora, Colorado, United States of America

  • Larry Gold,

    Affiliations SomaLogic, Boulder, Colorado, United States of America, Department of Molecular, Cellular, and Developmental Biology, University of Colorado, Boulder, Colorado, United States of America

  • Mike Mehan,

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • York E. Miller,

    Affiliations University of Colorado Cancer Center, University of Colorado at Denver, Anschutz Medical Campus, Aurora, Colorado, United States of America, Denver Veterans Affairs Medical Center, Denver, Colorado, United States of America

  • Harvey I. Pass,

    Affiliation Langone Medical Center and Cancer Center, New York University School of Medicine, New York, New York, United States of America

  • William N. Rom,

    Affiliation Division of Pulmonary, and Critical Care, and Sleep Medicine, New York University School of Medicine, New York, New York, United States of America

  • Jill M. Siegfried,

    Affiliation Department of Pharmacology and Chemical Biology, University of Pittsburgh School of Medicine, University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America

  • Alex Stewart,

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • Jeffrey J. Walker , (RMO); (JJW)

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • Joel L. Weissfeld,

    Affiliation Department of Epidemiology, University of Pittsburgh Graduate School of Public Health, University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America

  • Stephen Williams,

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • Dom Zichi,

    Affiliation SomaLogic, Boulder, Colorado, United States of America

  • Edward N. Brody

    Affiliation SomaLogic, Boulder, Colorado, United States of America



Lung cancer is the leading cause of cancer deaths worldwide. New diagnostics are needed to detect early stage lung cancer because it may be cured with surgery. However, most cases are diagnosed too late for curative surgery. Here we present a comprehensive clinical biomarker study of lung cancer and the first large-scale clinical application of a new aptamer-based proteomic technology to discover blood protein biomarkers in disease.

Methodology/Principal Findings

We conducted a multi-center case-control study in archived serum samples from 1,326 subjects from four independent studies of non-small cell lung cancer (NSCLC) in long-term tobacco-exposed populations. Sera were collected and processed under uniform protocols. Case sera were collected from 291 patients within 8 weeks of the first biopsy-proven lung cancer and prior to tumor removal by surgery. Control sera were collected from 1,035 asymptomatic study participants with ≥10 pack-years of cigarette smoking. We measured 813 proteins in each sample with a new aptamer-based proteomic technology, identified 44 candidate biomarkers, and developed a 12-protein panel (cadherin-1, CD30 ligand, endostatin, HSP90α, LRIG3, MIP-4, pleiotrophin, PRKCI, RGM-C, SCF-sR, sL-selectin, and YES) that discriminates NSCLC from controls with 91% sensitivity and 84% specificity in cross-validated training and 89% sensitivity and 83% specificity in a separate verification set, with similar performance for early and late stage NSCLC.


This study is a significant advance in clinical proteomics in an area of high unmet clinical need. Our analysis exceeds the breadth and dynamic range of proteome interrogated of previously published clinical studies of broad serum proteome profiling platforms including mass spectrometry, antibody arrays, and autoantibody arrays. The sensitivity and specificity of our 12-biomarker panel improves upon published protein and gene expression panels. Separate verification of classifier performance provides evidence against over-fitting and is encouraging for the next development phase, independent validation. This careful study provides a solid foundation to develop tests sorely needed to identify early stage lung cancer.


Lung cancer is the leading cause of cancer deaths, because ∼84% of cases are diagnosed at an advanced stage [1][3]. Worldwide in 2008, ∼1.5 million people were diagnosed and ∼1.3 million died [4] – a survival rate unchanged since 1960. However, patients diagnosed at an early stage and have surgery experience an 86% overall 5-year survival [2], [3]. New diagnostics are therefore needed to identify early stage lung cancer.

Over the past decade the clinical utility of low-dose CT has been evaluated [5][8] with the hope that high-resolution imaging can help detect lung cancer earlier and improve patient outcomes, much as screening has done for breast and colorectal cancers [9]. Definitive conclusions about CT screening and lung cancer mortality await results from randomized trials in the US [8] and Europe [10][13]. CT can detect small, early-stage lung tumors, but distinguishing rare cancers from common benign conditions is difficult and has led to unnecessary procedures, radiation exposure, anxiety, and cost [6], [14][16]. We (J.M.S., J.L.W., and colleagues) recently reported such conclusions for the Pittsburgh Lung Screening Study (PLuSS), the largest single-institution CT screening study reported to date [5].

Other types of biomarkers have also been sought [17]. Proteins are attractive because they are an immediate measure of phenotype, in contrast to DNA which provides genotype, largely a measure of disease risk [18]. Single protein biomarkers are the foundation of molecular diagnostics in the clinic today. It is widely thought that multiple biomarkers could improve the sensitivity and specificity of diagnostic tests, and that complex diseases like cancer change the concentrations of multiple proteins [19]. However, discovering multiple protein biomarkers by measuring many proteins simultaneously (proteomics) in complex samples like blood has proven difficult for reasons of coverage, precision, throughput, preanalytical variability, and cost [20].

To enable biomarker discovery, we developed a new proteomic technology that is based on a new generation of aptamer protein binding reagents and has potentially broad application [18]. The current assay measures 813 diverse human proteins in just 15 µL of blood with low limits of detection (1 pM average and as low as 100 fM), 7 logs of overall dynamic range, and high reproducibility (5% median coefficient of variation) [18]. Here we present the first large scale clinical application of our proteomics technology to discover blood protein biomarkers in a large multi-center case-control study conducted in archived samples from 1,326 subjects from four independent studies of non-small cell lung cancer (NSCLC) in long-term tobacco-exposed populations.

Materials and Methods

Ethics Statement

All samples were collected from study participants after obtaining written informed consent under clinical research protocols approved by the following institutional review boards: The University of Pittsburgh Institutional Review Board (Pitt); The New York University School of Medicine Institutional Review Board (NYU); The Roswell Park Cancer Institute Institutional Review Board (RP); and The Cape Cod Healthcare Institutional Review Board (BS).

Study Design

The objectives of this study were to discover biomarkers that discriminate NSCLC from smokers with ≥10 years of cigarette smoking history, to train and cross-validate a multi-biomarker classifier of NSCLC to meet pre-specified performance criteria, and to verify the performance of this classifier with a separate set of blinded samples. The overall design of the study is shown in Figure 1. We designed and executed this study to current rigorous standards for biomarker clinical studies [21][23] with the goals of maximize biomarker robustness, validity, and reliability at the discovery phase, and minimizing potential effects of preanalytical variability. The study was a discovery-phase, case-control design. Critical study design features include the following. The clinical question and study design were pre-specified prior to identifying and acquiring samples. Samples were acquired from four independent study sites in order to control for potential preanalytical variability. Strict standard operating procedures were followed to ensure sample and data anonymity and blinding at all times (see below). A verification sample set consisting of 25% of all samples in the study was randomly selected and the identification of this set was blinded. The statistical analysis plan was pre-specified and included minimally acceptable performance criteria for sensitivity and specificity.

Figure 1. Study Flow for Algorithm Training and Verification.

Sample Cohort

The sample cohort comprised 1,326 serum samples obtained from four independent biorepositories: New York University (NYU) [24]; Roswell Park Cancer Institute (RPCI) [25]; The University of Pittsburgh (PITT) [5]; and a commercial biorepository (BioServe (BS)) (Table 1). All samples were collected from study participants after obtaining informed consent under institutionally approved clinical research protocols as described [5], [24], [25]. Both case and control serum samples were collected from four study centers. The clinical characteristics of the study cohort for the training and verification sets are shown in Table 2. The staging and histology of NSCLC cases is shown in Table 3. The sample cohort included patients diagnosed with pathologic or clinical stage I-III NSCLC and a high-risk control population with a history of long-term tobacco use, including active and ex-smokers with ≥10 pack-years of cigarette smoking. The control populations were selected randomly within each study to represent the patient population at risk for lung cancer that would be candidates for CT screening, with a ratio of case:control of 1∶3.5. Blood samples for cases were collected from patients within eight weeks of the first biopsy-proven lung cancer diagnosis and prior to removal of the tumor by a surgical procedure. All cases used in this study were confirmed as primary lung cancer by pathology review. NSCLC staging was assigned by pathological staging for 240 subjects and clinical staging for 51 subjects. Benign nodule controls have at least one year of follow-up data and non-malignant diagnosis. Smoker controls were asymptomatic study participants with ≥10 pack-years of cigarette smoking. Smoker controls from NYU and Pitt were nodule free by CT; nodule status is unknown for the smoker controls from RP and BS. Demographic data was collected by self-report questionnaires. Additional data for cases was acquired through clinical chart review. Pulmonary function testing was assessed by spirometry for a subset of the study participants.

Table 2. Clinical characteristics of NSCLC case and control sets for training and verification.

Table 3. Clinical characteristics of NSCLC cases in the training and verification sets.

Serum Collection, Processing, Storage, and Shipment

All serum specimens were collected following uniform protocols recommended by the National Cancer Institute's Early Detection Research Network [22]. Three of the centers (NYU, PITT and RPMC) collected serum in red top Vacutainer tubes (Becton Dickinson, Raritan, NJ) and one center (BS) collected serum in tiger top SST Vacutainer tubes (Becton Dickinson). All samples were allowed to clot and serum was recovered by centrifugation within 2–8 hours of collection and stored at −80°C. HIPAA compliant, de-identified samples were shipped frozen on dry ice to SomaLogic from the study centers and stored at -80°C. Samples were thawed once for aliquoting prior to proteomic analysis.

Sample Blinding

In order to prevent potential bias, this study followed a strict standard operating procedure for sample de-identification and blinding, such that all physical samples and data records were identified exclusively by a unique, unidentifiable barcode number and the key was stored in a secure database accessible only to designated responsible administrators. All sample aliquots run in this study were stored in identical tubes identified only by assigned barcode. The sample blinding code was broken only according to the pre-specified analysis plan for the purposes of classifier training with the training set and classifier verification with the verification set. For the verification sample set, a unique blinding key was generated and provided exclusively to a third party reader (K.C.), unaffiliated with the study centers or SomaLogic, to score and report the final verification results.

Proteomic Analysis

Serum samples were analyzed on our proteomic discovery platform as described in Gold et al [5]. Briefly, this technology uses novel DNA aptamers that contain chemically modified nucleotides as highly specific protein binding reagents in a unique multiplexed assay that transforms the quantity of each targeted protein into a corresponding quantity of aptamer, which is quantified with a custom hybridization array. Protein quantities are recorded as relative fluorescent units (RFU), which can be converted to concentrations with standard curves. The platform is highly automated [26] and scalable to accommodate a broad range of sample throughput. In this study, 813 protein targets were measured in 15 µL of serum for each subject, and all 1,326 sera were analyzed in a continuous process over a period of eight days. Overall, the results are analogous to a little more than 1,000,000 high quality ELISA measurements. Samples were processed in multiple 96-well microtiter plates, and all 1,326 samples were distributed randomly and their identities were completely blinded throughout the proteomic analysis process.

Biomarker Selection

Biomarkers were selected with a strategy designed to identify analytes with the highest performance in classifying NSCLC cases from controls across all study sites and that were least affected by preanalytical variables. In the first step of this analysis, we eliminated analytes that exhibited unexpected variation compared to internal controls, due to, for example, sample instability. In this process, we chose a set of analytes that performed well in a total of six naïve Bayes (NB) classifier training analyses. First we divided the training set into two distinct populations to control for possible biological variability between them: (1) all cases and controls with benign nodules identified by CT; and (2) all cases and all other smoker controls (nodule status unknown). For each population, we compared cases to controls in three NB training analyses designed to control for potential preanalytical variability between study sites. The three NB analyses started with a unique set of potential biomarkers based on the following criteria: (1) cases versus controls KS≥0.3 for all comparisons within each of the four study sites; (2) cases versus controls KS≥0.3 for comparing all sites combined; (3) both criteria one and two were met. For each analysis, we used a greedy forward search algorithm to select subsets of potential biomarkers, build NB classifiers (see below), and scored their performance for classifying lung cancer and controls using the training set. In this process, this meta-heuristic approach efficiently searches classifier space to identify potential biomarkers that perform best in classification. We used a simple measure of diagnostic performance of classifiers, the numerical sum of sensitivity + specificity, and measured the frequency with which potential biomarkers were selected by the greedy algorithm for inclusion in classifier panels with sensitivity + specificity ≥1.7. This step produced a set of potential biomarkers for each of the six parallel analyses. We selected the final set of biomarkers as the union of these six sets.

Statistical Methods

The KS statistic is a non-parametric measure of the difference between two distributions. The two-sample KS Statistic is: , where and are empirical cumulative distributions for two populations of values.

The naïve Bayes classifier assumes independence between the samples, and models the distributions of the training classes to make predictions [27]. We used normal distributions to model our data. However, the features in our data often contain distributions with heavy tails so maximum likelihood estimation of the distribution parameters performs poorly. Therefore, we modeled our distributions as log-normal distributions and used the Gauss-Newton algorithm to fit the data.

We constructed Bayesian classifiers using sets of potential biomarkers identified as described above. We used a parametric model to capture the underlying protein distribution for a given state. The simplest parametric model for the probability density function (pdf) for a single protein is a normal distribution, completely described by a mean u and variance σ2 (Eq. 1).(1)

Many protein distributions were observed as normal with respect to the logarithm of the concentration. The numeric cdfs can be fit to a normal distribution in log concentrations x (Eq. 2).(2)

The models fit the data well. More complex models of the probability distribution functions may be used when warranted but the simple model provided a good description of our data.

To combine multiple markers, we used a multivariate normal distribution to model the probability density function (pdf) for each class. For n markers, the multivariate pdf is given by the following equation (Eq. 3).(3)

where x is an n-component vector of protein levels, µ is an n-component vector of mean protein levels, Σ is the n x n covariance matrix and |Σ| and Σ−1 are its determinant and inverse. In its simplest form, we can assume a diagonal representation for Σ. Such an approximation leads to a naïve Bayes model, which assumes independence between the markers. In this work, we exclusively use the naïve Bayes model for constructing classifiers. The parameter values for µ and Σ used in the naïve Bayes classification were obtained from nonlinear regression analysis as described above.

The addition of subsequent markers with good KS distances will, in general, improve the classification performance if the subsequently added markers are independent of the first marker. We searched for optimal marker panels with a “greedy” algorithm, which is any algorithm that follows the problem solving meta-heuristic of making the locally optimal choice at each stage with the hope of finding the global optimum. We used the sensitivity (fraction of true positives) plus specificity (fraction of true negatives) as a classifier score. The algorithm approach used here is described as follows. All single analyte classifiers were generated from a table of potential biomarkers and added to a list. Next, all possible additions of a second analyte to each of the stored single analyte classifiers were performed, saving a predetermined number (10,000 in this case) of the best scoring pairs on a new list. All possible three marker classifiers are explored using this new list of the best two-marker classifiers, again saving the best thousand of these. This process continues until the score either plateaus or begins to deteriorate as additional markers are added.


We analyzed 1,326 serum samples from four independent biorepositories: New York University (NYU) [24]; Roswell Park Cancer Institute (RPCI) [25]; The University of Pittsburgh (PITT) [5]; and a commercial biorepository (BioServe (BS)) (Table 1). The study included patients diagnosed with pathologic or clinical stage I-III NSCLC and a high-risk control population with a history of long-term tobacco use, including active and ex-smokers with ≥10 pack-years of cigarette smoking (Table 2 and 3). The control populations were selected randomly within each study to represent the patient population at risk for lung cancer that would be candidates for CT screening, with a ratio of case to control of 1 to 3.5.

Samples were randomly distributed into segregated sets for classifier training and verification (Figure 1) with no significant differences in demographics between these sets (Table 2). More than 45% of NSCLC cases were pathologically confirmed stage IA or IB or clinical stage I with adenocarcinoma representing the major histological diagnosis (Table 3). All lung cancer patients had a biopsy-proven cancer diagnosis.

We measured the quantity of 813 proteins in each of the 1,326 samples with our proteomic discovery platform [18]. We followed a pre-specified two-phase analysis plan to identify biomarkers and develop a classifier to distinguish lung cancer subjects from controls within the training set (training phase) and to verify the classifier performance with the blinded independent verification set (verification phase). The training phase entailed two steps – biomarker selection and algorithm training with cross-validation.

To select biomarkers we performed a systematic analysis that narrowed the potential biomarker field for algorithm training to increase the probability of true discovery, yet still cast a relatively broad net. We used a naïve Bayes (NB) method to systematically assess potential biomarker performance with pre-specified criteria. We applied the NB method to subsets of the training data to broaden our cast for potential biomarkers (see Methods). The results identified a set of 44 potential biomarkers (Table 4) that distinguish lung cancer from controls across a range of comparisons in the training set while minimizing potential preanalytical variability – artifacts introduced by variations in sample collection and storage (see below) [28], [29].

To develop a potential diagnostic to distinguish NSCLC from controls, we trained NB classifiers starting with the 44 potential biomarkers we identified using a “greedy” forward search algorithm and ten-fold stratified cross validation, starting with three biomarkers and adding one more at each step. We assessed classifier performance with pre-specified performance criteria (Table 5). We constructed 45 seven to twelve-biomarker classifiers from this set of 44 potential biomarkers that met our performance criteria, which suggests that there is significant redundancy in the information contained within the set of potential biomarkers. Cross-validated classifier performance reached a performance plateau with twelve biomarkers. Following our analysis plan, we selected from the 45 resulting classifiers one with the highest overall performance of pre-specified criteria (Table 5), including discrimination of NSCLC from controls, detection of Stage I disease, and detection of cancer in chronic obstructive pulmonary disease (COPD). In the training set, the classifier achieved 91% sensitivity, 84% specificity, and an area under the curve (AUC) of 0.91 (Figure 2). The results (Table 6) show that sensitivity is maintained for Stage I NSCLC (90% for training set). The classifier performed well on samples from all four study sites (Figure 3).

Figure 2. ROC curve for 12-biomarker naïve Bayes classifier.

Figure 3. ROC curve performance of the 12-biomarker naïve Bayes NSCLC classifier by study site.

Table 5. Criteria for algorithm performance on training and cross-validation.

Table 6. Performance of Bayesian Classifier to distinguish NSCLC cases from controls.

The twelve biomarkers are shown in Table 7. The estimated serum concentrations for these markers span 4 logs (10 pM-100 nM). About half the control group had benign pulmonary nodules detected by CT (Table 2), and the performance of the classifier in that subgroup was similar to that of the whole (Table 6). We also tested the effect of other attributes that could affect classifier performance such as age, smoking history, and COPD, but found little effect (Tables 8 and 9). Age has a moderate effect on the shape of the ROC curve because the probability of cancer increases with age, but this effect can be controlled by adjusting the prior probability of cancer in the Bayes classifier model. The classification performance of the fixed algorithm was tested on the blinded independent verification set and verified by a third party reader to achieve 89% sensitivity and 83% specificity, nearly matching the training set performance.

Table 8. Performance of classifier in demographic subsets.

Table 9. Classifier specificity by level of airflow obstruction.

To determine whether our classification results were affected either by age, smoking status, or smoking history, which are the demographics with significant differences between the case and control populations (Table 2), we compared the classifier performance on subsets of the training set population divided into groups based on the median value of these attributes. The results show similar classifier performance for all subsets (Table 8). To further assess whether our classification results were affected either by age, smoking status, or smoking history, we tested for potential correlation of the twelve biomarkers with these variables. The results showed no correlations except for endostatin, which showed a moderate correlation, increasing with age. This effect can be compensated for by adjusting the prior probability of cancer in the Bayes classifier model. We also assessed the specificity of the classifier for the discrimination of controls known to have airflow obstruction (measured by GOLD score). The results are shown in Table 9. Spirometry data was incomplete for NSCLC cases, so we could not calculate sensitivity.

Preanalytical variability underlies common failures to translate candidate biomarkers into clinically useful tests [20], [29]. We assessed preanalytical variability in this study by measuring differences in protein levels within the same disease class (NSCLC or control) between different sites and comparing them to differences observed between NSCLC and control populations. The results (Figure 4) show significant preanalytical variability between sites. However, proteins most affected are distinct from potential NSCLC biomarkers. Many proteins that exhibit preanalytical variability (Table 10) are known to be susceptible to variations in sample collection and handling [28], [29]. This result confirms that pre-analytical variability exists in our study and provides evidence that, as designed, our study largely overcomes this variability to maximize the chances of discovering true, robust biomarkers of NSCLC.

Figure 4. Heat map shows the magnitude of difference for each protein measured (columns) between subject populations for the comparison of NSCLC to controls (top row) and comparisons of cases or controls between study sites (bottom row).

Top row: KS distances for NSCLC versus control distributions. Bottom row: mean KS distances for all 12 pair-wise comparisons, between the four sites, of case and control samples analyzed separately. Proteins were ordered by subtracting the NSCLC KS distance from the mean site KS distance. This revealed groups of NSCLC biomarkers (top right) contrasting with preanalytical markers (bottom left).


The primary findings of this study are 44 potential lung cancer biomarkers that discriminate stages I-III NSCLC cases from at-risk heavy smoker controls that can be combined into classifier panels that meet and exceed pre-specified performance criteria. The results of this study are novel in the following: (1) most of the proteins identified in this study have not been identified previously as serum lung cancer biomarkers; (2) we have identified novel protein biomarker panels that distinguish lung cancer cases from appropriate controls with high sensitivity and specificity in an independent, blinded verification set; and (3) this study achieves a new level of evidentiary standard in clinical proteomic biomarker studies as a result of a large sample size, a study design to control preanalytical variability, and the unique capability of this proteomic technology to interrogate the circulating proteome quantitatively with a breadth, sensitivity, and dynamic range unmatched by other broad serum profiling platforms [18], including mass spectrometry [18], antibody arrays [18], and autoantibody arrays [18], [30][32]. This study is the first large-scale application of this technology and the largest clinical proteomic biomarker study to date. As such, this study aims to overcome critical confounders and limitations of clinical proteomic biomarker studies that contribute largely to the lack of translation to the clinic due to false discovery [20]. These confounders and limitations include clinical sample integrity, preanalytical variability, and inadequate study design and power.

The best overall performing classifier used 12 of the 44 biomarkers and achieved 91% sensitivity and 84% specificity in cross-validated training and similar performance of 89% sensitivity and 83% specificity in blinded validation. These results provide evidence that these biomarkers are valid and that the classifier was not over-fit to the training data. This performance and the biological plausibility (following) of the 12 biomarkers are encouraging for the next phase of development – validation in an independent clinical study.

The 12 biomarkers identified in this study (Table 4) encompass functions of cell movement, inflammation, and immune monitoring that may contribute to cancer development. Most of the 12 proteins have been associated generally with cancer biology, some have been identified as candidate lung cancer biomarkers, none have been validated as lung cancer biomarkers, and none are used clinically [33], [34]. Four of the 12 proteins have been identified in serum and lung cancer tissue or cell culture as candidate lung cancer biomarkers – cadherin-1 [35], endostatin [36], HSP90 [37], and pleiotrophin [38]. Eight of the 12 proteins, CD30 ligand, LRIG3, MIP-4, PRKCI, RGM-C, SCF-sR, sL-Selectin, and YES, have not been identified previously in serum as lung cancer biomarkers and represent novel findings.

Six of the 12 proteins, CD30 ligand, endostatin, HSP90, MIP-4, pleiotrophin, PRKCI, and YES were observed up-regulated in lung cancer in this study, consistent with their proposed biological roles in proliferation, invasion, or host inflammatory and immune response to the tumor. CD30 ligand is a member of the TNF ligand superfamily, which stimulates T-cell growth. Up-regulation of this protein correlates with proliferation in hematological malignancies [36]. Endostatin, best known as an inhibitor of angiogenesis, has elevated serum levels in several cancers [39]. Overexpression of endostatin and its parent extracellular matrix protein, collagen XVIII have been associated with poor prognosis in NSCLC [36].

The chaperone HSP90α is important for the stability of and function of a wide range of oncoproteins, including BCR-ABL, ERBB2, EGFR, BRAF, and AKT, among others, and inhibitors of this protein are now in oncology clinical trials, including NSCLC [40]. HSP90 may also play a role in tumor cell resistance to complement mediated cytotoxicity [41]. MIP-4 is over-expressed in ovarian and gastric cancers, and may have a role in immunosuppression of the host tumor response [42]. Pleiotrophin is a growth factor with both mitogenic and angiogenic properties and levels in the serum of NSCLC patients have been reported to correlate with disease stage and prognosis [38]. PRKCI is an oncogene that is often amplified in NSCLC and over-expressed in lung tumors correlates with poor prognosis [43]. YES, another protein kinase and member of the src-family of tyrosine kinases, has a role in malignant transformation and increased protein levels have been reported in early stages of hepatocarcinoma [44].

We observed decreased levels of proteins in the serum of lung cancer patients compared to controls, including cadherin-1, LRIG3, sL-selectin, SCRsR, ERBB1 and RGM-C. Lower circulating levels of many of these proteins are associated with relief of inhibition of growth and invasion. For example, cadherin-1 is critical for cell adhesion and indirectly affects transcriptional regulation circuits through β-catenin [45]. Consistent with our results, reduced expression has been reported in lung cancer, and loss of cadherin-1 is a key event leading to loss of adherence, tumorgenicity, and metastasis [46]. The LRIG family consists of membrane proteins with soluble leucine rich repeat domains and immunoglobulin-like domains. Down-regulation of expression of this protein in glioblastoma cell lines resulted in increased proliferation and invasion, decreased apoptosis, and increased EGFR expression, leading to the hypothesis that LRIG is a tumor suppressor [47]. L-selectin plays a role in activation of naïve lymphocytes that participate in immune surveillance and antitumor immunity. It also mediates the adherence of lymphocytes to endothelial cells. Lower expression of L-selectin may be a component of the immune suppression observed in many cancer patients [48].

Some of the proteins described in this study are the soluble domains of membrane receptors, and the function of the circulating form of these proteins may oppose their membrane-bound counterparts. Turner et al. [49] proposed that soluble SCF-receptors regulate kit activation. Our results suggest that a low level of SCF-sR fails to titrate SCF, which makes more SCF available for binding cancer cells. Unlike the membrane bound form, soluble RGM-C inhibits hepcidin expression [50], [51]. We find that RGM-C is down regulated in NSCLC serum, consistent with increased intracellular iron and proliferative cell growth [52].

The limitations of this study include the following. We did not test cases prior to clinically apparent disease. We did not demonstrate organ-specificity and many of the markers are known to be elevated in other cancers. However, the markers will be used in combination and in the proper diagnostic context, such as with imaging, smoking history, and symptoms. We did not validate our findings in an independent set of clinical samples. Our multi-center study was designed to minimize the effects of potential preanalytical variability, which is mitigated, but not eliminated by this study. All of these limitations will be addressed in the next phase of development, which is enabled by the positive results of this study.

The biomarkers that we discovered have several potential clinical applications. The first application is early detection of lung cancer in long-term smokers when it may be cured by surgery. Our results are a significant improvement on the performance of other recently published lung cancer biomarker studies aimed at early diagnosis [17] using mass spectrometry [24], [53], [54] or gene expression [55]. This performance could allow for testing of individuals with increased lung cancer risk, with subsequent CT screening based on the blood test result.

A second potential application is a test for diagnosing lung cancer in subjects with suspicious lung nodules identified by CT, which could help mitigate the problem of morbidity and cost associated with surgical interventions. CT screening reveals suspicious nodules in ∼40% of long-term smokers [5], [56], [57], but ∼97% are likely benign [5], [57], [58]. Protocols for managing these patients balance the risk of “watchful waiting” with definitive and costly invasive procedures. Watchful waiting monitors nodule growth by periodic follow-up CTs, but may miss the opportunity for early surgical cure. Invasive procedures incur the risk of complications and death that arise from biopsy or futile thoracotomy for benign lesions. This risk might be reduced by a new strategy to assess nodule volume doubling time by CT [13]. However, CT radiation itself increases cancer risk [59].

Based on the discoveries reported here, we have initiated clinical validation studies of populations at risk for lung cancer. Our goal is to develop a clinical blood test to enable an earlier diagnosis. This study is the first to be published in a sequence of successful biomarker discovery studies that we have already completed in different cancers and demonstrates the power of our proteomic technology to discover robust biomarkers in important diseases. This general approach can also be applied to discover biomarkers for many more conditions including infectious, inherited, neurological, and metabolic diseases.


We thank our SomaLogic colleagues, in particular the assay group (Chris Bock, Evaldas Katilius, Tracy Keeney, Stephan Kraemer, Bridget Lollo, Suzanne Stratford, John Vaught, Alexey Wolfson), for making this work possible and providing valuable input to this manuscript. We thank Karen Copeland for unblinding the verification dataset.

Author Contributions

Designed the study: RMO LG SW DZ ENB. Provided clinical samples and interpreted results: WLB WF HIP WNR JMS JLW. Analyzed the data: MM AS DZ. Wrote the manuscript with input from all others: RMO JJW. All authors evaluated and interpreted the analyzed data, and critically reviewed the manuscript.


  1. 1. Jemal A, Siegel R, Ward E, Hao Y, Xu J, et al. (2009) Cancer statistics, 2009. CA Cancer J Clin 59: 225–249.
  2. 2. Okada M, Nishio W, Sakamoto T, Uchino K, Yuki T, et al. (2005) Effect of tumor size on prognosis in patients with non-small cell lung cancer: the role of segmentectomy as a type of lesser resection. J Thorac Cardiovasc Surg 129: 87–93.
  3. 3. Kassis ES, Vaporciyan AA, Swisher SG, Correa AM, Bekele BN, et al. (2009) Application of the revised lung cancer staging system (IASLC Staging Project) to a cancer center population. J Thorac Cardiovasc Surg 138: 412–418 e411–412.
  4. 4. Boyle P, Levin B, editors. (2008) World Cancer Report;. Lyon: International Agency for Research on Cancer (IARC).
  5. 5. Wilson DO, Weissfeld JL, Fuhrman CR, Fisher SN, Balogh P, et al. (2008) The Pittsburgh Lung Screening Study (PLuSS): outcomes within 3 years of a first computed tomography scan. Am J Respir Crit Care Med 178: 956–961.
  6. 6. Black WC (2007) Computed tomography screening for lung cancer: review of screening principles and update on current status. Cancer 110: 2370–2384.
  7. 7. Yau G, Lock M, Rodrigues G (2007) Systematic review of baseline low-dose CT lung cancer screening. Lung Cancer 58: 161–170.
  8. 8. NLST (2009) National Lung Screening Trial,
  9. 9. Smith RA, Cokkinides V, Eyre HJ (2007) Cancer screening in the United States, 2007: a review of current guidelines, practices, and prospects. CA Cancer J Clin 57: 90–104.
  10. 10. Blanchon T, Brechot JM, Grenier PA, Ferretti GR, Lemarie E, et al. (2007) Baseline results of the Depiscan study: a French randomized pilot trial of lung cancer screening comparing low dose CT scan (LDCT) and chest X-ray (CXR). Lung Cancer 58: 50–58.
  11. 11. Infante M, Lutman FR, Cavuto S, Brambilla G, Chiesa G, et al. (2008) Lung cancer screening with spiral CT: baseline results of the randomized DANTE trial. Lung Cancer 59: 355–363.
  12. 12. van Iersel CA, de Koning HJ, Draisma G, Mali WP, Scholten ET, et al. (2007) Risk-based selection from the general population in a screening trial: selection criteria, recruitment and power for the Dutch-Belgian randomised lung cancer multi-slice CT screening trial (NELSON). Int J Cancer 120: 868–874.
  13. 13. van Klaveren RJ, Oudkerk M, Prokop M, Scholten ET, Nackaerts K, et al. (2009) Management of lung nodules detected by volume CT scanning. N Engl J Med 361: 2221–2229.
  14. 14. Pinsky PF, Marcus PM, Kramer BS, Freedman M, Nath H, et al. (2005) Diagnostic procedures after a positive spiral computed tomography lung carcinoma screen. Cancer 103: 157–163.
  15. 15. Welch HG, Woloshin S, Schwartz LM, Gordis L, Gotzsche PC, et al. (2007) Overstating the evidence for lung cancer screening: the International Early Lung Cancer Action Program (I-ELCAP) study. Arch Intern Med 167: 2289–2295.
  16. 16. Brenner DJ (2004) Radiation risks potentially associated with low-dose CT screening of adult smokers for lung cancer. Radiology 231: 440–445.
  17. 17. Brower V (2009) Biomarker studies abound for early detection of lung cancer. J Natl Cancer Inst 101: 11–13.
  18. 18. Gold L, Ayers D, Bertino J, Bock A, Bock C, et al. (2010) Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS ONE 5(11): e15004.
  19. 19. Hartwell L, Mankoff D, Paulovich A, Ramsey S, Swisher E (2006) Cancer biomarkers: a systems approach. Nat Biotechnol 24: 905–908.
  20. 20. Rifai N, Gillette MA, Carr SA (2006) Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol 24: 971–983.
  21. 21. Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD (2008) Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Natl Cancer Inst 100: 1432–1438.
  22. 22. Tuck MK, Chan DW, Chia D, Godwin AK, Grizzle WE, et al. (2009) Standard operating procedures for serum and plasma collection: early detection research network consensus statement standard operating procedure integration working group. J Proteome Res 8: 113–117.
  23. 23. Ransohoff DF, Gourlay ML (2010) Sources of bias in specimens for research about molecular markers for cancer. J Clin Oncol 28: 698–704.
  24. 24. Greenberg AK, Rimal B, Felner K, Zafar S, Hung J, et al. (2007) S-adenosylmethionine as a biomarker for the early detection of lung cancer. Chest 132: 1247–1252.
  25. 25. Ambrosone CB, Nesline MK, Davis W (2006) Establishing a cancer center data bank and biorepository for multidisciplinary research. Cancer Epidemiol Biomarkers Prev 15: 1575–1577.
  26. 26. Keeney T, Kraemer S, Walker JJ, Bock C, Vaught J, et al. (2009) Automation of the SomaLogic Proteomics Assay: A Platform for Biomarker Discovery. J Assoc Lab Automat 14: 360–366.
  27. 27. Duda O, Hart PE, Stork DG (2001) Pattern Classification. New York: John Wiley and Sons.
  28. 28. Ostroff R, Foreman T, Keeney TR, Stratford S, Walker JJ, et al. (2009) The stability of the circulating human proteome to variations in sample collection and handling procedures measured with an aptamer-based proteomics array. J Proteomics.
  29. 29. Zhang Z, Chan DW (2005) Cancer proteomics: in pursuit of "true" biomarker discovery. Cancer Epidemiol Biomarkers Prev 14: 2283–2286.
  30. 30. Chen G, Wang X, Yu J, Varambally S, Thomas DG, et al. (2007) Autoantibody profiles reveal ubiquilin 1 as a humoral immune response target in lung adenocarcinoma. Cancer Res 67: 3461–3467.
  31. 31. Zhong L, Hidalgo GE, Stromberg AJ, Khattar NH, Jett JR, et al. (2005) Using protein microarray as a diagnostic assay for non-small cell lung cancer. Am J Respir Crit Care Med 172: 1308–1314.
  32. 32. Gao WM, Kuick R, Orchekowski RP, Misek DE, Qiu J, et al. (2005) Distinctive serum protein profiles involving abundant proteins in lung cancer patients based upon antibody microarray analysis. BMC Cancer 5: 110.
  33. 33. Sung HJ, Cho JY (2008) Biomarkers for the lung cancer diagnosis and their advances in proteomics. BMB Rep 41: 615–625.
  34. 34. Greenberg AK, Lee MS (2007) Biomarkers for lung cancer: clinical uses. Curr Opin Pulm Med 13: 249–255.
  35. 35. Cioffi M, Gazzerro P, Di Finizio B, Vietri MT, Di Macchia C, et al. (1999) Serum-soluble E-cadherin fragments in lung cancer. Tumori 85: 32–34.
  36. 36. Iizasa T, Chang H, Suzuki M, Otsuji M, Yokoi S, et al. (2004) Overexpression of collagen XVIII is associated with poor outcome and elevated levels of circulating serum endostatin in non-small cell lung cancer. Clin Cancer Res 10: 5361–5366.
  37. 37. Xu A, Tian T, Hao J, Liu J, Zhang Z, et al. (2007) Elevation of serum HSP90a correlated with the clinical stage of non-small cell lung cancer. J Cancer Mol 3: 107–112.
  38. 38. Jager R, List B, Knabbe C, Souttou B, Raulais D, et al. (2002) Serum levels of the angiogenic factor pleiotrophin in relation to disease stage in lung cancer patients. Br J Cancer 86: 858–863.
  39. 39. Suzuki M, Iizasa T, Ko E, Baba M, Saitoh Y, et al. (2002) Serum endostatin correlates with progression and prognosis of non-small cell lung cancer. Lung Cancer 35: 29–34.
  40. 40. Banerji U (2009) Heat shock protein 90 as a drug target: some like it hot. Clin Cancer Res 15: 9–14.
  41. 41. Gancz D, Fishelson Z (2009) Cancer resistance to complement-dependent cytotoxicity (CDC): Problem-oriented research and development. Mol Immunol 46: 2794–2800.
  42. 42. Schutyser E, Richmond A, Van Damme J (2005) Involvement of CC chemokine ligand 18 (CCL18) in normal and pathological processes. J Leukoc Biol 78: 14–26.
  43. 43. Erdogan E, Klee EW, Thompson EA, Fields AP (2009) Meta-analysis of oncogenic protein kinase Ciota signaling in lung adenocarcinoma. Clin Cancer Res 15: 1527–1533.
  44. 44. Nonomura T, Masaki T, Morishita A, Jian G, Uchida N, et al. (2007) Identification of c-Yes expression in the nuclei of hepatocellular carcinoma cells: involvement in the early stages of hepatocarcinogenesis. Int J Oncol 30: 105–111.
  45. 45. Ceteci F, Ceteci S, Karreman C, Kramer BW, Asan E, et al. (2007) Disruption of tumor cell adhesion promotes angiogenic switch and progression to micrometastasis in RAF-driven murine lung cancer. Cancer Cell 12: 145–159.
  46. 46. Charalabopoulos K, Gogali A, Kostoula OK, Constantopoulos SH (2004) Cadherin superfamily of adhesion molecules in primary lung cancer. Exp Oncol 26: 256–260.
  47. 47. Cai M, Han L, Chen R, Ye F, Wang B, et al. (2009) Inhibition of LRIG3 gene expression via RNA interference modulates the proliferation, cell cycle, cell apoptosis, adhesion and invasion of glioblastoma cell (GL15). Cancer Lett 278: 104–112.
  48. 48. Hanson EM, Clements VK, Sinha P, Ilkovitch D, Ostrand-Rosenberg S (2009) Myeloid-derived suppressor cells down-regulate L-selectin expression on CD4+ and CD8+ T cells. J Immunol 183: 937–944.
  49. 49. Turner AM, Bennett LG, Lin NL, Wypych J, Bartley TD, et al. (1995) Identification and characterization of a soluble c-kit receptor produced by human hematopoietic cell lines. Blood 85: 2052–2058.
  50. 50. Babitt JL, Huang FW, Wrighting DM, Xia Y, Sidis Y, et al. (2006) Bone morphogenetic protein signaling by hemojuvelin regulates hepcidin expression. Nat Genet 38: 531–539.
  51. 51. Babitt JL, Huang FW, Xia Y, Sidis Y, Andrews NC, et al. (2007) Modulation of bone morphogenetic protein signaling in vivo regulates systemic iron balance. J Clin Invest 117: 1933–1939.
  52. 52. Ward DG, Roberts K, Brookes MJ, Joy H, Martin A, et al. (2008) Increased hepcidin expression in colorectal carcinogenesis. World J Gastroenterol 14: 1339–1345.
  53. 53. Yildiz PB, Shyr Y, Rahman JS, Wardwell NR, Zimmerman LJ, et al. (2007) Diagnostic accuracy of MALDI mass spectrometric analysis of unfractionated serum in lung cancer. J Thorac Oncol 2: 893–901.
  54. 54. Patz EF Jr, Campa MJ, Gottlin EB, Kusmartseva I, Guan XR, et al. (2007) Panel of serum biomarkers for the diagnosis of lung cancer. J Clin Oncol 25: 5578–5583.
  55. 55. Spira A, Beane JE, Shah V, Steiling K, Liu G, et al. (2007) Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 13: 361–366.
  56. 56. Diederich S, Thomas M, Semik M, Lenzen H, Roos N, et al. (2004) Screening for early lung cancer with low-dose spiral computed tomography: results of annual follow-up examinations in asymptomatic smokers. Eur Radiol 14: 691–702.
  57. 57. Swensen SJ, Jett JR, Sloan JA, Midthun DE, Hartman TE, et al. (2002) Screening for lung cancer with low-dose spiral computed tomography. Am J Respir Crit Care Med 165: 508–513.
  58. 58. Croswell JM, Kramer BS, Kreimer AR, Prorok PC, Xu JL, et al. (2009) Cumulative incidence of false-positive results in repeated, multimodal cancer screening. Ann Fam Med 7: 212–222.
  59. 59. Twombly R (2010) Federal oversight of medical radiation is on horizon as experts face off. J Natl Cancer Inst 102: 514–515.
  60. 60. GOLD (2008) The Global Strategy for Diagnosis, Management and Prevention of COPD;