Is symptom-based diagnosis of lung cancer possible? A systematic review and meta-analysis of symptomatic lung cancer prior to diagnosis for comparison with real-time data from routine general practice

Background Lung cancer is a good example of the potential benefit of symptom-based diagnosis, as it is the commonest cancer worldwide, with the highest mortality from late diagnosis and poor symptom recognition. The diagnosis and risk assessment tools currently available have been shown to require further validation. In this study, we determine the symptoms associated with lung cancer prior to diagnosis and demonstrate that by separating prior risk based on factors such as smoking history and age, from presenting symptoms and combining them at the individual patient level, we can make greater use of this knowledge to create a practical framework for the symptomatic diagnosis of individual patients presenting in primary care. Aim To provide an evidence-based analysis of symptoms observed in lung cancer patients prior to diagnosis. Design and setting Systematic review and meta-analysis of primary and secondary care data. Method Seven databases were searched (MEDLINE, Embase, Cumulative Index to Nursing and Allied Health Literature, Health Management Information Consortium, Web of Science, British Nursing Index and Cochrane Library). Thirteen studies were selected based on predetermined eligibility and quality criteria for diagnostic assessment to establish the value of symptom-based diagnosis using diagnosistic odds ratio (DOR) and summary receiver operating characteristic (SROC) curve. In addition, routinely collated real-time data from primary care electronic health records (EHR), TransHis, was analysed to compare with our findings. Results Haemoptysis was found to have the greatest diagnostic value for lung cancer, diagnostic odds ratio (DOR) 6.39 (3.32–12.28), followed by dyspnoea 2.73 (1.54–4.85) then cough 2.64 (1.24–5.64) and lastly chest pain 2.02 (0.88–4.60). The use of symptom-based diagnosis to accurately diagnose lung cancer cases from non-cases was determined using the summary receiver operating characteristic (SROC) curve, the area under the curve (AUC) was consistently above 0.6 for each of the symptoms described, indicating reasonable discriminatory power. The positive predictive value (PPV) of diagnostic symptoms depends on an individual’s prior risk of lung cancer, as well as their presenting symptom pattern. For at risk individuals we calculated prior risk using validated epidemiological models for risk factors such as age and smoking history, then combined with the calculated likelihood ratios for each symptom to establish posterior risk or positive predictive value (PPV). Conclusion Our findings show that there is diagnostic value in the clinical symptoms associated with lung cancer and the potential benefit of characterising these symptoms using routine data studies to identify high-risk patients.


Introduction
Lung cancer has the highest mortality rate of any cancer worldwide and constitutes more than 40% of all new cancer diagnoses [1]. Although survival rates in England have improved in the last 40 years, they remain lower than in comparable European countries. Improving early diagnosis is a key component of relieving the cancer burden [2]. It has been estimated that earlier diagnosis of the four commonest cancers in England (lung, breast, prostate and colorectal), would benefit over 11,000 patients each year [3]. The National Institute for Health and Care Excellence (NICE) 2015 urgent referral guidelines for suspected cancer, set the positive predictive value (PPV) threshold of clinical presentations for cancer at 3% [4]. In this study, we aim to determine the validity of symptom-based lung cancer diagnosis, using published studies, routine data from electronic health records and published prior risk models. A recent review of lung cancer diagnosis using 'Risk Assessment Tools' (RATs) found that there was insufficient validation, and that the inclusion of 'epidemiological risk factors' in the models, along with symptoms, created confounders [5]. In this review, we specifically assess symptoms associated with lung cancer diagnosis without epidemiological factors, to avoid confounding. We can then determine the prior risk using epidemiological models and calculate the posterior probability, or PPV, using Bayes' theorem.

Systematic literature search
We performed a systematic review and meta-analysis of studies reporting the sensitivity, specificity, predictive values, odds ratios or likelihood ratios for lung cancer in patients consulting their GP with symptoms prior to diagnosis. Searches were performed on 24 th September 2017 of seven databases using search terms specific for lung cancer diagnosis (Fig 1) presented using the prisma flow chart [6]. For prisma checklist and full search terms and outcomes, see S1 and S3 Tables.

Eligibility criteria
Eligible studies were performed in either a primary or secondary care setting. They included male and/or female subjects, 15 years and over, with appropriate demographic information. All cases were diagnosed with primary lung cancer using imaging and/or histological data, assessed by a trained clinician. The presentation of symptoms prior to a diagnosis of primary lung cancer had to be clearly described in each of the selected studies and recorded prior to diagnosis. Studies that did not include sufficient data for outcome analysis using a 2x2 contigency table were excluded from the meta-analysis.

Data collection from the TransHis primary care electronic health record
The Transition Project "TransHis" is an electronic patient record used by 230 general practices worldwide to collate data in real-time [7]. All patients whose initial consultations were subsequently linked to a diagnosis of lung cancer were assessed [8]. These allowed us to monitor the evolution of an initial presenting symptom to its final diagnosis [9]. Data extraction was performed on 24 th September 2017.

Outcome analysis and statistical methods
For diagnostic analysis, we constructed 2x2 contingency tables for each study, using data collated prior to diagnosis [10]. For the meta-analysis, a random effects model for diagnostic accuracy was used to pool the data, as this accounts for differences in index test threshold, based on patient and/or clinical interpretation of presentations. A measure of the discriminatory power of the index test was calculated using diagnostic odds ratios (DOR). Heterogeneity in results across a study was assessed for each presenting symptom as a subgroup using Cochran's Q (Q � ) and I-squared (I 2 ) statistics [11,12]. Summary Receiver Operating Characteristic (SROC) curves for each presenting symptom were plotted from pooled sensitivity against (1-pooled specificity) using Moses' Model (weighted regression, inverse variance). The area under the curve (AUC) was used to measure diagnostic accuracy. STATA version 13 (STATACorp, USA) was used for the statistical analyses.

Results
The search strategy shown in Fig 1, produced 13,430 unique references. A further review of these titles followed by abstracts and selection of those studies that met the inclusion criteria, resulted in the selection of 34 studies by the first reviewer (GO). A full text review of the 34 studies was performed by the first and second reviewer independently (GO and BD) with good agreement, kappa of 0.85 (0.430-0.938). After discussion, a final thirteen studies were selected. All findings were reported in accordance with PRISMA guidelines.

Study strengths, limitations and bias assessment
The design and protocol used in each of the selected studies were subject to different types of bias (Table 1). The selected studies include six case series, three case-control and four cohort studies, summarised in Tables 2 and 3. Likelihood ratios (LR) are the most clinically useful outcome measures, as the LR is the probability of a cancer patient having the symptom divided by the probability of a non-cancer patient having that symptom. Table 3 details those studies where likelihood ratios could be calculated. Five of the selected studies included sufficient data to assess the diagnostic accuracy of symptoms associated with lung cancer using a dichotomous test approach. This data was compared with LRs from TransHis data.

Cohort studies.
Retrospective cohort studies typically use data collected in the electronic health record: they usually exclude data collected in the 6-12 months before diagnosis to address the potential bias from including post-diagnosis symptoms and to minimise the influence of GPs preferentially coding possible lung cancer symptoms when considering this as a potential diagnosis. Cohort studies accounted for 31% of the selected studies. Jones and colleagues (2007), used a symptom-based approach to investigate all diagnoses associated with haemoptysis in a large general practice database (Clinical Practice Research Datalink) of 762,325 UK patients. Of the 4,812 new episodes of haemoptysis, 6.3% were subsequently diagnosed with lung cancer. This study also reported PPVs and positive likelihood ratios (LR+) as shown in Table 3 [13]. Hippisley-Cox and colleagues (2011) determined the hazard ratios for lung cancer in a risk assessment model that considered three clinical predictors (haemoptysis, loss of appetite and weight loss) presenting within 12 months prior to a lung cancer diagnosis. Risk of lung cancer was greatest in patients with haemoptysis: hazard ratio 23.9 (20.6-27.6) in females and 21.5 (19.3-23.9) in males, after adjustment for late-stage diagnosis and the associated shorter time-to-diagnosis, waiting time paradox [14]. Walter and colleagues (2015) used a prospective cohort study design and interviewed patients who had been referred to a specialist respiratory clinic by their GP. Half of the referred patients (49.3%) reported that they had presented to their GP with a single first symptom. Almost 40% (>37.8%) presented with more than one presenting symptom that worsened over time. Haemoptysis had the greatest causative association to lung cancer with an adjusted hazard ratio of 2.17 (1.63-2.89) (P = 0.00) [15].  Is symptom-based diagnosis of lung cancer possible? Case-control studies. Case-control studies accounted for 23% of the selected studies and are limited in that the outcome measures such as PPVs cannot be generalised beyond the study. They are a product of the selection of cases and controls, not reflecting any natural  [18]. Case series studies. Case series studies accounted for 46% of the selected studies, the most common study design observed in this review but the least informative in relation to diagnostic value. The diagnostic value of these symptoms cannot be assessed because patients without lung cancer were not included in the study.
TransHis data. TransHis is an EHR specifically designed to capture the initial consultation as 'Reason for Encounter' (RfE) and maintain the episode of care structure as an ongoing prospective cohort study. The TransHis data were used to determine the relationship between lung cancer diagnosis and RfE, expressed as odds ratios. Cough followed by haemoptysis, dyspnoea, weight loss, chest pain and voice symptoms were the most prevalent RfEs in patients subsequently diagnosed with lung cancer (S5 Table). Constitutional symptoms (tiredness, weight loss, anorexia, fever and sweating) were collectively the third most common. As Trans-His data is captured from routine care using a primary-care specific classification (ICPC2) and the odds ratios are relative to 'all consulting patients', we compared the outcomes with our selected studies.
When considering all the selected studies, haemoptysis, cough, dyspnoea, chest pain and constitutional symptoms were found to be the most prevalent presentations. In all studies haemoptysis, dyspnoea and cough were consistently the most predictive symptom for lung cancer.
Statistical analysis for diagnostic accuracy of clinical presentations associated with lung cancer. Five studies enabled us to assess the diagnostic accuracy of symptoms associated with lung cancer [16][17][18]. The pooled diagnostic odds ratios (DOR) for.haemoptysis, dyspnoea, cough and chest pain were 6.39  The limited availability of studies that fit the criteria for diagnostic value, differences in study design and the differing thresholds for recording presence/absence of a symptom, shown in Tables 2 and 3, created the heterogeneity (I 2 ) observed in the SROC curves Fig 3. We compared the overall diagnostic value of each presentation from the selected studies with measurable outcome data and TransHis data using likelihood ratios as shown in Table 4.
The symptom most likely to be observed in lung cancer vs non lung cancer patients is haemoptysis, followed by dyspnoea, cough and finally chest pain.
Staging at diagnosis of lung cancer. The tumour stage at diagnosis, or its operability, was indicated in only four of the thirteen studies and most of the diagnosed cases were inoperable or at stages IIa and above. Hence, 31% of selected studies described the prognostic benefits of symptom-based early diagnosis by including data on disease stage and operability at diagnosis [14,15,[17][18][19].
In the most common form of lung cancer, non-small cell, the weighted means as a percentage of all cases in each study was calculated as follows: Stage I 10.7%, Stage II 6.9%, Stage III 43.2% and Stage IV 39.2%. These studies found that less than 8.2% of the lung cancer patients were amenable to surgery at diagnosis [20].

Summary of findings
We found haemoptysis, had the greatest diagnostic value in both the selected studies and the TransHis database, followed by dyspnoea, cough and chest pain. The review also indicated  that most of cancer patients are diagnosed at a late stage when there are limited surgical management options and less favourable clinical outcomes. More precise coding for symptoms and characterisation of symptoms, such as severity, timing and associated features, in electronic health records such as TransHis may provide sufficient evidence for early symptombased diagnosis of lung cancer. It is hoped that the introduction of a new and global clinical vocabulary for electronic health records, SNOMED CT (Systematized Nomenclature of Medicine-Clinical Terms), will also contribute to better utilisation of electronic health records to improve evidence-based research. Although, codes will need to be carefully restricted to a classification of symptoms to enable calculation of odds ratios.

Findings within the context of the current literature
To date, this is the only review to include a meta-analysis of clinical symptoms for the diagnosis of lung cancer. A previously published systematic review based on primary care data  showed haemoptysis to be a predictor of lung cancer, but there were insufficient data to perform a meta-analysis [21]. We included studies where the index cases were identified in both primary care and secondary care studies as long as patients were referred by their GP. We made this decision on the basis that referral to a clinic for investigation of respiratory symptoms represents a cohort of people in whom the GP is considering cancer, and in the absence of better data on the evolution of symptoms over time, may yield useful LRs (but not PPVs). Our findings are consistent with previous findings that haemoptysis is predictive of lung cancer, but in addition demonstrates the diagnostic value of dyspnoea, cough and chest pain [15,18,21,22].
Previously published studies suggest that efforts to expedite the diagnosis of symptomatic cancer are likely to benefit patients in terms of improved survival, earlier-stage diagnosis and improved quality of life [19,[23][24][25][26][27]. This review clearly identifies a place for symptom-based diagnosis, as the epidemiology of cancer symptoms is becoming better understood. Risk models that assess prior risk factors and then presenting symptoms could identify high-risk patients for early diagnosis [28]. Is symptom-based diagnosis of lung cancer possible?

Strengths and weaknesses of the review
All selected studies used routine data sources, a cost-effective and powerful resource for evidence-based research. Though variability in the study designs creates heterogeneity, there was sufficient data to perform a meta-analysis and determine the diagnostic accuracy of clinical presentations associated with lung cancer. Five of the thirteen studies assessed the association of lung cancer with a specific set of symptoms and did not investigate all symptoms reported in lung cancer patients [13,14,[16][17][18]. As a result, their findings may have missed other symptoms not already known to be associated with lung cancer. Each study provided demographic data on age, sex and smoking status; male smokers over 40 years were found to have the greatest incidence of lung cancer. However, routine data sources can also be subject to bias, such as missing data, coding inconsistencies, and work-up bias [29,30] Thus, these studies can miss the complexities of the clinical assessment necessary for cancer diagnosis, for example weight loss was found to be the fifth most prevalent presentation prior to diagnosis and, in one study, it was observed even in operable disease, indicative of a presentation associated with early diagnosis [31]. In 62% of the selected studies, weight loss was grouped with constitutional symptoms, therefore, specific analysis of weight loss as an isolated symptom was not possible. More data are required for diagnostic assessment of weight loss because it may prove to be a cost-effective predictor of high-risk patients. These patients could be identified for further investigations to facilitate early cancer diagnosis.

Implications for clinical practice and research
Case series studies represent a majority of the studies into symptoms associated with lung cancer, but this design has no diagnostic benefit because there are no controls. This highlights the importance of devising a study design that will produce clinically significant outcomes that will be of patient benefit. Understanding the precise diagnostic value of symptoms is a powerful tool in clinical decision making [28]. Table 5 outlines three case scenarios where symptomology is considered in combination with prior risk [32] to establish indivualised risk and appropriate management. Up to 20% of all chest X-ray requests from primary care in patients subsequently diagnosed with lung cancer are negative [33,34]. If we consider a high posterior risk of lung cancer, as shown in the Case C, even with a negative chest X-ray this patient still meets the criteria for urgent referral (PPV>3%), based on epidemiological risk factors and symptomology using Bayesian incorporation for posterior risk [17,23,[35][36][37][38][39].
Over-reliance on chest X-ray findings and ignoring the patient's prior risk could result in a missed diagnosis. This observation is reflected in the most recent NICE guidelines for referral of suspected cancer, it supports better primary care access to high-resolution imaging when indicated for high-risk patients [4].
Hamilton et al., 2005 investigated first and subsequent presenting symptom in lung cancer patients. Raw data from this study was utilised in the Bayesian model for risk of lung cancer described Table 5. Walters et al., 2015 looked at synchronous symptoms that occurred at the same time but did not define the specific symptom only the frequency of a single or synchronous symptom at first presentation.
In this systematic review we provide supporting evidence for four important symptoms for lung cancer diagnosis: haemoptysis, dyspnoea, cough and chest pain. It also highlights the difficulties with evaluating the diagnostic value of constitutional symptoms. For the diagnosis of relatively rare conditions such as cancer, population-based prospective cohort studies may never be feasible, hence, Walter and colleagues (2015) used selected high-risk patients. As we reach the limit of what we can be achieve with routine data in their current form, we must develop more defined and sophisticated criteria for clinical coding of symptoms and routine risk stratification of patients in real-time during clinical decision making [40,41].
Supporting information S1  Table. Summary of real-time data from routine general practice for the most common presentions associated with lung cancer patients > 6 months before diagnosis (95% confidence intervals)-Netherlands, Malta, Serbia and Japan since 1995 and including 19700 patients.

Acknowledgments
We would like to thanks Dr Jean Karl Soler, Mediterranean Institute of Primary Care, for advice on the use of TRANSHis, and Professor William Hamilton, University of Exeter, for the original data for his study and comments on the manuscript. The authors gratefully acknowledge infrastructure support from the Cancer Research UK Imperial Centre, the Imperial Experimental Cancer Medicine Centre and the National Institute for Health Research Imperial Biomedical Research Centre. Case A represents a low risk patient based on symptoms alone and therefore would not require further investigation or referral. When we take into account prior risk defined by age, sex, smoking status and intensity, this patient is at greater risk then the moderate risk patient in Case B below and does require further investigation (chest X-ray). Case C represents a high risk patient based on symptoms and even with a negative chest X-ray this patient would require further investigation to exclude lung cancer [17], as 20% of all chest X-ray requests from primary care in confirmed lung cancer patients are negative [33,34]. The current cut-off of urgent cancer referrals in the UK is PPV>3% so this patient would be considered at high risk and should be investigated further, regardless of the chest X-ray findings.