Factors associated with excess all-cause mortality in the first wave of the COVID-19 pandemic in the UK: A time series analysis using the Clinical Practice Research Datalink

Background Excess mortality captures the total effect of the Coronavirus Disease 2019 (COVID-19) pandemic on mortality and is not affected by misspecification of cause of death. We aimed to describe how health and demographic factors were associated with excess mortality during, compared to before, the pandemic. Methods and findings We analysed a time series dataset including 9,635,613 adults (≥40 years old) registered at United Kingdom general practices contributing to the Clinical Practice Research Datalink. We extracted weekly numbers of deaths and numbers at risk between March 2015 and July 2020, stratified by individual-level factors. Excess mortality during Wave 1 of the UK pandemic (5 March to 27 May 2020) compared to the prepandemic period was estimated using seasonally adjusted negative binomial regression models. Relative rates (RRs) of death for a range of factors were estimated before and during Wave 1 by including interaction terms. We found that all-cause mortality increased by 43% (95% CI 40% to 47%) during Wave 1 compared with prepandemic. Changes to the RR of death associated with most sociodemographic and clinical characteristics were small during Wave 1 compared with prepandemic. However, the mortality RR associated with dementia markedly increased (RR for dementia versus no dementia prepandemic: 3.5, 95% CI 3.4 to 3.5; RR during Wave 1: 5.1, 4.9 to 5.3); a similar pattern was seen for learning disabilities (RR prepandemic: 3.6, 3.4 to 3.5; during Wave 1: 4.8, 4.4 to 5.3), for black or South Asian ethnicity compared to white, and for London compared to other regions. Relative risks for morbidities were stable in multiple sensitivity analyses. However, a limitation of the study is that we cannot assume that the risks observed during Wave 1 would apply to other waves due to changes in population behaviour, virus transmission, and risk perception. Conclusions The first wave of the UK COVID-19 pandemic appeared to amplify baseline mortality risk to approximately the same relative degree for most population subgroups. However, disproportionate increases in mortality were seen for those with dementia, learning disabilities, non-white ethnicity, or living in London.

List below the member(s) of the research team who have experience with CPRD data.

Data Access Arrangements
Indicate with an 'X' the method that will be used to access the data for this study: Study-specific Dataset Agreement Institutional Multi-study Licence X Institution Name LSHTM Institution Address Keppel Street, London WC1E 7HT Will the dataset be extracted by CPRD? Yes X No X If yes, provide the reference number: We will extract data through the Define and Refine on-line tools. CPRD will extract the BMI, smoking and serum creatinine data from CPRD Aurum based on code lists that we provide ( Tracking of deaths in the population over time (mortality) is an essential part of monitoring and predicting the likely course of the COVID-19 pandemic. The number of deaths attributed to COVID-19 is problematic as likelihood of stating COVID-19 on the death certificate will have changed over time and may have been inconsistently applied across settings. Excess deaths by week from any cause could provide a more objective and comparable way of assessing the scale of the pandemic and formulating lessons to be learned. It is constructed by comparing the deaths over time since the start of the pandemic to the number expected based on the experience of previous nonpandemic years. Crucially this method has as the outcome all-cause mortality data, sidestepping the problems associated with attribution of cause of death to COVID-19.
Standalone national mortality data will allow tracking of excess mortality over time, but CPRD uniquely includes timely data on deaths alongside extensive clinical and demographic data, backed up by periodic linkages to national mortality data. This study will exploit this important combination of demographic/clinical and death data to examine how excess mortality during the course of the pandemic differs in individuals with different characteristics (e.g. age, sex, smoking) including pre-existing illnesses (e.g. respiratory disease, heart disease, diabetes). The study will provide key information to inform clinical guidance, resource planning and policies for protecting the most vulnerable.

C. Technical Summary (Max. 300 words)
In recognition of limitations of cause-specific death data, estimates of excess mortality are routinely used to estimate and compare deaths due to seasonal influenza epidemics. These are based on national death registration data which allows for stratification by age, sex and geographic region. We aim to replicate these analyses for the COVID-19 pandemic in the UK with further stratification by COVID-19 effect modifiers that can be measured in primary care data.
Time series methods will be used in our primary analyses of observed mortality during the whole period (pre-COVID-19 and during COVID-19) using generalised linear models with a negative binomial error structure. The main outcome is mortality, measured in primary care data using the CPRD derived death date and the main exposure is the pandemic period (1 st March 2020 onwards). The model will initially include exposure, age, sex, and terms to capture seasonality and underlying year-on-year trends. Interaction terms will be used to investigate excess mortality during the pandemic according to demographic, lifestyle-related and comorbidity characteristics. Relative and absolute differences in excess mortality will be described.
We will compare our primary time series approach to national excess mortality estimates. In a secondary analysis to check conclusions, we will compare weekly mortality with expected mortality based on comparable week-of-year mortality in the years prior to the pandemic (standardised mortality ratio (SMR) approach). We will also explore an approach based on individual-level cohort data, with the pandemic period included as a time-updated variable. In secondary analyses we will use updated linked ONS mortality data when available to repeat the analysis using the ONS death date and HES APC data to improve ascertainment of comorbidities.

D. Outcomes to be Measured
All-cause mortality

E. Objectives, Specific Aims and Rationale
Objective and aims: The overall objective is to inform public health policy by providing timely evidence on how key individual-level demographic characteristics and comorbidities affect excess mortality risk over time during the course of the COVID-19 pandemic.
Specific aims (primary): 1. Validate approaches/data by estimating all cause excess mortality, by age, sex and co-morbidity grouping, over time by week, during the COVID-19 pandemic in the UK, and compare with similar estimates derived from national death registration data. 2. Use time series methods to investigate how all-cause excess mortality varies by key individual-level characteristics (demographics, lifestyle-related, comorbidities).

Specific aims (secondary)
3. Check conclusions from the primary time series analysis using an SMR-based approach that compares observed and expected mortality for given week of the year.
4. Conduct an exploratory cohort analysis using individual-level data to investigate how excess mortality varies by characteristics including specified combinations of comorbidities.

Rationale:
Excess deaths by week from any cause could provide the most objective and comparable way of assessing the net impact of the pandemic as it progresses, and avoids the significant problems associated with attributing individual deaths to COVID-19. Use of alternative techniques in secondary and exploratory analyses will allow us to estimate the robustness of our model and potentially extend our analyses to less common comorbidities and multimorbidity groups.

F. Study Background
Accurate and consistent estimation of key population parameters is required to inform national and global strategies to mitigate and suppress progression of the coronavirus disease 2019 (COVID-19) pandemic. Essential parameters include mortality due to COVID-19. However, variation in testing and recording of cause of deaths in national death registers between countries, over time and by underlying health conditions makes this challenging 1 . Excess mortality over time may therefore be a more objective and comparable way of assessing the scale of the pandemic and formulating lessons to be learned. It is constructed by comparing the deaths over time in 2020 to the number expected based on the experience of previous non-pandemic years. Crucially this method has as the outcome allcause mortality, sidestepping the problems associated with attribution of cause of death to COVID-19.
It is possible to estimate excess mortality by age, sex and geographic region using national death registration data although there is a delay in publishing these 1 . These methods have been used to measure excess deaths in previous influenza seasons 2 and are being used for the COVID-19 pandemic 3 .
Current methods using death registration data alone do not allow stratified estimation of excess deaths for groups of people with clinical risk factors for COVID-19. CPRD uniquely includes timely data on deaths 4 alongside extensive clinical and demographic data, backed up by periodic linkages to national mortality data. This study will exploit this important combination of demographic/clinical and death data to examine how pre-existing illnesses (e.g. respiratory disease, heart disease, diabetes) and risk factors such as BMI and smoking are associated with excess mortality risk during the course of the pandemic. The study will provide key information to inform clinical guidance, resource planning and policies for protecting the most vulnerable.

G. Study Type
Descriptive

H. Study Design
For our primary analysis, we will use time series methods modelling weekly counts of deaths through regression models that incorporate seasonal parameters and annual trends and use interaction terms with pandemic indicators to estimate excess mortality after the pandemic and effect modification by patient characteristics. To check the robustness of the results, we will use SMR methods which split the time-series data into pre-COVID-19 and COVID-19 periods and use weekly aggregated data from the former to predict expected mortality rates in the latter.
In exploratory analyses, we will develop individual cohort designs which may allow us to estimate more granular patient characteristics (e.g rare comorbidities and multimorbidity groups). Differences between models will be examined and reported on.

I. Feasibility counts
Counts based on April 2020 builds CPRD GOLD CPRD Aurum Alive and registered with up to standard* practice on 1 January 2020

J. Sample size considerations
We will use all available data in CPRD GOLD and CPRD Aurum to maximise regional coverage and the number and combination of comorbidities that we can study whilst modelling changes on a weekly basis. Data extraction will be minimised by using the CPRD GOLD Define and Refine tools to ascertain covariates of interest only, rather than extracting full clinical data (as discussed with Rebecca Ghosh and Dan Dedman).
Methods for computing sample size and power and not well developed for time series analysis, so we took an empirical approach by combining recent CPRD GOLD denominator data with a simulated pandemic starting in week 9 of the final year which doubles death rates and affects patients with a given comorbidity (simulated to have similar prevalence to cancer) twice as badly. Using the time series approach, the IRR (95% CI) for the pandemic was 1.89 (1.74-2.07) and the comorbidity interaction (1.97 ,1.85-2.08). The confidence intervals included the true values (2.0 and 2.0) with adequate precision. We therefore expect the coverage to be good with this study size, even when estimating interaction terms. 8% of cells had zero denominators in the weekly time series dataset created from these simulated data, necessitating some collapsing of strata, and this would rise further if stratifying by >1 comorbidity, hence the need for maximum data (using both GOLD and Aurum) to enable us to explore joint associations of combinations of covariates, using mutually adjusted models.
We will use the July 2020 build for the initial primary care only analyses because these data are the most timely and updated monthly. Our study period will end on 30 April 2020 as there appears to be a two-month delay in registering deaths in primary care data. Analyses will be updated monthly.

K. Planned use of linked data (if applicable):
Additional analyses will be completed when set 19 of the linkage data is released, assuming that ONS linked data coverage will then include the pandemic period. We request access to linked ONS mortality data so that analyses can be repeated using ONS certified death as the outcome and linked HES admitted patient care data to improve ascertainment of ethnicity and comorbidities for stratification in later analyses.
Linked practice level Carstairs quintiles are requested to allow stratification by area-based deprivation in the initial analyses using primary care data only. We also request urban rural data as COVID-19 infections and mortality will be affected by deprivation, population density and distance from hospital. Patient level Carstairs quintiles are requested for the linked analysis.
Risk of identification of practice or patient location will be mitigated by restricting users who have access to the raw CPRD and linked data to Helen Strongman and Krishnan Bhaskaran who understand the increased risk of identification with these data and have completed mandatory data protection training at LSHTM. This will be implemented by password protecting folders where data are stored in accordance with LSHTM IT policy. Patient and practice identifiers will be pseudonymised in datasets released to other investigators.
Analyses using linked data will be restricted to English practices and patients who are eligible for linkage, and to the time periods of linked data coverage.
This study will use demographic/clinical and death data available in CPRD primary care and linked data sources to examine how pre-existing illnesses (e.g. respiratory disease, heart disease, diabetes) and risk factors such as BMI and smoking are associated with excess mortality risk during the course of the COVID-19 pandemic in the UK. The findings will provide key information to inform clinical guidance, resource planning and policies for protecting the most vulnerable people in the UK, including in England and Wales, during the pandemic.

L. Definition of the Study population
Data from practices that have contributed to both CPRD GOLD and CPRD Aurum will be removed from the CPRD Aurum cohort to maximise the power of analyses that can only be completed in CPRD GOLD (i.e. stratification by number of people in household).

1) Time series and SMR approaches
The study population will be the adult population of all practices with continuous follow-up from 1 st March 2014 and with last data collection dates no more than 2 months before the date of the most recent CPRD build ("time-series cohort").

2) Cohort approach
The population under investigation will be the adult population contributing CPRD follow-up at any time after 1 st March 2014.

M. Selection of comparison group(s) or controls
For the time series and SMR approach, the comparison is between observed aggregated mortality before and during the pandemic, investigated using respectively interaction terms with pandemic indicators, or comparisons with expected mortality modelled using pre-COVID-19 weekly aggregated data.
For the cohort approach, the comparison is between mortality during the pandemic period and mortality during the pre-pandemic period where information is kept at the individual-level (as opposed to aggregated).

N. Exposures, Outcomes and Covariates
The main exposure of interest is the Covid-19 pandemic starting from 1 st March 2020.
The main outcome is mortality, identified from primary care data using the CPRD derived death date. In secondary analyses we will use updated linked ONS mortality data when available to repeat the analysis using the ONS death date.
Covariates of interest for stratification are: -Geographical region -Age (initial time series counts to be generated in 5 year categories, then collapsed further for analysis) We will use code lists and definitions developed by the LSHTM Health Protection Research Unit on Immunisation where available, and where needed develop new code lists using systematic dictionary keyword searches, manual review, and clinical sign-off; these will be published online and linked in research outputs.
*Current UK government guidance on who should be considered vulnerable has been largely based on policies developed for previous epidemic respiratory viruses, notably influenza. Understanding of Covid-19 risk factors is improving rapidly. We will submit a protocol amendment should new Covid-19 risk factors that can be measured using the available data emerge during the study.

O. Data/ Statistical Analysis
To ensure our conclusions are robust, several approaches will be taken (with different strengths and limitations). We would expect the overall patterns and conclusions of the analyses to be similar. Any discrepancies in the patterns or results will be investigated and reported. In each case we will use data up to two months before the CPRD build date (i.e. data to end April 2020, from the build released in early July 2020) and analyses will be updated monthly.

1) Primary analysis -time series approach
We will derive weekly counts of numbers of deaths in contributing practices (see section L "Study Population") from the start of the study period to the end of the study period. Population denominator sizes (number under active follow-up) will also be extracted. We will extract these counts overall, and stratified by age, sex, region and other factors (see section N "Covariates"). Where these factors are time-varying, counts will reflect changes over time.
We will fit generalised linear models with a negative binomial error structure to model the counts of deaths, taking account of variable denominator sizes [approach we took in Matthews et al media and statins paper to deal with variable denominators, BMJ2016 5 ]. The model will initially include the outcome (mortality), exposure (2020 pandemic period vs pre-pandemic), age, sex, interactions between the exposure and age and sex, Fourier terms (a harmonic function of calendar week) to capture seasonality, a continuous function of calendar year (linear, or higher order polynomial if evidence of non-linearity) to capture any underlying trends, and if relevant dummy indicators to capture extreme flu seasons. When investigating excess mortality within subgroups, further exposure covariate stratification factors will be added individually, and in combinations where there is sufficient information.
Interactions between the exposure and stratification factors will provide estimates of the relative effect of the pandemic in strata.

2) Secondary analysis -SMR-based approach
We will use the stratified weekly counts generated for the time series approach, and collapse years into two periods (all pre-pandemic vs 2020 pandemic). Age/sex/comorbidity-specific expected weekly death rates for the pandemic period will be generated using the pre-pandemic data, and compared with observed weekly deaths in the pandemic period. Risk factors will be analysed one at a time, and combined where numbers allow. Standardised mortality ratios will be calculated as observed / expected and 95% confidence intervals will be generated from a saturated Poisson model.

3) Secondary analysis -Cohort approach
Using data from March 1 st 2014 to the end of the study period, we will fit a proportional hazards model for time to death on an underlying age timescale (with extensions if proportionality of the hazards assumption is not met). Individuals will enter the analysis 1 year after entry into CPRD to allow classification of comorbidities. Time-updated calendar month and year will be included in the model, alongside a time-updated indicator for the pandemic period, and individual-level covariates. Interactions between the pandemic indicator and covariates will be added one at a time initially and in combinations, where possible, to describe differences in excess mortality between risk factor groups.
Validation of approaches / data We will compare estimates of all cause excess age/sex-specific mortality over time by week during the course of the COVID-19 pandemic with similar estimates derived from national death registration data.

P. Plan for addressing confounding
This is not a hypothesis testing/causal modelling study, so confounding per se is not an issue. The role of covariates will be explored by stratification in the main analyses and by adjustment and stratification in the exploratory individual cohort analyses.

Q. Plans for addressing missing data
Analyses stratified by BMI, smoking and ethnicity are likely to have missing data. In our primary analyses, we will create binary yes/no variables for obesity, ever smoked, white ethnicity assuming that individuals with missing data are not obese, have never smoked and are white. We will conduct sensitivity analyses restricted to the study population with data available, with appropriate assessment of possible selection bias affecting the results. We will also conduct sensitivity analyses whereby missing obesity, smoking and ethnicity data are randomly recoded as taking any value other than the reference to assess the impact of misclassification bias.

R. Patient or user group involvement
There is no patient or user group involvement. This is a broad study and does not focus on a specific disease group, and the requirement to proceed on short timescales to inform government policy precludes consultation with user groups.
S. Plans for disseminating and communicating study results, including the presence or absence of any restrictions on the extent and timing of publication Results will be released at the earliest opportunity in order to inform policy in a timely way, including release of data on a preprint server. Peer review and formal publication will proceed in due course

Conflict of interest statement:
There are no conflicts of interest to declare.
T. Limitations of the study design, data sources, and analytic methods

Data sources
Our initial analyses will use primary care data only. This is the mostly timely data available to us as it is updated monthly and we believe that deaths are recorded within two months. Small lags have been reported between the date of death in primary care records compared to death registration data. We will therefore validate our approach through comparison with nationally reported excess mortality data.
Measurement of covariates such as ethnicity and conditions treated in hospital will also be affected. We will acknowledge these limitations and repeat our analyses using linked HES APC and mortality data when available.

Analytic methods
Time series and SMR analyses cannot confirm a causal link between the pandemic and the observed changes in mortality, or disentangle direct impacts of Covid-19 from indirect impacts of the pandemic and control measures. The design accounts for individual level factors such as smoking and obesity that remain fairly constant or change relatively slowly overtime within the population, but it is possible that other external factors played a role in the observed changes (e.g. influenza, air pollution). In the first few months of the pandemic, the impact of these changes are likely to be minimal compared to the impact of COVID-19.
Through stratification within our time series and SMR analyses, we will be able to show how the effect size differs between people with and without individual risk factors and two-way combinations of comorbidities. We will not be able to fully explore the impact of multimorbidity. However, through exploration of individual cohort methods, we hope to be able to estimate independent effects of risk factors, assuming no residual confounding.