Skip to main content
  • Loading metrics

Assessing the importance of demographic risk factors across two waves of SARS-CoV-2 using fine-scale case data

  • Anthony J. Wood,

    Roles Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Roslin Institute, University of Edinburgh, Midlothian, United Kingdom

  • Aeron R. Sanchez,

    Roles Formal analysis

    Affiliation Roslin Institute, University of Edinburgh, Midlothian, United Kingdom

  • Paul R. Bessell,

    Roles Formal analysis

    Affiliation Roslin Institute, University of Edinburgh, Midlothian, United Kingdom

  • Rebecca Wightman,

    Roles Formal analysis

    Affiliation Edinburgh Medical School, University of Edinburgh, Edinburgh, United Kingdom

  • Rowland R. Kao

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliations Roslin Institute, University of Edinburgh, Midlothian, United Kingdom, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, United Kingdom


For the long term control of an infectious disease such as COVID-19, it is crucial to identify the most likely individuals to become infected and the role that differences in demographic characteristics play in the observed patterns of infection. As high-volume surveillance winds down, testing data from earlier periods are invaluable for studying risk factors for infection in detail. Observed changes in time during these periods may then inform how stable the pattern will be in the long term. To this end we analyse the distribution of cases of COVID-19 across Scotland in 2021, where the location (census areas of order 500–1,000 residents) and reporting date of cases are known. We consider over 450,000 individually recorded cases, in two infection waves triggered by different lineages: B.1.1.529 (“Omicron”) and B.1.617.2 (“Delta”). We use random forests, informed by measures of geography, demography, testing and vaccination. We show that the distributions are only adequately explained when considering multiple explanatory variables, implying that case heterogeneity arose from a combination of individual behaviour, immunity, and testing frequency. Despite differences in virus lineage, time of year, and interventions in place, we find the risk factors remained broadly consistent between the two waves. Many of the observed smaller differences could be reasonably explained by changes in control measures.

Author summary

The COVID-19 pandemic has seen unprecedented amounts of high-quality data collected for a human disease. For longer-term control in the absence of widespread testing, these data are invaluable for understanding whom amongst the population is at the highest risk of infection. In this work we fit the detailed distributions of COVID-19 cases over Scotland, across two infection waves driven by different variants, to identify risk factors. These were at a time when Scotland had substantial population immunity from prior infection and vaccination, and strict control measures were being relaxed. Differences across the waves may then indicate how stable the pattern of infection will be in the longer term. Despite Scotland’s high geographic and demographic diversity, we effectively fit the case distribution in both waves, and find only minor variation between the two. Uniquely, our model was informed by the volume of negative COVID-19 lateral flow tests, and we find that a high rate of negative test reporting was a risk factor for a high rate of cases. This, combined with high variability in testing across demographics, leads us to suggest that patterns in reported case data may in fact be quite different to those of all infections, reported and unreported.

1 Introduction

A key challenge in the long term control of an infectious disease is to identify predictable patterns of incidence. The emergence and spread of the SARS-CoV-2 virus saw restrictions imposed globally on everyday life to control the spread of COVID-19 infection, and to protect individuals at highest risk of severe disease. While as of March 2023 few to no restrictions remain in place in Scotland, as in the rest of the UK, randomised testing [1] and hospital admissions [2] indicate continued widespread transmission. The winding down of community testing and other surveillance is making it more difficult to track the transmission patterns of COVID-19 in detail.

Typically, identifying risk factors for infection rely on disease surveillance studies. While these studies can be powerful and provide important insights [36], they are often expensive, laborious and time consuming. “Big Data” in the health sciences offers an opportunity to gain some of the same insights using routinely collected data. The availability of COVID-19 case data at fine spatial scales with detailed metadata enables us to identify important health-related risks, with the data collected during the pandemic being made available to researchers in close to real-time.

In this work we aim to identify risk factors for COVID-19 cases in Scotland, and their change over time, to serve as an indicator for how the longer-term profile of infection may evolve. We fit the case distributions of two different waves of COVID-19, with a machine learning model informed by a range of explanatory variables relating to geography and demographics.

The first COVID-19 case in Scotland was identified on 1st March 2020 [7]. The Scottish Government imposed strict “lockdown” non-pharmaceutical intervenions (NPIs) on 23rd March 2020 [8]. While initially applied at the national level, following the initial lockdown period NPIs were adjusted by local authority (administrative areas with populations ranging between 22,540–635,130) through a “levels”-based system [9]. The seeding and rapid spread of the B.1.1.7 lineage (termed the “Alpha” variant) in December 2020 led to a tightening of NPIs and a second lockdown [10, 11]. A mass vaccination programme began in December 2020 [12, 13], prioritising the elderly and healthare workers, with all adults eventually eligible.

We focus on case data gathered between May 2021 and January 2022, a period that saw the steady relaxation of nearly all NPIs [14]. This period had two major waves of infection: the first from May 2021 triggered by the B.1.617.2 lineage (“Delta”), and a second wave from November 2021 by the B.1.1.529 B.A.1 lineage (“Omicron”). The deletion of two specific amino acids in the Omicron sub-variant distinguished it from most co-circulating variants including Delta, in PCR tests that have an accompanying “S-gene” test result [15]. A high-capacity testing programme was in place throughout, with free-of-charge lateral flow testing strongly encouraged, and PCR testing mandated for those with symptoms, or a lateral flow positive.

Earlier work has exploited finely-grained case data to highlight risk factors for cases and severe outcomes including (but not limited to) sex [1618], population density [1921], deprivation [2225], occupation [2628], and age [2931]. Similar studies have incorporated movement data [32] to demonstrate the protective impact of NPIs that restrict mobility [21, 3337]. Many of these studies focus on the “first wave” of infection, during which strict NPIs were imposed and no population immunity had been established. This study focuses on a more advanced period moving away from NPIs, and the conditions for disease spread comparatively less “exceptional”. This is especially the case for the Omicron wave. A unique feature of our model is the inclusion of lateral flow test taking frequency. The proportion of infectons that end up reported is likely to depend on testing propensity, and we consider how that may lead to distortions in the case distribution.

Our main finding is that the risk factors for cases remained broadly consistent across both waves. Differences between the two waves either offer relatively small scale changes in demographic risk or are consistent with the impact of changes in approaches to control.

2 Results

The period November 15th 2021—January 6th 2022 covers the first outbreak and peak of the B.1.1.529 lineage (BA.1 sublineage, hereafter referred to as the Omicron variant) (S-gene “dropout” test signature). Prior to this, the B.1.617.2 lineage (Delta variant) (S-gene positive test signature) was dominant. From 15th November 2021, S-gene dropout cases consistently rise, and all subsequent “dropout” cases are assumed Omicron. Remaining S-gene positive cases are presumed to be Delta, consistent with nationwide sequence data [38].

2.1 Time evolution and early patterns of spread

We identified 385,558 cases between November 15th 2021 and January 6th 2022, of which 227,286 were likely Omicron. From 1st May 2021 to 7th September 2021 we identified 269,838 cases, of which 229,073 were likely Delta. The remaining cases in these periods (those with no S-gene result, or a different result) are excluded. The start date for each of these periods is the first date from which there are consistent rises in cases that are likely the new variant.

Omicron cases had a doubling time (the time taken for newly reported daily cases to double) of 2.9 days over the first 28 days, compared to 6.2 days for Delta (Fig A in S1 Text). Over half of all DZs had reported an Omicron case in the wave within 29 days, whereas for Delta this took 39 days (Fig B in S1 Text).

The reproduction number Rt consistently rose for Omicron, peaking at above 2 for nearly all local authorities 28 days in to the outbreak, and only consistently falling below 1 after 50 days (Fig C in S1 Text). Reproduction numbers for Delta are less consistent between LAs; while the number generally remains above 1 for most LAs in the period, there is no coherent peak at the start of the wave.

In the intermediate period during which Omicron became dominant and Delta declined, the age distributions by variant differed (Fig D in S1 Text). Taking the mid-points of the five-year age brackets, the mean ages of the Delta-type cases was 3.9 years lower than the Omicron-type cases (31.8 years compared to 35.7 years). A Student’s t test shows this difference to be statistically significant (t = −52.2, p < 0.001). This was the case from relatively early on when Omicron accounted for at least 5% of cases. However, the median ages are equal (both 32.5 years), as in the Omicron-type cases there is a trough in those aged 0–14, with fewer than 50% of cases in this age group Omicron, but then a peak in the 20–29 age group.

2.2 Case distribution and model fit

Fig 1, shows the distribution of COVID-19 cases for the Omicron and Delta waves broken down by age, sex, prior cases (serving as a proxy for prior immunity from infection), deprivation and health board. Omicron case rates were highest in younger adults, peaking at 90 cases/1,000 in ages 20–24. There was only a small difference in rates between men and women. Case rates were much lower amongst those that had tested positive for COVID-19 previously. Fig 2 shows case rates per DZ. Geographically, case rates fall with increasing rurality, most notably in Orkney, Shetland and the Western Isles (all island communities). The trend with respect to multiple deprivation decile is bimodal, with higher rates towards the highest and lowest deciles.

Fig 1.

Summary of 227,286 Omicron COVID-19 cases in Scotland between November 15th 2021 and January 6th 2022 (blue, filled), and 229,073 Delta cases from 1st May 2021 to 7th September 2021 (green, filled). The full population (N = 5, 465, 169) is broken down by age range, prior case status (whether a person had previously reported a COVID-19 case prior to that specific wave, and when), deprivation (of place of residence, per the SIMD decile, with 1 the most deprived), rurality (of place of residence, per the census Urban/Rural Classification) and location (at the level of Scottish health board). Cases are given per 1,000 people in that group (with subpopulation N recorded on the axis labels). The corresponding case rates as fit by our models are superimposed. Note that the subpopulations in the prior case status plot change across waves, due to being at different points in time.

Fig 2. COVID-19 cases in Scotland over the Delta period (A) as compared to Omicron (B), with focus on the Greater Glasgow region (C, D).

Each point indicates the population-weighted centroid of a DZ, with the colour representing the number of cases reported. Base maps obtained from Natural Earth [39].

The fit case rates from our random forest regression models are overlaid onto Fig 1. We achieve a good fit to these larger-scale trends. The model slightly under-fits the age ranges 15–24, where case rates were the highest overall. Variable importance outputs are presented in Fig G in S1 Text, with node purity and accuracy loss.

Fig 3A shows model performance at DZ level, comparing observed cases to fit cases. Beginning with Omicron cases, our full model explains 70% (fit: 71%, test: 62%) of local variation in the case distribution (R-squared for case numbers, aggregated at a DZ level), with a poorer fit for cohorts with very high case counts. A “reduced” random forest model informed by population and population density alone explained 59% (fit: 60%, test: 55%) of variation. A model informed by only population/deprivation rank explained 53% (fit: 53%, test: 51%), and one informed by only population/age explained 48% (fit: 48%, test: 51%). Fig 3A shows further deviation of the data-fit slopes away from the diagonal for these “reduced” models.

Fig 3. Performance of different models.

(A) comparing observed cases to fit cases at DZ level. Each point represents a DZ. Points deviating from the diagonal indicate DZs with less accurate fits. The full model is compared with performance of reduced models informed with only population, and one of either age, overall deprivation rank, or population density. Also shown is residual clustering as measured by the Moran’s I statistic, at different physical distances (B) and network-based distances (C). Higher values represent higher autocorrelation between model residuals, when comparing DZs sitting within a given locus. DZs are defined as nearest neighbours of one another if they share a boundary.

Considering now earlier Delta cases from 1st May to 7th September 2021, the geographical distribution (Fig 2) is visually similar, with a concentration of high case rates in the denser “central belt”. Cases skewed slightly younger (Fig 1), with the highest rates within ages 15–19. The distributon with respect to deprivation decile remains bimodal, with higher rates in both the most and least deprived DZs. Model performance was similar, explaining 72% (fit: 73%, test: 61%) of DZ-level variation.

Fig 3B and 3C shows for both the Delta and Omicron models, autocorrelation of residuals (as measured by the Moran’s I statistic, Section 4.5) within 1km is 0.35, falling to 0.15 at 5km, and 0.05 at 50km. The reduced models exhibit much higher residual autocorrelation, with the density-only model performing best, but persisting over larger distances (see Fig F in S1 Text for a map view of residuals).

2.3 Accumulated local effects

Fig 4 shows the accumulated local effects (ALEs) of all explanatory variables in the model (see Section 4.4 for definition).

Fig 4. Accumulated local effects across all explanatory variables.

For each variable, the x-axis represents the range of values of that variable in the data, and the y-axis (note scale differences for population, age, sex and prior case status) is the ALE for that variable value. The overall magnitude of the ALE represents the relative size of the effect.

Population, age, sex, and prior case status have ALEs that follow the empirical distributions observed in Fig 1; ALEs are strongly positive for ages between 15–40, and those that had never reported a case before.

Beyond these variables, Fig 4 shows that features such as low population density, high vaccination uptake, a low mean household size, and a low rate of negative LFD test reporting are protective. We note that for vaccination uptake, the protective value at zero is likely an artefact arising from cohorts with ages 0–9 that were not eligible.

The effects for many variables associated with social deprivation such as the ratio of working age people with no qualifications and the rate of income deprivation (see Section B.2 in S1 Text for full descriptions) are weaker. This is consistent with the small degree of deprivation-level variation seen in Fig 1.

The directionality of the ALEs remain broadly consistent across both waves. Some risk factors were more pronounced in the Delta model, including in mean hosehold size, population density and the proportion of individuals belonging to a black or minority ethnicity. Conversely, cohorts with very high student populations were associated more strongly with high case rates in the Omicron fit.

3 Discussion

Scotland’s programme of free community testing was an invaluable tool for tracking the spread of COVID-19 infection up to early 2022. With the ending of detailed surveillance since, it is more difficult to monitor the precise patterns of infection amongst the population and how that will evolve over time, especially with respect to different variants.

The aim of this study was to compare the patterns of cases across two waves of COVID-19 in Scotland in 2021, during which non pharmaceutical interventions (NPIs) were being relaxed but testing remained mandatory and a mass vaccination rollout was in progress. We analysed the distribution of cases during the B.1.617.2 “Delta” wave from May 2021, and the B.1.1.529 “Omicron” wave from November 2021. We have shown that case heterogeneity was associated with broad factors such as age structure and residual immunity from earlier cases, but also with factors relating to testing, vaccination, geography and demographics. Despite differences in the severity of interventions in place, time of year, vaccination uptake and virus phenotype, these risk factors remain broadly consistent across both waves.

Our models accurately capture the case distributions (Fig 1). However, not all variation is explained, and residual autocorrelation persists at <5km scales (Fig 3). A reason for this may be that our model is not informed by mobility, thus explicit links between communities are not known to the model. We also do not include meteorological data (such as in e.g. [33]). This could have explained further variation as our waves occur in different seasons, where the characteristic routes of transmission may have differed. Last, the fit cases are time-aggregated, and therefore do not account for changes in risk factors during each wave.

The inclusion of the local outbreak duration for each DZ (the time the first case was detected in the DZs wider intermediate zone, typically containing 4–6 DZs) accounts in part for local interactions between neighbouring communities, in the absence of explicit mobility data. A weakness of this is that the local outbreak duration correlates with the total number of cases, given the relatively short periods studied. We suspect this is less influential in the Omicron model where geographical spread was more rapid. The regression models applied here may be better suited to scenarios where an infectious disease is already well established in the population. For future analyses on cases at the very beginning of an outbreak with fewer cases, this approach may be adapted to instead fit case rates per day, from when the first case was identified locally.

3.1 Risk factors

We presented the accumulated local effects (Fig 4), revealing broad indicators for higher or lower case rates, and how they changed between waves. It is difficult to fully disentangle whether a difference was caused by a change in control measures, or a change in virus strain. Nonetheless, our analyses provide some important insights.

To begin, high mean household size emerges as a risk factor, consistent with the high secondary attack rates for SARS-CoV-2 [40, 41], and increased risk of inter-household transmission relative to contacts outside of the home [42]. That this, and high population density are both stronger risk factors for Delta may reflect the stronger NPIs at this tme increasing the proportion of within-DZ or within-household transmissions.

High vaccine uptake (amongst those eligible) is also protective, more so with Delta, consistent with higher rates of immune breakthrough with the Omicron variant as compared to Delta [4345]. We do not know the specific vaccination status of those in the test data, however, and linked data may show a stronger protective effect.

For Delta, a high proportion of individuals of black and minority ethnicity is a stronger risk factor. In the UK, this is also a risk factor for severe COVID-19 outcomes [4648] but without detailed, linked data, it is difficult to firmly establish drivers for a heightened risk during the Delta wave. Differences may emerge from known variations in vaccination uptake [49] and occupation [50] (thus ability to work from home or effectively physical distance), and the relative impacts of those factors changing across the two waves.

Finally, living in a deprived community was suggested from early on [51] and has since also emerged as a risk factor for severe COVID-19 disease [5257]. However, the corresponding ALEs for the variables associated with deprivation are small. Deprivation effects may be captured by proxy with other variables that correlate with deprivation such as age [58] and vaccine uptake [59, 60].

3.2 Testing frequency

The low case rate variation with deprivation (Fig 1) contrasts with observed inequalities over severe outcomes [2225, 61], suggesting that those living in more deprived communities experience a higher inherent case-hospitalisation rate. We suspect that a lower proportion of case ascertainment, however, may also be a factor.

An important and unique variable in our model is the rate at which negative LFD tests were reported throughout the period. We found high rates of negative test reporting to be a risk factor. This suggests a variation in case ascertainment across different demographics, which may in turn lead to skews in the observed case distribution [22, 62, 63].

Further work (Fig H and Table A in S1 Text) shows that up to February 2023, the rate of LFD testing and positivity varied substantially across deprivation (quintile 1: 3.6 tests/person, 4.61% positive; quintile 5: 6.7 tests/person, 3.57% positive) as well as sex (M: 3.7 tests/person, 4.82% positive; F: 7.0 tests/person 3.30% positive). If demographic differences in testing behaviour correspond to differences in case ascertainment, the profile of all infections may then be biased from reported cases, and testing rates may be obscuring the true patterns of infection over sex and deprivation.

In addition, the magnitude of the risk factor (as seen in the ALE, Fig 4) plateaus beyond a certain rate (>∼1 test/person in each period). This hints at a deeper relationship between true incidence, the frequency of testing (and whom amongst the population is taking those tests), and the proportion of infections that are ascertained.

Our model is unique in including negative test reporting, and has revealed strong differences between different demographics that may bias the profile of cases. Beyond the work presented here, further analysess of reported cases need to be considered with these strong skews in testing behaviour in mind.

3.3 Conclusion

The COVID-19 data studied here are remarkable in terms of volume and resolution, and has allowed us to assess a national-level epidemic at extremely fine scale. However, regardless of resolution, cases only partially represent the full underlying pattern of infection. Variations in testing frequency and known trends in severe outcomes suggest that the distribution of infections may have been very different to that of reported cases. By incorporating trends on cases, testing behaviour, and severe outcomes more closely linked to infection (hospitalisation, ICU admission and mortality), it may be possible to build a much more comprehensive retrospective picture of how infections were distributed amongst the population.

Importantly, while our access to such finely-grained data was exceptional, it can be expected that such data are likely to become more common in the future, and may become available in real time. As such, our demonstration of the utility of such data points the way to an important approach to improving data analysis supporting control policy response to infectious disease emergencies in the future.

4 Data and methods

4.1 Preparation of case data

We use COVID-19 testing data from Public Health Scotland’s electronic Data Research and Innovation Service (eDRIS) system, dated from July 14th 2022. The data include individual tests by type (polymerase chain reaction (PCR) or rapid lateral flow device (LFD)), test result (positive, negative, void, inconclusive), test date, S-gene test result if known (positive, dropout, inconclusive), age, sex, and residing data zone (DZ, a census area typically comprising 500–1,000 individuals). De-identified IDs link repeat tests by the same individual. We reduce the raw test data to cases by removing duplicate tests by the same individual within 60 days (taking the date of the first PCR positive as the case date, or the first LFD in the absence of any PCR). These metadata—in particular the DZ, specifying location to within an area as small as 0.1km2 in densely populated areas—therefore identify cases at a fine spatio-temporal scale. Data on vaccine administrations are also provided by eDRIS.

This analysis considers the BA.1 sub-variant of the Omicron lineage only. The sub-variant BA.2/B.1.1.529.2 later replaced BA.1, becoming dominant in Scotland from around 25th February 2022. This variant, like Delta, has an S-gene positive test signature. However by the end of the period studied the BA.2 variant was only being identified in fewer than 1% of fully sequenced cases in the UK [64], and here we assume all remaining S-gene positive cases to be Delta.

Prior to January 6th 2022 in Scotland, positive LFD tests (typically taken at home) required PCR confirmation. Approximately 90% of cases in this period have a definitive S-gene result. A policy change then dropped this PCR requirement [65], after which cases with S-gene results fell to about 50% by February 2022 (per eDRIS data).

For Omicron cases, we gather from the data S-gene dropout cases between 15th November 2021 and 6th January 2022, and for the Delta outbreak, S-gene positive cases between 1st May and 7th September 2021 (choosing this end date to have a similar number of cases in each set). We exclude cases that have a different, or no S-gene result.

Using the linked historical tests, we label cases based on whether the individual had either: never tested positive before; had tested positive in the last six months prior to the start of that wave, or; last tested positive over six months prior to the start of that wave. We denote this the prior case status, as a proxy for infection-based immunity.

Finally to prepare the cases data to be fit, we group individuals that have the same age range, sex, residing datazone, and prior case status, terming these subsets of individuals cohorts. As an illustrative example, a cohort may be a population of 38 males aged between 50–54 residing in a given datazone “X”, that have never tested positive for COVID-19 before, among whom 9 Omicron COVID-19 cases were identified. This is the highest practical resolution we can acheive using the eDRIS case data, and our model (Section 4.3) fits case counts at this resolution.

4.2 Time series analysis

4.2.1 Time-dependent reproduction number.

The time-dependent reproduction number Ri is the average number of forward infections caused by a person infected on day ti. Define nj as the number of new infections on day tj. These new infections came from individuals infected on days on, or prior to tj. Define Aij as the number of new infections on day tj specifically from those infected on day titj: Pt) is the probability of an individual passing on the infection, Δt days after being infected. The presence of the Kronecker delta δij excludes the possibility of infected individuals infecting themselves. The reproduction number Ri is then the average total of infections generated over all subsequent days [66]:

We take Pt) to be with λ−1 the mean infectious period. Individuals are equally infectious throughout the entire infection. In our calculations we estimate 1/λ = 6.26 days, using the posterior mean duration of infectiousness obtained from the SCoVMod compartmental model (for more detail see Reference [57]).

As we estimate the infection reproduction number using the cases data, we implicitly assume that case ascertainment does not change over time, and does not account for the delay between infection, and registering a case.

In this work the reproductive number is measured at local authority level, the level at which the Scottish Government monitored and adjusted NPIs.

4.2.2 Case doubling time.

At the start of each wave we assume exponential growth of cases: where the gradient of a linear regression on log (new cases) against t returns the growth rate r. The evolution of new cases an also be rewritten in terms of of a doubling time tD: where .

4.3 Model

Our statistical model is designed to explain variation in COVID-19 case numbers as prepared in Section 4.1, and identify risk factors amongst a broad range of variables, using random forest regression. We fit models to the distribution of Delta and Omicron cases respectively, allowing for comparison of risk factors across the two waves.

4.3.1 Explanatory variables.

We include demographic factors (population, age, sex, ethnicity, student population), COVID-19 related factors (testing volume, prior case status, vaccination uptake), geography (local population density and transport time to public services to serve as proxies for connectivity and geographic remoteness), as well as deprivation. Data on deprivation are taken from the Scottish Indices of Multiple Deprivation (SIMD) [67]. The SIMD ranks DZs in Scotland by “multiple” deprivation, incorporating measures relating to local health, housing, geographic access, employment, income, crime, and education. In our model we use the raw measures of deprivation as explanatory variables. To account for local spread of infection between neighbourhoods that are geographically close to one another, we include an local outbreak duration parameter, which specifies the date at which the first case of the variant was identified at the intermediate zone (IZ, an administrative area containing of order 4–6 DZs).

A comprehensive description of all individual variables used is given in Section B.2 in S1 Text.

4.3.2 Random forest model.

We use random forest regression [68] on the distribution of COVID-19 cases, as it allows us to fit the distribution without specifying any prior analytical relation between the outcome variable (cases) and any of the explanatory variables, which may themselves be correlated. We fit the time-aggregated case distribution in R (version 4.1.0) [69], using the randomForest package [70] (version 4.6–14).

We fit the outcome variable at cohort level (with a cohort defined in Section 4.1). The fit number of cases at other scales (such as DZ level) is then an aggregation of cases from their constituent cohorts.

We extract two metrics for variable importance from the randomForest function output: the node purity (a measure of how effective variables are at partitioning cohorts with differing numbers of cases in the tree), and the loss of model accuracy on effective removal of that variable from the model.

Model hyperparameters were chosen manually so as to maximise the variance explained by a subset of the data not used to fit the model. Full hyperparameter specification is included in Section B.1 in S1 Text. The model specifications for fitting the Omicron and Delta waves are identical with one exception: for the Omicron model, third/booster dose uptake is used, whereas for Delta, second dose uptake is used (third/booster doses were only administered later; see Section B.3 in S1 Text for further details).

In addition to the full model, we fit for each of Omicron and Delta three “reduced” models, under equivalent hyperparameters to the full model and the same cohort structure, but informed only by population, and one of: age; the relative deprivation of the residing DZ, as defined by the overall SIMD deprivation rank [71], and; population density. These outputs illustrate how effective these variables are at alone at explaining case variation, relative to our full model.

4.4 Accumulated local effects

To identify risk factors amongst the explanatory variables used to inform the model, we calculate the accumulated local effects (ALEs) of each variable. The ALEs describe how the model fit value changes, in response to changing one variable value in isolation, averaged over many different entries in the data [72]. In this context, ALEs indicate whether a variable value is associated with fewer or more cases in general over the data. If the ALE is greater than zero, the fit cases generally increases given that variable value.

4.5 Moran’s I autocorrelation statistic

To probe geographical variation in cases not explained by the model, we measure the Moran’s I autocorrelation [73, 74] on the residuals (the difference between the data and fit value), relating to their physical location. We compare local DZ-aggregated residuals over physical distances (from 1–100km), as well as network distance (number of nearest neighbours apart). For a set of N residuals yi, the Moran’s I is a measure of autocorrelation: with the mean of all residuals, and wi,j is an associated weight of the pair of observations (i, j), with wi,i = 0. To measure the autocorrelation between residuals within a separation d (either a physical or network-based distance) of one another, we set wi,j = 1 if dist(i, j) ≤ d, and 0 otherwise. Fully correlated residuals would have I = 1, whereas I = 0 would indicate no correlation.

This measure characterises how effective our models are at explaining geographical variation, and with different distances d shows over what length scales residual autocorrelation persists.

Supporting information

S1 Text.

A. Supplementary plots for the time evolution of cases across the Delta and Omicron waves. B. Additional methodology details; hyperparameter selection, detailed description of all explanatory variables. C. Map view of population distribution of Scotland, and model residuals for Omicron model. D. Plots for explanatory variable Importance; node purity, accuracy loss on variable permutation. E. Additional details on lateral flow testing frequency, broken down by sex and deprivation quintile.



We thank Public Health Scotland’s electronic Data Research and Innovation Service (eDRIS) for the provision of COVID-19 testing, vaccination and severe outcomes data. We also thank the reviewers for their feedback and suggestions, which has led to improvement of the article.


  1. 1. Office for National Statistics. Coronavirus (COVID-19) Infection Survey: Scotland Dataset;. Available from (last accessed 15/08/2023).
  2. 2. Office for National Statistics. Coronavirus (COVID-19) latest insights: Hospitals;. Available from (last accessed 16/08/2023).
  3. 3. Simpson CR, Robertson C, Vasileiou E, McMenamin J, Gunson R, Ritchie LD, et al. Early pandemic evaluation and enhanced surveillance of COVID-19 (EAVE II): protocol for an observational study using linked Scottish national data. BMJ open. 2020;10(6):e039097. pmid:32565483
  4. 4. Sheikh A, Kerr S, Woolhouse M, McMenamin J, Robertson C. Severity of Omicron variant of concern and vaccine effectiveness against symptomatic disease: national cohort with nested test negative design study in Scotland. 2021;.
  5. 5. Canas LS, Sudre CH, Pujol JC, Polidori L, Murray B, Molteni E, et al. Early detection of COVID-19 in the UK using self-reported symptoms: a large-scale, prospective, epidemiological surveillance study. The Lancet Digital Health. 2021;3(9):e587–e598. pmid:34334333
  6. 6. Antonelli M, Penfold RS, Merino J, Sudre CH, Molteni E, Berry S, et al. Risk factors and disease profile of post-vaccination SARS-CoV-2 infection in UK users of the COVID Symptom Study app: a prospective, community-based, nested, case-control study. The Lancet Infectious Diseases. 2022;22(1):43–55. pmid:34480857
  7. 7. The Scottish Government. Coronavirus (COVID-19) confirmed in Scotland;. Available from (last accessed 16/08/2023).
  8. 8. The Scottish Government. Effective ‘lockdown’ to be introduced;. Available from (last accessed 16/08/2023).
  9. 9. The Scottish Government. Coronavirus (COVID-19): protection levels—reviews and evidence;. Available from (last accessed 16/08/2023).
  10. 10. The Scottish Government. New guidance issued for the festive period;. Available from (last accessed 16/08/2023).
  11. 11. The Scottish Government. Scotland in Lockdown;. Available from (last accessed 16/08/2023).
  12. 12. The Scottish Government. First COVID-19 vaccinations in Scotland take place;. Available from (last accessed 16/08/2023).
  13. 13. The Scottish Government. Coronavirus (COVID-19): vaccine deployment plan 2021;. Available from (last accessed 15/08/2023).
  14. 14. Hale T, Angrist N, Kira B, Petherick A, Phillips T, Webster S. Variation in government responses to COVID-19. 2020;.
  15. 15. McMillen T, Jani K, Robilotti EV, Kamboj M, Babady NE. The spike gene target failure (SGTF) genomic signature is highly accurate for the identification of Alpha and Omicron SARS-CoV-2 variants. Scientific reports. 2022;12(1):18968. pmid:36347878
  16. 16. Gebhard C, Regitz-Zagrosek V, Neuhauser HK, Morgan R, Klein SL. Impact of sex and gender on COVID-19 outcomes in Europe. Biology of sex differences. 2020;11:1–13. pmid:32450906
  17. 17. Galbadage T, Peterson BM, Awada J, Buck AS, Ramirez DA, Wilson J, et al. Systematic review and meta-analysis of sex-specific COVID-19 clinical outcomes. Frontiers in medicine. 2020;7:348. pmid:32671082
  18. 18. Peckham H, de Gruijter NM, Raine C, Radziszewska A, Ciurtin C, Wedderburn LR, et al. Male sex identified by global COVID-19 meta-analysis as a risk factor for death and ITU admission. Nature communications. 2020;11(1):6317. pmid:33298944
  19. 19. Sartorius B, Lawson A, Pullan R. Modelling and predicting the spatio-temporal spread of COVID-19, associated deaths and impact of key risk factors in England. Scientific reports. 2021;11(1):5378. pmid:33686125
  20. 20. Diao Y, Kodera S, Anzai D, Gomez-Tames J, Rashed EA, Hirata A. Influence of population density, temperature, and absolute humidity on spread and decay durations of COVID-19: A comparative study of scenarios in China, England, Germany, and Japan. One Health. 2021;12:100203. pmid:33344745
  21. 21. Smith TP, Flaxman S, Gallinat AS, Kinosian SP, Stemkovski M, Unwin HJT, et al. Temperature and population density influence SARS-CoV-2 transmission in the absence of nonpharmaceutical interventions. Proceedings of the National Academy of Sciences. 2021;118(25):e2019284118. pmid:34103391
  22. 22. Green MA, García-Fiñana M, Barr B, Burnside G, Cheyne CP, Hughes D, et al. Evaluating social and spatial inequalities of large scale rapid lateral flow SARS-CoV-2 antigen testing in COVID-19 management: An observational study of Liverpool, UK (November 2020 to January 2021). The Lancet Regional Health-Europe. 2021;6:100107. pmid:34002172
  23. 23. Meurisse M, Lajot A, Devleesschauwer B, Van Cauteren D, Van Oyen H, Van den Borre L, et al. The association between area deprivation and COVID-19 incidence: a municipality-level spatio-temporal study in Belgium, 2020–2021. Archives of Public Health. 2022;80(1):1–10.
  24. 24. KC M, Oral E, Straif-Bourgeois S, Rung AL, Peters ES. The effect of area deprivation on COVID-19 risk in Louisiana. PLoS One. 2020;15(12):e0243028.
  25. 25. Badr HS, Du H, Marshall M, Dong E, Squire MM, Gardner LM. Association between mobility patterns and COVID-19 transmission in the USA: a mathematical modelling study. The Lancet Infectious Diseases. 2020;20(11):1247–1254. pmid:32621869
  26. 26. Reuter M, Rigó M, Formazin M, Liebers F, Latza U, Castell S, et al. Occupation and SARS-CoV-2 infection risk among 108 960 workers during the first pandemic wave in Germany. Scandinavian Journal of Work, Environment & Health. 2022;48(6):446.
  27. 27. Rhodes S, Wilkinson J, Pearce N, Mueller W, Cherrie M, Stocking K, et al. Occupational differences in SARS-CoV-2 infection: analysis of the UK ONS COVID-19 infection survey. J Epidemiol Community Health. 2022;76(10):841–846. pmid:35817467
  28. 28. Zhang M. Estimation of differential occupational risk of COVID-19 by comparing risk factors with case data by occupational group. American journal of industrial medicine. 2021;64(1):39–47. pmid:33210336
  29. 29. Chadeau-Hyam M, Bodinier B, Elliott J, Whitaker MD, Tzoulaki I, Vermeulen R, et al. Risk factors for positive and negative COVID-19 tests: a cautious and in-depth analysis of UK biobank data. International journal of epidemiology. 2020;49(5):1454–1467. pmid:32814959
  30. 30. Lau MS, Grenfell B, Thomas M, Bryan M, Nelson K, Lopman B. Characterizing superspreading events and age-specific infectiousness of SARS-CoV-2 transmission in Georgia, USA. Proceedings of the National Academy of Sciences. 2020;117(36):22430–22435. pmid:32820074
  31. 31. Working group for the surveillance, control of COVID-19 in Spain, group for the surveillance W, control of COVID-19 in Spain, Redondo-Bravo L, Sierra Moros MJ, et al. The first wave of the COVID-19 pandemic in Spain: characterisation of cases and risk factors for severe outcomes, as at 27 April 2020. Eurosurveillance. 2020;25(50):2001431.
  32. 32. Hu T, Wang S, She B, Zhang M, Huang X, Cui Y, et al. Human mobility data in the COVID-19 pandemic: characteristics, applications, and challenges. International Journal of Digital Earth. 2021;14(9):1126–1147.
  33. 33. Ledebur K, Kaleta M, Chen J, Lindner SD, Matzhold C, Weidle F, et al. Meteorological factors and non-pharmaceutical interventions explain local differences in the spread of SARS-CoV-2 in Austria. PLoS computational biology. 2022;18(4):e1009973. pmid:35377873
  34. 34. Jia JS, Lu X, Yuan Y, Xu G, Jia J, Christakis NA. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature. 2020;582(7812):389–394. pmid:32349120
  35. 35. Wang H, Ghosh A, Ding J, Sarkar R, Gao J. Heterogeneous interventions reduce the spread of COVID-19 in simulations on real mobility data. Scientific reports. 2021;11(1):7809. pmid:33833298
  36. 36. Hou X, Gao S, Li Q, Kang Y, Chen N, Chen K, et al. Intracounty modeling of COVID-19 infection with human mobility: Assessing spatial heterogeneity with business traffic, age, and race. Proceedings of the National Academy of Sciences. 2021;118(24):e2020524118. pmid:34049993
  37. 37. Asem N, Ramadan A, Hassany M, Ghazy RM, Abdallah M, Ibrahim M, et al. Pattern and determinants of COVID-19 infection and mortality across countries: An ecological study. Heliyon. 2021;7(7). pmid:34254048
  38. 38. Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34(23):4121–4123. pmid:29790939
  39. 39. Natural Earth. Terms of Use;. Available from (last accessed 19/09/2023).
  40. 40. Jalali N, Brustad HK, Frigessi A, MacDonald EA, Meijerink H, Feruglio SL, et al. Increased household transmission and immune escape of the SARS-CoV-2 Omicron variant compared to the Delta variant: evidence from Norwegian contact tracing and vaccination data. medRxiv. 2022;.
  41. 41. Fonager J, Bennedbæk M, Bager P, Wohlfahrt J, Ellegaard KM, Ingham AC, et al. Molecular epidemiology of the SARS-CoV-2 variant Omicron BA. 2 sub-lineage in Denmark, 29 November 2021 to 2 January 2022. Eurosurveillance. 2022;27(10):2200181. pmid:35272746
  42. 42. Dupraz J, Butty A, Duperrex O, Estoppey S, Faivre V, Thabard J, et al. Prevalence of SARS-CoV-2 in household members and other close contacts of COVID-19 cases: a serologic study in canton of Vaud, Switzerland. In: Open forum infectious diseases. vol. 8. Oxford University Press US; 2021. p. ofab149.
  43. 43. Andrews N, Stowe J, Kirsebom F, Toffa S, Rickeard T, Gallagher E, et al. Covid-19 vaccine effectiveness against the omicron (B. 1.1. 529) variant. New England Journal of Medicine. 2022;. pmid:35249272
  44. 44. Cele S, Jackson L, Khoury DS, Khan K, Moyo-Gwete T, Tegally H, et al. Omicron extensively but incompletely escapes Pfizer BNT162b2 neutralization. Nature. 2022;602(7898):654–656. pmid:35016196
  45. 45. Vasileiou E, Simpson CR, Shi T, Kerr S, Agrawal U, Akbari A, et al. Interim findings from first-dose mass COVID-19 vaccination roll-out and COVID-19 hospital admissions in Scotland: a national prospective cohort study. The Lancet. 2021;397(10285):1646–1657.
  46. 46. Office for National Statistics. Updating ethnic contrasts in deaths involving the coronavirus (COVID-19), England: 8 December 2020 to 1 December 2021;. Available from: (last accessed 15/08/2023).
  47. 47. Platt L, Warwick R. Are some ethnic groups more vulnerable to COVID-19 than others. Institute for fiscal studies. 2020;1(05):2020.
  48. 48. Lo CH, Nguyen LH, Drew DA, Warner ET, Joshi AD, Graham MS, et al. Race, ethnicity, community-level socioeconomic factors, and risk of COVID-19 in the United States and the United Kingdom. EClinicalMedicine. 2021;38. pmid:34308322
  49. 49. Office for National Statistics. Coronavirus and vaccination rates in people aged 18 years and over by socio-demographic characteristic and occupation, England: 8 December 2020 to 31 December 2021;. Available from: (last accessed 15/08/2023).
  50. 50. National Records of Scotland. Census 2011: Release 3I—Detailed characteristics on Labour Market and Education in Scotland;. Available from: (last accessed 15/08/2023).
  51. 51. Khalatbari-Soltani S, Cumming RC, Delpierre C, Kelly-Irving M. Importance of collecting data on socioeconomic determinants from the early stage of the COVID-19 outbreak onwards. J Epidemiol Community Health. 2020;74(8):620–623. pmid:32385126
  52. 52. Lone NI, McPeake J, Stewart NI, Blayney MC, Seem RC, Donaldson L, et al. Influence of socioeconomic deprivation on interventions and outcomes for patients admitted with COVID-19 to critical care units in Scotland: a national cohort study. The Lancet Regional Health-Europe. 2021;1:100005. pmid:34173618
  53. 53. Blundell R, Costa Dias M, Joyce R, Xu X. COVID-19 and Inequalities. Fiscal studies. 2020;41(2):291–319. pmid:32836542
  54. 54. Bambra C, Riordan R, Ford J, Matthews F. The COVID-19 pandemic and health inequalities. J Epidemiol Community Health. 2020;74(11):964–968. pmid:32535550
  55. 55. Baena-Díez JM, Barroso M, Cordeiro-Coelho SI, Díaz JL, Grau M. Impact of COVID-19 outbreak by income: hitting hardest the most deprived. Journal of Public Health. 2020;42(4):698–703. pmid:32776102
  56. 56. McGurnaghan SJ, Weir A, Bishop J, Kennedy S, Blackbourn LA, McAllister DA, et al. Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland. The lancet Diabetes & endocrinology. 2021;9(2):82–93. pmid:33357491
  57. 57. Banks CJ, Colman E, Doherty T, Tearne O, Arnold M, Atkins KE, et al. SCoVMod–a spatially explicit mobility and deprivation adjusted model of first wave COVID-19 transmission dynamics. Wellcome Open Research. 2022;7(161):161. pmid:35865220
  58. 58. National Records of Scotland. Mid-2021 Small Area Population Estimates, Scotland (Report);. Available from (last accessed 15/08/2023).
  59. 59. Office for National Statistics. Coronavirus (COVID-19) Infection Survey technical article: Analysis of characteristics associated with vaccination uptake;. Available from (last accessed 15/08/2023).
  60. 60. Wood AJ, MacKintosh AM, Stead M, Kao RR. Predicting future spatial patterns in COVID-19 booster vaccine uptake. medRxiv. 2022;p. 2022–08.
  61. 61. Wood AJ, Kao RR. Empirical distributions of time intervals between COVID-19 cases and more severe outcomes in Scotland. PloS one. 2023;18(8):e0287397. pmid:37585389
  62. 62. Colman E, Puspitarani GA, Enright J, Kao RR. Ascertainment rate of SARS-CoV-2 infections from healthcare and community testing in the UK. Journal of Theoretical Biology. 2022;p. 111333. Available from: pmid:36347306
  63. 63. Nightingale ES, Abbott S, Russell TW. The local burden of disease during the first wave of the COVID-19 epidemic in England: estimation using different data sources from changing surveillance practices. BMC public health. 2022;22(1):1–14.
  64. 64. The UK Health Security Agency. SARS-CoV-2 variants of concern and variants under investigation in England: technical briefing 35;. Available from (last accessed 15/08/2023).
  65. 65. The Scottish Government. Self-Isolation and testing changes;. Available from (last accessed 15/08/2023).
  66. 66. Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of epidemiology. 2004;160(6):509–516. pmid:15353409
  67. 67. The Scottish Government. SIMD 2020 Technical Notes;. Available from (last accessed 15/08/2023).
  68. 68. Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
  69. 69. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2022. Available from:
  70. 70. Liaw A, Wiener M, et al. Classification and regression by randomForest. R news. 2002;2(3):18–22.
  71. 71. The Scottish Government. Scottish Index of Multiple Deprivation 2020;. Available from (last accessed 15/08/2023).
  72. 72. Apley D, Apley MD. Package ‘ALEPlot’. 2018;.
  73. 73. Moran PA. Notes on continuous stochastic phenomena. Biometrika. 1950;37(1/2):17–23. pmid:15420245
  74. 74. Gittleman JL, Kot M. Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology. 1990;39(3):227–241.