Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking

Background After claiming nearly five hundred thousand lives globally, the COVID-19 pandemic is showing no signs of slowing down. While the UK, USA, Brazil and parts of Asia are bracing themselves for the second wave—or the extension of the first wave—it is imperative to identify the primary social, economic, environmental, demographic, ethnic, cultural and health factors contributing towards COVID-19 infection and mortality numbers to facilitate mitigation and control measures. Methods We process several open-access datasets on US states to create an integrated dataset of potential factors leading to the pandemic spread. We then apply several supervised machine learning approaches to reach a consensus as well as rank the key factors. We carry out regression analysis to pinpoint the key pre-lockdown factors that affect post-lockdown infection and mortality, informing future lockdown-related policy making. Findings Population density, testing numbers and airport traffic emerge as the most discriminatory factors, followed by higher age groups (above 40 and specifically 60+). Post-lockdown infected and death rates are highly influenced by their pre-lockdown counterparts, followed by population density and airport traffic. While healthcare index seems uncorrelated with mortality rate, principal component analysis on the key features show two groups: states (1) forming early epicenters and (2) experiencing strong second wave or peaking late in rate of infection and death. Finally, a small case study on New York City shows that days-to-peak for infection of neighboring boroughs correlate better with inter-zone mobility than the inter-zone distance. Interpretation States forming the early hotspots are regions with high airport or road traffic resulting in human interaction. US states with high population density and testing tend to exhibit consistently high infected and death numbers. Mortality rate seems to be driven by individual physiology, preexisting condition, age etc., rather than gender, healthcare facility or ethnic predisposition. Finally, policymaking on the timing of lockdowns should primarily consider the pre-lockdown infected numbers along with population density and airport traffic.


Methods
We process several open-access datasets on US states to create an integrated dataset of potential factors leading to the pandemic spread. We then apply several supervised machine learning approaches to reach a consensus as well as rank the key factors. We carry out regression analysis to pinpoint the key pre-lockdown factors that affect post-lockdown infection and mortality, informing future lockdown-related policy making.

Findings
Population density, testing numbers and airport traffic emerge as the most discriminatory factors, followed by higher age groups (above 40 and specifically 60+). Post-lockdown infected and death rates are highly influenced by their pre-lockdown counterparts, followed by population density and airport traffic. While healthcare index seems uncorrelated with mortality rate, principal component analysis on the key features show two groups: states (1) forming early epicenters and (2) experiencing strong second wave or peaking late in rate of infection and death. Finally, a small case study on New York City shows that days-to-peak for infection of neighboring boroughs correlate better with inter-zone mobility than the interzone distance.

Interpretation
States forming the early hotspots are regions with high airport or road traffic resulting in human interaction. US states with high population density and testing tend to exhibit during pre-and post-COVID periods to show that the odds of mortality of whites and blacks are statistically equivalent [23]. Myers et al. analyzed the COVID-19 positive patients in California to investigate its prognosis in the higher age groups and individuals with preexisting conditions [24]. Zoabi et al. applied ML on 51,831 COVID-19 positive patients to understand the effect of gender, age and contact to show that close social interaction is a strong feature for COVID-19 transmissibility [25]. Khan et al. applied regression tree, cluster analysis and principal component analysis on Worldometer infection count data to study the variability and effect of testing in prediction of confirmed cases [26]. Finally, Pan et al. studied the effects of the myriad public health interventions (such as lockdown, traffic restriction, social distancing, home quarantine, centralized quarantine, etc.) on 32,583 COVID-19 patients, with respect to their age, sex, residential location, occupation, and severity [27].
Contributions: While it is evident that factors such as gender, race, age, testing, social contact and distancing have been analyzed in a piecemeal manner, there is no comprehensive study that combines the demographic, economic, and epidemiological, ethnic and health indicators for infection and mortality from COVID-19. To address this gap, we carry out a machine learning-based analysis with the following three objectives.
1. We curate a dataset of diverse features (detailed in Sec. 2.1) from 50 states of USA. This dataset is somewhat unique, since, in addition to the above features, it includes factors such as airport traffic, homeless and variations in lockdown dates. Also, note that the lockdown was enforced on the US states at around the same time, when each state was at a different stage of the COVID-19 infection cycle.
2. We analyze the variation of COVID-19 infection spread and mortality rates using a set of standard supervised ML methods. We rank the key discriminatory factors based on the importance score calculated from randomized decision trees. We combine the findings to identify the most vulnerable age groups and US states. We also show the effect of testing and lockdowns on the infection spread dynamics.
3. We utilize multiple linear regression to gauge the extent to which the key pre-lockdown factors affect the post-lockdown infected and death numbers. This study assigns weights to features and drive mitigation efforts and large scale policymaking.
Our data-driven experiments using supervised methods demonstrate that population density, testing [28] and airport traffic [29] are key factors contributing to infection and mortality rates. Furthermore, high age group (40 and beyond, and specifically exceeding 60) population are more vulnerable. Principal component analysis on the key features show two groups: highly affected US states (1) forming early epicenters and (2) showing consistent or newly peaking rate of infection and death. Multiple regression analysis shows that the postlockdown numbers are most influenced by the pre-lockdown infected and death numbers followed by population density and airport activity, while overall healthcare index of a state does not seem to play a part in the overall death count. Similarly, the race of individuals did not play any significant role in the infection or mortality numbers. Despite increased testing rates, the fraction of individuals tested positive drop approximately three weeks into the lockdown, suggesting that the social distance measures has had an impact on curbing spread. Finally, we discuss the role of mobility and distance in infection spread. In the absence of large-scale inter-state mobility data, our case study on the boroughs of New York City show that peaks of infection correlate better with inter-zone mobility than the interzone distance.

Materials and methods
All the experiments have been performed using Scikit-learn, which is a popular Machine Learning library in Python [30].

Dataset
Let us discuss the details of the two datasets used in this work.
2.1.1 Data from US states. Our dataset has been carefully curated from several open sources to examine the possible factors that may affect the COVID-19 related infection and death numbers in the 50 states of USA. The individual open-access data sources as well as the integrated (curated) dataset has been shared on GitHub (https://github.com/satunr/COVID-19/tree/master/US-COVID-Dataset). Below, we discuss a summary of the features and output labels of the integrated dataset.
• Gross Domestic Product (in terms of million US dollars) for US states [31] (filename: source/ GDP.xlsx, feature name: GDP).
• Distance from one state to another (is not measured in miles but the euclidean distance between their latitude-longitude coordinates between the pair of states [32]) (filename: source/Data_distance.xlsx, feature name: d(state1, state2)).
• Gender feature(s) is a fraction of total population representing the male and female individuals [33] (filename: source/Data_gender.csv, feature name: Male, Female).
• Ethnicity feature(s) are the fraction of total population representing white, black, Hispanic and Asian individuals (we leave out other smaller ethnic groups) [34] (filename: source/ Data_ethnic.csv, feature name: White, Black, Hispanic and Asian).
• Healthcare index is measured by Agency for Healthcare Research and Quality (AHRQ) on the basis of (1) type of care (like preventive, chronic), (2) setting of care (like nursing homes, hospitals), and (3) clinical areas (like care for patients with cancer, diabetes) [35] (filename: source/Data_health.xlsx, feature name: Health).
• Homeless feature is the number of homeless individuals of a state [36] (filename: source/ Data_homeless.xlsx, feature name: Homeless). The normalized homeless population of each state is the ratio between its homeless and total population.
• Total cases (and deaths) of COVID-19 is the number of individuals tested positive and dead [37] (filename: source/Data_covid_total.xlsx, feature name: Total Cases and Total Death). The normalized infected/death is the ratio between the infected/death count to total population of the given state.
• Infected score and death score is obtained by rounding normalized total cases and deaths to discrete value between 0-6 (feature name: Infected Score, Death Score).
• Death-to-Infected is a feature measuring impact of death in terms of the difference between death and infected scores. It is calculated as max(Death Score -Infected Score, 0).
• Lockdown type is a feature capturing the type of lockdown (shelter in place: 1 and stay at home: 2) in a given state [37,38] (filename: source/Data_lockdown.csv, feature name: Lockdown).
• Day of lockdown captures the difference in days between 1st January 2020 to the date of imposition of lockdown in a region [39] (filename: source/Data_lockdown.csv, feature name: Day Lockdown).
• Population density is the ratio between the population and area of a region [40] (filename: source/Data_population.csv, feature name: Population, Area, Population Density).
• Traffic/activity of airport measures the passenger traffic (also normalized by the total traffic across all the states of USA [41] (filename: source/Data_airport.xlsx, feature name: Busy airport score, Normalized busy airport).
• Peak infected (and peak death) measures the duration between first date of infection and date of daily infected (and death) peaks [40] (feature name: Peak Infected, Peak Death).
• Pre-and post-infected and death count measures the number of individuals infected and dead before and after lockdown dates (feature name: Testing, Pre-infected count, Pre-death count, Post-infected count, Post-death count).
• Days between first infected and lockdown date (feature name: First-Inf-Lockdown).
The above features, their abbreviations and summary statistics (i.e., mean, standard deviation, maximum and minimum) are enlisted in Table 1. Note that, for gender and ethnicity we report the fraction of the total state population falling in each category.

PLOS ONE
Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking • Mobility data (based on traffic volume counts collected by DOT for New York Metropolitan Transportation Council (NYMTC) [43]) shows the number of trips from one borough to another.
• COVID-19 data shows the number of COVID-19 infected and death counts for each borough [44].

US infected and testing data.
We acquire the daily infected and testing counts across US from January-July, 2020 [45]. This dataset is part of the COVID Tracking project that collect COVID-19 statistics on the numbers on tests, cases, hospitalizations, and patient outcomes from every US state and territory by voluntary public participation.

Data preprocessing and normalization.
We use the Scikit-learn library KBinsDiscretizer to group the continuous feature values into discrete values by creating balanced clusters using the quantile strategy [46].
2.1.5 Supervised learning methods. Supervised machine learning algorithms learn a function that maps the input training data (i.e., features) to some output labels [47]. In this work, we consider the following supervised learning techniques. (Refer [48][49][50][51][52][53][54] for the details on these ML approaches.) • Support Vector Machine (SVM) is used for classification and regression problems that maps the inputs to high-dimensional feature spaces. SVM operates on hyperplanes-decision boundaries that help classify the data points. The objective is to maximize the separation between the data points and the hyperplane. SVM is memory efficient and effective for datasets with fewer data samples [55].
• Stochastic Gradient Descent (SGD) is an iterative approach that fits the data to an objective function [56]. As the name suggests, it is a stochastic variant of the popular gradient descent (GD) optimization model [57]. In GD, the optimizer starts at a random point in the search space and reaches the lowest point of the function by traversing along the slope. Unlike GD that requires calculating the partial derivative for each feature at each data point, SGD achieves computational efficiency by computing derivatives on randomly chosen data points.
• Nearest Centroid (NC) is a simple classification model that represents each class by the centroid of its members. Subsequently, it assigns each data point to the cluster whose centroid is the closest to it. NC is particularly effective for non-convex classes and does not suffer from any additional dependencies on model parameters [58].
• Decision Trees (DTs) are a classification and regression technique that assigns target labels based on decision rules inferred from data features [59]. DT maintains the decision rules using a tree. A data point is assigned to a class by repeatedly comparing the tree root with the data point value to branch off to a new root.
• Gaussian Naive Bayes (NB) are a class of fast, probabilistic learning techniques that apply the Bayes' theorem to assign labels to the data points [60].
While supervised ML approaches generally yield reliable prediction accuracy, they often suffer from overfitting or convergence issues [47,61]. Each of the above approaches has its own advantages and disadvantages. SVM works well when the underlying distribution of the data is not known. However, it is prone to overfitting when the number of features is much greater than the number of samples. SGD needs low convergence time for a large dataset, but it may require to fit a number of hyperparameters. Conversely, DT involves almost no hyperparameters, but often entails slightly higher training time. Unlike DT, NB requires less training time but works on the implicit assumption that all the attributes are mutually independent. Finally, NC is a fast method but is not robust to outliers or missing data. In the context of our work, we intuit that the discriminatory feature(s) will yield a high accuracy irrespective of the underlying supervised ML algorithm used.

Metrics
• Accuracy function measures the fraction of matches between the predicted and actual labels in a multi-label classification, i.e., the ratio of correctly predicted observations to the total observations. It can be calculated as: In the above equation, TP, TN, FP, FN denote true positive, true negative, false positive and false negative, respectively.
• Extra trees classifier is an estimator that fits randomized decision trees (called extra-trees) on data samples. The memory and computation overhead of this approach can be controlled by regulating the size of the extra trees. The nodes in the tree are split into sub-trees resulting in high accuracy (i.e., drop in impurity). Thus, feature importance is measured as total reduction in impurity affected by that feature [62].
• Multiple regression (MR) is a statistical tool to capture the linear relationship between the independent and the dependent variables x and y of a function y = g(x). In our context, MR generates a linear relationshipŷ where b fi is the coefficient that captures the contribution of feature f i towards the dependent variable y, while β 0 and � are the intercept and error terms, respectively.

Data correlation, standardization and error estimation
Given any pair of vectors v andv (jvj ¼ jvj ¼ n), we apply the following standard statistical operations: • Mean centering subtracts the mean μ from each element of a vector v, i.e., v 0 = v − μ(v). This standardization adjusts the scales of magnitude by making the new mean 0 and helps compare data from varied sources or having different datatypes.
• Mean squared error (MSE) is calculated as 1 • Pearson Correlation Coefficient (PCC) between v andv measures the strength of a linear association between two variables, where the value PCC = 1 is a perfect positive correlation and −1 is perfect negative correlation.
• Positivity rate ρ is the ratio between the number of individuals tested positive to the number of tests performed daily [63].

Results
This section is classified into the following three subsections: (1) and (2) Table 2. Unless otherwise stated, the feature set comprises GDP, gender, ethnicity, health care, homeless, lockdown type, population density, airport activity, and age groups, whereas the output labels consist of infected and death scores on a scale of 0-6.

Identification of discriminatory factors
We apply supervised machine learning (ML) approaches to identify the key factors affecting COVID-19 infected and death counts. For each supervised ML technique, we perform an exhaustive search of all possible combinations of any 5 features and identify the feature subset (s) with the highest accuracy (discussed in Sec. 2.2) as the most important features. Fig 1 shows the scores for different supervised methods. Although proposing a machine learning algorithm that works best on COVID-19 data is not the purpose of this study, it is worth reporting that decision tree classifier (DT) slightly outperforms the other algorithms for both cases of infected and death scores. We create a pool of all features participating in at least one combination for output labels of infected and death scores. Fig 2 shows a heatmap of the importance I for all such features against each supervised technique. For infected score as output label (top figure), homeless (home), population density (PD), airport activity (air), testing (test), white (wht), etc. have the highest I. For death score as output label, PD, air, test and age groups above 50 years (age50_54 and age80_84) exhibit the highest importance.

Ranking of discriminatory features
We apply the extra trees classifier to generate the impurity-based rank for the features (discussed in Sec. 2.2). Fig 3a shows the top 5 important features corresponding to the infected and death scores, respectively. It is interesting that for both cases, the same set of features, namely, population density, days to peak, airport traffic, testing and high age groups, are identified. Also note that the same features exhibit a very high participation in the 5-feature combinations shown in Fig 2. Next, as a validation exercise, we apply dimension reduction on the

PLOS ONE
Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking  Table 1

Effect on age
We discussed in Sec. 2.1, that our initial dataset groups ages into brackets of 4 (0-4, 4-8, and so on). Our results from supervised learning (Sec. 3.1) and extra trees (Sec. 3.2) suggest that high age groups are important factors affecting the infected and death scores of COVID-19. To understand the effect of COVID-19 infected and death scores on low and high age groups, we create two feature sets for population of age �40 and >40. Fig 4a shows that for both cases of infected and death, the accuracy (ACC) is higher for higher age groups. We explore this by repeating the above experiment, this time, with a feature set of groups 40-60 and >60. Fig 4b  depicts that ACC for age group 60+ is marginally higher, suggesting that the elderly are amongst the most vulnerable, however the difference in mortality rates in this case was not statistically significant.

Feature influence on post-lockdown infection spread
We carry out a study to identify the pre-lockdown factors of any region (US states in our case) that contribute to the overall post-lockdown infection and death numbers. We partition the total infected and death numbers for each state into pre-and post-lockdown infected and death counts. We then create a feature set consisting of population density, airport business, pre-lockdown infected, pre-lockdown death, days between first infected to lockdown and age group above 80. The features represent the set of observable factors for the administrative and health bodies and were already shown to possess high feature significance in the previous

PLOS ONE
Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking section. The output labels are the post-lockdown infected and post-lockdown death numbers. We perform the following experiments: 3.4.1 Identification of discriminating features. We carry out a simple preprocessing step to convert each feature entry to percentile (with respect to the feature vector) and rank the US states in the decreasing order of infected and death scores (Fig 5). We calculate the weighted average percentile of features for the top and bottom k = 10 US states using the formula where p(f i ) and ρ(f i ) are the percentile and rank of the i th feature value, while r is the number of US states (equal to maximum rank). We intuit that the feature exhibiting the maximum difference in weighted average percentile for top and bottom k COVID-19 affected US states are the discriminating ones. Fig 6a shows the percentile difference suggesting that airport and population density are the most significant, while days between first infected to lockdown and age group of 80+ are the least discriminating.

Feature weights based on multiple regression.
We apply multiple regression (MR) (see Sec. 2.2) to measure the weightage of each of the above features in the observed post-lockdown infected (Post_Inf) and post-death numbers (Post_Dth). We eliminate the days between first infected to lockdown (Fst-Lock) and age group 80+, which are the least discriminating features from the percentile analysis (see Fig 6a). As a prerequisite for MR, we need to eliminate features that are mutually correlated. Fig 6b shows that Pre-inf and Pre-dth are highly correlated, and hence we run two separate batches of MR: (1) population density, airport business, pre-lockdown infected and (2) population density, airport business, pre-lockdown death.

Effect of testing and lockdown on infection spread.
We explore the effect of testing and lockdown on infection spread. We utilize positivity ratio ρ (defined in Sec. 2.3) to gauge how widespread the infection spread is [63]. We acquire the daily infected and testing count in US (see Sec. 2.1.3) and plot the mean daily ρ across all states over the period of February-July 2020. Fig 7a shows that the testing increased over a period time, while the positivity ratio dropped post lockdown (shown in red dotted line). While, testing (and, by extension, positivity ratio) is an effective epidemiological indicator, it cannot curb infection spread by itself. However, Fig 7a shows that the ρ has dropped approximately three weeks into the lockdown, suggesting that the latter had an impact on curbing spread by minimizing social contact. Table 3 shows that pre-infected and pre-death with high coefficients contribute highly towards

PLOS ONE
Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking the post-lockdown infected and death numbers, followed by population density and airport traffic. This finding is further supported by the p values reported for the respective features. Note that the R 2 scores for all the four cases are >0.8, suggesting that the output features capture a high proportion of the variance in the input features. Overall, pre-infected count has higher coefficient and R 2 score and emerges as a marginally better discriminating feature of post-lockdown effects than the pre-death count.

PLOS ONE
Factors affecting COVID-19 infected and death rates inform lockdown-related policymaking

Discussions
In Sec. 3.2, we perform PCA on the feature set of the key factors to show that states with high infection and death numbers stand out of the cluster of other states. These states include some erstwhile hotspots forming group 1 (such as New York City, New Jersey, Massachusetts, Connecticut, Rhode Island) as well as states experiencing a steady infection and death count and also a strong second wave forming group 2 (such as Texas, Washington, California, Georgia, Arkansas, Utah and Colorado) (Fig 3b). In the PCA analysis, PC1 and PC2 account for 41% and 21% variance, respectively. We explore how each feature influences each component to show that PC1 is driven by factors such as airport activity and high age groups (70 and beyond), while PC2 is dominated by population density, airport, age (80+) and testing. Notice in Fig 3b, though both groups 1 and 2 exhibit high spread across PC1, group 2 forms a slightly denser cluster than group 1, implying that it exhibits an even mix of PC1 and PC2 features. We intuit that the early peaking in infection in group 1 states is due to high road and airport mobility leading to high mixing and infection spread that is manifested in the elderly population. Group 2 shows enduring infection spread due to high population density and testing, in addition to airport activity and populations with higher age group. We study how demographics affect COVID-19 numbers to show that states with higher age groups (particularly 60 and beyond) numbers are the most vulnerable. Finally, we split the infected and death numbers on the pre-and post-lockdown epochs and apply multiple linear regression to show that pre-lockdown infected and death, population density and airport contribute highly to the post-lockdown numbers. This analysis can be particularly effective in pinpointing the most vulnerable states and recommending lockdown policies on starting dates and duration to curb pandemic spread. Note that our present study pertains to the identification of the discriminatory features with respect to the date of lockdown. There exists several unanswered questions regarding the impact of length, scheduling strategies, lockdown types and extent of lockdowns on pandemic spread that need to be answered. Such an analysis requires a richer feature set as well as a sound understanding of the dynamics of infection spread in terms of healthcare, distance, mobility, etc. As a preliminary study, we first explore whether there is any relationship between the health care index (Health) of a US state and the number of transitions from infected to death (Dth/Inf) in this state. The Pearson's correlation coefficient (see Sec. 2.3) between the two factors is 0.11, suggesting that the overall mortality numbers is largely unrelated to the healthcare facility and may solely depend on the infected individual's attributes, such as age, comorbidities, infection severity, etc.
Second, since proximity plays a role in infection spread, neighboring regions should peak at nearly the same time. We posit that mobility may play an even greater role in the spread, than a static measure like distance between a pair of regions. In the absence of a inter-state mobility dataset, we create two feature sets for the NYC boroughs dataset (see Sec. 2.1): (1) inter-borough distance and (2) inter-borough mobility. Each borough b has a distance and mobility vector D b = {d b1 , d b2 � � �} and M b = {m b1 , m b2 � � �} where d bi and m bi are the probabilistic measure of distance and mobility between a borough b with borough i. We calculate the correlation of the mean squared error (see Sec. 2.3) of the distance/mobility vectors of any pair of boroughs b 1 and b 2 against the absolute difference of their peak to infected or peak-to-death features. Fig 7b suggests that mobility yields a higher correlation (0.44) than distance (0.22) suggesting that mobility is a slightly more informative feature to analyze infection spread.
We are currently working towards broadening the scope of this study in different directions. First, this work attempted to apply ML analysis on a wide range of features, making the the states of United States the ideal choice, specifically from the standpoint of data availability. In future we would like to extend this work by running these experiments on epidemiological, demographic and economic data of different countries. It would be interesting to report the variation in the discriminatory features identified for different countries. Second, we identify population density, testing, airport activity and pre-lockdown infected count as key features driving the post-lockdown infection and death numbers. We plan to utilize these findings to design policies on the timing, duration and stringency of lockdown for future pandemics. Third, all the input features discussed in this work are static or time invariant. It is imperative to analyze the evolution of dynamic features (such as GDP and unemployment rates) from the pre-COVID to the post-COVID timelines to uncover the long-term economic effects of COVID-19.

Conclusions
Machine learning is emerging as an important tool to predict the dynamics of spread of COVID-19 and identify the key factors driving infection and mortality rates. While existing works study the effects of gender, race, age, testing, social contact and distancing separately, we present an unified analysis of the demographic, economic, and epidemiological, ethnic and health indicators for infection and mortality rates from COVID-19. We curate a dataset of US states comprising features (from varying sources discussed in Sec. 2.1) that may potentially impact infection and death rates of COVID-19. We run several supervised machine learning techniques to identify and rank the key factors correlating with infection and fatality counts. Population density, testing rate, airport traffic, high age groups emerge as significant, while ethnicity, gender, healthcare index, homeless and GDP have little or no impact on pandemic spread and mortality.