Heavy-tailed distributions of confirmed COVID-19 cases and deaths in spatiotemporal space

This paper conducts a systematic statistical analysis of the characteristics of the geographical empirical distributions for the numbers of both cumulative and daily confirmed COVID-19 cases and deaths at county, city, and state levels over a time span from January 2020 to June 2022. The mathematical heavy-tailed distributions can be used for fitting the empirical distributions observed in different temporal stages and geographical scales. The estimations of the shape parameter of the tail distributions using the Generalized Pareto Distribution also support the observations of the heavy-tailed distributions. According to the characteristics of the heavy-tailed distributions, the evolution course of the geographical empirical distributions can be divided into three distinct phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II. These three phases could serve as an indicator of the severity degree of the COVID-19 pandemic within an area. The empirical results suggest important intrinsic dynamics of a human infectious virus spread in the human interconnected physical complex network. The findings extend previous empirical studies and could provide more strict constraints for current mathematical and physical modeling studies, such as the SIR model and its variants based on the theory of complex networks.


Introduction
The COVID-19 global pandemic broke out in the human interconnected physical complex network in late 2019.It has caused unprecedented damage to global public health, society safety, and economy [1][2][3][4][5].To control such a pandemic well, many countries have taken a variety of containment measures since the beginning phase of this epidemic [6].
Over the past few years, it has been observed that some countries, states, provinces, cities, and counties have a huge number of confirmed COVID-19 cases and deaths, while others have a few confirmed cases and deaths [1,[7][8][9].One may intuitively attribute this geographical heterogeneity to the different containment measures adopted in different regions.It is true that some countries have adopted relatively strict containment measures compared with others.However, within a country, such as China where unified COVID-19 containment measures are adopted, similar geographical heterogeneity is still observed.
The huge geographical heterogeneity implies spatial heavy-tailed distribution [10][11][12][13].Blasius [10] analyzed data of COVID-19 confirmed cases and deaths at the end of March 2020 for countries worldwide and for counties in the US, and revealed that the geographical distributions of confirmed cases and deaths follow the truncated power-law.Similarly, Toda and Beare [11] also investigated the US county-level distribution of confirmed cases during the early phase of this epidemic, and then found that this distribution obeys the power-law with an exponent close to 1. Subsequently, Ahundjanov et al. [12] computed the Chinese city-level distribution of confirmed cases at the end of May 2020, and obtained a similar result as that of the US county-level distribution [12].In 2022, Liu and Zheng [13] conducted a more comprehensive and systematic analysis of COVID-19 cumulative and daily confirmed cases and deaths for countries worldwide by considering its temporal evolution characteristics.This study shows that the power-law and the stretched exponential function can well describe the geographical country-level distributions, in a time period from the early phase of this epidemic to June 2022 [13].In addition to the geographical power-law distribution, a power-law growth pattern during the early phase of this epidemic was also observed [14][15][16][17].The power-law pattern reveals important intrinsic dynamics of the COVID-19 spread process in the complex network of human society.
The previous empirical studies focused only on data from the early phase of this epidemic and did not investigate the differences among different geographical scales.To obtain a comprehensive understanding of the dynamics of the COVID-19 spread process, it is valuable to perform a more in-depth and systematic analysis of COVID-19 data across different geographical scales in a time span as long as possible.Compared with the previous studies [10][11][12][13], at the county, city, and state/province levels, this paper extendedly analyzes the COVID-19 time series data from Australia, Canada, China, Denmark, France, Netherlands, New

Data source
This paper conducts a detailed analysis of the distributions of the COVID-19 numbers for both cumulative and daily confirmed cases and deaths at county, city, and state/province levels.The COVID-19 datasets analyzed in this paper are sourced from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University [8] and the China Data Lab Dataverse operated by Harvard University [64,65].The dataset from CSSE was updated until March 2023, while the dataset from China Data Lab Dataverse was updated until May 2022.More details of these datasets can be accessed via the sources listed in Table 1 and the scientific articles published in journals of the Lancet Infectious Diseases [9] and the Data and Information Management [66].Table 1 lists the sources of the datasets for each geographical scale analyzed in this paper.

Level
Country Source

State (province)
Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, the US Reference [64] (for the US) Reference [8] (for others)

County
The US Reference [8] Mathematical heavy-tailed distributions and methods Many observables in natural and social systems obey the exponentially bounded distribution.It decays exponentially ( −  ) or faster than the exponential (e.g. the normal distribution  −  2 / 2 ) as the observable  increases.It is also known as thin-tailed distribution because of the very small probability of large observable .However, we also observed extreme values of some observables in complex systems.These extreme events are important and interesting since they often play a crucial role in understanding the complex system.The occurrence of extreme events implies a heavy-tailed distribution that decays more slowly than the exponentially bounded distribution [67].The heavy-tailed distributions that are often used to describe empirical data include the power-law (also known as the Pareto distribution or Zipf's law), the truncated power-law, the stretched exponential (also known as the Weibull distribution), and the lognormal distributions [67].
The probability density functions of the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions are mathematically written as Eqs ( 1) -( 4), respectively [67]. (1) For some special cases, we can let support of  is [  , +∞) or [  ,   ] with condition of   > 0. The  is a normalizer which is different for continuous and discrete random variables.In our study, we use the discrete form since the numbers of confirmed COVID-19 cases and deaths are integers.
The power-law distribution has been generalized as the Generalized Pareto Distribution (GPD).The GPD covers both cases of the thin and heavy-tailed distributions.Therefore, although the GPD might not be suitable for describing an empirical distribution over the entire range of a variable, it is often used to estimate the shape of a distribution's tail [44].
In this paper, the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions are employed to fit the empirical distribution over the observable's entire range with the popular fitting methods developed by Clauset et al. [68] and Klaus et al. [69].The mathematical details of the fitting methods can be found in references [68,69], and the algorithm's implementation has been coded in the Python package powerlaw [70], which is widely used in scientific research and thus in this work.To have a better quantitative understanding of the tail distribution, this paper also uses the GPD to estimate the tail's shape parameter [71,72].
Here we use four statistical hypothesis tests to evaluate the goodness of fit.The first one is the log-likelihood ratio test.It can identify which one of the two fits is better.This method compares one candidate distribution against another, and calculates two values of  and -value [70].The positive  indicates that the former is a better fit compared with the latter one, and vice versa [70].The -value quantifies the significance for that direction, and that direction is significant when -value is less than 0.05 ( < 5%) [70].The other tests include the Kormogorov-Smirnov test (KS test) [73], the Anderson-Darling test (AD test) [73], and the Cramér-von Mises test (CVM test) [73].The statistics for these three tests are denoted by ,  2 , and  in this paper, respectively.The -values of these three tests are obtained by performing a Monte Carlo method in our analysis.It is noteworthy that we should use the KS test with caution, as it is underpowered relative to alternative available test statistics [70,73,74].

The US county-level distributions
Here we first analyze the cumulative and daily numbers of confirmed COVID-19 cases and deaths at the US county-level.Since the COVID-19 pandemic has invaded more than 3,000 counties in the US, there are enough statistics to investigate the characteristics of the distributions for these four metrics.Figs 1 to 4 show the evolution behavior of these distributions over more than two years from the early stages of this pandemic to 1 June 2022.These four figures use the same techniques of statistical analysis.
Fig 1 shows the US county-level distributions of the numbers of cumulative confirmed cases.From this figure, it is seen that the power-law and lognormal distributions describe these empirical data well.The key parameters of theoretical distributions are obtained by the fitting algorithm [68,69] implemented in powerlaw package [70], and shown in legends.
The goodness of fits presented in Fig 1 is evaluated by the aforementioned four hypothesis tests.The black text in the bottom left corners of each panel refers to the log-likelihood ratio test which compares the exponential, the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions.The log-likelihood ratio test indicates that the lognormal distribution is the best candidate fit for the empirical data presented in the last 5 panels.For the data presented in the first panel, the power-law might be a better candidate distribution.However, the log-likelihood ratio test presented in the first panel shows that the power-law, the truncated power-law, and the stretched exponential distributions are indistinguishable.This indistinguishability may be caused by the limitation of statistics in the early stages of the COVID-19 pandemic.The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test, whose results are presented in Table 2. From this table, we see that both the KS test and the CVM test accept these fits at a significance level of 1%.    a time span of more than two years.In the early stages of this pandemic, the power-law with an index  = 1.50 may be a candidate distribution to describe the empirical data.A characteristic of such empirical distribution in this phase is that the curve of the probability density function is close to a straight line in a log-log plot.According to the log-likelihood ratio test, it is difficult to distinguish the power-law from the truncated power-law and the stretched exponential distributions.To identify the true distribution pattern underlying empirical data, the generative mechanism is highly needed to be investigated theoretically by developing mathematical and physical models.
In the subsequent course of this pandemic, the power-law gradually converges to the lognormal distribution, which is characterized by two key parameters: location  and scale .With the development of this pandemic, the location  gradually increases, and the scale  decreases except for the data presented in the last panel.The mode of the lognormal distribution equals  −  2 .The fitting result shows that the mode gradually moves to the right along with the evolution course.Depending on the locational relationship between the mode and the minimum value of the empirical data, the subsequent pandemic course has two distinct phases, called lognormal phase I and lognormal phase II.The mode and the minimum value of the empirical data are close to each other in the lognormal phase I, and the former is significantly larger than the latter in the lognormal phase II.
Based on the discussion above, we conclude that the evolution of the empirical distributions over more than two years can be divided into three phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II.The distributions of these three phases have the characteristic of the heavy-tailed distribution.The transformation from the power-law phase to the lognormal phase I is similar to the previous observation on the country-level distributions [13].However, the US county-level distributions especially exhibit the feature of the lognormal phase II, which is not observed in the country-level distributions [13].
Fig 2 shows the US county-level distributions of the numbers of cumulative confirmed deaths.The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test.The results of these three tests are presented in Table 3 and show that these theoretical distributions are acceptable.
A similar evolution process as shown in Fig 1 is observed here.However, the evolution speed for cumulative confirmed deaths is much slower than that for cumulative confirmed cases.This phenomenon can be explained by the development of the process of "infection" events and "death" events.The fact that the "death" event appears later and slower than the "infection" event causes the slower evolution speed of the distributions of cumulative confirmed deaths.The effective containment measures contribute partially to this slower evolution.For the US county-level distributions of both cumulative confirmed cases and deaths, the feature of the lognormal phase II is observed.In accordance with the mathematical model of the lognormal distribution, one could make a prediction that if no effective and timely containment measure is taken to control the development of such a pandemic, all counties will be invaded in the late stages of the COVID-19 course.
Compared with this study, the previous analysis on the country-level data [13] does not observe the lognormal phase II at least on 1 June 2022.This difference between the US county-level and the worldwide country-level may result from the fact that the severity degree of the COVID-19 pandemic from the perspective of the world is smaller than that from the perspective of the US county.Actually, these three phases could be an indicator of the severity degree in terms of the geographical scope affected and the number of people infected by the COVID-19 pandemic.4. The aforementioned four hypothesis tests all indicate that the power-law and the lognormal distributions are better fits to the empirical data in the early stages and the subsequent course of this pandemic, respectively.Such observation at the county-level is similar to the former analysis at the country-level [10,13].We note that the  for the stretched exponential distribution is negative in the second panel.Therefore, the stretched exponential distribution is also used to fit that empirical data, and the other three tests accept this fit.Only based on the statistical analysis, we can not distinguish the lognormal distribution from the truncated power-law and the stretched exponential distributions using the empirical data presented in the second panel.This limitation of the statistical analysis further stimulates the theoretical studies on the generative mechanism using mathematical and physical models.5 all show that the power-law, the truncated power-law, and the lognormal distributions well describe the empirical data in different stages of the COVID-19 pandemic.However, we can not make a solid conclusion based on the log-likelihood ratio test due to the limitation of the statistics of the daily data on confirmed deaths.

The Chinese city-level distributions
This paper also analyzes the city-level data of more than 300 Chinese cities. Fig 5 shows the Chinese city-level distributions of the numbers of cumulative confirmed cases over a time span through the very early stages of this pandemic to 1 April 2022.There is no illustration for the cumulative confirmed deaths and daily confirmed cases and deaths, since there are not enough cities with such confirmed cases and deaths.From the results of the KS test, the AD test, and the CVM test presented in Table 6, it is evident that the lognormal distribution can well describe the Chinese city-level data in different stages of this pandemic.However, the log-likelihood ratio test shows that some theoretical distributions are indistinguishable based on the empirical data on some dates.
Although a solid conclusion is not available based on the log-likelihood ratio test, the other hypothesis tests and the shapes of the empirical distributions indicate that the empirical distributions exhibit the feature of the lognormal phase I in all stages studied here.In the very early stages of this pandemic, at least by 1 February 2020, the lognormal distribution had been formed.Compared with the US county-level distributions shown in Fig 1 and the country-level distributions presented in former studies [10,13], the Chinese city-level data formed the lognormal distribution much earlier than the US county-level data and the worldwide country-level data.The reason for this difference between the Chinese city-level data and the US county-level data (or the worldwide country-level data) is that the infectious novel coronavirus SARS-CoV-2 first invaded China and then other countries.Note that no distribution is the form of the lognormal phase II.This absence of the lognormal phase II reflects the effect of China's effective pharmaceutical and non-pharmaceutical interventions before mid-2022.

State/province-level distributions
In addition to the US county-level and the Chinese city-level data, this paper also analyzes the state/province-level data from Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, and the US.There are more than 100 states and provinces in these 9 countries.Figs 6 and 7 show the state/province-level distributions of the numbers of cumulative confirmed cases and deaths by considering these more than 100 states and provinces together.We can not make an illustration of the daily metrics since there are not enough states and provinces with daily confirmed cases and deaths.
For the cumulative confirmed cases, the state/province-level distribution can be described by the stretched exponential distribution according to the log-likelihood ratio test.The log-likelihood ratio test also indicates that the stretched exponential and the lognormal distributions are indistinguishable based on the data presented in the first five panels.Therefore, for ease of comparison, we also plot the results of the lognormal fits in the first five panels.The KS test, AD test, and the CVM test presented in Table 7 also accept the lognormal fits, and the stretched exponential fits except for that on 1 December 2021.It is noteworthy that the empirical distribution shows the feature of lognormal phase I in all stages studied here, which is similar to the Chinese city-level distribution of cumulative confirmed cases.As a comparison, the country-level and the US county-level distributions still followed the power-law on March 2020 [10,13].The red dashed curves and legends are the results of the stretched exponential fits.For ease of comparison, the results of the lognormal fits are presented by blue dashed curves and legends.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.From the studies discussed above, it is concluded that the US county-level, the Chinese city-level, and the state/province-level empirical distributions behave as heavy-tailed distributions.These heavy-tailed distributions indicate that a large number of areas are infected slightly and a few areas are infected tremendously by the COVID-19 pandemic.The transformation from the power-law phase to the lognormal phase I, and then to the lognormal phase II, suggests that the geographical heterogeneity is narrowing along with the evolution of this pandemic.

Estimations of the shape parameter using the GPD
To further quantitatively reveal the characteristics of the tails of the US county-level, the Chinese city-level, and the state/provincelevel empirical distributions, here we estimate the shape parameters of the tails using the GPD.Tables 9, 10, and 11 report the point and the 95% confidence interval estimates of the shape parameters over different thresholds, since there is no universal guidance on the selection of threshold.
From Tables 9, 10, and 11, we see all point estimates are positive, which means that these empirical distributions are heavytailed.For a few cases, the 95% confidence interval estimates show that negative shape parameters are possible, but there is more possibility that the shape parameters are positive.Therefore, the GPD estimations of the shape parameters support the conclusion of heavy-tailed distributions, which is also concluded by fitting the empirical data using the theoretical heavy-tailed distributions.
Table 9. Estimation of the shape parameter of the US county-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.The numbers in [ ] are the 95% confidence intervals.

Conclusions
The COVID-19 global pandemic has caused unprecedented damage to global public health, society safety, and the economy.Understanding the spread dynamics of this pandemic is a challenging task.The geographical distribution characteristics of the numbers of confirmed COVID-19 cases and deaths suggest important intrinsic dynamics underlying a physical complex network in which the human infectious virus spreads.To investigate the distributions in different stages of this pandemic and different geographical scales of the pandemic spread, at the county, city, and state/province levels, this paper systematically analyzes the time series data of the confirmed COVID-19 cases and deaths from the early stages of this pandemic to June 2022.
The distributions, for both cumulative and daily confirmed cases and deaths at the US county-level, for cumulative confirmed cases at the Chinese city-level, and for both cumulative confirmed cases and deaths at the state/province-level of Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, and the US, are statistically analyzed.The statistical analysis provides evidence that these geographical distributions can be described by the heavy-tailed distributions in different stages of this pandemic.These heavy-tailed distributions indicate that a large number of areas are infected slightly and a few areas are infected tremendously by the COVID-19 pandemic.The evolution of this pandemic can be divided into three distinct phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II.The distributions in different phases have different shapes.The shape of the distribution in the power-law phase is close to a straight line in a log-log plot.For the lognormal phase I and phase II, the shape of a distribution is a concave curve in a log-log plot.The difference between these two lognormal phases is the locational relationship between the mode and the observed minimum value.The location of the mode is close to the observed minimum value for the lognormal phase I, and the former one is significantly larger than the latter one for the lognormal phase II.
The US county-level cumulative distributions feature these three phases.In the early stages, the distributions have the feature of the power-law phase.Then the distributions gradually converge to the lognormal phase I and phase II.The evolution speed for the cumulative confirmed deaths is slower than that for the cumulative confirmed cases because of the difference between the dynamics of the "death" event and the "infection" event.The Chinese city-level distributions for the cumulative confirmed cases only have the feature of the lognormal phase I across the entire stages we studied in this paper.As a comparison with the US county-level distributions, the absence of the power-law phase and the lognormal phase II indicates that the evolution speed for the Chinese city-level distribution is faster in the early stages of this pandemic, and that speed is slower in the middle and late stages.The reason for this difference in evolution speed between the Chinese city-level data and the US county-level data is that the infectious novel coronavirus first invaded China and then other countries, and the further development of this pandemic was significantly suppressed by the effective pharmaceutical and non-pharmaceutical interventions taken in China.The state/province-level data of cumulative confirmed cases have a similar evolution process of the Chinese city-level data, and that data for cumulative confirmed deaths show a much slower speed of evolution compared with the Chinese city-level data.
These three phases could be an indicator of the severity degree of the COVID-19 pandemic within an area.Generally, the power-law phase, the lognormal phase I, and the lognormal phase II represent low, middle, and high severity degrees of the COVID-19 pandemic within an area, respectively.Whether the COVID-19 number in an area forms the distribution following these three phases, depends on the evolution speed and the severity degree of the COVID-19 pandemic in that area.The transformation from the power-law phase to the lognormal phase I, and then to the lognormal phase II, suggests that the geographical heterogeneity is narrowing along with the evolution of this pandemic.
The findings presented in this paper extend previous empirical studies.And it provides another example of the damages from natural disasters characterized by the heavy-tailed distribution.This study could provide more strict constraints for current mathematical and physical modeling studies, such as the SIR model and its variants based on the theory of complex networks.Such a study is also especially important for predicting the COVID-19 evolution as many countries are stopping tracking COVID-19 [75,76].

3 47 Figure 1 .
Figure 1.The US county-level distributions of the numbers of cumulative confirmed COVID-19 cases.The star markers are the empirically estimated probability density  () of a county to have  cumulative confirmed cases by one day.The red dashed curves and legends are the fit results using theoretical distributions.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Figure 2 .
Figure 2. The US county-level distributions of the numbers of cumulative confirmed COVID-19 deaths.The star markers are the empirically estimated probability  () of a county to have  cumulative confirmed deaths by one day.The red dashed curves and legends are the fit results using theoretical distributions.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Fig 3
Fig 3 illustrates  the US county-level distributions for daily confirmed cases.The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test.The results of these three tests are presented in Table4.The aforementioned four hypothesis tests all indicate that the power-law and the lognormal distributions are better fits to the empirical data in the early stages and the subsequent course of this pandemic, respectively.Such observation at the county-level is similar to the former analysis at the country-level[10,13].We note that the  for the stretched exponential distribution is negative in the second panel.Therefore, the stretched exponential distribution is also used to fit that empirical data, and the other three tests accept this fit.Only based on the statistical analysis, we can not distinguish the lognormal distribution from the truncated power-law and the stretched exponential distributions using the empirical data presented in the second panel.This limitation of the statistical analysis further stimulates the theoretical studies on the generative mechanism using mathematical and physical models.

1 Figure 3 .
Figure 3.The US county-level distributions of the numbers of daily confirmed COVID-19 cases.The star markers are the empirically estimated probability density  () of a county to have  daily confirmed cases on one day.The red dashed curves and legends are the results of the lognormal fits.For ease of comparison, the results of the stretched exponential fit are presented as the blue dashed curve and legend in the top middle panel.The values of  (-value) of the log-likelihood ratio tests are shown as black text in the bottom left corners of each panel.

Figure 4 .
Figure 4.The US county-level distributions of the numbers of daily confirmed COVID-19 deaths.The star markers are the empirically estimated probability density  () of a county to have  daily confirmed deaths on one day.The red dashed curves and legends are the fit results using theoretical distributions.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Figure 5 .
Figure 5.The Chinese city-level distributions of the numbers of cumulative confirmed COVID-19 cases.The star markers are the empirically estimated probability density  () of a city to have  cumulative confirmed cases by one day.The red dashed curves and legends are the results of the lognormal fits.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Figure 6 .
Figure 6.State/province-level distributions of the numbers of cumulative confirmed COVID-19 cases.The star markers are the empirically estimated probability density  () of a state/province to have  cumulative confirmed cases by one day.The red dashed curves and legends are the results of the stretched exponential fits.For ease of comparison, the results of the lognormal fits are presented by blue dashed curves and legends.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Figure 7 .
Figure 7. State/province-level distributions of the numbers of cumulative confirmed COVID-19 deaths.The star markers are the empirically estimated probability density  () of a state/province to have  cumulative confirmed deaths.The red dashed curves and legends are the fit results using theoretical distributions.The values of  (-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

Table 2 . The goodness of fit for the US county-level distributions of the numbers of cumulative confirmed COVID-19 cases. Date Distribution KS test's 𝐷 (𝑝-value) AD test's 𝐴 2 (𝑝-value) CVM test's 𝑇 (𝑝-value)
Fig 1 exhibits a typical evolution characteristic of the US county-level distribution of cumulative confirmed COVID-19 cases over

Table 10 . Estimation of the shape parameter of the Chinese city-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.
The numbers in [ ] are the 95% confidence intervals.

Table 11 . Estimation of the shape parameter of the state/province-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.
The numbers in [ ] are the 95% confidence intervals.