Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Heavy-tailed distributions of confirmed COVID-19 cases and deaths in spatiotemporal space

Abstract

This paper conducts a systematic statistical analysis of the characteristics of the geographical empirical distributions for the numbers of both cumulative and daily confirmed COVID-19 cases and deaths at county, city, and state levels over a time span from January 2020 to June 2022. The mathematical heavy-tailed distributions can be used for fitting the empirical distributions observed in different temporal stages and geographical scales. The estimations of the shape parameter of the tail distributions using the Generalized Pareto Distribution also support the observations of the heavy-tailed distributions. According to the characteristics of the heavy-tailed distributions, the evolution course of the geographical empirical distributions can be divided into three distinct phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II. These three phases could serve as an indicator of the severity degree of the COVID-19 pandemic within an area. The empirical results suggest important intrinsic dynamics of a human infectious virus spread in the human interconnected physical complex network. The findings extend previous empirical studies and could provide more strict constraints for current mathematical and physical modeling studies, such as the SIR model and its variants based on the theory of complex networks.

Introduction

The COVID-19 global pandemic broke out in the human interconnected physical complex network in late 2019. It has caused unprecedented damage to global public health, society safety, and economy [15]. To control such a pandemic well, many countries have taken a variety of containment measures since the beginning phase of this epidemic [6].

Over the past few years, it has been observed that some countries, states, provinces, cities, and counties have a huge number of confirmed COVID-19 cases and deaths, while others have a few confirmed cases and deaths [1, 79]. One may intuitively attribute this geographical heterogeneity to the different containment measures adopted in different regions. It is true that some countries have adopted relatively strict containment measures compared with others. However, within a country, such as China where unified COVID-19 containment measures are adopted, similar geographical heterogeneity is still observed.

The huge geographical heterogeneity implies spatial heavy-tailed distribution [1013]. Blasius [10] analyzed data of COVID-19 confirmed cases and deaths at the end of March 2020 for countries worldwide and for counties in the US, and revealed that the geographical distributions of confirmed cases and deaths follow the truncated power-law. Similarly, Toda and Beare [11] also investigated the US county-level distribution of confirmed cases during the early phase of this epidemic, and then found that this distribution obeys the power-law with an exponent close to 1. Subsequently, Ahundjanov et al. [12] computed the Chinese city-level distribution of confirmed cases at the end of May 2020, and obtained a similar result as that of the US county-level distribution [12]. In 2022, Liu and Zheng [13] conducted a more comprehensive and systematic analysis of COVID-19 cumulative and daily confirmed cases and deaths for countries worldwide by considering its temporal evolution characteristics. This study shows that the power-law and the stretched exponential function can well describe the geographical country-level distributions, in a time period from the early phase of this epidemic to June 2022 [13]. In addition to the geographical power-law distribution, a power-law growth pattern during the early phase of this epidemic was also observed [1417]. The power-law pattern reveals important intrinsic dynamics of the COVID-19 spread process in the complex network of human society.

The previous empirical studies focused only on data from the early phase of this epidemic and did not investigate the differences among different geographical scales. To obtain a comprehensive understanding of the dynamics of the COVID-19 spread process, it is valuable to perform a more in-depth and systematic analysis of COVID-19 data across different geographical scales in a time span as long as possible. Compared with the previous studies [1013], at the county, city, and state/province levels, this paper extendedly analyzes the COVID-19 time series data from Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, and the US, and further investigates the spatial and temporal impact on the distributions of both cumulative and daily confirmed cases and deaths.

The heavy-tailed distribution features the concurrence of small and extremely large values. It has been observed widely in the complex systems of nature and human society [1822]. Typical examples include the distributions of income and wealth [23, 24], return of financial assets [2527], consumption [28], firm size [29], rainfall depth [30], agricultural land size [31], city size [32], rank-size of human settlements [33], human mobility [3437], oil and natural gas production [38], frequency of words [39], university research activities [40], degree of complex networks [4143], tropical cyclone damages [44], COVID-19 coronavirus superspreading [45], etc. Such distribution is an indication of the essential dynamics underlying a complex system, and often related to the phenomena of self-organized criticality, turbulence, fractals, and so on [43, 4649]. This paper contributes to the literature by presenting the observation of heavy-tailed distributions in empirical data of confirmed COVID-19 cases and deaths. This paper also provides another example of the damages from natural disasters characterized by the heavy-tailed distribution [44].

To understand the spread dynamics of the COVID-19 pandemic, scholars have conducted a large number of theoretical studies using mathematical and physical modeling approaches. Pullano et al. [50] and Gilbert et al. [51] estimated the early importation risk by considering air travel flows originating from the infected cities in China. Jia et al. [52] introduced a spatiotemporal risk source model to study the geographical spread and the growth pattern during the early phase based on the population flow data. Maier and Brockmann [53] and Manchein et al. [54] investigated the impact of containment measures on the early subexponential and power-law growth patterns. Arenas et al. [55] introduced an age-stratified mobility-based metapopulation model to study the spatiotemporal spread by considering mobility and social distancing interventions. Sardar et al. [56] and Sardar and Rana [57] developed mathematical models with lockdown effect to assess lockdown policy and potential outbreak risks from hospitals and quarantined centers. Jentsch et al. [58] introduced a coupled social-epidemiological model by considering the interaction between social and epidemiological dynamics. Musa et al. [59] introduced a COVID-19 mathematical model which considered the effect of public awareness. Moore et al. [60] developed a model structured by age and UK region to study the effects of vaccination and non-pharmaceutical interventions. Özköse et al. [61] proposed a fractional order model to study the dynamics of the Omicron variant. Ikram et al. [62] developed a stochastic SIVR model for COVID-19. Sk et al. [17] introduced a completely new model which incorporated the power-law effect of the COVID-19 transmission process. Chang et al. [63] introduced a metapopulation SEIR model using mobile phone data to explain inequities and inform reopening. Although there are many mathematical and physical modeling studies as stated above, these studies did not capture the feature of the geographical heavy-tailed distributions presented in some empirical studies [1013]. Blasius [10] proposed a model with two spatial scales to explain the emergence of power-law during the initial phase, and pointed out that the power-law pattern might be changed in the subsequent phase of this pandemic. Beare and Toda [11] used a simple mathematical model involving Gibrat’s law to explain the emergence of the US county-level power-law distribution. Ahundjanov et al. [12] developed a proportionate random growth model based on Gibrat’s law to investigate the emergence of the Chinese city-level power-law distribution.

In this paper, we observe not only the geographical power-law distribution of confirmed COVID-19 cases and deaths in different spatial scales, but also other heavy-tailed distributions across different temporal phases. The findings presented in this paper extend previous empirical studies, and could provide more strict constraints for current mathematical and physical modeling studies, such as the SIR model and its variants based on the theory of complex networks.

Data source

This paper conducts a detailed analysis of the distributions of the COVID-19 numbers for both cumulative and daily confirmed cases and deaths at county, city, and state/province levels. The COVID-19 datasets analyzed in this paper are sourced from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University [8] and the China Data Lab Dataverse operated by Harvard University [64, 65]. The dataset from CSSE was updated until March 2023, while the dataset from China Data Lab Dataverse was updated until May 2022. More details of these datasets can be accessed via the sources listed in Table 1 and the scientific articles published in journals of the Lancet Infectious Diseases [9] and the Data and Information Management [66]. Table 1 lists the sources of the datasets for each geographical scale analyzed in this paper.

thumbnail
Table 1. The sources of COVID-19 time series data analyzed in this paper.

https://doi.org/10.1371/journal.pone.0294445.t001

Mathematical heavy-tailed distributions and methods

Many observables in natural and social systems obey the exponentially bounded distribution. It decays exponentially (ex) or faster than the exponential (e.g. the normal distribution ) as the observable x increases. It is also known as thin-tailed distribution because of the very small probability of large observable x. However, we also observed extreme values of some observables in complex systems. These extreme events are important and interesting since they often play a crucial role in understanding the complex system. The occurrence of extreme events implies a heavy-tailed distribution that decays more slowly than the exponentially bounded distribution [67]. The heavy-tailed distributions that are often used to describe empirical data include the power-law (also known as the Pareto distribution or Zipf’s law), the truncated power-law, the stretched exponential (also known as the Weibull distribution), and the lognormal distributions [67].

The probability density functions of the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions are mathematically written as Eqs (1)–(4), respectively [67]. (1) (2) (3) (4)

For some special cases, we can let support of x is [xmin, + ∞) or [xmin, xmax] with condition of xmin > 0. The C is a normalizer which is different for continuous and discrete random variables. In our study, we use the discrete form since the numbers of confirmed COVID-19 cases and deaths are integers.

The power-law distribution has been generalized as the Generalized Pareto Distribution (GPD). The GPD covers both cases of the thin and heavy-tailed distributions. Therefore, although the GPD might not be suitable for describing an empirical distribution over the entire range of a variable, it is often used to estimate the shape of a distribution’s tail [44].

In this paper, the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions are employed to fit the empirical distribution over the observable’s entire range with the popular fitting methods developed by Clauset et al. [18] and Klaus et al. [68]. The mathematical details of the fitting methods can be found in references [18, 68], and the algorithm’s implementation has been coded in the Python package powerlaw [69], which is widely used in scientific research and thus in this work. To have a better quantitative understanding of the tail distribution, this paper also uses the GPD to estimate the tail’s shape parameter [70, 71].

Here we use four statistical hypothesis tests to evaluate the goodness of fit. The first one is the log-likelihood ratio test. It can identify which one of the two fits is better. This method compares one candidate distribution against another, and calculates two values of R and p-value [69]. The positive R indicates that the former is a better fit compared with the latter one, and vice versa [69]. The p-value quantifies the significance for that direction, and that direction is significant when p-value is less than 0.05 (<5%) [69]. The other tests include the Kormogorov-Smirnov test (KS test) [72], the Anderson-Darling test (AD test) [72], and the Cramér–von Mises test (CVM test) [72]. The statistics for these three tests are denoted by D, A2, and T in this paper, respectively. The p-values of these three tests are obtained by performing a Monte Carlo method in our analysis. It is noteworthy that we should use the KS test with caution, as it is underpowered relative to alternative available test statistics [69, 72, 73].

Results and discussion

The US county-level distributions

Here we first analyze the cumulative and daily numbers of confirmed COVID-19 cases and deaths at the US county-level. Since the COVID-19 pandemic has invaded more than 3,000 counties in the US, there are enough statistics to investigate the characteristics of the distributions for these four metrics. Figs 1 to 4 show the evolution behavior of these distributions over more than two years from the early stages of this pandemic to 1 June 2022. These four figures use the same techniques of statistical analysis.

thumbnail
Fig 1. The US county-level distributions of the numbers of cumulative confirmed COVID-19 cases.

The star markers are the empirically estimated probability density P(N) of a county to have N cumulative confirmed cases by one day. The red dashed curves and legends are the fit results using theoretical distributions. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g001

thumbnail
Fig 2. The US county-level distributions of the numbers of cumulative confirmed COVID-19 deaths.

The star markers are the empirically estimated probability P(N) of a county to have N cumulative confirmed deaths by one day. The red dashed curves and legends are the fit results using theoretical distributions. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g002

thumbnail
Fig 3. The US county-level distributions of the numbers of daily confirmed COVID-19 cases.

The star markers are the empirically estimated probability density P(N) of a county to have N daily confirmed cases on one day. The red dashed curves and legends are the results of the lognormal fits. For ease of comparison, the results of the stretched exponential fit are presented as the blue dashed curve and legend in the top middle panel. The values of R (p-value) of the log-likelihood ratio tests are shown as black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g003

thumbnail
Fig 4. The US county-level distributions of the numbers of daily confirmed COVID-19 deaths.

The star markers are the empirically estimated probability density P(N) of a county to have N daily confirmed deaths on one day. The red dashed curves and legends are the fit results using theoretical distributions. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g004

Fig 1 shows the US county-level distributions of the numbers of cumulative confirmed cases. From this figure, it is seen that the power-law and lognormal distributions describe these empirical data well. The key parameters of theoretical distributions are obtained by the fitting algorithm [18, 68] implemented in powerlaw package [69], and shown in legends.

The goodness of fits presented in Fig 1 is evaluated by the aforementioned four hypothesis tests. The black text in the bottom left corners of each panel refers to the log-likelihood ratio test which compares the exponential, the power-law, the truncated power-law, the stretched exponential, and the lognormal distributions. The log-likelihood ratio test indicates that the lognormal distribution is the best candidate fit for the empirical data presented in the last 5 panels. For the data presented in the first panel, the power-law might be a better candidate distribution. However, the log-likelihood ratio test presented in the first panel shows that the power-law, the truncated power-law, and the stretched exponential distributions are indistinguishable. This indistinguishability may be caused by the limitation of statistics in the early stages of the COVID-19 pandemic. The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test, whose results are presented in Table 2. From this table, we see that both the KS test and the CVM test accept these fits at a significance level of 1%.

thumbnail
Table 2. The goodness of fit for the US county-level distributions of the numbers of cumulative confirmed COVID-19 cases.

https://doi.org/10.1371/journal.pone.0294445.t002

Fig 1 exhibits a typical evolution characteristic of the US county-level distribution of cumulative confirmed COVID-19 cases over a time span of more than two years. In the early stages of this pandemic, the power-law with an index α = 1.50 may be a candidate distribution to describe the empirical data. A characteristic of such empirical distribution in this phase is that the curve of the probability density function is close to a straight line in a log-log plot. According to the log-likelihood ratio test, it is difficult to distinguish the power-law from the truncated power-law and the stretched exponential distributions. To identify the true distribution pattern underlying empirical data, the generative mechanism is highly needed to be investigated theoretically by developing mathematical and physical models.

In the subsequent course of this pandemic, the power-law gradually converges to the lognormal distribution, which is characterized by two key parameters: location μ and scale σ. With the development of this pandemic, the location μ gradually increases, and the scale σ decreases except for the data presented in the last panel. The mode of the lognormal distribution equals . The fitting result shows that the mode gradually moves to the right along with the evolution course. Depending on the locational relationship between the mode and the minimum value of the empirical data, the subsequent pandemic course has two distinct phases, called lognormal phase I and lognormal phase II. The mode and the minimum value of the empirical data are close to each other in the lognormal phase I, and the former is significantly larger than the latter in the lognormal phase II.

Based on the discussion above, we conclude that the evolution of the empirical distributions over more than two years can be divided into three phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II. The distributions of these three phases have the characteristic of the heavy-tailed distribution. The transformation from the power-law phase to the lognormal phase I is similar to the previous observation on the country-level distributions [13]. However, the US county-level distributions especially exhibit the feature of the lognormal phase II, which is not observed in the country-level distributions [13].

Fig 2 shows the US county-level distributions of the numbers of cumulative confirmed deaths. The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test. The results of these three tests are presented in Table 3 and show that these theoretical distributions are acceptable.

thumbnail
Table 3. The goodness of fit for the US county-level distributions of the numbers of cumulative confirmed COVID-19 deaths.

https://doi.org/10.1371/journal.pone.0294445.t003

A similar evolution process as shown in Fig 1 is observed here. However, the evolution speed for cumulative confirmed deaths is much slower than that for cumulative confirmed cases. This phenomenon can be explained by the development of the process of “infection” events and “death” events. The fact that the “death” event appears later and slower than the “infection” event causes the slower evolution speed of the distributions of cumulative confirmed deaths. The effective containment measures contribute partially to this slower evolution.

For the US county-level distributions of both cumulative confirmed cases and deaths, the feature of the lognormal phase II is observed. In accordance with the mathematical model of the lognormal distribution, one could make a prediction that if no effective and timely containment measure is taken to control the development of such a pandemic, all counties will be invaded in the late stages of the COVID-19 course.

Compared with this study, the previous analysis on the country-level data [13] does not observe the lognormal phase II at least on 1 June 2022. This difference between the US county-level and the worldwide country-level may result from the fact that the severity degree of the COVID-19 pandemic from the perspective of the world is smaller than that from the perspective of the US county. Actually, these three phases could be an indicator of the severity degree in terms of the geographical scope affected and the number of people infected by the COVID-19 pandemic.

Fig 3 illustrates the US county-level distributions for daily confirmed cases. The theoretical distributions, chosen to fit the empirical data based on the log-likelihood ratio test, are further tested using the KS test, the AD test, and the CVM test. The results of these three tests are presented in Table 4. The aforementioned four hypothesis tests all indicate that the power-law and the lognormal distributions are better fits to the empirical data in the early stages and the subsequent course of this pandemic, respectively. Such observation at the county-level is similar to the former analysis at the country-level [10, 13]. We note that the R for the stretched exponential distribution is negative in the second panel. Therefore, the stretched exponential distribution is also used to fit that empirical data, and the other three tests accept this fit. Only based on the statistical analysis, we can not distinguish the lognormal distribution from the truncated power-law and the stretched exponential distributions using the empirical data presented in the second panel. This limitation of the statistical analysis further stimulates the theoretical studies on the generative mechanism using mathematical and physical models.

thumbnail
Table 4. The goodness of fit for the US county-level distributions of the numbers of daily confirmed COVID-19 cases.

https://doi.org/10.1371/journal.pone.0294445.t004

Fig 4 illustrates the US county-level distributions for daily confirmed deaths. The KS test, the AD test, and the CVM test presented in Table 5 all show that the power-law, the truncated power-law, and the lognormal distributions well describe the empirical data in different stages of the COVID-19 pandemic. However, we can not make a solid conclusion based on the log-likelihood ratio test due to the limitation of the statistics of the daily data on confirmed deaths.

thumbnail
Table 5. The goodness of fit for the US county-level distributions of the numbers of daily confirmed COVID-19 deaths.

https://doi.org/10.1371/journal.pone.0294445.t005

The Chinese city-level distributions

This paper also analyzes the city-level data of more than 300 Chinese cities. Fig 5 shows the Chinese city-level distributions of the numbers of cumulative confirmed cases over a time span through the very early stages of this pandemic to 1 April 2022. There is no illustration for the cumulative confirmed deaths and daily confirmed cases and deaths, since there are not enough cities with such confirmed cases and deaths. From the results of the KS test, the AD test, and the CVM test presented in Table 6, it is evident that the lognormal distribution can well describe the Chinese city-level data in different stages of this pandemic. However, the log-likelihood ratio test shows that some theoretical distributions are indistinguishable based on the empirical data on some dates.

thumbnail
Fig 5. The Chinese city-level distributions of the numbers of cumulative confirmed COVID-19 cases.

The star markers are the empirically estimated probability density P(N) of a city to have N cumulative confirmed cases by one day. The red dashed curves and legends are the results of the lognormal fits. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g005

thumbnail
Table 6. The goodness of fit for the Chinese city-level distributions of the numbers of cumulative confirmed COVID-19 cases.

https://doi.org/10.1371/journal.pone.0294445.t006

Although a solid conclusion is not available based on the log-likelihood ratio test, the other hypothesis tests and the shapes of the empirical distributions indicate that the empirical distributions exhibit the feature of the lognormal phase I in all stages studied here. In the very early stages of this pandemic, at least by 1 February 2020, the lognormal distribution had been formed. Compared with the US county-level distributions shown in Fig 1 and the country-level distributions presented in former studies [10, 13], the Chinese city-level data formed the lognormal distribution much earlier than the US county-level data and the worldwide country-level data. The reason for this difference between the Chinese city-level data and the US county-level data (or the worldwide country-level data) is that the infectious novel coronavirus SARS-CoV-2 first invaded China and then other countries. Note that no distribution is the form of the lognormal phase II. This absence of the lognormal phase II reflects the effect of China’s effective pharmaceutical and non-pharmaceutical interventions before mid-2022.

State/province-level distributions

In addition to the US county-level and the Chinese city-level data, this paper also analyzes the state/province-level data from Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, and the US. There are more than 100 states and provinces in these 9 countries. Figs 6 and 7 show the state/province-level distributions of the numbers of cumulative confirmed cases and deaths by considering these more than 100 states and provinces together. We can not make an illustration of the daily metrics since there are not enough states and provinces with daily confirmed cases and deaths.

thumbnail
Fig 6. State/province-level distributions of the numbers of cumulative confirmed COVID-19 cases.

The star markers are the empirically estimated probability density P(N) of a state/province to have N cumulative confirmed cases by one day. The red dashed curves and legends are the results of the stretched exponential fits. For ease of comparison, the results of the lognormal fits are presented by blue dashed curves and legends. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g006

thumbnail
Fig 7. State/province-level distributions of the numbers of cumulative confirmed COVID-19 deaths.

The star markers are the empirically estimated probability density P(N) of a state/province to have N cumulative confirmed deaths. The red dashed curves and legends are the fit results using theoretical distributions. The values of R (p-value) of the log-likelihood ratio tests are shown as the black text in the bottom left corners of each panel.

https://doi.org/10.1371/journal.pone.0294445.g007

For the cumulative confirmed cases, the state/province-level distribution can be described by the stretched exponential distribution according to the log-likelihood ratio test. The log-likelihood ratio test also indicates that the stretched exponential and the lognormal distributions are indistinguishable based on the data presented in the first five panels. Therefore, for ease of comparison, we also plot the results of the lognormal fits in the first five panels. The KS test, AD test, and the CVM test presented in Table 7 also accept the lognormal fits, and the stretched exponential fits except for that on 1 December 2021. It is noteworthy that the empirical distribution shows the feature of lognormal phase I in all stages studied here, which is similar to the Chinese city-level distribution of cumulative confirmed cases. As a comparison, the country-level and the US county-level distributions still followed the power-law on March 2020 [10, 13].

thumbnail
Table 7. The goodness of fit for the state/province-level distributions of the numbers of cumulative confirmed COVID-19 cases.

https://doi.org/10.1371/journal.pone.0294445.t007

For the cumulative confirmed deaths, according to the log-likelihood ratio test, the truncated power-law distribution well describes the data presented in the last five panels, and the lognormal distribution seems to be a better fit for the data presented in the first panel. This comparative test also indicates that the lognormal, the truncated power-law, and the stretched exponential distributions are indistinguishable based on the data presented in the first panel, and the truncated power-law and the stretched exponential distributions are indistinguishable based on the data presented in the last panel. However, all fits are acceptable according to the KS test, AD test, and CVM test which are presented in Table 8. From the shape of these empirical distributions and the fits, it is evident that these empirical distributions were in the transition stages from the power-law phase to the lognormal phase I. At least by 1 May 2022, the data had not yet formed the feature of the lognormal phase II.

thumbnail
Table 8. The goodness of fit for the state/province-level distributions of the numbers of cumulative confirmed COVID-19 deaths.

https://doi.org/10.1371/journal.pone.0294445.t008

From the studies discussed above, it is concluded that the US county-level, the Chinese city-level, and the state/province-level empirical distributions behave as heavy-tailed distributions. These heavy-tailed distributions indicate that a large number of areas are infected slightly and a few areas are infected tremendously by the COVID-19 pandemic. The transformation from the power-law phase to the lognormal phase I, and then to the lognormal phase II, suggests that the geographical heterogeneity is narrowing along with the evolution of this pandemic.

Estimations of the shape parameter using the GPD

To further quantitatively reveal the characteristics of the tails of the US county-level, the Chinese city-level, and the state/province-level empirical distributions, here we estimate the shape parameters of the tails using the GPD. Tables 911 report the point and the 95% confidence interval estimates of the shape parameters over different thresholds, since there is no universal guidance on the selection of threshold.

thumbnail
Table 9. Estimation of the shape parameter of the US county-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.

The numbers in [ ] are the 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0294445.t009

thumbnail
Table 10. Estimation of the shape parameter of the Chinese city-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.

The numbers in [ ] are the 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0294445.t010

thumbnail
Table 11. Estimation of the shape parameter of the state/province-level distribution over thresholds of 50th, 60th, 70th, and 80th percentiles using the GPD.

The numbers in [ ] are the 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0294445.t011

From Tables 911, we see all point estimates are positive, which means that these empirical distributions are heavy-tailed. For a few cases, the 95% confidence interval estimates show that negative shape parameters are possible, but there is more possibility that the shape parameters are positive. Therefore, the GPD estimations of the shape parameters support the conclusion of heavy-tailed distributions, which is also concluded by fitting the empirical data using the theoretical heavy-tailed distributions.

Conclusions

The COVID-19 global pandemic has caused unprecedented damage to global public health, society safety, and the economy. Understanding the spread dynamics of this pandemic is a challenging task. The geographical distribution characteristics of the numbers of confirmed COVID-19 cases and deaths suggest important intrinsic dynamics underlying a physical complex network in which the human infectious virus spreads. To investigate the distributions in different stages of this pandemic and different geographical scales of the pandemic spread, at the county, city, and state/province levels, this paper systematically analyzes the time series data of the confirmed COVID-19 cases and deaths from the early stages of this pandemic to June 2022.

The distributions, for both cumulative and daily confirmed cases and deaths at the US county-level, for cumulative confirmed cases at the Chinese city-level, and for both cumulative confirmed cases and deaths at the state/province-level of Australia, Canada, China, Denmark, France, Netherlands, New Zealand, the UK, and the US, are statistically analyzed. The statistical analysis provides evidence that these geographical distributions can be described by the heavy-tailed distributions in different stages of this pandemic. These heavy-tailed distributions indicate that a large number of areas are infected slightly and a few areas are infected tremendously by the COVID-19 pandemic. The evolution of this pandemic can be divided into three distinct phases, namely the power-law phase, the lognormal phase I, and the lognormal phase II. The distributions in different phases have different shapes. The shape of the distribution in the power-law phase is close to a straight line in a log-log plot. For the lognormal phase I and phase II, the shape of a distribution is a concave curve in a log-log plot. The difference between these two lognormal phases is the locational relationship between the mode and the observed minimum value. The location of the mode is close to the observed minimum value for the lognormal phase I, and the former one is significantly larger than the latter one for the lognormal phase II.

The US county-level cumulative distributions feature these three phases. In the early stages, the distributions have the feature of the power-law phase. Then the distributions gradually converge to the lognormal phase I and phase II. The evolution speed for the cumulative confirmed deaths is slower than that for the cumulative confirmed cases because of the difference between the dynamics of the “death” event and the “infection” event. The Chinese city-level distributions for the cumulative confirmed cases only have the feature of the lognormal phase I across the entire stages we studied in this paper. As a comparison with the US county-level distributions, the absence of the power-law phase and the lognormal phase II indicates that the evolution speed for the Chinese city-level distribution is faster in the early stages of this pandemic, and that speed is slower in the middle and late stages. The reason for this difference in evolution speed between the Chinese city-level data and the US county-level data is that the infectious novel coronavirus first invaded China and then other countries, and the further development of this pandemic was significantly suppressed by the effective pharmaceutical and non-pharmaceutical interventions taken in China. The state/province-level data of cumulative confirmed cases have a similar evolution process of the Chinese city-level data, and that data for cumulative confirmed deaths show a much slower speed of evolution compared with the Chinese city-level data.

These three phases could be an indicator of the severity degree of the COVID-19 pandemic within an area. Generally, the power-law phase, the lognormal phase I, and the lognormal phase II represent low, middle, and high severity degrees of the COVID-19 pandemic within an area, respectively. Whether the COVID-19 number in an area forms the distribution following these three phases, depends on the evolution speed and the severity degree of the COVID-19 pandemic in that area. The transformation from the power-law phase to the lognormal phase I, and then to the lognormal phase II, suggests that the geographical heterogeneity is narrowing along with the evolution of this pandemic.

The findings presented in this paper extend previous empirical studies. And it provides another example of the damages from natural disasters characterized by the heavy-tailed distribution. This study could provide more strict constraints for current mathematical and physical modeling studies, such as the SIR model and its variants based on the theory of complex networks. Such a study is also especially important for predicting the COVID-19 evolution as many countries are stopping tracking COVID-19 [74, 75].

Acknowledgments

The authors would like to thank the academic editor and the anonymous reviewers for their constructive comments which make the statistical analysis of this paper more robust.

References

  1. 1. World Health Organization. Coronavirus disease (COVID-19) pandemic; 2023. https://www.who.int/emergencies/diseases/novel-coronavirus-2019. Retrieved 6 October 2023.
  2. 2. Sun J, He WT, Wang L, Lai A, Ji X, Zhai X, et al. COVID-19: Epidemiology, evolution, and cross-disciplinary perspectives. Trends in Molecular Medicine. 2020;26(5):483–495. pmid:32359479
  3. 3. Hu B, Guo H, Zhou P, Shi ZL. Characteristics of SARS-CoV-2 and COVID-19. Nature Reviews Microbiology. 2021;19:141–154. pmid:33024307
  4. 4. Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nature Microbiology. 2020;5:536–544.
  5. 5. COVID-19 Excess Mortality Collaborators. Estimating excess mortality due to the COVID-19 pandemic: A systematic analysis of COVID-19-related mortality, 2020–21. The Lancet. 2022;399:1513–1536.
  6. 6. Pan A, Liu L, Wang C, Guo H, Hao X, Wang Q, et al. Association of public health interventions with the epidemiology of the COVID-19 outbreak in Wuhan, China. JAMA. 2020;323:1915–1923. pmid:32275295
  7. 7. Our World in Data. Coronavirus pandemic (COVID-19); 2023. https://ourworldindata.org/coronavirus. Retrieved 6 October 2023.
  8. 8. Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. COVID-19 data repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University; 2023. https://github.com/CSSEGISandData/COVID-19. Retrieved 6 October 2023.
  9. 9. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases. 2020;20:533–534. pmid:32087114
  10. 10. Blasius B. Power-law distribution in the number of confirmed COVID-19 cases. Chaos. 2020;30:093123. pmid:33003939
  11. 11. Beare BK, Toda AA. On the emergence of a power law in the distribution of COVID-19 cases. Physica D. 2020;412:132649. pmid:32834250
  12. 12. Ahundjanov BB, Akhundjanov SB, Okhunjanov BB. Power law in COVID-19 cases in China. Journal of the Royal Statistical Society Series A. 2022;185:699–719.
  13. 13. Liu P, Zheng Y. Temporal and spatial evolution of the distribution related to the number of COVID-19 pandemic. Physica A. 2022;603:127837. pmid:35783919
  14. 14. Singer HM. The COVID-19 pandemic: Growth patterns, power law scaling, and saturation. Physical Biology. 2020;17:055001. pmid:32526721
  15. 15. Vazquez A. Superspreaders and lockdown timing explain the power-law dynamics of COVID-19. Physical Review E. 2020;102:040302. pmid:33212569
  16. 16. Vasconcelos GL, Macêdo AMS, Duarte-Filho GC, Brum AA, Ospina R, Almeida FAG. Power law behaviour in the saturation regime of fatality curves of the COVID-19 pandemic. Scientific Reports. 2021;11:4619. pmid:33633290
  17. 17. Sk T, Biswas S, Sardar T. The impact of a power law-induced memory effect on the SARS-CoV-2 transmission. Chaos, Solitons & Fractals. 2022;165:112790.
  18. 18. Clauset A, Shalizi CR, Newman MEJ. Power-Law Distributions in Empirical Data. SIAM Review. 2009;51:661–703.
  19. 19. Mitzenmacher M. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics. 2004;1:226–251.
  20. 20. Newman MEJ. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics. 2005;46:323–351.
  21. 21. Kumamoto SI, Kamihigashi T. Power Laws in Stochastic Processes for Social Phenomena: An Introductory Review. Frontiers in Physics. 2018;6:20.
  22. 22. Laherrère J, Sornette D. Stretched exponential distributions in nature and economy: “Fat tails” with characteristic scales. The European Physical Journal B. 1998;2:525–539.
  23. 23. Oancea B, Andrei T, Pirjol D. Income inequality in Romania: The exponential-Pareto distribution. Physica A. 2017;469:486–498.
  24. 24. Ribeiro MB. Income Distribution Dynamics of Economic Systems: An Econophysical Approach. Cambridge: Cambridge University Press; 2020.
  25. 25. Mantegna RN, Stanley HE. Scaling behaviour in the dynamics of an economic index. Nature. 1995;376:46–49.
  26. 26. Plerou V, Gopikrishnan P, Nunes Amaral LA, Meyer M, Stanley HE. Scaling of the distribution of price fluctuations of individual companies. Physical Review E. 1999;60:6519–6529. pmid:11970569
  27. 27. Liu P, Zheng Y. Precision measurement of the return distribution property of the Chinese stock market index. Entropy. 2023;25:36.
  28. 28. Toda AA. A note on the size distribution of consumption: More double Pareto than lognormal. Macroeconomic Dynamics. 2017;21:1508–1518.
  29. 29. Luttmer EGJ. Selection, growth, and the size distribution of firms. The Quarterly Journal of Economics. 2007;122:1103–1144.
  30. 30. Montfort MAJV, Witter JV. The generalized Pareto distribution applied to rainfall depths. Hydrological Sciences Journal. 1986;31:151–162.
  31. 31. Akhundjanov SB, Chamberlain L. The power-law distribution of agricultural land size. Journal of Applied Statistics. 2019;46:3044–3056.
  32. 32. Gabaix X. Zipf’s law for cities: An explanation. The Quarterly Journal of Economics. 1999;114:739–767.
  33. 33. Reed WJ. On the rank-size distribution for human settlements. Journal of Regional Science. 2002;42:1–17.
  34. 34. Barabási AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435:207–211. pmid:15889093
  35. 35. Brockmann D, Hufnagel L, Geisel T. The scaling laws of human travel. Nature. 2006;439:462–465. pmid:16437114
  36. 36. Alessandretti L, Aslak U, Lehmann S. The scales of human mobility. Nature. 2020;587:402–407. pmid:33208961
  37. 37. Schläpfer M, Dong L, O’Keeffe K, Santi P, Szell M, Salat H, et al. The universal visitation law of human mobility. Nature. 2021;593:522–527. pmid:34040209
  38. 38. Balthrop A. Power laws in oil and natural gas production. Empirical Economics. 2016;51:1521–1539.
  39. 39. Irmay S. The relationship between Zipf’s law and the distribution of first digits. Journal of Applied Statistics. 1997;24:383–394.
  40. 40. Plerou V, Amaral LAN, Gopikrishnan P, Meyer M, Stanley HE. Similarities between the growth dynamics of university research and of competitive economic activities. Nature. 1999;400:433–437.
  41. 41. Albert R, Jeong H, Barabási AL. Diameter of the World-Wide Web. Nature. 1999;401:130–131.
  42. 42. Barabási AL, Albert R, Jeong H. Mean-field theory for scale-free random networks. Physica A. 1999;272:173–187.
  43. 43. Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. pmid:10521342
  44. 44. Conte MN, Kelly DL. An imperfect storm: Fat-tailed tropical cyclone damages, insurance, and climate policy. Journal of Environmental Economics and Management. 2018;92:677–706.
  45. 45. Wong F, Collins JJ. Evidence that coronavirus superspreading is fat-tailed. PNAS. 2020;117:29416–29418. pmid:33139561
  46. 46. Bak P, Tang C, Wiesenfeld K. Self-organized criticality: An explanation of the 1/f noise. Physical Review Letters. 1987;59:381–384. pmid:10035754
  47. 47. Nelkin M. Universality and scaling in fully developed turbulence. Advances in Physics. 1994;43:143–181.
  48. 48. Meneveau C, Sreenivasan KR. The multifractal nature of turbulent energy dissipation. Journal of Fluid Mechanics. 1991;224:429–484.
  49. 49. Bunde A, Havlin S. Fractals and Disordered Systems. Springer Berlin, Heidelberg; 2012.
  50. 50. Pullano G, Pinotti F, Valdano E, Boëlle PY, Poletto C, Colizza V. Novel coronavirus (2019-nCoV) early-stage importation risk to Europe, January 2020. Eurosurveillance. 2020;25:2000057. pmid:32019667
  51. 51. Gilbert M, Pullano G, Pinotti F, Valdano E, Poletto C, Boëlle PY, et al. Preparedness and vulnerability of African countries against importations of COVID-19: A modelling study. The Lancet. 2020;395:871–877. pmid:32087820
  52. 52. Jia JS, Lu X, Yuan Y, Xu G, Jia J, Christakis NA. Population flow drives spatio-temporal distribution of COVID-19 in China. Nature. 2020;582:389–394. pmid:32349120
  53. 53. Maier BF, Brockmann D. Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China. Science. 2020;368:742–746. pmid:32269067
  54. 54. Manchein C, Brugnago EL, da Silva RM, Mendes CFO, Beims MW. Strong correlations between power-law growth of COVID-19 in four continents and the inefficiency of soft quarantine strategies. Chaos. 2020;30:041102. pmid:32357675
  55. 55. Arenas A, Cota W, Gómez-Gardeñes J, Gómez S, Granell C, Matamalas JT, et al. Modeling the spatiotemporal epidemic spreading of COVID-19 and the impact of mobility and social distancing interventions. Physical Review X. 2020;10:041055.
  56. 56. Sardar T, Nadim SS, Rana S, Chattopadhyay J. Assessment of lockdown effect in some states and overall India: A predictive mathematical study on COVID-19 outbreak. Chaos, Solitons & Fractals. 2020;139:110078.
  57. 57. Sardar T, Rana S. Effective lockdown and role of hospital-based COVID-19 transmission in some Indian states: An outbreak risk analysis. Risk Analysis. 2022;42:126–142. pmid:34223651
  58. 58. Jentsch PC, Anand M, Bauch CT. Prioritising COVID-19 vaccination in changing social and epidemiological landscapes: A mathematical modelling study. The Lancet Infectious Diseases. 2021;21:1097–1106. pmid:33811817
  59. 59. Musa SS, Qureshi S, Zhao S, Yusuf A, Mustapha UT, He D. Mathematical modeling of COVID-19 epidemic with effect of awareness programs. Infectious Disease Modelling. 2021;6:448–460. pmid:33619461
  60. 60. Moore S, Hill EM, Tildesley MJ, Dyson L, Keeling MJ. Vaccination and non-pharmaceutical interventions for COVID-19: A mathematical modelling study. The Lancet Infectious Diseases. 2021;21(6):793–802. pmid:33743847
  61. 61. Özköse F, Yavuz M, Şenel MT, Habbireeh R. Fractional order modelling of omicron SARS-CoV-2 variant containing heart attack effect using real data from the United Kingdom. Chaos, Solitons & Fractals. 2022;157:111954. pmid:35250194
  62. 62. Ikram R, Khan A, Zahri M, Saeed A, Yavuz M, Kumam P. Extinction and stationary distribution of a stochastic COVID-19 epidemic model with time-delay. Computers in Biology and Medicine. 2022;141:105115. pmid:34922174
  63. 63. Chang S, Pierson E, Koh PW, Gerardin J, annd David Grusky BR, Leskovec J. Mobility network models of COVID-19 explain inequities and inform reopening. Nature. 2021;589:82–87. pmid:33171481
  64. 64. China Data Lab. US COVID-19 Daily Cases with Basemap. Harvard Dataverse. 2020.
  65. 65. China Data Lab. China COVID-19 Daily Cases with Basemap. Harvard Dataverse. 2020.
  66. 66. Hu T, Guan WW, Zhu X, Shao Y, Liu L, Du J, et al. Building an open resources repository for COVID-19 research. Data and Information Management. 2020;4:130–147. pmid:35382104
  67. 67. Barabási AL, Pósfai M. Network science. Cambridge, UK: Cambridge University Press; 2016. Available from: http://barabasi.com/networksciencebook/.
  68. 68. Klaus A, Yu S, Plenz D. Statistical analyses support power law distributions found in neuronal avalanches. PLOS ONE. 2011;6:1–12. pmid:21720544
  69. 69. Alstott J, Bullmore E, Plenz D. powerlaw: A Python package for analysis of heavy-tailed distributions. PLOS ONE. 2014;9:1–11. pmid:24489671
  70. 70. Embrechts P, Klüppelberg C, Mikosch T. Modelling Extremal Events for Insurance and Finance. New York: Springer; 1997.
  71. 71. Kotz S, Nadarajah S. Extreme Value Distributions: Theory and Applications. London: Imperial College Press; 2000.
  72. 72. Stephens MA. EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association. 1974;69:730–737.
  73. 73. Lilliefors HW. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association. 1967;62:399–402.
  74. 74. Nature editorials. This is no time to stop tracking COVID-19. Nature. 2022;603:550. pmid:35322258
  75. 75. Leung K, Leung GM, Wu JT. Modelling the adjustment of COVID-19 response and exit from dynamic zero-COVID in China; 2022. Preprint from medRxiv at https://doi.org/10.1101/2022.12.14.22283460.