Course of the first month of the COVID 19 outbreak in the New York State counties

We illustrate and study the evolution of reported infections over the month of March in New York State as a whole, as well as in each individual county in the state. We identify piecewise exponential trends, and search for correlations between the timing and dynamics of these trends and statewide mandated measures on testing and social distancing. We conclude that the reports on April 1 may be dramatically under-representing the actual number of statewide infections, an idea which is supported by more recent retroactive estimates based on serological studies. A follow-up study is underway, reassessing data until June 1, using additional measures for validation and monitoring for effects of the PAUSE directive, and of the reopening timeline.


Introduction
Since its first confirmed infections in the US, it has become clear that the COVID-19 outbreak was going to affect all US states and territories. However, the reported rates of infection, hospitalization and death have been from the start subject to wide controversy that sprouted from the known limitations in testing abilities, potentially resulting in dramatic under-reporting. Failure to accurately report infection rates can crucially alter our assessment of the size and dynamics of the pandemic, and our predictions of its future evolution. It also affects administrative decisions on the type and timeline of necessary social distancing measures. In the absence of precise measurements of infection rates, it is therefore very important, when interpreting the available data on confirmed infections, to differentiate between trends that are specific to the epidemic dynamics (and the effects of social distancing specifications), and artifacts introduced by limitations on testing availability or by reaching medical care capacity. The former are governed by the virus' clinical characteristics and the interactions in the social network where the virus propagates (in the presence or absence of additional social regulations); the latter are only a reflection of assessment limitations. In this study, we aim to tease apart such patterns.
Overall, there have been notable between-state differences in the timeline and magnitude of the epidemic. It is likely that these differences arise from a combination of inherent factors, from timing of the first infection (earlier states were caught unprepared), to population density and intrinsic social dynamics. Differences may also arise from variations in timing and efficiency of statewide mandated directives (travel bans, social distancing, timing and availability of testing). We work from the premise that patterns resulting from statewide control (1) lead to accentuating between-states differences, and (2) can explain unifying trends in the data from same-state counties, transcending the between-county variability based on inherent factors. Our study investigates this premise based on the first month of data on the epidemic development in the US, focused specifically on its evolution in the State of New York, which, to date, had the highest and fastest climbing infection counts. After a brief comparison of the statewide evolution of confirmed infections with that in other early states, we focus on analyzing patterns unifying New York counties beyond intrinsic variables such as time of first exposure, or population density. In terms of policy, New York State's response to the pandemic came in a few stages, both in terms of testing and of mandated social measures. Consistent with the national trend, the testing response had trouble initially scaling up in New York State [1], acquired some momentum, but only for a short period of time [2,3]. Based on the shortage of protective equipment, a Health Department Advisory capped testing again in the New York City on March 20th: "Outpatient testing must not be encouraged, promoted or advertised[. . .] There is a national shortage of personal protective equipment (PPE), [. . .] and it is critical that laboratory testing be prioritized for hospitalized patients." [4][5][6]. While not government mandated, similar restrictions on testing were simultaneously implemented in other parts of the state as a wide-spread approach, by both health care providers [7,8] and mobile testing sites [9,10].
In terms of social dynamics restrictions, among the first to close in the New York State were university campuses and schools [11,12], followed by restaurants, gyms and other entertainment venues [13,14]. During the month of March, gatherings of gradually smaller sizes were banned, all eventually leading to the PAUSE directive that took effect on March 22 and shut down statewide all non-essential activity [15,16]. When investigating the potential effect of social dynamics on the epidemic spread, it is important to remember that, while changes in testing flow may instantaneously reflect in the number of reported infections, effects of social distancing have a significant lag, due to the incubation period of the virus, and the corresponding time delay between exposure and symptoms.

Modeling methods
The study of early epidemic growth has historically revealed different patterns, depending on the particular pathogen, and even on the particular outbreak. While other growth patterns have also been found (e.g., polynomial, subexponential), exponential growth has been a seemingly ubiquitous trend detected in data of early outbreaks of influenza, Ebola, foot-and-mouth disease, plague, measles and smallpox [17]. This behavior appears to be driven by the free spread of the pathogen in the first stages of the epidemic [18]. Exponential growth patterns appear to also be representative of the current COVID-19 pandemic, during its early development in the US overall. The first aim of our study is to verify whether this was the case for the total number of confirmed infections in New York, as reported daily for the duration of the first month of the outbreak (March 2020). The second aim is to illustrate that, prior to April 1, the outbreak size and timeline varied significantly between different states. To do this, we chose to briefly compare three of the states with the earliest reported infections of COVID-19, and with the highest infection counts to date: California (9,816 infections by April 1), Washington State (5,588 infections) and New York State (83,948 infections). Thirdly, we aim to focus more specifically on the outbreak dynamics within New York State, to understand both the specific and the unifying trends in the data from different counties, and to correlate these trends with state-mandated measures on testing and social distancing.

Data sources
The epidemic data for this study was accessed on April 1 from two sources in the public domain: the data-set provided by the Johns Hopkins Center for Systems Science and Engineering [19] and the one maintained by the COVID Tracking Project [20]. The COVID Tracking Project public repository (covering the period March 4 to April 1) was used for the statewide reports of confirmed infections and number of tests, as well as of COVID-19 associated hospitalizations and deaths. The reports provided through the Johns Hopkins Center for Systems Science and Engineering (covering the period March 1 to April 1) were separated by county, and were used for the county-wise analyses and comparisons. We performed a cross-validation of the two data sources on the number of statewide confirmed infections, which was reported by both. On the common portion (March 4 to April 1), the two were identical, except for minor details which are discussed in the Results section. Additional demographic data (census data and population density in each New York county) was obtained from the New York State Department of Health web page [21].

Data fitting
In order to more easily detect potential exponential growth N(t) = N 0 e αt (where t is time in days, and N is number of infected individuals), we consider the logarithm of the time series lnN = lnN 0 + αt. This allows to test for linearity and piecewise linearity of the logarithmic time-series by using a simple linear regression algorithm, maximizing the goodness of fit. Piecewise linearity is understood in the context of discrete time series as linearity on each of the K consecutive pieces ½t j k ; t j kþ1 � that form a partition of the whole time interval ½t j 0 ; t j K �. For each such piece we calculate the slope α; since the pieces vary in length, we compute both the sum of squares and the Pearson χ 2 to illustrate the goodness of fit.

Exponential growth and between-state comparison
What one would generally expect to see in the epidemic development is an initial period with high growth rate (potentially exponential). As the growth rate subsides, the time series should move to a more slowly growing curve, then finally reach a peak, and eventually transition to a decreasing trend (as the epidemic is dying out). Figs 1 and 2 illustrate a comparison between time series extracted from the Johns Hopkins data-base [19] for the states of California, Washington and New York. These are all states with early infection compared to other states (January 21 in Washington, January 25 in California, March 1 in New York). Fig 1 shows the infection timeline for each of the three states, both as raw time series of confirmed infections (in the top two panels), and as number of confirmed infections normalized by the population size in each state (i.e., reported per 10,000 individuals, in the two bottom panels). For each panel, the inserts magnify detail in the first period of the timeline, where the initial small numbers would otherwise be indistinguishable in comparison with the subsequent climb in the graph. California and Washington State had their first confirmed infections four days apart, and had comparable evolutions in terms of the sheer number of infections (although their increasing patterns appeared different). When normalizing by the state population, the curve corresponding to Washington State was significantly higher than that for California (three times as high at the end of the time interval). We chose to represent the state of New York in a separate panel for clarity, because in both instances (of raw and normalized time series), the values of the New York State curves were one order of magnitude higher than for the other two states, by the end of the time interval. We continued by investigating the more subtle trends in rate that may underlie the noticeable differences in these three evolutions.   Table 1. https://doi.org/10.1371/journal.pone.0238560.g002

PLOS ONE
representation between the three states is that all three are approximately piecewise linear (with piecewise slopes α and goodness of fit shown in Table 1). It also suggests that all three original time series were still growing exponentially in the time interval preceding April 1. It is clear, however, that there are qualitative differences in the dynamic patterns for each statereflected in the length, succession and slope of the linear pieces. According to this data, the number of confirmed infections in California can be faithfully described by the same exponential curve with slope α � 0.18 since day 36 (February 29), with a slight sub-exponential tendency towards the very end of this interval. Starting with day 38 (February 27) Washington follows a logarithmic pattern of successively moving (approximately every week) to a new linear piece with slightly lower slope, thus flattening out to an exponential rate close to zero by the end of March. In contrast, the logarithmic timeline for New York State shows an alternation in slope, even after the initial transient spike in confirmed infections (March 1-4): the first almost-linear piece is steep (March 4-7), followed by a more relaxed increase over the following 10 days (March 7-16), only to launch into another steeper increase (March 16-23), and then another apparent reduction in rate (captured with significant goodness of linear fit over the 10 day segment prior to April 1).
From here, we will pursue two directions with our analysis, aimed at understanding whether transitions from one exponential rate to another in the case of New York State are intrinsic to the epidemic dynamics or are a reporting artifact. To this aim, we first test if these transitions correlate temporally with the timeline of statewide directives on social measures or with the testing/reporting schedule. In Section 3.2, we examine state wide data on confirmed infections, hospitalizations and deaths in conjunction with data on testing, for the period from March 4 to April 1, and look for potential triggering events. We need to recall that, while the effects of changes in testing can be observed immediately, the effects of social distancing measures on decreasing infections involve an observation lag that can be as long as a few weeks.
New York State is relatively large geographically, and presents a wide range of social profiles (urban, suburban, rural). The transitions between rates may be driven by specific social behavior in counties with dense population (which have more statistical power), or may be similar across counties with different social profiles (in which case they are more likely to be triggered by statewide mandated measures). In Section 3.3 we analyze the piecewise linear succession in time series for confirmed infections for all New York counties which had reached over 100 infections by April 1, and we look for unifying trends. This will establish whether the slope alternation pattern which appears to be the signature of the evolution of confirmed infections in New York State is an effect of averaging geographically over many types of different evolutions, or if it emerges from a consistent similar pattern across county borders.

Statewide trends
We illustrated and compared the time series for confirmed infections, for hospitalizations and for COVID-19 related deaths versus the statewide number of individuals tested for COVID-19. Since this data was accessed from the COVID-19 Tracking Project repository [20], a different source from that used in Section 1, we first cross-validated the two sources for the one variable they have in common: While the first panel reveals that all variables increased dramatically, and provides a better illustration of their actual size, the latter better conveys the fact that all variables had stabilized along an almost exponential curve of constant rate for over a week prior to April 1 (the exponential fit is quantified in Table 2). The confirmed infections and the death count show other transient rates before this steadier trend. The hospitalizations count is considered more robust epidemiological prevalence measure (since it relies on symptoms, rather than on general testing availability). However, data was only provided in the reference for the last 12 days of March, making it difficult to tell if there were any other consistent trends prior to March 20. In addition, symptomatic COVID-19 infections reportedly exceeded hospital capacities in New York State, associating potential artifacts to this measure as well [22,23].
The time series for total COVID-19 tests performed in New York State shows a notable jump between March 12 and March 13. This spike stands out in the logarithmic form in Fig  4b, and appears even more prominent when representing the fraction of the tests that returned a positive diagnosis (shown in Fig 3c), which plummeted from 70% on March 12 to less than 15% on March 13. This is not surprising, since higher availability of tests can be responsible for a sudden increase in negative results (due to more liberally providing tests to susceptible individuals who turned out to not be infected). The idea will be revisited as a discussion point in Section 4.

County-wise behavior
To study infection dynamics at the county level, we used New York State data from the John Hopkins archive [19], which provided county-wise reports for the period from March 1 (when the first case was reported in the State of New York) to April 1 (the date this study was initiated). County-wise demographic information was obtained from the Department of Health web page [21].
When considering the number of confirmed infections for each county on April 1, one simple observation is that they did not only correlate with the county total population (correlation coefficient R = 0.9, significance value p < 0.0001), but also with the population density (correlation coefficient R = 0.4195, significance value p = 0.0007). As a start, this suggests a deeper dependence of infection rates on the type of social dynamics associated to the population density distribution in each county. In our further analysis, we considered only the counties exceeding 100 confirmed infections by April 1 (providing enough data for an adequate assessment).
In     Table 2. https://doi.org/10.1371/journal.pone.0238560.g004 County (3,321 infections). On the other hand, Saratoga and Ulster Counties reported 122 and 222 infections respectively, by April 1. We chose to first illustrate and analyze separately these counties with confirmed early infection (before March 10th). These are networks of communities in which the infection had already propagated for a long enough time to provide more substantial data that may allow us to understand the mechanics of this propagation. We will illustrate separately the five counties which had over 3,000 confirmed infections by April 1, and the two counties which, despite an early start, were an order of magnitude lower in the number of confirmed infections by the same date.
In terms of raw infection counts, the New York City counties taken together transcended all other counties by a factor of at least five, and Rockland County showed the lowest numbers (Fig 5a and 5b). However, in a normalized representation in which the number of infections is reported per 1,000 individuals, New York City comes fourth, following in order Westchester, Rockland and Nassau (Fig 5c). It is also relatively easy to see that confirmed infections at the end of the time window (i.e., on April 1st) do not only show different counts for different counties, but also appear to be increasing at different rates: while Westchester is leading in terms of the proportion of individuals with confirmed diagnoses, Rockland is increasing at a higher rate. In order to better represent the evolution of the rate of change in the context of identifying exponential trends, we again considered the time series in logarithmic form.
In the logarithmic time series, we assessed piecewise linear behavior, starting with the day when the first case was reported until April 1. The slopes α (corresponding to the exponential growth rates) and the goodness of fit statistics are shown in Table 3.
Notice that all these counties show, via negligible fluctuations, a similar piecewise linear pattern as that of statewide logarithmic time series (see Fig 6). In each case, we were able to identify three pieces: a milder increasing segment ending between March 16-18, followed by a steeper segment ending on March 22 (26 in the case of Westchester), followed again by a linear segment with more relaxed slope, and significant goodness of fit.
While the confirmed infection counts were an order of magnitude lower for Saratoga and Ulster Counties, the same three-piece pattern was apparent in their time series as well (Fig  7), suggesting that the piecewise linear effect with alternating steepness which we observed in the state wide dynamics was not an averaging effect, driven by the counties with high-density urban and suburban areas, but is rather present in every one of the counties with early infections.
We want to further investigate whether the transitions between linear pieces depend on the original confirmed infection in the corresponding county as temporal reference, or if they are  Table 4. The unifying feature for these counties is the presence of two linear segments (the first with steeper slope, followed by one with flatter slope) with the transition occurring over the same short time window (March 22-26) as the similar transition in the counties with early starts. This confirms that the evolution in this later set of counties is not restarting and replicating the evolution of the earlier counties, but rather is correlating with their current evolution at date (suggesting that the reason behind these trends is a statewide factor rather than an intrinsic epidemic development, as further discussed in the next section).

Discussion
In this study, we focused on the dynamics of the COVID-19 epidemic in the state of New York for the first month of the outbreak (March 1st to April 1st), and we analyzed statewide data on confirmed infections, testing, hospitalizations and deaths, as well as county-wise data on infection rates, for the counties in the state that were reporting over 100 infections by April 1. Our primary goal was to determine whether it is possible to dissociate between intrinsic trends in Table 3. Rate of exponential growth calculated as the slopes of the piecewise linear fit to the logarithmic plot. The sum of squares and Pearson goodness of fit statistic are also provided in each case. (One outlier-March 24-was left out when computing the slopes for Rockland County.).

County
Interval  the epidemic dynamics, the effects of social distancing measures, and the effects of underreporting produced by the schedule and limitations in testing. We first identified a signature in the evolution of the number of confirmed infections specific to the state of New York, characterized by piecewise behavior with four distinct pieces, switching between exponential curves with higher versus lower alternating rates, and eventually settling to a relatively low exponential rate. We aimed to understand the factors behind the transitions, with a primary focus on answering whether the trend shown in the last piece of the data (preceding the start of April) corresponds to a damping in infections, or is a reporting artifact.
Analyzing data on statewide testing, we found that, while the number of tests steadily increased, the ratio of positive to total tests also increased before March 12, to the point where over 70% of tests were positive. This gradual saturation could be indicative of two potential factors: the infection rate overtook the rate with which new tests were becoming available; or the limited testing was particularly targeted to individuals with highest probability of being infected (well-defined COVID-19 symptoms), dangerously leaving out other infected, but less symptomatic individuals.
We detected a noticeable jump in the number of tests on March 13th, (which could be attributed to the state government directive on March 12 to ramp up COVID-19 testing), and a subsequent drop in the positive test percentage from 70% to under 15%. The time of the first rate increase found in the confirmed-infection time series occurred shortly after this (March 16). A likely interpretation is that the increasingly wider availability of tests led to an increased detection rate, and subsequently to more accurate reporting. For specific counties, the duration of this transition in the slope ranged from March 12 to March 18, but can still be viewed as a potential effect of the increase in testing, as the factor synchronizing the piecewise behavior across county borders. This is supported by the fact that the counties with later infection onset do not show this transition, and start off with a segment of higher exponential rate directly. We therefore suggest that the higher confirmed infection rate during the period March 16-March 23 reflects the actual epidemic spread more faithfully than the rates along the periods preceding and following it (when tests were administered more conservatively). After March 13, the positive test percentage promptly started increasing again, having climbed back to almost 40% by April 1. The subsequent climb of this ratio could be explained by the rate of increase in actual infections transcending the rate of increase in testing, thus slowly saturating again the testing capacity.
We also detected a statewide transition in the confirmed infections on March 23, from a higher exponential rate α � 0.46 to a lower rate α � 0.15. The subsequent analysis revealed that all counties showed a similar transition at dates spanning from March 19 to March 23, from a higher exponential rate (with mean μ α = 0.51 and standard deviation σ α = 0.18 between the 15 counties examined) to a milder rate (with mean μ α = 0.16 and standard deviation σ α = 0.04). As mentioned previously, there are two potential (and not mutually exclusive) factors that could be held responsible for this behavior: testing restrictions and social distancing. Below, we discuss both possibilities. One possibility if that the lower infection rate may be a reflection of the stricter testing cap, re-introduced throughout the state of New York around March 20, to address PPE and other resource shortage. The number of actual infections was increasing statewide at an exponential rate close to α � 0.5 prior to March 23, to the point of overtaking the testing capacity (which was following a more relaxed exponential curve). In order to maintain realistic reporting, testing would have had to also increase at a comparable pace. The March 20 recalibration of testing priorities likely resulted in restricting testing of both positive (but potentially asymptomatic) and negative COVID-19 individuals to the same extent (people being instructed to refrain from testing unless in very serious condition). While there was no visible drop in test numbers, or additional imbalance in positive test percentage, the testing restriction may have effectively capped the reported infections. This would explain why the rate of the new infections (α = 0.15) settled to a rate comparable to the overall rate of testing (α = 0.115), likely reflecting more the slower rate of the testing cap rather than the potentially much faster rate of infections.
If the resulting lower exponential rate over the period from March 23 to April 1 is indeed only an artifact of the testing cap, we can estimate a more realistic number of infections on April 1, by considering the higher infection rate α = 0.46 shown prior to March 23 to remain in effect until April 1. The estimate of the total number of infections in New York State by April 1 would then be 1,302,800. This should be compared to the reported 83,712, and the projected 80,524 (when using the α = 0.15 lower rate to extend). This is in line with reports from a preliminary serological study that estimates actual infections many times higher than the confirmed infection counts, indicating a 11-18% prevalence of COVID-19 infections in New York State residents (with wide geographic variation) [24,25]. However, the accuracy of serological findings is also problematic, with sampling biases such as false positives and test performance issues potentially affecting the results.
A second possibility worth investigating is that the rate flattening in all counties between March 19-23 represents an effect of social distancing. The timing of this change is too premature to reflect any effects of the PAUSE directive, or even of the state-mandated initial closures initiated between March 12 and 16 (which would need around two weeks to manifest significantly). But this does not exclude the possibility of people having initiated social distancing before it was required at state level. One way to assess whether this aspect had any contribution to the infection dynamics is to search for decreasing patterns in social mobility, and verify if the drop in infection rate correlates, with an appropriate time lag, with a drop in mobility.
Social mobility trends can now be estimated based on data made available by both Google and Apple, describing traffic patterns (driving, walking and public transport), as well as number of direction requests to different types of destinations (Retail/Recreation, Grocery/Pharmacy, Parks, Public Transit, Workspace), and time spent at one's Residence. A broad assessment of this data suggests that, across New York State counties, mobility patterns did not start showing consistent decreasing patterns with respect to their respective baselines until the second week in March, with an inflection point around March 15, and a minimum consistently reached around the date of the PAUSE directive. Hence early social distancing does not seem to offer a compelling basis for the infection rate decline observed in the clinical data. In a subsequent study, we focus on addressing further potential correlations between mobility measures and epidemic measures, in longitudinal data spanning a timeline from March 1 until June 21 [26]. The study significantly correlates peak infections and epidemic outcome with the timing and degree of reduction in social mobility, and suggests that longer lags of up to 40 days may be more meaningful when assessing the full impact of social distancing on epidemic size and dynamics.

Limitations and future work
It is important to discuss some constraints imposed on our study by the length, quality and accessibility of data in the public domain. An inherent problem comes from the length of the time series for confirmed infections (only spanning one month, with infection in many of the counties having been ongoing in fact for much shorter than that). Aside from the analytical concern associated with detecting trends in short time series, the time period was insufficient to permit assessment of the effects of the PAUSE directive, or indeed of any social distancing. A second limiting aspect is the fact that some of the measures of interest (such as hospitalizations and deaths) were not included promptly or accurately in the preliminary reports. This is partly due to the priority assignment during the early stage of the outbreak, and partly due to the fact that assessment of certain measures may require time and hindsight. For example, data on hospitalizations only started being included in the public domain repository on March 20, and was not separated by county in the original reports, nor was the statewide number of available tests. The original identification criteria of COVID-related deaths had to be later redefined to better incorporate progressing knowledge on the clinical aspects. The death count, expected by many to be a more accurate epidemic measure than confirmed infections, had to be subsequently updated by an additional 3,000 individuals.
On one hand, these limitations are significant; on the other hand, however, they go together with the very necessity of data analyses in the early stages of the epidemic, so that it can be used to understand its dynamic signature, generate testable predictions, and apply the appropriate measures before the outbreak gets out of control and transcends the health care capacity. While longer, more complete and informed data streams would likely improve predictions, waiting for more complete data may concomitantly have a detrimental impact on the timeliness of the response [27].
Along these lines, social mobility time series that were not available during the first weeks of COVID-19 development in the New York State have been made easily accessible in the public domain by Google and Apple. This prompted a follow-up study, in which we revisited the New York State data, with an improved perspective based on (1) a longer timeline (covering the period starting with the March 1 initial confirmed infection and extending until the last week of June); (2) access to social mobility data, as a quantifiable measure of the efficiency of lockdown and social distancing measures in each New York county [26]. Our study is detecting county-wise correlations between the number of confirmed infections, COVID-related hospitalizations and deaths, and effectiveness of social measures along the closing down, PAUSE, and reopening of the state. While the initial slope flattening in the current study appears to be primarily an effect of testing limitations, our follow-up study suggests that social distancing measures can be identified further down the line as a primary player in controlling the outbreak in New York State.