Modelling the first wave of COVID-19 in India

Estimating the burden of COVID-19 in India is difficult because the extent to which cases and deaths have been undercounted is hard to assess. Here, we use a 9-component, age-stratified, contact-structured epidemiological compartmental model, which we call the INDSCI-SIM model, to analyse the first wave of COVID-19 spread in India. We use INDSCI-SIM, together with Bayesian methods, to obtain optimal fits to daily reported cases and deaths across the span of the first wave of the Indian pandemic, over the period Jan 30, 2020 to Feb 15, 2021. We account for lock-downs and other non-pharmaceutical interventions (NPIs), an overall increase in testing as a function of time, the under-counting of cases and deaths, and a range of age-specific infection-fatality ratios. We first use our model to describe data from all individual districts of the state of Karnataka, benchmarking our calculations using data from serological surveys. We then extend this approach to aggregated data for Karnataka state. We model the progress of the pandemic across the cities of Delhi, Mumbai, Pune, Bengaluru and Chennai, and then for India as a whole. We estimate that deaths were undercounted by a factor between 2 and 5 across the span of the first wave, converging on 2.2 as a representative multiplier that accounts for the urban-rural gradient. We also estimate an overall under-counting of cases by a factor of between 20 and 25 towards the end of the first wave. Our estimates of the infection fatality ratio (IFR) are in the range 0.05—0.15, broadly consistent with previous estimates but substantially lower than values that have been estimated for other LMIC countries. We find that approximately 35% of India had been infected overall by the end of the first wave, results broadly consistent with those from serosurveys. These results contribute to the understanding of the long-term trajectory of COVID-19 in India.

in the main text and Section 1. 8  1 Computation of reproduction number R 0 Assuming uniform contacts across all age groups using the infectivity parameter β and efficiency parameters ϵ i , the F matrix [2,3] can be written as: Here a, p, m, s correspond to asymptomatic, pre-symptomatic, mild and severe compartments respectively. And the V matrix is obtained as, Here we have used the fractions and rates as defined in the main article.
The next generation matrix: The dominant eigenvalue of the next generation matrix is R 0 .
Since INDSCI-SIM model is age stratified and uses contact matrices, the F and V matrices are generalized according to the model. Given N age groups and M equations determining the Exposed and infected compartments we will have F and V matrices of order N M × N M. In our model we have 9 age compartments and 5 equations for exposed and 4 infected compartments. Therefore the F and V are of order 45 × 45.
The modified F matrix can be written as: Here [0] represents N × N matrix with all entries as 0. [B i (t)] represents the time dependent effective infectivity matrix (order N × N) for different infected compartments and is defined as follows: where the m, n'th element of the matix [C] are defined with the contact matrix C mn and the population fraction f m as C mn f m / f n .
Here τ i parameter represents the characteristic time-scale describing the increased effectiveness of non-pharmaceutical interventions, as introduced in the main paper. Note that β in the unifrom contact case andβ in the differential contacts are different. In the uniform contact β represents the infectivity averaged over all contacts while in this caseβ represents the infectivity per contact. Note that the contact matrix changes during lockdown and therefore changes the effective infectivity matrix.
Generalized V matrix is now denoted as, In the above N M × N M matrix each entry in square brackets represents a N × N matrix. In our model M = 5. For models with more infected compartments, M will increase for both F and V matrices. Γ, Λ i 's represent diagonal matrices with the transition rates as the diagonal terms. In our case, we have assumed the rates of transfer between compartments are same for all age groups. In a more general model, where the rates are different, the diagonal entries will change. The asymptomatic fractions and mild fractions are also represented by diagonal matrices, A and M respectively. Note that our fractions are age dependent and therefore, the diagonal terms contain the fractions for 9 age groups. With the F and V matrices prepared, we follow the same procedure as in the uniform contact case and the dominated eigenvalue of F V −1 is estimated as R 0 .
At any point in time, while the basic reproduction number R 0 can be obtained using the relations above, the effective reproduction number R(t) takes into account the susceptible population remaining at time t w.r.t. the initial susceptible population.
where the susceptible population at t = 0 and t = t are given by S (t = 0) and S (t = 0) respectively.

Contact matrices between stratified age-groups
Mixing between different age groups is determined by age-specific contact matrices. To compute these, we use contact matrices provided by [4], estimated for the Indian population, for 16 age groups ranging between 0-80 years, tabulated at 5 year intervals.
We reduce the matrix to further coarse-grained age brackets as follows: Let the number of contacts of the i-th individual in the age category j with any individual belonging to the age category k be C k j (i). The mean number of contacts for age category j with age category k is thusC k j = where N j is the population size in age category j. Suppose now that k is subdivided into 2 finer categories k 1 and k 2 , and we are given the entriesC k 1 k 1 ,C k 1 k 2 ,C k 2 k 1 andC k 2 k 2 , as well as,C jk 1 ,C jk 2 ,C k 1 j andC k 2 j . We then have with the off diagonal terms defined as C jk = [N k 1C jk 1 + N k 2C jk 2 ]/(N k 1 + N k 2 ), and C k j =C k 1 j +C k 2 j . We use this reduced matrix [C] in our analysis. We plot the contact matrices in Fig A. 3 Brief discussion on Nested Sampling Since the posterior probabilities are expected to be multi-modal Markov Chain Monte Carlo can not be used. Therefore we use nested sampling with PolyChord that is effective for higher dimensional parameter spaces [5,6]. This algorithm uses live points that are updated in each iteration shrinking the n-dimensional parameter spaces. We use 500-1000 live points according to the dimension of the parameter space for the model. Polychord starts with the live points and the points are sequentially updated in each iteration. Point with the lowest live point is discarded (termed as dead point) and replaced by a new point with a likelihood higher than that of the dead point. It can be shown [5] that the prior volume shrinks exponentially with each iteration. The evidence or the marginal likelihood is computed from the integral of likelihood and prior over the prior volume.

Flowchart of our analysis
In Fig B we plot the logical flow of the analysis. Centrally, we use 2 codes. For simulation, we have developed ELiXSIR -Extended, zone Linked IX-compartmental SIR model: a code to simulate COVID19 infection [7]. For sampling we use PolyChord [5], a widely used code for nested sampling. Both codes are available publicly. Given a set of parameters we set up the system for ELiXSIR. In this set-up we provide the population, number of age groups and fraction of population in these age groups. We fix the fraction and rates of transition between compartments that are runtime constants. Lockdown dates are mentioned based on which the code switches from lockdown to unlock modes switching contact matrices. We use the coarse grained contact matrices for each region using the population fractions within the age groups of base contact matrix and the coarse grained age groups [4]. After the initial set up, the priors, initial starting value of all parameter and widths are provided. Daily data of reported infection and deaths are supplied to the code with the dates.
The CosmoChord integrated with ELiXSIR is then run in several processors (MPI) in the cluster. The samples within the parameter volume are drawn and sent to ELiXSIR for the time-series. Daily infection and deaths are computed from the compartments. Bias evolution is generated according to the parametric form and the values of b, ∆ b from the samples. After scaling the theoretically obtained daily infection numbers by the bias, the prediction of daily infected is then compared with the reported daily infection. Prediction of daily deaths however is directly compared with the daily deaths data. If the death undercounting factor is included in the analysis then the daily deaths prediction is compared with the death data scaled by the undercounting factor. During the runtime, live points are generated and updated sequentially that reduces the prior volume, as discussed in the last section.
After termination of the PolyChord sampling, we use getdist [8] to generate posterior distributions. The samples are then supplied to ELiXSIR directly to generate the bounds on the timeseries.

Posterior distributions and correlations between parameters
Using getdist we compute the posterior distributions of the parameters for different regions and they are presented in triangle plots. We also tabulate the 2σ bounds on the parameters in the following tables.

Plots for cumulative infection and deaths
The timeseries of daily infection and deaths generate cumulative infection and deaths. In the following plots we plot the bounds on cumulative infection and deaths. Note that since we fit the daily infection and death reports, the error parameters correspond to those values and not the cumulative values.

Karnataka
Estimations for cumulative infections and deaths are plotted with data in Fig C ( Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.   have plotted the constraints obtained assuming and without assuming death multiplier for undercounting. We use death multipliers of 1, 2.2 and 5 to consider the following scenarios -no death undercounting, average and high undercounting obtained from districts. Please refer to Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.

Mumbai
Estimations for cumulative infections and deaths are plotted with data in Fig G. Marginalized posteriors of the parameters are shown as triangular plot in Fig H while Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.   Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.  Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.

Infection curves for select Indian cities
In the main text, we presented the analysis of Mumbai alone. Here, following the same protocol, we describe results for Bengaluru Urban, Chennai, Pune, and Delhi.

Bengaluru Urban
Our analysis for Bengaluru Urban is presented in Fig K; posterior distributions and parameter constraints are provided in SI: Section 3, Fig 2 and Table 1). For the calculations, the population data for Bangalore is sourced from the projections for the period 2020-2021, as provided by the report issued by Directorate of Economics and Statistics, Bangalore [9]. Our approach yields a double-peak structure in the curve of daily infections, reflecting a temporary flattening of the curve around September 2020 followed by an increase that resulted in a peak around October. Between mid-September and mid-October, we observe a plateau in deaths. Our results can be compared to those from serosurveys, as quoted in Refs. [10,11,12]. In September of 2020, these calculations yield a seroprevalence in Bengaluru Urban of about 30%, consistent with serosurvey results.
We note that the model indicates that a substantial fraction of the population in Bengaluru had already been infected by Feb 15. The effective reproductive ratio R(t) decreases abruptly during the lockdown, while displaying an equally sharp increase post lockdown. Our estimate for those infected in Bengaluru city by Feb 15, 2021 lies between 64% and 85% at the 95% confidence level. The bias multiplier narrows quite rapidly, being approximately 10 by mid-February 2021; at the time of the Bengaluru serosurvey, this factor was approximately 30. Our estimate for the IFR at the end of the first wave indicates a value in the vicinity of 0.05%, roughly consistent with similar estimates for the first wave based on serosurveys. An increase in the IFR, say to 0.08 or 0.1, can also be achieved if we assume that deaths have been under-counted by a suitable factor, while leaving our estimates for total infections the same.
Estimations for cumulative infections and deaths are plotted with data in Fig L. Marginalized posteriors of the parameters are shown as triangular plot in Fig M while Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.

Chennai
Our results for Chennai are plotted in Fig N, with posterior distributions and parameter constraints provided in the SI: Section 3, Fig 3 and Table 2). For these calculations, we source population data for Chennai from Wikipedia, which takes into account the expansion of the district limits in 2018 [13]. Apart from the national lockdown of 68 days, we note that Chennai had imposed a shorter, local lockdown during the period June 19 -July 5, 2020. Both lockdown periods are marked in grey. While across the first lockdown the numbers of infected increased steadily, the numbers began to show a decline following the second lockdown. We see minor peaks around mid-August and October of a lesser height when compared to the first peak. This decrease is also corroborated by the estimates for the R(t). Serosurvey results for Chennai have been reported in Ref. [14].
There are clearly issues with data here since around July 2020, the peak in deaths appears to precede the infection peak, although the data is quite noisy. After that, a steady decrease in the death numbers is seen. While between mid-July to end-October the infection numbers change only marginally, the numbers of deaths reduce largely monotonically. This suggests an effective decrease in IFR, perhaps related to improvements in patient management. The bias multiplier also show a decrease. However the rate of the decline is much slower compared to Bengaluru Urban. The plot indicates a 30-fold undercounting of cases initially. This is reduced to around 15 by the middle of February 2020. A low test positivity since the onset of infection is supported by this plot and attributed to a large scale testing program. The actual infection plot suggests that nearly 54-85% (at 95% C.L.) of the Chennai population has been infected by mid-February.
We can compare these predictions to data from Ref. [15] for the state of Tamil Nadu as a whole in December 2020, which found that seroprevalence in urban areas (36.9%) was higher than in rural areas (26.9%). They found that 22.6 million persons were infected by the end of November, roughly 36 times the number of confirmed cases. For Chennai, the estimated seroprevalance by December was in the vicinity of 40%, within the 95% bound of our own results.
Estimations for cumulative infections and deaths are plotted with data in Fig O. Marginalized posteriors of the parameters are shown as triangular plot in Fig P while Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.   Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.

Delhi
Fig Q present our analysis with the Delhi data. The data for Delhi presents a challenge to modelling because of the presence of multiple peaks. The data requires 5 adaptive windows to fit. Within these 5 windows, the constrained infectivity parameter β show a oscillatory trend. This oscillatory pattern may possibly reflect patterns in the return of migrant workers to the national capital region. A number of different serosurvey results have been reported for Delhi, including in Ref. [16]. For our calculations, we use Delhi's population data from the projections for the year 2020 reported in the 2019 report published by the National Commission on Population [17].
With these windows, our model captures the trends in the data. Posterior distributions and parameter constraints are provided in the SI: Section 3, Fig 4 and Table 3. We find multi-peak posteriors in the Node parameters reflecting several possible solutions of the Delhi trajectory. The reported deaths here show outliers around June 2020 that occur before infection peaks towards the middle and end of June. Although the IFR decreases we find that it is still consistent with 0.1% IFR around February 2021. The bias multiplier settles down around 20, indicating that a point estimate of about 80% (71-90% at 95% C.L.) infected population in Delhi by January 2021, with a confidence of interval of ±10%. Our limits at the 3σ confidence limit though indicate a range of 50-90% infected, given the skewed posterior distribution of bias. The estimated R(t) as expected show an oscillatory behavior, consistent with the independent estimates. We observe that R(t) fluctuates in a range R(t) ∼ 1 − 1.5 before it reduces below 1 around December 2020. Results from the fifth serosurvey in Delhi concluded that about 56% of the over 28,000 people whose blood samples were collected in January 2021 had developed antibodies against COVID-19. The first such survey done in the city in June-July had shown that 23.4% of people surveyed had developed antibodies against the virus. Similar surveys in August showed that 29.1% of people had antibodies to SARS-CoV-2 at that time. This became 25.1% in September and 25.5% in October. These results are approximately consistent with the results we present here, except for the January serosurvey, which lies outside our 95% confidence interval. However, due to the decay of antibodies, seroprevalence as derived from CLIA tests are expected to underestimate numbers of infected. We expect that accounting for this decay would yield closer agreement. Initial estimates of Delhi's IFR in Ref. [18] yield numbers consistent with a range (0.05-0.1%), consistent with our calculation here. However, other calculations yield a cumulative proportion of the population estimated infected of 48.7% (95% CrI 22.1% -76.8%) by end-September 2020. These are noticeably larger than our own estimates of about 30% at that time. (The IFR assumed in that work was considerably larger, though, between 0.22% and 0.39%; this required that a relatively small number of deaths were actually recorded, about 28% of the actual number if the IFR was 0.21%.).
Estimations for cumulative infections and deaths are plotted with data in Fig Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.

Pune
In this section we present our final city analysis, that of Pune in Fig T. The population of the district is taken as per projection for 2021 [19]. The posterior distributions and parameter constraints for this calculation are provided in SI: Section 3, Fig 6 and Table 5. As was the case with Chennai, Pune had imposed a local lockdown during 14-24 July 2020. There is substantial scatter in the reported death data, compared to the data for numbers of infected. This may arise from incomplete reporting initially. The infection data show a single prominent peak structure with a possible kink around mid-July 2020. The global peak comes around mid-September 2020, with infected numbers decreasing after that and becoming nearly constant around November. The death reports follow the infection in standard manner. However we do not find a substantial decrease in reported deaths vis a vis. infections. This is also reflected in the IFR plot where we see relatively minor changes from the initial IFR. The bias multiplier changes from 35 to 15 indicating an improvement in testing. We also find a large infected fraction, of 64-90% of the population in Pune. Our estimated R(t) stays around 1.5 till September 2020 after which it comes down sharply to nearly 0.5 in two months. In December we see a rise in R(t) which remains nearly constant till the end of the first wave around mid-February. R(t) from December 2020 to February 2021 stays at nearly 1 which can also be observed in the flatness in reported cases and deaths.
A serosurvey in Pune, conducted between 20th July and 5th August 2020, found a seroprevalance of 51·3% (95%CI 39·9 to 62·4) [20] . The overall IFR was calculated to be 0.21. The inferred seroprevalence lies well above our own median prediction of about 35% by August 2020, although our IFR estimates at that time are closer to the value inferred from the serosurveys, at about 0.17. We note, however, that the substantial spread in the inferred IFRs across different localities even within the city make comparisons more difficult. Compartmental models are insensitive to such ultra-local variations; other, more refined individual-based models will be required to assess these effects.
Estimations for cumulative infections and deaths are plotted with data in Fig U. Marginalized posteriors of the parameters are shown as triangular plot in Fig V while Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.   Table 3 in the main text and Section 1.8 and 1.9 in the main text for detailed descriptions of the symbols.