Estimating the infection burden of COVID-19 in Malaysia

Malaysia has reported 2.75 million cases and 31,485 deaths as of 30 December 2021. Underestimation remains an issue due to the underdiagnosis of mild and asymptomatic cases. We aimed to estimate the burden of COVID-19 cases in Malaysia based on an adjusted case fatality rate (aCFR). Data on reported cases and mortalities were collated from the Ministry of Health official GitHub between 1 March 2020 and 30 December 2021. We estimated the total and age-stratified monthly incidence rates, mortality rates, and aCFR. Estimated new infections were inferred from the age-stratified aCFR. The total estimated infections between 1 March 2020 and 30 December 2021 was 9,955,000-cases (95% CI: 6,626,000–18,985,000). The proportion of COVID-19 infections in ages 0–11, 12–17, 18–50, 51–65, and above 65 years were 19.9% (n = 1,982,000), 2.4% (n = 236,000), 66.1% (n = 6,577,000), 9.1% (n = 901,000), 2.6% (n = 256,000), respectively. Approximately 32.8% of the total population in Malaysia was estimated to have been infected with COVID-19 by the end of December 2021. These estimations highlight a more accurate infection burden in Malaysia. It provides the first national-level prevalence estimates in Malaysia that adjusted for underdiagnosis. Naturally acquired community immunity has increased, but approximately 68.1% of the population remains susceptible. Population estimates of the infection burden are critical to determine the need for booster doses and calibration of public health measures.


Introduction
The global transmission of COVID-19 is unprecedented and has led to more than 282 million cases and 5.4 million deaths as of 31 December 2021 [1]. Efforts to contain its transmission have focused primarily on public health and social measures (PHSM) that have come at significant economic and social costs [2].
Malaysia has reported 2.75 million cases and 31,485 deaths as of 31 December 2021 [3]. The first large outbreak in Malaysia was managed successfully using movement restrictions between March and April 2020 [4]. However, since September 2020, institutional outbreaks, state elections, and inconsistent implementation of PHSM have led to large periodic outbreaks [5].
Underestimation remains an issue despite the substantial reported burden of disease. Screening strategies and diagnostic test accuracy are two factors that drive this underestimation [6]. Reported cases are biased estimators of true disease burden. The true burden of disease may be estimated using seroprevalence surveys and random sampling [6,7]. Alternative indicators such as hospitalization and emergency room data do not estimate the overall infection rate [7]. A more accurate estimator of the actual COVID-19 infection burden may be COVID-19 mortalities, especially in countries with low excess mortalities [7][8][9].
Accurately estimating the epidemic size is critical in forming situational awareness in designing and evaluating future public health and social measures. Misclassified estimates of total COVID-19 cases may hamper forecasting and future disease control planning [9,10].
To the best of our knowledge, no studies have yet estimated the true burden of disease at a population level in Malaysia. We aimed to estimate the burden of COVID-19 infections in Malaysia between 1 March 2020 and 31 December 2021.

Statistical analysis
Data were explored for missingness using descriptive statistics, visualizations, and a univariate and multivariate logistic regression model. A multiple imputation model using expectationmaximization with bootstrapping was utilized to impute age. Age was then categorized as 0-11, 12-17, 18-50, 51-65, and >65 years old.
Cases, deaths, brought-in dead, incidence rates, mortality rates, and adjusted case fatality rates (aCFR) estimates were tabulated cumulatively and stratified by age and time. Death location was classified as either in-hospital death or brought-in-dead. A 95% confidence interval around these parameters was estimated using Wald's bootstrapping approach [14]. The daily incidence and mortality rates were visualized to explore longitudinal trends within Malaysia. The mid-year population was assumed to be the population at risk for the risk set on each day. The incidence density (date of reporting) and mortality rate (date of actual death) are given by: The reported case fatality rate (CFR) is estimated as the percentage of COVID-19 mortalities on a specific date over the reported number of COVID-19 cases on the date of death. A limitation of this crude reporting is the misspecification of the population at risk resulting in a more prevalent measure rather than an incident measure of risk [15][16][17][18]. We estimated an adjusted CFR (aCFR) by first attributing deaths to the date they were reported positive and then calculating the percentage of COVID-19 mortalities over the reported number of COVID-19 cases on the date the death was reported as a case. We approximated the age-stratified IFR using the lowest-non zero aCFR between 1 October 2020 and 31 October 2021. We utilized this period as few deaths (n = 12) occurred before 1 October 2020, and the risk of death is likely different after 31 October 2021 due to the National COVID-19 Immunization Program. The age-stratified aCFR was estimated over a moving 3-month period to stabilize the approximated IFR as there was a low number of deaths reported in some age strata. The lowest age-stratified non-zero aCFR was compared to reported pooled estimates [19,20].
The ratio of CFR to IFR will equal 1 when all cases are ascertained ( CFR IFR ¼ 1Þ. As the number of underestimated cases increases, the CFR also increases, and as such CFR IFR > 1. We corrected the observed number of cases to reflect the true number of cases by calibrating the observed cases against an adjustment factor obtained from the ratio of CFR IFR which is given by: The expected distribution of COVID-19 cases from COVID-19 mortalities assumes: i) the estimated lowest non-zero CFR approximates the unknown IFR, ii) All COVID-19 deaths are reported, iv) Exclusion of brought-in-dead mortalities will account for any excess mortality secondary to poor healthcare accessibility, and v) IFR is constant through time. Hospital protocols during this period likely resulted in high ascertainment of COVID-19 related deaths among all deaths. In addition, all reported deaths outside the hospital were tested for SARS--CoV2 using RT-PCR tests. All adjustment factors below one were replaced with one as it is impossible to estimate fewer cases than reported and were likely due to a small number of observed cases. The expected number of infections was rounded to the nearest thousand to ease interpretation. Reported cases and the estimated infections were tabulated by months. A sensitivity analysis was carried out to quantify the effect of hospital death underreporting on aCFR and case estimation. The Farrington algorithm was utilised to model daily age-specific excess mortality counts between January 2020 and September 2021 in Malaysia using age-specific all-cause mortality data from January 2016 to December 2019 [21][22][23][24][25][26][27]. A back-projection model following a Poisson process was carried out to estimate the unobserved age-specific death curve on the day a case was reported positive using an empirically estimated time-lagged age-specific delay distribution from reporting to death [28][29][30]. Excess mortality counts were then utilised in estimating the aCFR and subsequent degree of underestimation. All analysis was carried out using the "tidyverse", "zoo", "epitools", "prevalence", "boot", "amelia" and "surveillance" packages in R 4.3.1.
A total of 27,711 deaths were reported between 1 March 2020 and 30 December 2021 (Table 1). Mortality trends were quadrimodal with peaks reported on 29 March 2020 (n = 7), 18 February 2021 (n = 25), 2 June 2021 (n = 126), and 9 August 2021 (n = 360). The age-specific mortality rate is highest among individuals aged above 60 between August and September 2021 (Fig 2). The distribution of cases by age remained similar across time.

Discussion
An estimated 32.8% (9.95 million) of the population are likely to have been infected, and 23.2% of COVID-19 infections were reported between March 2020-December 2021. The adjustment factor for the burden of illness varied by time and age group. These results suggest that community immunity is higher than expected.
The number of infections is estimated to be, on average, 4.3 times (Range = 1-8.8) the number of reported cases in Malaysians, with variations by period and age group. The overall underestimation in Malaysia is comparable to estimations in the United States (US) across time and age strata [31,32]. Underestimation was estimated to be nine times (90% CI: [4][5][6][7][8][9][10][11][12][13][14] higher than reported cases in another IFR-based adjustment in the US [9]. Another global modelling study reported that the true number of infections was 1.4 to 18 times higher than reported cases with heterogeneity between countries [15]. The degree of underestimation is comparable in many of these settings to the findings observed in our study [6,7,33]. Over the study period, an estimated 32.8% of the population is approximated to have been infected. Estimates of seroprevalence over smaller geographical localities and periods are   consistent with estimates here [15,[34][35][36]. However, comparisons of the national-level period prevalence estimates were not carried out due to the sparse availability of published literature. Age-specific prevalences between Malaysia and the United States (US) are comparable except for those above 65 years [31,32]. This may be due to lower aCFR estimations when reported deaths are utilised, as an increase in the number of estimated cases in individuals aged above 65 is observed when excess mortality counts were utilised. Nonetheless, per capita differences remain even when excess mortality counts were utilised. This residual difference of higher prevalence of infection among those above 65 years in the US may be due to the higher proportion of elderly under nursing or institutional care. Malaysia also reports high prevalences in the youngest age groups, with a significant degree of these infections being underestimated [37]. Explanations for this phenomenon include the possibility of preferential testing of older age groups due to their higher risk of severe disease [38]. Changes in this age dynamic have also been reported in countries that have changed testing strategies [39]. Lower susceptibility in younger age groups and a higher propensity to be asymptomatic have also been proposed as an explanation for this dynamic [40]. One final possibility is the use of long-term school closures within many settings, particularly in lower-income countries, which has been suggested to be very effective in reducing transmission [41]. The lack of human-human interaction between younger individuals over a prolonged period, such as has been observed in Malaysia, could potentially amplify this age dynamic.

Apr
Published estimates of the true burden of disease have utilized diverse methodologies, and these published findings have primarily been within the global north. These include: i) the use of random population sampling [6,42], ii) seroprevalence studies [43,44], iii) ILI surveillancebased models to estimate prevalence [9,45], iv) crowdsourced data [46], v) testing data based adjustments using probabilistic bias analysis [33] or post-sampling stratification and reweighting [47], and vi) mortality-data based methods such as mechanistic disease models [15], vii) statistical curve fitting [48], viii) mortality mapping using Bayesian frameworks [9], and ix) combination methods; combining mechanistic models, random sampling, and other data sources with the IFR [7]. The methodology proposed here is advantageous to the global South, where estimates of the true burden of disease and resources remain scarce.
There are several limitations to this analysis. Firstly, the CFR may deviate away from the unknown IFR by changes in the testing strategy, unreliable vital statistics data, saturation of surveillance systems, the fluidity of disease transmission, virulence, virus genotype, inability to provide adequate care, availability of resources including human capital, heterogeneities in the distribution of medical comorbidities, and changes in immunity due to vaccination [49]. We utilized an approximated IFR estimated from the lowest non-zero, age-specific aCFR as the CFR has been shown to approximate the IFR when the CFR is smaller [50]. These approximated IFR estimates were similar to published pooled IFR estimates [51][52][53][54].
Second, we did not quantify the individual effect of various factors that drive the underestimation of COVID-19 incidence in Malaysia. Selection biases, including issues of access, may lead to varying levels of testing over time and space. Testing strategies may also modify the detection of disease within a population. Asymptomatic and pre-symptomatic transmission, variations in transmissibility, and long-tailed incubation periods further complicate the ascertainment of disease [51][52][53][54]. Misclassification biases are driven by the accuracy of tests, with studies suggesting nucleic acid amplification (NAA) test sensitivity ranging from 63% to 89% and NAA test specificity of almost 99% [6,9]. These biases limit a surveillance system's ability to ascertain the true burden of COVID-19 infections [45].
Third, the aCFR was estimated using reported deaths instead of the number of excess deaths. The estimated infection may be underestimated using the aCFR compared to excess deaths in countries with poor COVID-19 specific mortality reporting. However, the Malaysian National Death register has a high ascertainment coverage [55,56], and sensitivity analysis revealed small differences in the prevalence estimate using excess counts (25.8%) and reported deaths (28.8%). Fourth, the incidence density estimates utilised the mid-year population as the population at risk. However, this assumes that infections and vaccinations do not provide complete immunity to the SARS-CoV2 virus. Violation of this assumption may lead to underestimation of the Incidence density. Finally, we also used a multiplier model approach instead of a mechanistic approach, limiting its utility in forecasting future disease dynamics.

Conclusion
The characterization of the true burden of disease is essential in developing and implementing policy measures and allocating resources. These estimations highlight a more accurate  infection burden in Malaysia. It provides the first national-level estimates of prevalence in Malaysia that are adjusted for underdiagnosis. The higher underestimation of infections during April-September 2021 coincided with sustained higher community transmission and higher healthcare utilization. Naturally acquired community immunity is still low but is likely to increase secondary to an Omicron-fuelled infection surge. Booster doses may further hasten an equilibrium between the SARS-CoV-2 virus and its human host. Such an equilibrium should mark the start of an endemic state. Future variants may upend this equilibrium and necessitate periodic mitigation of disease transmission. Population estimates of the infection burden are critical to determine the need for booster doses and public health measures.
Supporting information S1 Appendix. Adjudication of deaths in Malaysia. (DOCX)