Estimates of the basic reproduction number for rubella using seroprevalence data and indicator-based approaches

The basic reproduction number (R0) of an infection determines the impact of its control. For many endemic infections, R0 is often estimated from appropriate country-specific seroprevalence data. Studies sometimes pool estimates from the same region for settings lacking seroprevalence data, but the reliability of this approach is unclear. Plausibly, indicator-based approaches could predict R0 for such settings. We calculated R0 for rubella for 98 settings and correlated its value against 66 demographic, economic, education, housing and health-related indicators. We also trained a random forest regression algorithm using these indicators as the input and R0 as the output. We used the mean-square error to compare the performances of the random forest, simple linear regression and a regional averaging method in predicting R0 using 4-fold cross validation. R0 was <5, 5–10 and >10 for 81, 14 and 3 settings respectively, with no apparent regional differences and in the limited available data, it was usually lower for rural than urban areas. R0 was most correlated with educational attainment, and household indicators for the Pearson and Spearman correlation coefficients respectively and with poverty-related indicators followed by the crude death rate considering the Maximum Information Coefficient, although the correlation for each was relatively weak (Pearson correlation coefficient: 0.4, 95%CI: (0.24,0.48) for educational attainment). A random forest did not perform better in predicting R0 than simple linear regression, depending on the subsets of training indicators and studies, and neither out-performed a regional averaging approach. R0 for rubella is typically low and using indicators to estimate its value is not straightforward. A regional averaging approach may provide as reliable an estimate of R0 for settings lacking seroprevalence data as one based on indicators. The findings may be relevant for other infections and studies estimating the disease burden and the impact of interventions for settings lacking seroprevalence data.


Effect of studies with high R0 values on the performance of simple linear regression and Random Forest prediction
In this section we explain in detail how the MSE variation of simple linear regression on different indicators can be directly traced to particular combinations of missing indicator values and the value of R0 as calculated using seroprevalence data.
Consequently, we argue that this variation in the MSE should not be interpreted as increased predictive power of some indicators over others.
More specifically, we show that those indicators that have a missing value for either or both of the high seroprevalence-estimated R0 value studies ('Czech Republic, <1967' and'Chile (rural), 1967-68' with R0 value equal to 19.97 and 16.53 respectively) are associated with worse MSE performance. In simple terms, the trained model prediction output cannot reach those high R0 values in either of the two regression methods that we consider (simple linear regression and random forest). Hence the indicators which have a missing value for both those studies have an advantage in terms of the prediction MSE.
Contrary to that, that indicators which have a valid value for both those studies are associated with a higher MSE. Furthermore, this increase in the MSE is further exacerbated for those indicators which have very few valid values overall, as the higher error due to the two high R0 studies is averaged over a smaller number of total studies. Hence, indicators with a valid value for the 'Czech Republic, <1967' and 'Chile (rural), 1967-68' studies but with valid values for a few studies overall are associated with the highest MSE.
Starting from the simple linear regression case, as can be seen in Table A (which also has a version sorted by MSE value for presentation clarity), the best performing indicator ('Poverty gap at national poverty lines (%)' -listed in line 20 of Table A Table A). As was described above, this practically means that the higher error due to high R0 study is averaged over a much smaller number of total studies and, consequently, the MSE for those two indicators is much higher (33.32 and 17.4 respectively).
Considering the indicators which have a valid value for both of the high R0 studies A similar effect can be seen in the random forest performance results regarding the indicator subsets that are used to train and the random forest (a tabular summary is included in Table B). In the cross validation experiment conducted using only the 25 indicators that have no missing values, both the high R0 studies are present and the overall error has the value of 9.87. In the experiments conducted using the indicators that have up to 10, 20 and 40 missing values only one of the two high R0 studies is present and the overall error is lower. However, as the number of studies involved in the experiment reduces from 69 to 52 and then 32 as we allow indicators with progressively more missing values, the overall error progressively increases (8.19, 9.6 and 13.8 respectively). Finally, none of the two high R0 studies is included in the experiment conducted with the 20 indicators having up to 70 missing values and the overall error in the case attains the lowest value of 6.88.
None of those effects can be seen in the imputed results (Fig 4  consultations'). Both those effects follow exactly the same pattern described above.