Disaggregate level estimates and spatial mapping of food insecurity in Bangladesh by linking survey and census data

Food insecurity is an important and persistent social issue in Bangladesh. Existing data based on socio-economic surveys produce divisional and nationally representative food insecurity estimates but these surveys cannot be used directly to generate reliable district level estimates. We deliberate small area estimation (SAE) approach for estimating the food insecurity status at district level in Bangladesh by combining Household Income and Expenditure Survey 2010 with the Bangladesh Population and Housing Census 2011. The food insecurity prevalence, gap and severity status have been determined based on per capita calorie intake with a threshold of 2122 kcal per day, as specified by the Bangladesh Bureau of Statistics.The results show that the food insecurity estimates generated from SAE are precise and representative of the spatial heterogeneity in the socioeconomic conditions than do the direct estimates. The maps showing the food insecurity indicators by district indicate that a number of districts in northern and southern parts are more vulnerable in terms of all indicators. These maps will guide the government, international organizations, policymakers and development partners for efficient resource allocation.


Introduction
Achieving food security for all remains one of the major development goals throughout the world. Food security exists when all people, at all times, have physical and economic access to sufficient, safe and nutritious food that meets their dietary needs and food preferences for an active and healthy life as described by Food and Agriculture Organization (FAO) of the United Nations [1]. Inversely food insecurity exists when people do not have adequate physical, social or economic access to food. The food security is one of the highest priorities of the Government of Bangladesh to achieve the second Sustainable Development Goal (SDG 2). Although the food security situation in Bangladesh has improved during the last few decades, growing inequality among lower administrative boundary levels is a serious concern. Policy interventions and foreign aids are more likely to be effective in reducing spatial inequality of food security if resources can be allocated and distributed based on local level food insecurity status. Despite the high importance, the estimates of food insecurity indicators at local area or lower administrative boundary (e.g. district) level are still lacking. The Bangladesh Bureau of Statistics (BBS) uses the estimate of food insecurity indicators as an alternative of poverty indicators calculated from the Household Income and Expenditure Survey (HIES) of BBS, see for example, BBS and UNWFP [2]. In Bangladesh "food insecurity" is defined as an average intake of less than 2122 kcal per capita per day based on the Direct Calorie Intake (DCI) method. However, the district level estimates of food insecurity indicators (as an alternative of poverty indicators) calculated by the BBS based on HIES 2000 data [2] are now obsolete in terms of its use-effectiveness. Thus, the district level estimates of food insecurity indicators using the latest round of HIES 2010 data can provide invaluable information for the policy planners to formulate effective action plans to achieve the relevant SDGs. Hence, to fill in the gaps in this respect, this paper aims to generate the district level estimates of food insecurity indicators using Empirical Best Prediction (EBP) approach of Small Area Estimation (SAE), seemingly reasonably well suited method for Bangladesh context (discussed in section 2), with supplementary interactive maps.
The rest of the paper is organized as follows. Section 2 delineates an overview of SAE methods to search for a better option to estimate food insecurity and describes the data from the HIES 2010 and the Bangladesh Population and Housing Census 2011 (hereafter Census 2011). This Section also provides discussion on model specification required for the SAE analysis. Section 3 briefly describes SAE methodology, in particular the EBP, employed for generating the district level estimates of the food insecurity indicators. Application of the EBP to food insecurity mapping at the district level is presented in Section 4. Finally, Section 5 set outs the main conclusions and recommendations.

Overview of SAE approach: Searching a better option
The existing data, based on national level socio-economic survey (e.g. HIES conducted by the BBS), produce estimates that are representative of the macro-geographical units (e.g. national and division (admin-1)) and cannot be used directly to produce reliable micro or disaggregate or local level (also referred as small area e.g. district or sub-district level) statistics due to very small or even zero sample sizes. In the survey literature, an area is regarded as small if the areaspecific sample size is not large enough to ensure that a direct survey estimator has adequate precision. In this context, the statistical methodology that tackles this problem of small sample sizes is often referred as SAE theory in the survey literature [3,4]. The technique is a modelbased method that links the variable of interest from survey with the auxiliary information available from other data sources (e.g. census) for small areas. See Rao and Molina [3] for a comprehensive overview of SAE techniques.
The standard approach of SAE methods based on both unit-level and area-level modelling have been developed comprehensively for linear parameters such as means and totals. However, direct use of such SAE methods to produce estimates for non-linear and complex parameters, for instance Foster-Greer-Thorbecke (FGT) type food insecurity and poverty indicators becomes problematic [5] and appropriate transformation of the target variable is needed (see methodology section for details). As unit (i.e. household) level data is accessible, this paper focuses on unit level small area modeling and considers the SAE of food insecurity parameters. In this context, the three commonly used SAE methods based on unit-level model are the Elbers, Lanjouw and Lanjouw (ELL) or World Bank method [6], the empirical Bayes (Best) prediction (EBP) method [7] and the M-Quantile (MQ) method [8]. What follows, these methods are compared to suggest for a better approach for estimating food insecurity indicators.
The ELL method is the simplest one in terms of application and theoretical complexity. In the ELL method, the original/transformed values of the target response is assumed to follow a unit-level nested error regression model where households are nested within the cluster that is clusters instead of the target small areas are assumed to have random effects. Although the ELL method is an attractive and the easiest one to implement, it provides poor mean squared errors (MSE) even if the bias is usually small [9]. Since the ELL method assumes cluster-variability rather than area-variability, the ELL estimator performs poorly and can even perform inferior to direct estimator when unexplained between-area variation remains in the model after accounting a number of explanatory variables [7]. Moreover, this method is not optimal for a given distribution since it does not give the empirical best estimator [9]. In the MQ method, the between-area variations are captured by calculating area-specific M-quantile coefficients instead of random effects. The distinguishing features of this method are distribution free assumptions on the model errors and the area effects, and this also allows outlier robust inference. However, when the functional form of the relationship between the q th MQ and the covariates is non-linear, it can lead to biased estimates [10]. The EBP method is based on the assumption that a suitable transformation (e.g. log) of the response variable will follow a nested-error linear regression model [11], called hereafter BHF (Battese-Harter-Fuller) model, with normally distributed errors. The EBP is similar to the ELL method but generates census predictions of response variable such as per capita calorie intake by using the conditional predictive distribution of the out-of-sample data, given the sample data under the BHF model. This method gives the best predictions by minimizing the MSE under the assumed unit-level model. Further, the EBP performs better than the ELL method when underlying assumptions are satisfied and the unexplained between-area variation remains significant [12].
The performances of three methods (i.e. ELL, EBP and MQ) differ in terms of their underlying model assumptions particularly distinctions in the consideration of random effects [13]. One method performs better when the real dataset satisfies the respective underlying model assumptions. The EBP utilizes the survey data to narrow down the random area effects while the conventional ELL method makes no such attempt. Consequently, the EBP method will make a difference for areas with some information in the survey, while the EBP method reduces to the conventional ELL prediction for those areas without any information in the survey. For Bangladesh poverty mapping study, when the prediction level goes down to sub-districts from district, there are many small domains without any information in the survey data. To use the survey information, the EBP method has been added in the World Bank POVMAP software version-2.5 [14] with the note that the conventional EBP method based prediction is expected to do better if there are relatively large random area effects, if many of the small areas are covered by the survey, while the error distributions can be reasonably well approximated by a normal distribution. In our analysis, the target variable of interest is daily per capita calorie intake averaged at the household level and on logarithmic (log) scale is approximately normally distributed with a negligible number of influential observation and the random district specific effect is significant for this data (for detail results see next Section). Further, all the target small domains (here district) are covered in the survey data. As a consequence, following [14] this paper adopts the EBP method rather than the ELL and the MQ methods to generate the district level estimates of the food insecurity indicators in Bangladesh by linking the HIES 2010 and the Census 2011.

Data and model specification
In this study, the variable of interest for which small area estimates are required is taken from the HIES 2010 conducted by BBS. The HIES 2010 survey data is collected following a two-stage stratified sampling design covering all the 7 divisions and 64 districts. Detail sampling plan is available in the HIES report [15]. In HIES 2010, the district-wise sample size ranges from 120 to 480 with an average of 191. The sampling fraction varies from 0.00019 to 0.00052 across district with an average of 0.00040. Eventually, the district level sample sizes are not sufficient to produce reliable estimates of the food insecurity indicators and their associated standard errors at district level. The application of SAE technique is an obvious choice for obtaining the district level estimates of the food insecurity indicators. The daily per capita calorie intake averaged at the household level is used as a response variable to estimate the food insecurity indicators. The per capita calorie intake is slightly right skewed, and so the final response variable is prepared by taking log-transformation by adding up a fixed quantity to make it always positive as suggested by Molina and Marhuenda [16]. Thus, the log-scale daily per capita calorie intake is used as the response variable in the BHF model considered in this paper.
The auxiliary variables for this analysis are obtained from the Census 2011. Since the full census data set is not accessible, the permitted 5% Census 2011 data are used. As auxiliary variables, we used only those variables for modelling and predicting that are common and comparable between HIES 2010 and Census 2011. Also, some contextual variables at sub-district levels are created from the Census 2011 and combined with the HIES 2010. There are 70 auxiliary variables (covariates) scrutinized for fitting a 2-level BHF model considering household as the first level and district as the second level. We adopted the principle of hierarchical modelling technique for selecting the best set of covariates [17]. The two-way interaction terms are included in the model if only their corresponding main effects are also included. The interaction and non-linear terms are added carefully and judiciously. Following this approach, we used a stepwise regression between log-transform per capita calorie intakes and 70 different auxiliary variables to identify the significant covariates. Finally, 23 auxiliary variables that significantly explain the linear model with R-squared (R 2 ) value of 22.5 percent, are identified for the use in the SAE analysis. Although diagnostic plots are not reported here, the residual diagnostic plots namely histogram and normal P-P plot of standardized residual for this data reveal fitted model satisfies the underlying assumption of normality and homoscedasticity reasonably well. Though the R 2 value is not particularly high, it is reasonable to obtain precise disaggregate level estimates, if the unexplained variation is mostly at household level rather than area-level [18].
The summary results of the finally selected 2-level BHF model are given in Table 1. The estimates of fixed effect parameters along with their level of significance are shown in Table 2. The results in Table 1 clearly reveal that the significant contribution of random district specific effect ðs 2 u Þ in the variability of response variable in the null model. The between district variation has been reduced when covariates are included in the null model. Following the idea of Nakagawa and Schielzeth [19], the marginal and conditional R-squared values for the full model indicate that about 22.2% of total variation is explained by the fixed effects alone and about 33.0% of total variation is explained by both the fixed and random effects. The smaller AIC value for full mixed-effects model also suggests the goodness of the fitted full model. The intra-class correlation coefficient (ICC) suggests that about 13% of the total variability in the response are due to between districts variation. The likelihood ratio (LR) test also confirms the significance of the between-district variation (w 2 ð1Þ ¼ 1416:46; P-value < 0.001). The BHF model is based on distributional assumption of level 1 (household) and level 2 (district) random effects (residuals), i.e. both the level 1 and 2 random effects in model are independently and identically distributed values from normal distributions with mean zero and fixed variances. Standard diagnostics such as distribution of residuals, histograms and normal probability (p-p) plots are used for this purpose. Diagnostic plots of the level 1 and 2 residuals obtained from the fitted BHF model are shown in Fig 1. The plots in Fig 1 indicate that these distribution features hold for both level 1 and 2 residuals when fitted to HIES 2010 data. We also observed that the household as well as the district level residuals are randomly distributed, and that their line of best fit does not significantly differ from the line y = 0 in both cases. Model diagnostics are therefore satisfactory for both the BHF model fitted to HIES 2010 data. The EBP method of SAE is therefore expected to provide efficient estimates of district level food insecurity indicators obtained from the fitted BHF model.

SAE Methodology: EBP method
This Section briefly presents an overview of SAE method used in the estimation of district wise food insecurity indicators. To start, let us assume that there is a known number N i of population units in area i, with n i of these sampled. The total number of units in the population is . The quantity of interest is the small area food insecurity indicator F αi , followed from Foster et al. [5], is defined as Here w ij ¼ w � ij = P j2s i w � ij is normalized survey weights and w � ij is inverse of the inclusion probability for unit j in area i. The design-based variance of the direct estimatorF Dir ai can be approximated by, The direct estimator becomes inefficient when area specific sample size is small and further, for areas with no sample data, direct estimates cannot be used. In this context, EBP method of Molina and Rao [7] is often used for estimating the FGT indicators [7]. Let x ij be the vector of values of p unit level auxiliary variables associated with the target variable y ij = log(E ij ). A twolevel nested error model (which is a special case of linear mixed model), often referred as BHF model in SAE [11], considering household at level-1 and target small area (here districts) at level-2 is Here β is a p-vector of fixed effects parameter; u i � Nð0; s 2 u Þ is a random area specific effect associated with area i and e ij � Nð0; s 2 e Þ is an individual level random effect for unit j in area and assume that they are mutually independent. The FGT food insecurity indicator for small

PLOS ONE
Spatial mapping of food insecurity in Bangladesh area i given by (1) can be expressed as where s i and r i denote sample and non-sample vectors of size n i and N i −n i units of area i. Molina and Rao [7] obtained the empirical best predictor (EBP) of F αij = h α (y ij ), as a nonlinear function of y ij ,by minimizing the MSE without restrictions of linearity or unbiasedness and is given byF where f y j ðy j jy s i Þ is the conditional density of y r i given the sample data y s i . The expected value in (6) cannot be calculated explicitly due to the complex non-linear parameters of F αij = h α (y ij ), even if this conditional distribution was completely known. In this case, Molina and Rao [7] proposed to estimate the unknown model parameters by consistent estimators such as Maxi- iii. Average the target quantity over the L simulations aŝ Since the size of y r i is typically very large, generation of y r i might be computationally cumbersome from a multivariate distribution and so Molina and Rao [7] proposes to generate y r i from univariate distribution as with The EBP of the food insecurity measure F αi is then given bŷ For areas with zero sample size, the EBP (9) reduces to a synthetic type estimator given bŷ The MSE estimates are required to measure the precision of estimates and also to construct the confidence interval for the estimates. Following González-Manteiga et al. [20] and Molina and Rao [7], the MSE estimate of (9) is obtained using the parametric bootstrap method. The EBP (9) defined under model (4) and its associated parametric bootstrap MSE estimates can be obtained using sae package in R [16].

Results and discussions
The district level estimates of three food insecurity indicators namely FIP, FIG, and FIS are generated from the EBP method under BHF model (4) with 23 significant covariates. The parametric bootstrap MSE estimates used in this analysis are based on B = 100 samples and L = 50 samples in the EBP estimation. In SAE application, two types of diagnostics measures are suggested and applied: (i) the model diagnostics, and (ii) the diagnostics for the small area estimates. The model diagnostics are applied to verify model assumptions. The other diagnostics are used to validate the reliability of the model-based small area estimates. In model (4), both level 1 (household level) and level 2 (district level) random effects (residuals), are assumed to be independently and identically distributed values from a normal distribution with mean zero and fixed variance. The standard diagnostics implemented reveals that the normality assumptions are satisfied reasonably well for the data utilized in this analysis.
To validate the reliability of the small area estimates produced by the EBP method (i.e. model-based small area estimates) from the fitted BHF model, a set of diagnostics is required. For example, model-based small area estimates should be consistent with unbiased direct estimates, and more precise than direct estimates. The values for the model-based small area estimates should be consistent with the unbiased direct estimates. Further, the model-based small area estimates should have MSEs significantly lower than the variances of the corresponding direct estimates [21,22]. In this context, we deliberate three widely used diagnostics, viz. the bias diagnostics, coefficient of variation (CV) and 95% confidence intervals (CIs) for the small area estimates. In addition, we inspect the aggregation diagnostic where the model-based estimates are aggregated to higher level and compared with direct estimates at this level [21].
The We also use Goodness of fit (GoF) diagnostic. This tests whether the direct and modelbased small area estimates are statistically different. The null hypothesis is that the direct and model-based small area estimates are statistically equivalent. The alternative is that the direct and model-based estimates are statistically different. The GoF diagnostic is computed using the Wald statistic for every model-based estimate:  The percent CV is calculated to assess the improved precision of the model-based estimates generated by the EBP method compared to the direct estimates. The CV shows the sampling variability as a percentage of the estimate. Estimates with large CVs are considered unreliable.  obtained via the Direct and EBP methods along with their percentage CVs and and 95 confidence intervals are set out in S1-S3 Appendices. From the results presented in S1-S3 Appendices and shown in Fig 3, it is evident that the CVs of the direct estimates are slightly higher and therefore the estimates are unreliable. As expected, the larger CVs occur in the district smaller sample size. Though there is no exact role of thumb for the CV, 20% CV is maintained by the Office for National Statistics in the United Kingdom [23]. The CVs in Fig 3 show that the direct estimates of food insecurity prevalence (FIP/HCR) have CVs over 20% for several districts, whereas the CVs of the EBP estimates do not exceed this limit for any of the districts. The similar scenarios are also observed in case of FIG and FIS, except in few districts. Overall, the EBP estimates are more reliable than the direct estimates in terms of percentage CV for all the food insecurity indicators. The district-wise 95% CIs of the model-based and the direct estimates are reported in S1-S3 Appendices. In general, 95% CIs for the direct estimates are wider than the 95% CIs for the model-based estimates (see S1-S3 Appendices). The direct 95% CI estimates are wider due to large standard errors.
We also examine the aggregation of the model-based estimates of food insecurity prevalence at divisional and national level. Standardized differences between the two estimators (i.e. Direct and EBP) calculated Z ¼ ðEBP estimateÀ Direct estimateÞ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi varðDirect estimateÞþMSEðEBP estimateÞ p are shown in Table 3. The Z score is used to examine how the small area estimates differ from the design-unbiased direct estimates. The Z-scores are observed within three standard errors of the direct estimates, indicating a reasonable level of agreement between the two estimators. In all cases, except the estimate for the national level where the standard errors (SEs) are exactly equal, the small area estimates are more precise than the direct estimates. The improvement of SAE estimates in terms of SE is expectedly less at the division and national as expected due to sufficient sample size, however significant gains are expected at the lower administrative units (say, district or sub-district). Summary statistics of the estimated food insecurity indicators for the 64 districts and along with their SEs generated by the Direct and the EBP methods are shown in Table 4. The results reveal that the averages of the EBP estimates of food insecurity indicators are slightly higher compared to those of the direct estimates but with a lower variation. As for example, average values of FIP/HCR are estimated by the direct and the EBP methods are 34.4% and 35.1% respectively. The standard deviations of these estimates generated by the direct and the EBP methods are 12.8% and14.2% respectively.
The FIP/HCR, FIG and FIS estimates calculated by the EBP method are presented in the cartograms of Fig 4. The maps show the spatial distribution of food insecurity indicators at district level. Darker regions of the maps correspond to the regions of high food insecurity. As the map demonstrates, food insecurity rates, intensity and severity are mainly concentrated more in the northern and southern parts of Bangladesh. For example, the Barisal and Chandpur districts in the southern part are found be the most vulnerable (FIP > 59%). Likewise, the districts under the recently announced Mymensingh division in the northern part of Bangladesh are found to be vulnerable in terms of all indicators. However, north-western part, north eastern part except Sylhet district, hill tracts area in south-eastern part except Bandarban district are found to be less vulnerable. The least vulnerable districts are Kushtia, Panchagarh, Meherpur, Thakurgaon, Noakhali, Khagrachhari, Nilphamari and Jhenaidah (FIP < 21%). Moreover, similar patterns are observed for the intensity and severity of food insecurity. The actual district-specific food insecurity estimates of all the indicators along with their percentage CVs and and 95 confidence intervals generated by the Direct and EBP methods are given in S1-S3 Appendices. The results reported in S1-S3 Appendices and maps in Fig 4 clearly show the degree of inequality with respect to distribution of food insecurity among the districts of Bangladesh. In particular, the maps show unequal distribution for all the three food insecurity indicators at district level in Bangladesh.
It is worth noting that the poverty indicators are calculated based on the Cost of Basic Needs (CBN) method in poverty mapping analysis, where the poverty line is calculated based on the household consumption expenditure for basic food and non-food items. The poverty line based on the cost for both items is referred to as upper poverty line, while the poverty line based on only the cost for the basket of basic food items is referred as lower poverty line. The quantities in the food basket are scaled according to the nutritional requirement of 2122 kcal per person per day, which is the food-insecurity line in this study. While the abovementioned upper poverty line is mainly used in the poverty mapping study. Consequently, the estimated food-insecurity prevalence based on the DCI method are not directly comparable with those head-count-rates estimated based on the CBN method. However, a relationship among these two types of indicators at district-level can provide a deeper insight to the policy makers. The most recent poverty mapping study based on the HIES 2010 and Census 2011 data indicates that the lowest (3.6%) and the highest (63.7%) HCRs were estimated for Kushtia (belongs to Khulna division) and Kurigram districts (belongs to Rangpur division) respectively (please see Table 5 of WB, BBS and WFP [15]). While this study based on food-intake measure indicates that estimated FIPs are 8.0% and 30.0% respectively for these two districts. For the capital Dhaka district, the estimated HCR was estimated as 15.7% while the FIP is estimated as 43.0% in this study. Similar opposite pattern is also observed for Rangpur (HCR: 42.0% and FIP:

Concluding remarks
Food insecurity maps are crucial for the allocation of funds by the governments and international organizations. Despite the importance, the local level food insecurity estimates in Bangladesh is lacking. The very latest available research conducted by the BBS in 2004, though as part of a poverty study, is now obsolete in terms of its use-effectiveness. To bridge this gap, this study aimed to estimate food insecurity prevalence at the district level in Bangladesh by using the latest available HIES 2010 dataset and the Census 2011. The reliable local level food insecurity indicators are estimated using the EBP method and as expected, the EBP estimates are found more reliable than direct estimates. For most of the districts, the reduction in CV is quite evident and the gains in efficiency of the EBP method tend to be larger for districts with smaller sample sizes. Finally, the generated district level cartograms of food insecurity prevalence, gap and severity indicates food insecurity indicators are mainly concentrated in the north and south areas of Bangladesh. In general, the degree of inequality with respect to the distribution of food insecure households among districts is quite high. Hence, the maps of this study help to show the districts with a relatively higher concentration of the food insecure people, which ultimately help the government, international organizations and policymakers for fund allocation and effective regional planning. It is worth noting that the empirical performance of EBP and ELL methods cannot be directly compered because of differences in methodological settings of two methods. The empirical results may be compared by applying the EBP and the ELL based on a three-level model to accommodate both types of variation (cluster-specific and area-specific). However, it is tough to estimate both cluster-specific and area-specific consistent variance component simultaneously from the HIES 2010 or earlier data of Bangladesh due to insufficient number of clusters per district [24]. Further, when the target domains are at the very detailed level the ELL method is preferred to EBP due to computational simplicity of the ELL method, while if survey data contains information for most of the target domains, the EBP method can be preferred to the ELL method. Therefore, a comparative study can be done as a future research by implementing both the ELL and EBP method to the recent HIES 2016 [25] data (not yet fully available for the researcher) which covers many sub-districts, for estimating both district and sub-district level estimates of food insecurity in Bangladesh.