Predicting local malaria exposure using a Lasso-based two-level cross validation algorithm

Recent studies have highlighted the importance of local environmental factors to determine the fine-scale heterogeneity of malaria transmission and exposure to the vector. In this work, we compare a classical GLM model with backward selection with different versions of an automatic LASSO-based algorithm with 2-level cross-validation aiming to build a predictive model of the space and time dependent individual exposure to the malaria vector, using entomological and environmental data from a cohort study in Benin. Although the GLM can outperform the LASSO model with appropriate engineering, the best model in terms of predictive power was found to be the LASSO-based model. Our approach can be adapted to different topics and may therefore be helpful to address prediction issues in other health sciences domains.


Introduction
Malaria is endemic and remains a major cause of mortality especially for children under the age of five years in sub-Saharan Africa [1]. Assessment of malaria burden is critical for the evaluation of control measures. The correct definition of "unexposed" individuals (not in contact with the malaria vector and then also not with the parasite) is important for the interpretation of results, since it may help in distinguishing protection (i.e immune individuals) from lack of exposure [2]. A precise characterization of exposure could mitigate classification error and facilitate clinical trial and cohort study designs. The exposure to the malaria vector (the Anopheles mosquito) is space and time dependent in endemic areas and highly related to the rainy Season. local-scale heterogeneity of malaria transmission and exposure to the vector bite [3][4][5][6]. A classical entomological indicator used to characterize the human exposure to the malaria vector is the human biting rate (hbr), which is the number of anopheles bites per man per time unit. In a previous work, we have built a predictive model to estimate the individual hbr in a population of Beninese children by using entomological and environmental data from a cohort study carried out between 2007 and 2010. Variable selection in statistical models is a highly complex and vast research area with a huge literature [7][8][9][10][11][12][13][14][15]. In health sciences, regression models are commonly used, but classical variable selection methods (backward, forward selection. . .) show limits (non convergence, collinearity. . .) as the ratio variables/observations increases. In particular, taking into account all the interactions terms in GLM with backward selection is often impossible in practice, although it can be useful to improve the prediction power. In our previous work, the selection of the variables introduced in the General Linear Model regression (GLM) was perfomed with a backward procedure and only few interactions terms could have been entered in the model based on an empirical expertise [3]. Machine learning is a growing field of research, particularly adapted for prediction problems in high dimension, and constitutes then an appealing approach to overcome this issue. Several recent studies in biology, epidemiology and medicine have actually shown that the predictive performance of classical methods can be improved by implementing machine learning methods, for example [16], [17][18][19], [20][21][22]. The present work aims to revisit the empirical algorithm and to propose an automatic machine learning method combining GLM-Lasso and a stratified two-levels cross validation in order to select the best subset of predictors. The Lasso method proposed by Tibshirani [7] is a regularized estimation approach for regression model using an L1-norm and constraining the regression coefficients, which simultaneously performs selection and estimation, and is robust for variables selection in high dimension [8,23]. The algorithms implemented in our work are based on [8,23,24]. The predictive performances of the automatic LASSO-based method and the reference method are evaluated and compared to each other.

Materials
In this section, we briefly recall the description of the study area, the mosquito collection and related variables. For more details, see [3].
Study area. The study was conducted in the district of Tori-Bossito (Republic of Benin), from July 2007 to July 2009. Tori-Bossito is on the coastal plain of Southern Benin, 40 kilometers north-east of Cotonou. This area has a subtropical climate and during the study, the rainy Season lasted from May to October. Average monthly temperatures varied between 27˚C and 31˚C. The original equatorial forest has been cleared and the vegetation is characterized by bushes with sparse trees, a few oil palm plantations, and farms. The study area contained nine villages (Avamé centre, Gbédjougo, Houngo, Anavié, Dohinoko, Gbétaga, Tori Cada Centre, Zébè, and Zoungoudo). Tori Bossito was recently classified as mesoendemic with a clinical malaria incidence of about 1.5 episodes per child per year [25]. Pyrethroid-resistant malaria vectors are present [26].
Mosquito collection and identfication. Entomological surveys based on human landing catches (HLC) were performed in the nine villages every six weeks for two years (July 2007 to July 2009). Mosquitoes were collected at four catch houses in each village over three successive nights (four indoors and four outdoors, i.e. a total of 216 nights every six weeks in the nine villages). Five catch sites had to be changed in the course of the study (2 in Gbedjougo, 1 in Avamè, 1 in Cada, 1 in Dohinoko) and a total of 19 data collections was performed in the field from July 2007 to July 2009. In total, data from 41 catch sites are available. Each collector caught of predictional mosquitoes landing on the lower legs and feet between 10 pm and 6 am. All mosquitoes were held in bags labeled with the time of collection. The following morning, mosquitoes were identified on the basis of morphological criteria [27,28]. All Anopheles gambiae complex and Anopheles funestus mosquitoes were stored in individual tube with silica gel and preserved at 220˚C. Plasmodium falciparum infection rates were then determined on the head and thorax of individual anopheline specimens by CSP-ELISA [29].
Environnement and behavioral data. Original variables: Rainfall was recorded twice a day with a pluviometer in each village. In and around each catch site, the following information was systematically collected: (1) type of soil (dry lateritic or humid hydromorphic)assessed using a soil map of the area (map IGN Benin at 1/200 000 e, sheets NB-31-XIV and NB-31-XV, 1968) that was georeferenced and input into a GIS; (2) presence of areas where building constructions are ongoing with tools or holes representing potential breeding habitats for anopheles; (3) presence of abandoned objects (or ustensils) susceptible to be used as oviposition sites for female mosquitoes; (4) a watercourse nearby; (5) number of windows and doors; (6) type of roof (straw or metal); (7) number of inhabitants; (8) ownership of a bed-net or (9) insect repellent; and (10) normalized difference vegetation index (NDVI) which was estimated for 100 meters around the catch site with a SPOT 5 High Resolution (10 m colors) satellite image (Image Spot5, CNES, 2003, distribution SpotImage S.A) with assessment of the chlorophyll density of each pixel of the image S1 Database. Due to logistical problems, rainfall measurements are only available after the second entomological survey. Consequently, we excluded the first and second survey (performed in July and August 2007 respectively) from the statistical analyses.
Recoded variables: Some pre-treatments based on the knowledge of experts in entomology and medicine are operated on some original variables. These pre-treatments generate a second type of covariables called recoded variables. The dependent variable was the number of Anopheles collected in a house over the three nights of each catch and the explanatory variables were the environmental factors, i.e. the mean rainfall between two catches (classified according to quartile), the number of rainy days in the ten days before the catch (3 classes [0-1], [2][3][4], >4 days), the Season during which the catch was carried out (4 classes: end of the dry Season from February to April; beginning of the rainy Season from May to July; end of the rainy Season from August to October; beginning of the dry Season from November to January), the type of soil 100 meters around the house (dry or humid), the presence of constructions within 100 meters of the house (yes/no), the presence of abandoned tools within 100 meters of the house (yes/no), the presence of a watercourse within 500 meters of the house (yes/no), NDVI 100 meters around the house (classified according to quartile), the type of roof (straw or Sheet metal), the number of windows (classified according to quartile), the ownership of bed nets (yes/no), the use of insect repellent (yes/no), and the number of inhabitants in the house (classified according to quartile).
The Original and the recoded variables are described in Tables 1 and 2. Two groups of covariables set are used: the first group (Group 1), the original covariables with all covariables obtained by interactions, the second group (Group 2), the recoded covariables with all covariables obtained by interactions.

Ethics.
A written informed consent was obtained from all participants involved in the study. The study protocol was approved by the Ethics Committee of the University of Abomey-Calavi (Faculté des Sciences de la Santé; FSS) in Benin and the Consultative Committee of Ethics of Institute of Development Research (IRD). Methods. The main objective is to predict the number of Anopheles Y using the environmental factors X.
For doing this, statistical analysis are conducted in three steps.
Step 1. First, the variables selection is performed using GLM-lasso method through a cross validation. For this part, we have implemented an automatic algorithm Leave One Level Out Double Cross-Validation (LOLO-DCV) 0.1. This algorithm developed in this work is a stratified cross validation with two levels [30,31]. ii. The two regularizing parameters λ.min k and λ.1se k are obtained.
iii. The coefficients of active variables i.e variables with non-zero coefficients associated to these two parameters are debiased iv. Predictions are performed using a GLM model on E k v. The presence PðX i Þ of each variable is determined using λ.min k and λ.1se k on A k 3. The step 2c is repeated until predictions are performed for all observations.
The second level allows to avoid over-fitting in learning stage in the process of variables selection because the number of observations is lower. Its aim is to compute a second cross validation (CV 2 ) for prediction at each step of learning of a first cross validation (CV 1 ). The GLMM-Lasso method of variables selection is based on the calculation of the coefficients of the variables defined as:b , n is the number of observations, p is the number of covariables, β is a (p +1)-vector of fixed parameters including the intercept, Y is the vector of the target variable, L GLM the likelihood of the model, λ is the regularizing parameter, The choice of the regularizing parameter lambda is done by minimizing a score function based on the deviance. In practice, Eq (2) is solved using a combination of Laplace approximation, Newton-Raphson method or Fisher scoring method. The deviance can be defined as: where LðMðbÞÞ the log-likelihood of the model MðbÞ, MðsatÞ is the "saturated" model and MðbÞ is the model of Poisson regression. The selection of the best subset of variables is done according to two strategies, LDLM (Lolo Dcv Lambda Min) and LDLS (Lolo Dcv Lambda 1Se). The strategy LDLM is based on the regularizing parameter defined as: The strategy LDLS is based on the value λ.1se defined by T. Hastie et al which minimizes the deviance plus its standard deviation [23,32,33]: The best subset of variables is selected as follows. Let V ¼ fV 1 ; V 2 ; . . . ; V N var g be the set of all variables including interactions, N var the number of variables. If N f is the number of folds, at each first level k, 1 k N f , the second level of cross validation provides two vectors β(λ.min k ) and β(λ.1se k ) of coefficients of variables using λ.min k and λ.1se k Eqs (4) and (5) respectively. Based on this, one can determine the presence or the absence of each covariable. For any λ, let define the function "Presence" of variable like: where β r (λ), 1 r N var is a vector of coefficients of covariables V r and Θ the null vector. For a threshold s, 1 s 100, the subset of selected variables (SV) is defined as: We also studied the influence of the variability of the threshold s on the quality of the prediction. Then we compared the predictive performance of the model for s taken in {75, 80, 90, 95, 100}. At the end of this step the strategies LDLM and LDLS provides two optimal subset of variables SV LDLM , and SV LDLS which are used to build a GLM predictive model.
Step 2. The predictive performance of the models described above are compared to each other and to the reference B-GLM model The comparison criteria are: 1. The mean of predictions 2. The quadratic risk of predictions

Summary of results on prediction accuracy and quality criteria with LOLO-DCV
The Tables 3 and 4 present the comparison of the performance of the three models B-GLM, LDLM, and LDLS models in terms of quadratic and absolute risks. When selection and prediction are performed using the recoded variables, the reference B-GLM model is the best regarding the indicators of performance for any threshold. On the other hand, when selection and prediction are performed using the original variables, LDLM and LDLS are superior to B-GLM but only with a 100% threshold.

Optimal subset of variables for prediction
Tables 5, 6, and 7 show that both of the strategies LDLM and LDLS provide a sparse optimal subset for original variables.
The best subset of variables selected for each group of covariable is:

B-GLM
For B-GLM [3], the best subset of covariables is Season, Number of rainy during the mission, Mean rainfall, Rainy days before mission, Repellent, Vegetation, the interaction between Season and Vegetation.

LOLO-DCV (LDLM and LDLS)
Based on the results of Fig 1 and the Tables 3, 4, and 5, the best covariables for the Group 1 and Group 2 is: Season; interaction between Mean rainfall and openings; interaction between Rainy days before mission and Nbr of inhabitants; interaction between Rainy days during the mission and Vegetation Figs 1 and 2, show how the number of selected variables is reduced as the threshold increases. The fact that the best model is obtained for a threshold equal to 100% (LDLM strategy) shows that the best prediction power is reached when unstable variables are removed from the final subset. Fig 3, show the number of mosquitoes collected at every mosquito collection mission at 4 collection sites (given as example of the 41 collection sites) predicted by the B-GLM model, predicted by the LDLS model, and the observations (the number of mosquitoes actually caught). The LDLS model predictions were better than the B-GLM model ones, and the predictive curve from LDLS is often able to mimic the observations curves in a very satisfactory way. This has been found for the highest majority of the 41 collection sites.

Discussion and conclusion
The main objective of this work was to propose an automatic algorithm (LOLO-DCV) based on a machine learning approach for variables selection and prediction of the malaria exposure from data of a cohort study carried out in Benin. This algorithm has been performed using both the original variables and then the variables recoded based on the expert knowledge of the topic, and its prediction power has been compared to an empirical algorithm previously used on the same data [3]. This automatic algorithm has shown a substantial improvement in terms of predictive power compared to the empirical algorithm. Our LOLO-DCV algorithm has several advantages on the reference empirical variable selection method (B-GLM). First, being based on the LASSO method, the high ratio variables/ observations is no longer an issue and all the variables can be entered together in the model, including all their second order interactions (automatically generated). This avoids the subjective part of the empirical analysis where a pre-selection based on the field expertise is needed to limit the variables/observations ratio. Second, the algorithm is automatically performed in a reasonable CPU-time (on our data), while the empirical algorithm would require much more time manually. Third, the second level of cross-validation makes this method more robust (and then more generalizable) than the empirical algorithm. Fourth, and most important, LOLO-DCV succeeded to improve the prediction performance of the empirical model, which is of course the ultimate goal. We observe that the global performance criteria as well as the local predictive power at the house (collection site) level are substantially improved compared to the empirical algorithm. In particular, LOLO-DCV algorithm was able to improve two important drawbacks that were observed for prediction at the house level by the reference method: (i) the extreme values were hardly reached by the B-GLM predictions and are much better predicted by LOLO-DCV and (ii) LOLO-DCV succeeded for most of the houses to mimic the exact shape of the observations curves, whereas the B-GLM only succeeded to approximate this shape. Overall, all these improvement make LOLO-DCV algorithm a superior alternative to the B-GLM method. Many other machine learning methods exist, for example random forest, boosting regression etc, [34] [35][36][37][38]. But a drawback of these alternative methods is that they do not lead to easily interpretable results [16,37,38], [39]. The interpretation of the results given by the LOLO-DCV method is the same as those from a classical regression model, thereby much easier to understand by the malaria experts than the results from other methods. In particular, the subset of variables and interactions selected by LOLO-DCV is consistent. As expected the rainfall and Season variables are of highest importance, which is relevant. However, we cannot ensure that our LOLO-DCV algorithm guaranties the best predictive performance, and maybe other approaches would even give better results. This is a limitation of our work and other experiments may be condutcted to explore this matter.
In our work, original variables have shown better results than recoded variables. It may be due to the fact that in our case, recoding was to categorize quantitative variables, which allows to interpret the results more easily, but is known to reduce the variability (and then the information) of the variables. However, it may not be a general result, and we do not recommend avoiding systematically recoded variables.
In conclusion, this work has confirmed the value of using a machine learning approach to address the important health science problem of predicting the individual malaria exposure in a cohort study. Such approach can be helpful to improve the predictive performance of the classical methods and to overcome their limits. Our Lasso-based LOLO-DCV algorithm has clearly shown a substantial improvement compared to the reference method, giving robust and easy-to-interpret results by non-statisticians or machine learning specialists. We think LOLO-DCV can then be recommended to predict any count outcome from a dataset of several dozen of variables and hundreds of observations, which is an average dataset dimension in this study area. For all these reasons the authors plan to build an easy-to-use R package and recommend the use of LOLO-DCV in prediction problem in health science.
Supporting information S1 Database. S1_Database.xls. This database contains all original environnement and behavioral variables used in this work. (XLS)