Bayesian Spatial Semi-Parametric Modeling of HIV Variation in Kenya

Spatial statistics has seen rapid application in many fields, especially epidemiology and public health. Many studies, nonetheless, make limited use of the geographical location information and also usually assume that the covariates, which are related to the response variable, have linear effects. We develop a Bayesian semi-parametric regression model for HIV prevalence data. Model estimation and inference is based on fully Bayesian approach via Markov Chain Monte Carlo (McMC). The model is applied to HIV prevalence data among men in Kenya, derived from the Kenya AIDS indicator survey, with n = 3,662. Past studies have concluded that HIV infection has a nonlinear association with age. In this study a smooth function based on penalized regression splines is used to estimate this nonlinear effect. Other covariates were assumed to have a linear effect. Spatial references to the counties were modeled as both structured and unstructured spatial effects. We observe that circumcision reduces the risk of HIV infection. The results also indicate that men in the urban areas were more likely to be infected by HIV as compared to their rural counterpart. Men with higher education had the lowest risk of HIV infection. A nonlinear relationship between HIV infection and age was established. Risk of HIV infection increases with age up to the age of 40 then declines with increase in age. Men who had STI in the last 12 months were more likely to be infected with HIV. Also men who had ever used a condom were found to have higher likelihood to be infected by HIV. A significant spatial variation of HIV infection in Kenya was also established. The study shows the practicality and flexibility of Bayesian semi-parametric regression model in analyzing epidemiological data.


Introduction
Globally, people living with HIV were estimated to be 35.3 million in 2012 with 2.2 million new infections. It is estimated that over two thirds of all persons living with HIV are in sub-Saharan Africa [1]. Furthermore, this region also has the highest prevalence and HIV infection incidence in the entire world [1].
Monitoring the prevalence of HIV in a country using national averages is important, especially for assessing the HIV trend at national level over time and for comparison purposes between countries. However, the approaches can facade HIV prevalence variability among administration units in the country [11]. Subsequently, policies and interventions developed from these approaches might have no effect at the grassroots level.
In Kenya, HIV prevalence in 2007 was estimated to be approximately at 7.1% among adults aged 15-64 years [13]. HIV prevalence varies considerably between sub-regions in the country. Counties like Bungoma, Embu, Isiolo, Kajiado, Machakos, Meru, Wajir and West Pokot having a prevalence of less than 3% while counties close to the Lake Victoria region, in the West having prevalence ranging between 13% and 25% [13]. The dissimilarities in these regions may be explained by cultural factors, demographic factors and socioeconomic factors [14][15].
A proper understanding of the spatial component of HIV among the administrative units in a country is key to structuring, developing and implementing apposite strategies that will have an effect on people at the local administrative units [16].
HIV prevalence in Kenya also varies significantly by sex and age. HIV prevalence among women in Kenya is 6.9% and 4.4% among men [12]. HIV prevalence by age has a nonlinear relationship, assuming an inverted U-shape [14,17]. In particular, HIV prevalence is low among people below the age of 18, the prevalence increases up to the age of 35-40 then starts declining with increase in age.
Several studies on spatial analysis of HIV prevalence at lower administration units in Kenya have been based on proportions [17,18]. This prevalence estimates and corresponding maps are prone to instability, hence unreliable due to small sample sizes from these administrative units. Some studies have applied spatial smoothing techniques to circumnavigate this sparse data problem [19]. However, these studies have inherently assumed that all covariates in the study have a linear relationship with the response variable. This linear relationship might not be valid for some variables. As noted earlier, HIV prevalence has a nonlinear relationship with age of an individual.
A primary objective of this study is to develop and apply flexible models to capture this nonlinear nature of some covariates while still accounting for the spatial heterogeneity. The study proposes a spatial semi-parametric model based on penalized regression spline to model HIV prevalence data among men, extracted from the Kenya AIDS Indicator Survey (2007).

Data
The Kenya AIDS Indicator Survey (KAIS) was carried out by the Government of Kenya with financial support from United States President's Emergency Plan for AIDS Relief (PEPFAR) and United Nations (UN). The key objective of survey was to collect high quality data on the prevalence of HIV and Sexually Transmitted Infections (STI) among adults, and to assess knowledge of HIV and STI in the populations.
The survey collected a representative sample of households selected from the eight provinces in the country. It involved men and women in the age of 15-64 years. Two questionnaires were used in the survey. A household questionnaire which was used to collected information about the household head and the characteristics of the dwelling place. The second one, the individual questionnaire collected information from men and women aged 15-64 years, about their demographic characteristics, and their knowledge on HIV and STI. Each individual was then asked for consent to provide a venous blood sample for HIV and HSV-2 testing. Readers are referred to the final survey report regarding survey methodologies used in collecting the data [13]. Even though a new round of KAIS, 2012 [12] has been done, this study uses the 2007 data since the final release of this new study has not been made hence the data was not availlable for use. In this study we use the men's data from this survey. In total, 3,662 men who provided venous blood for testing and also had full covariate information were used in the analysis. In the data, age was captured as both categorical and continous while all the other covariates were continuous. Table 1 gives an initial exploratory analysis used to identify variables that are significantly associated with HIV infection. The variables were categorised in to four groups, namely, demographic, social, biological and behavioural.
From this initial analysis, the following variables were found to be associated with HIV infection and were included in subsequent analyses: place of residence, age, education level, marital status, age at first sex, perceived risk of HIV, circumcision status, if had STI in the last 12 months and if ever used a condom. Further from this initial analysis, it was evident that age had a nonlinear effect on HIV infection, hence its continous form (mean = 33.70, SD = 13. 45) was used in the subsequent analyses.

Ethics Statement
Ethical clearance was granted by the institutional review board of the Kenya Medical Research Institute (KEMRI) and the US Centers for Disease Control and Prevention. The consent procedure, highlighted below, was approved by these two bodies.
Participants provided separate informed oral consent for interviews, blood draws and blood storage and, the interviewer signed the consent form to indicate whether or not consent was given for each part. An oral informed consent was given for participants in the age of 18-64 while for minors, in the age group 15-17, oral informed consent was obtained from a parent/ guardian or other adult responsible for the youth's health and welfare before the youth was asked for his/her consent. Only after the parent or guardian had agreed, was when the consent was asked of the adolescent.
Investigators in the study got a waiver of documentation of informed consent for all participants due to the fact that the research presented very minimal risk of harm to the individuals. The waiver did not adversely affect the rights and welfare of the participants, and the survey involved no procedures for which written consent is normally required outside the research context in Kenya.

Statistical model
Let y ij be the HIV status of individual i in county j. y ij~1 if individual i in county j is HIV positive and zero otherwise.
The vector x ij~( x ij1 ,x ij2 Á Á Á x ijp )' contains p continuous independent random variables and w ij~( w ij1 ,w ij2 Á Á Á w ijr )' contains rcategorical independent random variables with first component accounting for the intercept. In this study, p~2 and r~8.
This study assumes that the dependent variable y ij is Bernoulli distributed, i.e y ij jr ij~B ernoulli(r ij ) with an unknown E(y ij )~r ij , being related to the independent variables as follows: In this equation, g(:) is a logit link function, b is a p dimensional vector of regression coefficients for the continuous independent variables, and c is a rdimensional vector of regression coefficients for the categorical independent variables.
In order to cater for both the nonlinear effects of the continuous covariates and the spatial autocorrelation in the data, a semiparametric model utilizing the penalized regression spline approach and convolution model was employed.
The penalized regression spline approach relaxed the highly restrictive linear predictor by a more flexible semi-parametric predictor, defined as: Here, f t (:) is a nonlinear twice differentiable smooth function for the continuous covariates and f spat (s j ) is a factor that caters for the spatial effects of each county. This study considers a convolution approach to these spatial effects, which assumes that the spatial effect can be decomposed in to two pure components, spatially structured and spatially unstructured, i.e f spat (s j )~f str (s j )z f unstr (s j ).
The final model can be expressed as:

Estimation of parameters
A full Bayesian approach in estimation was used in this study. Prior distributions were assigned to all the parameters as discussed in the following subsections.

Smooth function
There exists a myriad of methods for estimating the smooth functionsf t (:). Several authors have given a thorough review of these methods [20][21]. Of particular importance to this current study is the use of penalized regression splines, which was proposed by Eilers and Marx [22]. In this method, the assumption is that the effect of the continuous covariates can be approximated by a polynomial spline. In particular, they assume that the smooth function f t (:) can be approximated by a spline of degree l with K equally spaced knots, x p, min~yp1 vy p2 Á Á Á y pk{1 vy pK~xp, max , yielding, is positive and zero otherwise. Following Caroll and Rupert [23], this study considers 20 knots to ensure enough flexibility and takes the k th knot to be defined as the sample quantile of the continuous independent variable obtained with a probability equal to k Kz1 . Furthermore, this study utilizes a quadratic spline, hence degree l~2.
To avoid getting a smooth function which ''wiggles'' too much, a roughness penalty, suggested by Green and Silverman [24], given by is imposed in the log-likelihood yielding the penalized loglikelihood function given by The balance between flexibility and smoothness is dictated by the positive parameter l.

Spatial components priors
For the prior distribution of the spatially structured effects, this study uses the nearest neighbour Gaussian Markov random field which is specified as follows: Here, N (i) and d i are, respectively, the set and number of neighbours of region i. The neighbourhood can be defined in terms of whether two regions share a border or not. If the two regions share a border then they are neighbours, otherwise they are not. This leads to the intrinsic conditional autoregressive (ICAR) prior distribution [25,26].
The unstructured spatial effect were assumed to have a Gaussian prior distribution, that is f unstr (s i )Dt 2 unstr~N (0,t 2 unstr ). In addition, the variance hyperparameters, t 2 str and t 2 unstr were assigned inverse gamma prior distributions as follows:

Posterior distribution
Posterior distribution refers to the distribution of the parameters after observing the data. It is obtained by updating the prior distribution with observed data. A full Bayesian inference gets its estimates by sampling from this posterior distribution. In reality, the posterior distribution is usually of high dimension and analytically intractable. This is aggravated by the heavy integration required when performing analytical methods. Markov chain Monte Carlo (McMC) methods is a class of techniques used to overcome this problem. It allows for direct sampling from this posterior distribution repeatedly and estimates are calculated from these samples using simple data summaries such as mean and median.
Assuming conditional independence, the posterior distribution of the parameters for the Bernoulli model is given by All the analyses in this study were carried out using WinBUGS 14 [27]. For each model, 40,000 Markov chain Monte Carlo (McMC) iterations were ran, with the initial 10,000 discarded to cater for the burn-in period and thereafter keeping every tenth sample value. The 3,000 iterations left were used for assessing convergence of the McMC and parameter estimation. We assessed McMC convergence of all models parameters by checking trace plots and autocorrelation plots of the McMC output [28]. Text S1 presents the WinBUGS code used during the analysis.
Model diagnostics. The models were compared using the Deviance Information Criterion (DIC) as suggested by Spiegelhalter et al. [29]. The best fitting model is one with the smallest DIC. DIC value is given by DIC~D(h)zpD, in which D is the posterior mean of the deviance that measures the goodness of fit, and pD gives the effective number of parameters in the model which penalizes for complexity of the model. In this criterion, low vales of D indicate a better fit while small values of pD indicate a parsimonious model.
When comparing two models, how big the difference between the DIC values of the two models need to be so as to declare that one model is better than the other is not clear cut. However, several authors have stated that a difference in DIC of 3 between two models cannot be distinguished while a difference between 3 and 7 can be weakly differentiated [26,29].

Application/Data Analysis
The following set of models were investigated in order to understand the effect of the observed covariates and unobserved effects on the distribution of HIV in Kenya based on the male data: Model M 1 is a model of fixed categorical covariate and one continuous covariate, age, modelled with a non-linear smooth function and linear effects of the categorical covariates, place of residence, education level, marital status, age at first sex, perceived risk of HIV, circumcision status, if had STI in the last 12 months and if ever used a condom. The choice of modelling age with nonlinear smoothing prior is supported by results from [14,17]. This model does not take into account the spatially structured and spatially unstructured random effects.
Model M 2 is an additive model that assumes a nonlinear function of the covariate age, linear effects of categorical covariates (listed in model 1 above) and spatially unstructured random effects which cover the unobserved covariates that are inherent within the counties.
The third model, M 3 , examines the effect of nonlinear covariate age, linear effects of categorical covariates and spatially structured random effect which accounts for any unobserved covariates that vary spatially among the counties.
The final model M 4 examines the effects of nonlinear effects of age, linear effects of categorical covariates and a convolution of spatially structured and spatially unstructured random effects. Table 2 summarizes the structure of nesting for the four models, under study, in order of complexity. The adjusted odds ratios,  were reported, for these multivariate models. The adjustment is to cater for confounding between the variables.  Table 4 gives the posterior estimates of the Odds ratios (OR) and their corresponding 95% credible intervals (CI) for the categorical covariates that were assumed to have a linear effect of HIV, based on the best fitting model 4. The following covariates were found to be significantly associated with HIV infection in this model: place of residence, education level, circumcision status, if the individual had an STI in the last 12 months and if the man has ever used a condom.

Fixed effects
HIV infection is negatively related to education. The likelihood of HIV infection was higher for those men with primary education as compared to those with no education, albeit, this was not significant as indicated by the odds ratio and its corresponding credible interval (OR: 1.25, 95% CI: 0.83 to 1.79). The chance of HIV infection was lower for men with secondary education compared to those with no education, although this difference was not significant (OR: 0.73, 95% CI: 0.45 to 1.10). The likelihood of HIV infection was lowest in men with higher education (OR: 0.46, 95% CI: 0.23 to 0.82). Circumcision was found to be significantly associated with HIV infection. The odds of an uncircumcised man Nonlinear effects of age Figure 1 shows the nonlinear association between of age of an individual and the HIV infection. The figure gives the posterior mean of the smooth function and its corresponding 95% CI. It is evident from the figure that the effect of age on HIV infection is nonlinear, and an assumption of a linear effect would have led to specious results and subsequently wrong interpretations. The likelihood of HIV infection is seen to increase with age up to an optimum age of approximately 40 years then starts declining with increase with age.

Spatial effects
We examined the spatial effects based on the best fitting model M 4 . The spatial effect with the corresponding 95% CI is given in Figure 2. From the figure, counties with light grey colour show low association with HIV infection while black shaded counties have high association with HIV infection prevalence. There is clear evidence of spatial variation of HIV prevalence. From the maps, counties in the Western part, around Lake Victoria region, show high spatial variation. Very high HIV prevalence (.10%) was recorded in the following counties: Homabay, Migori, Siaya, Turkana, and Kisumu. Figure 3 gives a box-plot of HIV prevalence by counties (the counties are numbered from 1 to 46, see Table 5 for corresponding county names) while Table 5 gives the exact estimates.
The structured spatial effects are dominant over the spatially unstructured random effects as shown by the ratio of variance components, calculated by Y str~t 2 str t 2 str zt 2 unstr~7 4:07% (Table 4).

Discussion
This study utilizes Bayesian techniques to analyze regional variation and risk factors of HIV infection. The study develops and uses Bayesian semi-parametric regression models to help assess factors associated with HIV infection. We build on existing contributions by Eilers and Marx [22] and Caroll and Rupert [23]. The models used in this study are based on the penalized regression splines, in a semi-parametric modeling paradigm, with allowance for spatial variation in the response variable. Many situations arise where the relationship between the function of the response variable and covariates is not pigeonholed linearly [21]. Semi-parametric models are a class of models which borrows strength from both parametric and non-parametric models. Subsequently, it can be viewed as enriching standard parametric models by exploring the non-parametric domain while still keeping intact the linear structure [30]. Furthermore, the regularly encountered assumption of fully parametric linear predictor for evaluating the effect of covariates on the dependent variable is usually restrictive, unrealistic and can lead to specious conclusions. The spatially structured effects in the model are modeled using a Gaussian Markov Random Field (GMRF) while the spatially unstructured random effects are modeled using a zero mean Gaussian process [25,26]. Due to the high complexity and intractability owing to this intricacy, the maximum likelihood approaches are not viable in the estimation of these models. Consequently, the Bayesian inference, utilizing McMC techniques is highly favoured.  In this study, we found that the male circumcision reduces the risk of HIV infection, precisely, uncircumcised men were more likely to be HIV positive than circumcised men. This finding is supported by previous studies and therefore adds to the large body of research indicating that circumcision lowers the risk of HIV infection [4][5][6]. The Ministry of Health in Kenya currently runs an aggressive Voluntary Medical Male Circumcision (VMMC) in the country [31][32]. With respect to this finding, there will be hope for decline in HIV prevalence due to this campaign.
Place of residence proved to have a significant relationship to HIV infection among men when controlled for other covariates. Men in urban areas were more likely to be HIV positive as compared to men in the rural areas. The effect of place of residence on HIV infection has been reported in many studies but with mixed conclusions [14,33]. This finding could be used to inform tailormade HIV campaign strategies depending on place of residence.
Another finding from this study is that the likelihood of HIV infection was lowest among men with higher education. Other studies have also shown similar result [14]. The government's introduction of free primary education and subsidized secondary education is hoped to increase the number of young people attaining higher level of education [34]. This is envisaged to increase awareness of this pandemic and further reduce its spread.
It was also found in this study that men who had an STI in the last 12 months were more likely to be HIV positive. In terms of condom usage, men who have ever used a condom were more likely to be infected by HIV. This is an unexpected finding. This finding may be attributed to two possible reasons. Firstly, some men who use condom, with their spouses, will later on end up stopping to use a condom by assuming that they ''know'' one another properly and trust each other. Subsequently having unprotected sex. Secondly, the way this question was captured in the study, wasn't useful. A better way was to capture consistent use of condom. If this could have been captured then, maybe, opposite and expected results could have been realized.
The nonlinear effect of age on HIV infection was evident from the analysis. The relationship between HIV infection and age has an inverted ''U'' shape. The likelihood of HIV infection increase with age and reaches a maximum at the age of around 40, then takes a nosedive. This result is in tandem with other bodies of research [14,17].
Spatial effects in the model acts as surrogate of the unobserved variables. Identification of high prevalence areas can provide insights for designing intervention and campaign programmes that are tailor-made to those regions, hence increasing the impact of the initiative. There was evidence of spatial variation of HIV infection among the counties, with highest prevalence rates being reported on the Western part of the country, around Lake Victoria region.
Vis-a-vis classical regression modeling frameworks, this study uses an approach that allows for flexible and realistic models to be implemented, thereby making it possible to establish and test epidemiological hypotheses. Availability of freely available software like R and WinBUGS makes implementation of these complex models easier and accessible to practitioners in medical, environmental and any other relevant field.
A major limitation of our analysis is that the data used for county estimation was collected when the country was still based on the old administrative units (provinces), the data was not powered to carry out estimation at these new administrative units. The study rides on the advantage that these new administrative units called counties were formed by combining several districts together. This made it easy for the county where an individual belongs to be allocated easily since each district belongs to only one county. Also, in the data, the way some variables were captured was not usefull; ever using a condom should be replaced with consistent use of condom. In terms of methodology, the knots used in the penalized spline regression were assumed to be fixed and were calculated as quantiles from the continuous variable age. A more flexible analysis can allow the knots to be data driven [35].  Despite the limitations highlighted above, the models introduced in this study can be replicated in other countries with similar data.

Supporting Information
Text S1 WinBUGS code used in the analysis. (DOCX)