Sex ratio at birth in Vietnam among six subnational regions during 1980–2050, estimation and probabilistic projection using a Bayesian hierarchical time series model with 2.9 million birth records

The sex ratio at birth (SRB, i.e., the ratio of male to female births) in Vietnam has been imbalanced since the 2000s. Previous studies have revealed a rapid increase in the SRB over the past 15 years and the presence of important variations across regions. More recent studies suggested that the nation’s SRB may have plateaued during the 2010s. Given the lack of exhaustive birth registration data in Vietnam, it is necessary to estimate and project levels and trends in the regional SRBs in Vietnam based on a reproducible statistical approach. We compiled an extensive database on regional Vietnam SRBs based on all publicly available surveys and censuses and used a Bayesian hierarchical time series mixture model to estimate and project SRB in Vietnam by region from 1980 to 2050. The Bayesian model incorporates the uncertainties from the observations and year-by-year natural fluctuation. It includes a binary parameter to detect the existence of sex ratio transitions among Vietnamese regions. Furthermore, we model the SRB imbalance using a trapezoid function to capture the increase, stagnation, and decrease of the sex ratio transition by Vietnamese regions. The model results show that four out of six Vietnamese regions, namely, Northern Midlands and Mountain Areas, Northern Central and Central Coastal Areas, Red River Delta, and South East, have existing sex imbalances at birth. The rise in SRB in the Red River Delta was the fastest, as it took only 12 years and was more pronounced, with the SRB reaching the local maximum of 1.146 with a 95% credible interval (1.129, 1.163) in 2013. The model projections suggest that the current decade will record a sustained decline in sex imbalances at birth, and the SRB should be back to the national SRB baseline level of 1.06 in all regions by the mid-2030s.


Compute SRB from censuses
We use the population at age zero from census to compute the sex ratio of population at age zero. Then we compute the SRB using the sex ratio of population divided by the sex ratio of survival rate between age zero to one year old: 1 N M 0 and 1 N F 0 are the number of male and female population below age one year old respectively. 1 L M 0 and 1 L F 0 are the number of years lived between age zero and one for male and female respectively. DHS, MICS, 2007 Survey of Birth and 2014 Intercensal Survey provide individual-level data with the full  birth history for each woman at reproductive age interviewed during the survey fieldwork period. We calculate the sampling error for log-transformed SRB for DHS data series using the jackknife method [1, 2, 3]. For a certain data series with full birth histories, let U denote the total number of clusters. The uth partial prediction of SRB is given by:

Sampling errors for data based on full birth histories
where n indexes the live births in each state-survey-year, N is the total number of live births. x n is the sex for the nth live birth. d n is the cluster number for the nth live birth. w n is the sampling weight for the nth live birth. I n (·) = 1 if the condition inside brackets is true and I n (·) = 0 otherwise. The uth pseudo-value estimate of the SRB on log-scale is: The sampling variance is: The annual log-transformed SRB observations are merged such that the coefficient of variation (CV) for log-transformed SRB is below 0.1 or the merged period reached 5 years [4]. For a certain data series, let {t n , t n−1 , · · · , t 1 } be the years with recorded births from recent to past. The merge starts from the most recent year t n : Merging process for data based on full birth histories 1: for t ∈ {t n , t n−1 , · · · , t 1 } do 2: if t = t n then 3: Compute σ as explained in above. Compute CV= σ/ log(y) * u 4: if CV < 0.1 or t n − t n−1 > 1 or t n+1 − t n = 5 then 5: stop and move to the previous time point 6: else 7: Repeat steps 3-5 based on births from t n and t n−1 The above procedures of computing sampling error and merging observation periods are performed for each data series with full birth history information.

Stochastic errors for data based on censuses
We assume that the census coverage of population is close to 100%. Hence, the main uncertainty from the census data are stochastic errors. We used a Monte Carlo simulation to approximate the stochastic variance. For a region-year, the gth simulated number of male live births B M (g) is obtained as: where B denotes a binomial distribution, G is the total number of simulations, B rep is the total number of live births as reported in VR data, and p M rep is the reported proportion of male live births among the reported total live births. The corresponding gth simulation for SRB is given by: The stochastic error for SRB on log-scale is: log(y (g) ).

Imputing sampling errors for Annual PCFPS data
Given that the individual-level birth records from Annual PCFPS are not publicly available, we are not able to directly compute the sampling errors associated with Annual PCFPS data. Hence, we imputed the sampling error for observed SRB from Annual PCFPS at 0.05, which is the median sampling error from 2007 and 2008 Survey of Birth and 2014 Intercensal Survey. Table 2 summarizes the notations and indexes used in this paper. N (µ, σ 2 ) refers to a normal distribution with mean µ and variance σ 2 . t 3 (µ, σ 2 ) refers to a Student-t distribution with degree of freedom 3, mean µ and variance σ 2 . U(a, b) denotes a continuous uniform distribution with lower and upper bounds at a and b respectively.

Model for Sex Ratio at Birth by Vietnam region
The model is largely based on the model described in [7]. In this study, we made a few modifications in the model to better address the data quality and availability of provincial SRB in Vietnam. The outcome of interest Θ r,t , the SRB in Vietnam region r in year t is modeled as: .063 is the SRB baseline level for the entire Vietnam. The Vietnam SRB baseline b is estimated based on national SRB observations from Vietnam before reference year 1970 [5,6]. Φ r,t follows an AR(1) times series model on the log scale to capture the natural fluctuations of SRB within each region over time. ρ = 0.9 and σ = 0.004 based on previous study [5,6]. Instead of estimating ρ and σ , we use the estimated values from prior study [5,6] since the study is based on an extensive national SRB database and the estimates of the parameters are robust. δ r is the binary identifier of the sex ratio transition, following a Bernoulli distribution: δ r |π r ∼ B(π r ), for r ∈ {1, · · · , 6}, logit(π r )|µ π , σ π ∼ N (µ π , σ 2 π ), for r ∈ {1, · · · , 6}.

Unknown Parameters Θ r,t
The model fitting for the true SRB for Vietnam region r in year t. Φ r,t The region-year-specific multiplier for capturing the natural fluctuation in SRB around the national baseline b for Vietnam region r in year t. α r,t The SRB imbalance for Vietnam region r in year t. t 0,r The start year of the SRB inflation for Vietnam region r. δ r The indicator of the presence (δ r = 1) or absence (δ r = 0) of SRB inflation in Vietnam region r. ξ r The maximum level of the SRB inflation for Vietnam region r. λ 1,r The period length for the stage of increase of the sex ratio transition for Vietnam region r. λ 2,r The period length for the stage of stagnation of the sex ratio transition for Vietnam region r. λ 3,r The period length for the stage of decrease back to national SRB baseline of the sex ratio transition for Vietnam region r. ω The non-sampling error for all the observations. Known Quantities y i The ith SRB observation.
The sampling error for the ith SRB observation as computed in Section 1.3. b The baseline level of SRB for the whole Vietnam [5], where b = 1.063. ρ The autoregressive indicator for Φ r,t , where ρ = 0.9 [5,6]. σ The SD of distortion parameter for Φ r,t , where σ = 0.004 [5,6]. The logit transformation ensures that the probability parameter π r lies in the interval [0, 1]. The logittransformed π r follows a hierarchical normal distribution with a global mean and variance µ π and σ 2 π . α r,t refers to the region-specific SRB imbalance process, and is modeled by a trapezoid function to represent the increase, stagnation, and decrease of the transition stages: t < t 0,r or t > t 3,r t 1,r = t 0,r + λ 1,r , t 2,r = t 1,r + λ 2,r , t 3,r = t 2,r + λ 3,r , The start year of SRB inflation t 0,r incorporates the fertility squeeze effect by using the data of total fertility rate (TFR). t 0,r is modeled with a Student-t distribution with degree of freedom 3 with mean at 2001 which indicates the year in which the TFR in Vietnam region r declines to 2.0 [7]. We use the year 2001 to determine the mean of the distribution for t 0,r because this is the estimated year in which Vietnam may start the sex ratio transition [7]. The low degrees of freedom for the Student-t distribution is needed to capture higher uncertainty at the tails for the distribution of t 0,r . Please refer to [7] for more details on the model assumption for α r,t .

Data quality model
, for i ∈ {1, · · · , 526}, σ 2 i is the sampling/stochastic error variance for log(y i ) which reflects the uncertainty associated with logscaled SRB observations due to survey sampling design or underlying natural fluctuations. σ 2 i is known and are calculated as explained in Section 1.3 and Section 1.4. ω 2 represents the non-sampling error that can be minimized but cannot be entirely eradicated due to causes such as non-responses, data input errors.
Note that we use both r, t and i to index the observations instead of only using r and t. This is because for some region r year t, there are multiple data point available. Although the mean of each region r year t is the same at Θ r,t , the variances of these data points are different at σ 2 i + ω 2 .

Priors
Informative priors are assigned to region-level parameters related to the sex ratio transition: the maximum level of SRB inflation ξ r , the period lengths for the stages of increase, stagnation and decrease as λ 1,r , λ 2,r and λ 3,r respectively. The means of the prior distributions are from the systematic study [7,8]  Vague priors are assigned to parameters related to the indicator for detecting the existence of a sex ratio transition, the SD of start year and the non-sampling error: inverse-logit(µ π ) ∼ U(0, 1), σ π ∼ U(0, 2), ω ∼ U(0, 0.5).

Statistical Computing
We obtained posterior samples of all the model parameters and hyper parameters using a Markov chain Monte Carlo (MCMC) algorithm, implemented in the open source softwares R 3.6.1 [9] and JAGS 4.3.0 [10], using R-packages R2jags [11] and rjags [12]. Results were obtained from 10 chains with a total number of 5,000 iterations in each chain, while the first 10,000 iterations were discarded as burn-in. After discarding burn-in iterations and proper thinning, the final posterior sample size for each parameter by combining all chains is 25,000. Convergence of the MCMC algorithm and the sufficiency of the number of samples obtained were checked through visual inspection of trace plots and convergence diagnostics of Gelman and Rubin [13], implemented in the coda R-package [14].

Model Validation
We assess the inflation model performance via two approaches: 1) out-of-sample validation; and 2) oneregion simulation.

Out-of-Sample Validation
We leave out 37% of the data points since the data collection year 2014 instead of reference year, which has been used in assessing model performance for demographic indicators largely based on survey data [15,16,17,18]. After leaving out data, we fit the model to the training data set, and obtain point estimates and credible intervals that would have been constructed based on the available data set in the survey year selected. We calculate median errors and median absolute errors for the left-out observations, where errors are defined as: e j = y j − y j , where y j refers to the posterior median of the predictive distribution based on the training data set for the jth left-out observation y j . Coverage is given by 1/J I[y j ≥ l j ]I[y j ≤ u j ], where J refers to the number of left-out observations, and l j and u j correspond to the lower and upper bounds of the 95% prediction interval for the jth left-out observation y j . The validation measures are calculated for 1000 sets of left-out observations, where each set consists one randomly selected left-out observation from each Vietnam region. The reported validation results are based on the mean of the outcomes from the 1000 sets of left-out observations.
For the point estimates based on full data set and training data set, errors for the true level of SRB are defined as e(Θ) r,t = Θ r,t − Θ r,t , where Θ r,t is the posterior median for region r in year t based on the full data set, and Θ r,t is the posterior median for the same region-year based on the training data set. Similarly, the error for the sex ratio transition process with probability is defined as e(αδ) r,t = α r,t δ r − α r,t δ r . Coverage is computed in a similar manner as for the left-out observations, based on the lower and upper bounds of the 95% credible interval of Θ r,t from the training data set.

One-Region Simulation
We assess the inflation model performance by one-region simulation. For each of the six Vietnam regions, we consider all data points as test data and simulate the SRB using the posterior samples of the global parameters from the sex ratio transition model (obtained using the full dataset).
The gth simulated SRB Θ r are simulated to refer to a "new" region, without taking into account any region-specific data, following the model specification for these parameters. α r,t and δ r are simulated using the posterior samples of all parameters and hyper-parameters related to them. After generating the simulated values, we calculate the same set of results as described in Section 3.1 on out-of-sample validation.

Validation and Simulation Results
where t = 2050, and h ∈ {1, 2, · · · , 70}. Given that the autoregressive indicator for Φ r,t is set to positive (ρ = 0.9 [5,6]), the ACF across Vietnam regions decreases exponentially to zero as the lag increases. Table 3 summarizes the results related to the left-out SRB observations for the out-of-sample validation exercise and the one-country simulation. Median errors and median absolute errors are very close to zero for left-out observations. The coverage of 95% and 80% prediction intervals are symmetrical and more conservative than expected. The wider-than-expected prediction interval for left-out observations are mainly due to the greater uncertainty associated in more recent observations. Table 4 shows results for the comparison between model estimates obtained based on the full dataset and based on the training set for the out-of-sample validation exercise. We look at the model estimates for the true SRB Θ r,t and the inflation process with country-specific probability δ r α r,t . Median errors and the median absolute errors are close to zero.
In summary, the validation results indicate reasonably good calibrations and predicting power of the inflation model with conservative credible intervals.   Table 3: Validation and simulation results for left-out SRB observations. Error is defined as the difference between a left-out SRB observation and the posterior median of its predictive distribution. SRB observations with data collection years since 2014 are left out. Numbers in the parentheses after the proportions indicate the average number of left-out observations fall below or above their respective 95% and 80% prediction intervals.

Model Validation
Θ r,t Θ r,t Θ r,t δ r α r,t δ r α r,t δ r α r,t  Table 4: Validation results for estimates based on training set. Error is defined as the differences between a model estimate (i.e. Θ r,t or δ r α r,t ) based on full dataset and training set. The proportions refer to the proportions (%) of countries in which the median estimates based on the full dataset fall below or above their respective 95% and 80% credible intervals based on the training set.