The authors have declared that no competing interests exist.
Wrote the paper: MVV DS. Wrote, prepared, and proofread the manuscript: MVV. Helped write and proofread the manuscript: DS. Performed orbit analysis of the data: MVV DS. Gathered data and performed detailed statistical analysis of the data: BS. Generated orbits and constructed the GUI/software for orbit visualization: FC.
We analyse demographic longitudinal survey data of South African (SA) and Mozambican (MOZ) rural households from the Agincourt Health and Socio-Demographic Surveillance System in South Africa. In particular, we determine whether absolute poverty status (APS) is associated with selected household variables pertaining to socio-economic determination, namely household head age, household size, cumulative death, adults to minor ratio, and influx. For comparative purposes, households are classified according to household head nationality (SA or MOZ) and APS (rich or poor). The longitudinal data of each of the four subpopulations (SA rich, SA poor, MOZ rich, and MOZ poor) is a five-dimensional space defined by binary variables (questions), subjects, and time. We use the orbit method to represent binary multivariate longitudinal data (BMLD) of each household as a two-dimensional orbit and to visualise dynamics and behaviour of the population. At each time step, a point (
Binary multivariate longitudinal data (BMLD) is here exemplified by the binary responses in a Yes/No form to a set of
Many BMLD studies use regression techniques [
In [
Here we use the orbit method to analyse binary demographic data of households from the Agincourt Health and Socio-Demographic Surveillance System (AHDSS) in South Africa. The longitudinal AHDSS data have been studied e.g. in [
The AHDSS longitudinal data analysed here is of about 4,000 households from 2001–2007. With purpose
The advantage in representing the data of each subject as a two-dimensional orbit is that orbits capture the dynamics of change in response of each subject so it reveals information of change over time at both individual and population level while retaining the full information of the original data. Using orbits for data analysis give a way to visualize data, i.e. identify clusters associated to stable (less frequently changing) variables, and patterns in subpopulations associated to clusters. BMLD can involve hundreds of variables so visualising in
The Agincourt Health and Socio-Demographic Surveillance System (HDSS) is located in Bushbuckridge in northeast South Africa and was established in 1992. Bushbuckridge is a poor rural sub-district that is made up of South African and former Mozambican refugees (approximately a third of the population) [
Recall our purpose and questions given in (
Aside from
HH<40 = 0 | HS<3 = 0 | HD high = 0 | A<M = 0 | IF^{+} = 0 |
HH≥40 = 1 | HS≥3 = 1 | HD low = 1 | A≥M = 1 | IF^{−} = 1 |
Population | SA Rich | SA Poor | MOZ Rich | MOZ Poor |
---|---|---|---|---|
4158 | 2610 | 421 | 468 | 659 |
Given the number of variables
Analysis of BMLD
To explain the orbit method, we illustrate for
ℓ | ℓ′ | ℓ″ | |
---|---|---|---|
0 | 010 | 100 | 111 |
1 | 001 | 100 | 001 |
2 | 000 | 101 | 001 |
3 | 001 | 110 | 101 |
4 | 000 | 101 | 001 |
Suppose we order questions and give more weight to those that least frequently change. As in numbers, we let the digit in the left-most position of the question order be most significant, and digit in the right-most be least significant. Observe that for both ℓ and ℓ′, answer to
We recall terms and notations as introduced in [
For convenience, states in
(a) Orbit of subject ℓ staying in subset of
The algorithm for determining the next states
[initial state
[state
[edge color] Draw an edge from
[state
State index | ||||||
---|---|---|---|---|---|---|
0 | 1 | 1 | 1 | 111 | 120 | 24 |
1 | 1 | 1 | 1 | 111 | 120 | 24 |
2 | 0 | 1 | 1 | 110 | 120 | 23 |
3 | 0 | 1 | 0 | 100 | 102 | 29 |
4 | 0 | 1 | 1 | 101 | 102 | 30 |
5 | 0 | 1 | 1 | 101 | 102 | 30 |
6 | 1 | 1 | 1 | 111 | 120 | 24 |
7 | 0 | 1 | 1 | 110 | 120 | 23 |
Using orbits, the complete
Note that many households may share an edge (or orbit) in
Observe clusters formed in regions of each subpopulation. variable 4 (influx) is most frequently changing in all four subpopulations so orbits do not stay in the region where y = 4****. Not all 5! significance states are shown.
Household orbits in
SA Rich | |||||
SA Poor | |||||
MOZ Rich | |||||
MOZ Poor |
SA Rich | SA Poor | MOZ Rich | MOZ Poor | |
---|---|---|---|---|
374 | 78 | 110 | 102 | |
466 | 53 | 50 | 85 | |
1280 | 164 | 216 | 309 | |
1334 | 308 | 299 | 405 | |
3373 | 580 | 599 | 971 |
There are immediately regions of interest in
Observed units that often stay in a region determined by one significant variable often experience the property of that region.
Regions in
In general, we say that a variable
Clusters in the right half regions of
As observed in
This is the zoomed region in
Regarding Remark 6, we can further analyse orbits from the SA Rich subpopulation.
Highlighted lines are self-transition, i.e.
The capacity at state 48 = (11111, 01234) is dominant. This is expected as most orbits idle in this state, as given by the numbers in
The capacity graphs for state pairs 48 and 47 = (11110, 01234), and 39 = (11110, 01243) and 40 = (11111, 01243), behave inversely and are almost symmetrical. Note that transition between state pair 47 and 48, and 39 and 40, are associated to change in variable 4 (IF) and 3 (AM) respectively. It is expected that capacity increase in 48 (more individuals migrating into households) result in decrease of capacity in 47. The same argument goes for exchange in capacity of states 39 and 40.
Transition between 23 = (11110, 01342) and 24 = (11111, 01342) are associated to change in variable 2 (HD). The capacity graph of 23 (HD = 0) is always above 24, except at
We particularly use
There is one dominant peak in SA Rich. This occurs at the fully fit state (11111, 01234), where it is most stable in
The peaks for SA Poor, MOZ Rich, and MOZ Poor are at states
Spikes at states ii., iv., v., vi., and vii. are identified with relatively stable unfavourable states HH<40, A<M, and HS<3 with IF^{+}. We then associate absence of visits to these states with SA Rich, and their presence with the other three subpopulations.
For the two dominant peaks at states i. and ii. in MOZ Rich in
We discuss the use of other conventional methods in analysis of BMLD and mention the advantage of using orbits.
For
To compare the performance of the conventional statistical model to the deterministic orbit approach we have adopted a generalized estimating equation (GEE) population modelling approach. In [
Before presenting the GEE model, we note that with regards to the correlated indicators, there is potential co-linearity between the household size and certain other covariates. This is suggested by the marginally high variance inflation factor (VIF) for this variable (
Variable | VIF | 1/VIF |
---|---|---|
10.18 | 0.0982 | |
7.14 | 0.1340 | |
4.73 | 0.2115 | |
3.24 | 0.3087 | |
2.69 | 0.3714 | |
Mean VIF | 5.60 |
1.0000 | |||||
0.0251 | 1.0000 | ||||
-0.0226 | -0.0231 | 1.0000 | |||
0.1068 | -0.2919 | -0.0302 | 1.0000 | ||
0.0381 | 0.6787 | -0.0114 | -0.1908 | 1.0000 |
GEE population-averaged model | Number of obs | = | 22270 | |||
Group variable: | hh_id | Number of groups | = | 5567 | ||
Link: | logit | Obs per group: min | = | 4 | ||
Family: | binomial | avg | = | 4.0 | ||
Correlation: | exchangeable | max | = | 5 | ||
Wald chi2(5) | = | 483.36 | ||||
Scale parameter: | 1 | Prob > chi2 | = | 0.0000 | ||
Odds Ratio | Std. Err. | [95% Conf. Interval] | ||||
1.2576 | 0.0315 | 9.14 | 0.000 | 1.1973 | 1.3209 | |
1.0118 | 0.0338 | 0.35 | 0.724 | 0.9477 | 1.0803 | |
1.3439 | 0.0346 | 11.48 | 0.000 | 1.2777 | 1.4134 | |
1.2846 | 0.0375 | 8.58 | 0.000 | 1.2132 | 1.3603 | |
0.7615 | 0.0180 | -11.51 | 0.000 | 0.7270 | 0.7977 | |
_cons | 0.7559 | 0.0345 | -6.13 | 0.000 | 0.6912 | 0.8266 |
Variable | VIF | 1/VIF |
---|---|---|
4.43 | 0.2257 | |
3.27 | 0.3059 | |
3.24 | 0.3089 | |
2.66 | 0.3762 | |
1.37 | 0.7301 | |
Mean VIF | 2.99 |
A motion chart is a dynamic bubble chart that enables the display of large multivariate data with large number of data points [
s | Answer | s | Answer | s | Answer | s | Answer |
---|---|---|---|---|---|---|---|
1 | 00000 | 9 | 01000 | 17 | 10000 | 25 | 11000 |
2 | 00001 | 10 | 01001 | 18 | 10001 | 26 | 11001 |
3 | 00010 | 11 | 01010 | 19 | 10010 | 27 | 11010 |
4 | 00011 | 12 | 01011 | 20 | 10011 | 28 | 11011 |
5 | 00100 | 13 | 01100 | 21 | 10100 | 29 | 11100 |
6 | 00101 | 14 | 01101 | 22 | 10101 | 30 | 11101 |
7 | 00110 | 15 | 01110 | 23 | 10110 | 31 | 11110 |
8 | 00111 | 16 | 01111 | 24 | 10111 | 32 | 11111 |
The heat map approach illustrated in
Using variables pertaining to socio-economic determination, we have illustrated via 2-dimensional orbits the dynamics and patterns of 4 subpopulations in the AHDSS. Stable and unstable variables (in terms of frequency of change) have been identified. The high frequency of change of IF variable (
The value of using the method of orbits for analysis of binary multivariate longitudinal data is that it gives a picture of how subjects and the population behave. There are no known methods that show exact visualisation of such data. Orbits can be used as an additional tool for say demographers and social scientist in analysis of data. An additional value of the method is to give insight into possible cause and effect. Presentation of longitudinal data as a time-evolving geometric orbit naturally enables visual identification of possible cause and effect along the orbit (e.g. if only state
One obvious limitation in using orbits is that it considers only complete data. Extending the method to accommodate missing data is necessary. Tools for (demographic) estimation from limited, deficient and defective data [
The primary confounder we included and stratified on in this analysis was household head nationality. Previous papers [
(RAR)
The authors would like to thank the referees for their valuable comments. The data used in this study was supplied by the MRC/Wits Rural Public Health and Health Transitions Research Unit (Agincourt). The Agincourt HDSS is funded by the Medical Research Council and University of the Witwatersrand, South Africa, Wellcome Trust, UK (grant no. 058893/Z/99/A, 069683/Z/02/Z, 085477/Z/08/Z), and National Institute on Aging of the NIH (grants 1R24AG032112-01 and 5R24AG032112-03).