Analysis of Binary Multivariate Longitudinal Data via 2-Dimensional Orbits: An Application to the Agincourt Health and Socio-Demographic Surveillance System in South Africa

We analyse demographic longitudinal survey data of South African (SA) and Mozambican (MOZ) rural households from the Agincourt Health and Socio-Demographic Surveillance System in South Africa. In particular, we determine whether absolute poverty status (APS) is associated with selected household variables pertaining to socio-economic determination, namely household head age, household size, cumulative death, adults to minor ratio, and influx. For comparative purposes, households are classified according to household head nationality (SA or MOZ) and APS (rich or poor). The longitudinal data of each of the four subpopulations (SA rich, SA poor, MOZ rich, and MOZ poor) is a five-dimensional space defined by binary variables (questions), subjects, and time. We use the orbit method to represent binary multivariate longitudinal data (BMLD) of each household as a two-dimensional orbit and to visualise dynamics and behaviour of the population. At each time step, a point (x, y) from the orbit of a household corresponds to the observation of the household, where x is a binary sequence of responses and y is an ordering of variables. The ordering of variables is dynamically rearranged such that clusters and holes associated to least and frequently changing variables in the state space respectively, are exposed. Analysis of orbits reveals information of change at both individual- and population-level, change patterns in the data, capacity of states in the state space, and density of state transitions in the orbits. Analysis of household orbits of the four subpopulations show association between (i) households headed by older adults and rich households, (ii) large household size and poor households, and (iii) households with more minors than adults and poor households. Our results are compared to other methods of BMLD analysis.


Introduction
Binary multivariate longitudinal data (BMLD) is here exemplified by the binary responses in a Yes/No form to a set of p ! 1 questions (variables) asked to each subject of a (sample) population over a period of time. As in the convenient convention of binary coding of 0 for a negative response and 1 a for positive response [1], the outcome of each of the binary variables here is coded as 0 if the outcome is unfavourable (by hypothesis) to a given purpose, and 1 if favourable.
Many BMLD studies use regression techniques [2] or Markov, transition and forecasting models ( [3][4][5]). These methods involve parameter estimation for the explanatory variables. However, visual analysis of data is equally important as it presents initial insights about the data. Descriptive tools such as tables and charts give a visual summary and simpler interpretation. For visual analysis of multivariate longitudinal data, some analysis is given in ( [6,7]) but very few tools are available when the data is binary.
In [8], the focus is on visualizing the complex border between patterns of BMLD. The border in a multidimensional space is converted into visual 2-dimensional and 3-dimensional forms. However, it does not illustrate patterns and dynamics of the population over time. A technique that accounts for dynamics of BMLD and within subject information is the orbit method discussed in [9]. Orbit method here is distinct from the Kirillov orbit method used in representation theory [10]. By an orbit we mean a sequence of points related by the evolution function of the underlying system. The method of orbit is a technique based on deterministic outcomes with emphasis on geometric visualization of multivariate longitudinal data as 2-dimensional orbits. It considers the frequency of change of variables and uses the order of variables for constructing orbits that represent subjects from the population. Orbits give insight for data analysis and provide exact data visualization.
Here we use the orbit method to analyse binary demographic data of households from the Agincourt Health and Socio-Demographic Surveillance System (AHDSS) in South Africa. The longitudinal AHDSS data have been studied e.g. in [11] and [12] where a spatial-temporal model to analyse distribution of mortality and asset accumulation rate respectively were employed. Determinants of socio-economic status/poverty or the relationship between poverty and increased mortality were viewed from more of a static perspective i.e. not from the more dynamic approach offered by the orbit. The orbit approach presents a visualisation of the data in a truly longitudinal-temporal manner. The orbit method is briefly illustrated in [9] using variables regarding child educational progression in AHDSS. However, detailed interpretation of subsets in the subspace, nor analysis of orbits in the space, were not discussed. Aside from [9], we know of no other visual analysis employed for AHDSS, particularly involving household variables pertaining to socio-economic determination. The Each question in (Eq 2) is associated to a variable, e.g. Q 0 to household head age (HH), Q 1 to household size (HS), and so on. We will sometimes refer to question Q i (i = 0, 1, . . ., 6) as variable i. The advantage in representing the data of each subject as a two-dimensional orbit is that orbits capture the dynamics of change in response of each subject so it reveals information of change over time at both individual and population level while retaining the full information of the original data. Using orbits for data analysis give a way to visualize data, i.e. identify clusters associated to stable (less frequently changing) variables, and patterns in subpopulations associated to clusters. BMLD can involve hundreds of variables so visualising in d ! 4 dimensions is difficult. Survey data (e.g. in the social sciences) is usually large both in dimension and in size but orbit representation is feasible for large numbers of variables and subjects. In this application of orbits to the analysis of AHDSS survey data, we hope to contribute in giving new insights in the analysis of binary multivariate longitudinal data.

Description of Data
The Agincourt Health and Socio-Demographic Surveillance System (HDSS) is located in Bushbuckridge in northeast South Africa and was established in 1992. Bushbuckridge is a poor rural sub-district that is made up of South African and former Mozambican refugees (approximately a third of the population) [13,14]. There have been annual updates of births, deaths, in-and out-migrations of individuals identified as members of households, as well as regular special modules (e.g. household asset ownership) used to derive a socio-economic status index.
Recall our purpose and questions given in (Eq 1) and (Eq 2). Regarding absolute poverty status (APS), it is independent of the household variables associated to questions in (Eq 2). Here, households above the absolute poverty line was defined using the definition proposed in [15] for a sub-Saharan African setting, namely ownership of a radio and bicycle, a cement floor in the house, and access to public water and a pit latrine (toilet). Absolute poverty classification is thus independent of the 6 explanatory variables used in our orbit analysis. APS of households below the poverty line are coded 0, while APS above the poverty line is coded 1. Because APS is gathered only in 4 out of the 7 observation years (i.e. at t = 2001, 2003, 2005, and 2007), we use the mean APS of a household over 7 years, which we denote by APS. If APS 2[0, 0.5], then APS is coded 0. Otherwise, APS is coded 1. Our sample population consists of 7715 household units, 4158 of which are either always above or below the absolute poverty line for all four years that APS was gathered. For these households, APS information not gathered for the three years 2002, 2004, and 2006 will not affect the coding of their APS. Households with APS = 0 are referred to as Poor households, and households with APS = 1 as Rich households. Our analysis will only consider these 4158 households. Ethical clearance for the primary study was given by the University of the Witwatersrand Human Research Ethics Committee (Medical). The data used in this study does not contain clinical records (nor does the core Agincourt HDSS database). Individual and household identifiers are anonymized/de-identified by the data managers prior to handing it over to researchers for analysis to ensure confidentiality.
Aside from APS, household head nationality is also constant throughout the survey period. In addition, former Mozambican (MOZ) refugees experience significantly higher levels of poverty compared to their South African (SA) counterparts and this gap has persisted over time [12,16,17]. It is then useful to extract Q 5 and Q 6 and analyse by these subpopulations of households where both poverty status and household head nationality are unchanging. We divide our population into four subpopulations, namely SA Rich, SA Poor, MOZ Rich, and MOZ Poor. Each subpopulation is analysed using p = 5 variables associated to Q 0 to Q 4 . Binary data of the four subpopulations is given in S1 Dataset. From [12,16], a 'yes' answer to questions Q 0 to Q 4 is assumed to be favourable to APS so we code all yes = 1, and all no = 0. Table 1 gives the favourable and unfavourable code for each of the five questions. Table 2 gives the number of households by constant variables (i.e. nationality and APS).

The Method of Orbits
Given the number of variables p ! 1, denote by M p ¼ f0; 1g p the space of binary strings (responses) of length p. For a subject ℓ observed at times t = 0, 1, . . ., T, we define the binary multivariate longitudinal data in p ! 1 variables of subject ℓ. The binary longitudinal data in p variables from a population of n ! 1 subjects observed over time T is the set D ½p;n;T ¼ fD 1 ; D 2 ; . . . ; D n g: We will only consider subjects with complete data. Analysis of BMLD always involves a fixed variable order where one can use the summary measure of the frequency of response pattern (elements of M p ) and perform factor analysis on the longitudinal data [1] or construct Markov models using information of change of time encoded in the matrix of transition probabilities [3]. The method of orbits [9] uses the fundamental information of frequency of change of variables and order of variables for analysis. The information of change is used to define a non-autonomous dynamical system from data of each subject, dynamically rearranging order of variables so that most stable least changing variable is eventually placed to the left, but keeping the full information in the original data. Mathematical properties of the map are discussed in [9].
To explain the orbit method, we illustrate for p = 3 variables. Let Q = {Q 0 , Q 1 , Q 2 } be a questionnaire and assign index i to Q i , i = 0, 1, 2. Table 3 illustrates concatenated coded answers to Q of three subjects from a sample population. To Q 0 , subject ℓ has constant answer 0 while ℓ 0 has constant answer 1. On the other hand, ℓ@ has constant answer 1 to Q 2 . Observe that this property of subjects having constant answers to certain questions is not trivially illustrated in the time series of the three subjects given in Fig 1. Suppose we order questions and give more weight to those that least frequently change. As in numbers, we let the digit in the left-most position of the question order be most significant, and digit in the right-most be least significant. Observe that for both ℓ and ℓ 0 , answer to Q 0 is the most stable (i.e. it is constant), followed by Q 1 (changes once in ℓ and twice in ℓ 0 ), and finally Q 2 as most frequently changing. Then question order for both ℓ and ℓ 0 is chosen as Q 0 , Q 1 , Q 2 , which we will denote by 012. Now position lexicographically in increasing order as binary integers the states (responses) 000; 001; 010; 011; 100; 101; 110; 111 ð3Þ along an axis, and denote this by X 3 . For fixed question order 012, a one-dimensional dynamics on the states in X 3 arises, where answers of subjects ℓ and ℓ 0 are visualised jumping from one state to another, particularly staying in the distinct regions 0ÃÃ and 1ÃÃ, the left and right half of X 3 , respectively. However, different subjects may have different frequencies of change in answer values. Because ℓ@ has constant answer to Q 2 , question order for ℓ@ is chosen such that Q 2 is given more weight. In particular, question order for ℓ@ is set to 210. We recall terms and notations as introduced in [ Definition 1 Given p ! 1, the spaces of sequences both with the lexicographic ordering of sequences (i.e. as increasing integers) are the fitness axis and significance axis for p variables, respectively. An element x 2 X p is called a fitness state, and y 2 Y p a significance state. The space is the change space in p variables composed of P = 2 p p! states. For convenience, states in S p are labeled from 1 to P = 2 p p! starting from left to right, top to bottom. The labeled space S p for p = 3 is illustrated in Fig 2. The space X 3 is the sequences in (Eq 3), Y 3 = {012, 021, 102, 120, 201, 210}, and the cardinality of jS 3 j = 2 3 3! = 48.  Table 3.
Then the initial question order of ℓ is ; use population frequencies f n i j ; f n i jþ1 to determine order between i j and i j+1 .
where each x j is the answer to question i j in y ' 0 .

The initial state of ℓ is s
The algorithm for determining the next states s ' t (t > 0) is as follows: Step 1: [initial state s ' 0 ] For t = 0 and subject ℓ, determine the initial significance state y ' 0 , followed by the initial fitness state x ' 0 . Step 2: [state s ' 1 ] Given initial state s ' 0 ¼ ðx ' 0 ; y ' 0 Þ of ℓ, identify the questions that change answer values at t = 1. If there are none, then the next state s ' If both Q i j and Q i j 0 change answers at t = 1 and j < j 0 , then sequentially swap to the right i j and i j 0 (resp. x j and x j 0) of the question order (resp. answer order), starting with i j 0 (resp.
This new answer order and question order is the next state s ' To show direction of transitions between states, color the edge red if transition is from right to left, green if transition is from left to right, and blue otherwise (i.e. same x-coordinate).
Step 4: [state s ' t ; t ! 2 Update state s ' 0 as s ' 1 and time t as t = 2 in Step 2. Repeat Steps 2 and 3, and iterate until t = T−1.
t Þ be the fitness, significance, and state of subject ℓ at time t, respectively. The orbit of ℓ is the sequence of points Example 1 Table 4 gives coded data of a subject ℓ to p = 3 questions. Recall that coding of answer is 0 = unfavourable and 1 = favourable according to purpose. The coded answer of ℓ to Q i at time t is denoted by a ' i;t . From (Eq 4), we have f ' This corresponds to state index 24 in Fig 2(a). No answer changes at t = 1 so y ' 1 ¼ y ' 0 and x ' 1 ¼ x ' 0 and state transition from t = 0 to t = 1 is denoted by 24 ! 24. Now at t = 2, Q 0 changes answer so we swap 0 and 1 to the right of y ' 1 and x ' 1 respectively (note that both are already on the right), and then change answer 1 to 0. Hence, y ' 2 ¼ 120 and x ' 2 ¼ 110, which corresponds to state 23. Columns 3 and 4 give the rest of Table 4. Coded data and orbit of a subject ℓ. The number a i, t is answer to where question i 0 = 1 is favourable. The longitudinal data of ℓ in S 3 is visualised as the orbit in Fig 2(a) with its time series illustrated in Fig 2(b). Example 2 The orbits of the three subjects in Table 3 in S 3 and over time are illustrated in Fig  3(a) and 3(b) respectively. Observe that orbit of ℓ stays strictly on the left half of S 3 , and the other two on the right half. Subject ℓ is unfavourable in stable variable 0, while ℓ 0 and ℓ@ are favourable in stable variable 0 and variable 2, respectively.
Remark 1 By the 0/1 coding of data, it is reasonable to suppose that the (concatenated) answers composed of unfavourable values 00Á Á Á0 is 'less fit' than the answer composed of favourable values 11Á Á Á1. By the weighting of variables, 'relative fitness' is made precise so that ordering of elements from the space M 3 = {0, 1} 3 of multivariate binary responses has meaning. Because more weight (significance) is given to the left-most position, we can then, for a fixed question order, write x = 010 < x 0 = 100, where most significant variable is unfavourable in x, and favourable in x 0 . For a fixed question order, we say that 100 is fitter than 010 (or 000, 001). The same argument holds in stating that 110 is less fit that 111.
Using orbits, the complete p-dimensional information of each subject at any moment is coded to a point in the 2-dimensional discrete space S p . No information in data is lost nor approximated as each subject's orbit has a one-to-one correspondence with the subject's original data. Clearly, question order of each subject at each time step is monitored. The ordering is selected as frequently changing variables are swapped to the far right (less significant digit of y), thus pushing slow changing variables to the left (significant digit of y). This 'swapping-changing-variable-to-the-right' process exposes clusters associated to stable variables.
Remark 2 The time complexity of computation of orbits for n subjects and time T scales like pnT and is feasible for large data. We also note that there are admissible and non-admissible state transitions in S p [9], e.g. in Fig 2, a transition that starts at 23 can only end at 23, 24, 29, and 31. Remark 3 The tendency of a subject to favour a particular state, or subset of states, is clustering in S p . The strategy for choosing the initial question order in (Eq 5) places an orbit in its most likely position. This facilitates clustering and is useful for short data sets.
Note that many households may share an edge (or orbit) in S p . The following definitions are of interest regarding transitions in S p .
Definition 4 a. The accumulated number of transitions from state s = (x, y) to s 0 = (x 0 y 0 ) is called the density from s to s 0 , denoted by d (s, s 0 ).
b. The number of orbits at state s at time t is called the capacity of state s at time t and is denoted by c s, t .
Remark 4 We can deduce correlation among variables from orbits in S p . We explain for p = 3. Using Fig 4,  then we can check for (positive) correlation between the first two variables 0 and 1. The asterisk Ã in x (resp. in y) can take any binary value (resp. any question index except for i).

Remark 5
The orbit method is not limited to binary data in p variables. For m-ary valued data, the space S p is composed of same number p! of significance states but now with m p fitness states. For the continuous case, data of each observation are binned, where bins are labelled from 0 to m−1 [18,19]. For instance, binary coding can be done by assigning 0/1 if variable is above or below a given value, tertiary coding if the variable is in a good/neutral/bad range of values, and so on.

Orbit Results
Household orbits in S 5 for each of the four subpopulations are illustrated in Fig 4. The x-axis is composed of 2 5 = 32 states but not all the 5! = 120 states in the y-axis are shown. There are no transitions between the four subpopulations as they are associated to constant variables. Recall that a red edge is used to denote a transition that goes from right to left on the next time step, a green edge for a transition that goes from left to right, and a blue edge a transition that goes to the same fitness state. The percentage of unfavourable answers for each question in each of the four subpopulations is given in Table 5 while the frequencies of answer change are given in Table 6. The frequencies of change for Q 0 (HH) and Q 1 (HS) are low, while Q 4 (IF) is the highest. This means that there is stability in the variables HH and HS in that most subjects will stay in the region where significance is either y = 01i 2 i 3 i 4 or y = 10i 2 i 3 i 4 , i j 2 {2, 3, 4}. In addition, there is high activity of the IF variable, which means few transitions where y = 4i 1 i 2 i 3 i 4 , i j 2 {0, 1, 2, 3}. All of this is recognized in Fig 4. There are immediately regions of interest in Fig 4. As an aide in interpreting regions in S 5 , we present in Fig 5 the subsets of S 5 determined by the first significant variable i 0 . A subject ℓ that spends most of its time in the region where x = jÃÃÃÃ, y = iÃÃÃÃ means that answer of ℓ to Q i is least frequently changing, with answer = j. The initial state of ℓ is chosen to lie in this Regions in Fig 5 may be further analysed. A more detailed description of the regions HH!40, HH<40, HS!3, and HS<3 is illustrated in Fig 6. In each sub region, the two variables i 0 and i 1 are significant (i.e. less frequently changing answers), with i 0 being more significant. Of course these sub regions may be further subdivided.
In general, we say that a variable i is stable if orbits cluster in a subset of S p determined by first significant variable i. Regions that are never visited (e.g. those associated to variables IF + , IF − , and A<M in Fig 4) are termed holes. Clusters are contained in regions where the leading significant variable is stable while holes are contained in regions with high activity of the leading variable. By visual inspection of orbits in S p , we can immediately detect stable variables (via clusters) and unstable variables (via holes).
Clusters in the right half regions of S p are fitter than clusters located on the left half of S p as they are associated to stable leading variable with favourable condition. From the orbits of subpopulations in Fig 4, observe that there are no transitions between the left and right half of S 5 in both SA Poor and MOZ Poor subpopulations. This is verified by Fig 7, the orbits of the four subpopulations, in time. In addition, columnar structures over clusters correspond to variables that are stable over the survey period. Although there are few transitions between clusters, there is considerable activity within each. Household orbits in one cluster may then reasonably be analysed independently of households in other clusters. The time series representation of orbits reveals idle behaviour (sequence of vertical blue edges) that are not always visible in orbits in S 5 .
As observed in Fig 4, some regions in a subpopulation appear denser than those of the other subpopulations (e.g. the region HH<40 appears to be more dense in MOZ poor than in MOZ rich, with the opposite phenomenon for the SA population). We use histograms to denote the   associated again to older household head and larger household size, but with household size more stable.
Remark 6 Population-level information is visible, but detailed individual information may be lost in the cluster. We may zoom into regions of interest (e.g. regions of high percentage of visit) to unclutter the display, as in the SA Rich region R HH ! 40, HS ! 3 illustrated in Fig 11. As for individual orbits, of interest in Fig 4 are those that seem to be 'outliers'. They can further be analysed e.g. by using interactive techniques such as focusing and brushing, as in dynamic parallel component plots [20].
Regarding Remark 6, we can further analyse orbits from the SA Rich subpopulation.

Results Regarding Purpose
We particularly use Fig 8 to draw conclusions with regard to our Purpose as stated in (Eq 1).
For spikes at states ii., iv., and vi., unfavourable answer is either in Q 0 or Q 3 (i.e. HH<40 or A<M). We then associate young household heads and less adults to minors to poorer (i.e. not Rich SA) households. Now spike at state vii. which is unfavourable in Q 1 and Q 4 (HS<3 and IF + ) is also associated to poorer households. The condition of small households should be examined.
3. Spikes at states ii., iv., v., vi., and vii. are identified with relatively stable unfavourable states HH<40, A<M, and HS<3 with IF + . We then associate absence of visits to these states with SA Rich, and their presence with the other three subpopulations.
4. For the two dominant peaks at states i. and ii. in MOZ Rich in Fig 8(c), A ! M has a higher peak than A<M. This is reversed in MOZ Poor in Fig 8(d). We associate MOZ Poor with a stable, dominant population of households with small adult component.

Other Methods of Binary Multivariate Longitudinal Data Analysis
We discuss the use of other conventional methods in analysis of BMLD and mention the advantage of using orbits. Markov Chain Model. For p binary variables, Markov chain models considers the analysis of change over time measures in M p = {0, 1} p . Question order is arbitrarily fixed and a 2 p ×2 p matrix of transition probabilities is constructed [3]. If a fixed question order alone is used for all times and for all subjects (say 012Á Á Á(p−1)) in analysing binary multivariate longitudinal data, then some information (e.g. clusters and holes) may not be revealed as orbits overlap in a single row (question order) of S p . For example, the six clusters visible in Fig 4(a) are not resolved in Markov analysis. This phenomenon of 'unfolding' states from a general case of a fixed question order is an advantage in analysing orbits in S p . Given the fundamental weighting    by frequency of change of variables, it is of great interest that S p is the space of all possible states to which subjects can change, and also captures change of significant variables. By prioritising slowly changing variables, orbits give a natural spatial ordering of states in S p by fitness.
Generalized Estimating Equation Model. To compare the performance of the conventional statistical model to the deterministic orbit approach we have adopted a generalized estimating equation (GEE) population modelling approach. In [21], the estimation-equation approach is proposed for population average models. It is argued that in general, mixed models involve unverifiable assumptions on the data-generating distribution resulting to potentially misleading estimates and biased inference. We use the quasi-information criterion (QIC) to identify the best working correlation structure to be used for our data [22]. Maximum likelihood based model selection methods, such as the widely used Akaike Information Criterion (AIC), are not directly applicable to the GEE approach [23]. The exchangeable correlation structure proved to be the best when fitted to our data.
Before presenting the GEE model, we note that with regards to the correlated indicators, there is potential co-linearity between the household size and certain other covariates. This is suggested by the marginally high variance inflation factor (VIF) for this variable (Q 1 ) of * 10 in Table 7. Further, the spearman rank correlation coefficient of 0.68 between Q 1 (household size) and Q 4 (influx) in Table 8 would be cause for further concern. Removing the co-linear effect of Q 1 , the GEE model for a binary outcome (APS = 0/1) using a binomial family, logit link function and an exchangeable correlation structure is given in Table 9. The VIF without Q 1 is given in Table 10.
Remark 7 The GEE model shows that HH ! 40, HS ! 3, HD low, A ! M, IF − , and HN = SA are more likely in the rich households. This is consistent with our favourable/unfavourable orbit coding to APS. In addition, the model also informs us that variables associated to holes (not just clusters) in S p should also be analysed. In particular, the Q 4 (IF) variable (associated to holes)  and Q 3 (AM) variable (associated to very few transitions) appear to be statistically significant and associated to households above the absolute poverty line. Motion Charts and Heat Maps. A motion chart is a dynamic bubble chart that enables the display of large multivariate data with large number of data points [6]. The central object in motion charts is a blob, or in general a 2-dimensional shape, which represents one entity from the dataset. This allows for visualization of the data by using additional dimensions (e.g. time, size and color of the blobs) to show different facets of the data. The dynamic appearance of the data in a motion chart facilitates visual inspection of associations, patterns and trends in multivariate datasets. The problem with motion charts is that for many variables, there is not enough dimensions (e.g. size, shape, color, etc.) to represent different entities. The advantage of using orbits is that adding more variables is easily accommodated by the increase in the number of fitness and significance states in the change space S p . Fig 15 show Table 11. While this gives a sense of where more households fall with regards to relative poverty probability (stratified by household nationality and then by  While some differences can be observed by nationality, the clearer visualisation offered by the orbit approach is evident in our opinion. The heat map approach is not without merits (one being easy to implement) and would require more extensive and detailed application to longitudinal data such as ours to fully surmise its utility relative to the deterministic orbit approach.

Discussion
Using variables pertaining to socio-economic determination, we have illustrated via 2-dimensional orbits the dynamics and patterns of 4 subpopulations in the AHDSS. Stable and unstable variables (in terms of frequency of change) have been identified. The high frequency of change of IF variable (Q 4 ) in each of the four subpopulations intuitively, is an unfavorable phenomenon because it directly measures instability of household numbers i.e. the rapid flow of individuals into and out of households, within the community. Policies that might stabilize this phenomenon are of interest.
The value of using the method of orbits for analysis of binary multivariate longitudinal data is that it gives a picture of how subjects and the population behave. There are no known methods that show exact visualisation of such data. Orbits can be used as an additional tool for say demographers and social scientist in analysis of data. An additional value of the method is to give insight into possible cause and effect. Presentation of longitudinal data as a time-evolving geometric orbit naturally enables visual identification of possible cause and effect along the orbit (e.g. if only state i precedes j, then state i causes j). Using orbits for longitudinal data analysis is fundamentally different from conventional longitudinal statistical models in that it develops visible orbits for fitness states and therefore extracts more information from the data. For instance, the standard statistical model does not give a visual sense of the density of households in a given state, rather just the magnitude of association (odds ratio).  Table 11. Label for answer combination associated to Fig 15 and  One obvious limitation in using orbits is that it considers only complete data. Extending the method to accommodate missing data is necessary. Tools for (demographic) estimation from limited, deficient and defective data [24] may be used, where longitudinal data does not satisfy the assumption that there is no missing data, or that each variable and each subject is measured at the same times.
The primary confounder we included and stratified on in this analysis was household head nationality. Previous papers [12,14,16] on socio-economic status in Agincourt have identified the proximal importance of household nationality as a determinant/confounder for socio-economic status/poverty. Our GEE regression results confirm the importance of this confounder as a determinant of poverty status. As for potential confounders of socio data-economic determination such as occupation and income, they are rarely tracked in the Agincourt HDSS. In addition, given the large amount of missing data for these variables, we would not have be able to apply the orbit theory to the key indicators in the manner presented currently. Within our study period from 2001-2007, the education modules was run only in 2002 and 2006 i.e. not directly captured in the same time points. Mozambicans generally have a significantly lower number of education years compared to South Africans (e.g. [14]) so we believe the nationality would also capture any confounding effects of education status. However we cannot discount any residual confounding influence of occupation, income, and education on our results.