Flexible and modular latent transition analysis—A tutorial using R

Lisbeth Lund; Christian Ritz

doi:10.1371/journal.pone.0317617

Abstract

Latent transition analysis (LTA) is a useful statistical modelling approach for describe transitions between latent classes over time. LTA may be characterized in terms of prevalence at each time point and through transition probabilities over time. Investigating predictors of these transitions is often of key interest. Currently, LTA can mostly be carried out using commercial and specialized software and only to some limited extent by means of open source statistical software. This tutorial demonstrates a flexible and modular approach for LTA, providing a powerful alternative using R through a combination latent class analysis and multiple logistic regression models. This approach has several advantages from a modelling perspective, as demonstrated through revisiting a previously conducted LTA, published in PLoS ONE recently. In short, results were very similar to the original analysis using commercial software although some additional novel results were also obtained. The proposed alternative approach offers more options in terms of choice of effect measures, model assumptions such as hierarchical structures and covariate adjustment, and differential handling of missing data. R code snippets are provided in the tutorial. A detailed accompanying script is also provided for full reproducibility.

Citation: Lund L, Ritz C (2025) Flexible and modular latent transition analysis—A tutorial using R. PLoS ONE 20(1): e0317617. https://doi.org/10.1371/journal.pone.0317617

Editor: Umair Khalil, Abdul Wali Khan University Mardan, PAKISTAN

Received: March 14, 2024; Accepted: January 1, 2025; Published: January 17, 2025

Copyright: © 2025 Lund, Ritz. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are provided within the manuscript and the supplementary material on zenodo.org (https://doi.org/10.5281/zenodo.10794077).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A range of statistical methods may be used for estimation of data-driven, exploratory identification of latent classes in longitudinal data. Statistical methods may crudely be divided into 1) ones that focus on characterization of determinants of longitudinal patterns and 2) ones that focus on characterization of determinants of transitions [1]. The fundamental distinction between the two approaches is whether individuals can be assumed 1) to remain in an identified latent class during the study period, that is a longitudinal pattern, or 2) to transition between latent classes over time during the study period.

The former is often referred to as estimating growth curves or latent class trajectories [2]. Typically these methods involve two steps: initial estimation of a number of trajectories, often by means of latent class analysis (LCA), followed by statistical modelling where membership of trajectories as an outcome is described by relevant determinants. One example from type II diabetes research is the use of regression modelling to characterize obesity patterns by means of available baseline information; another example is characterization of child growth using baseline information [3, 4]. Combined estimation of the latent class and mixed model structures in one step is also possible and it has been used for modelling cerebral and functional aging [5]. In short, these methods are powerful for characterizing features that are associated with certain trajectories.

The latter approach, which is the topic of this tutorial, does not focus on characterizing static pattern membership but on dynamic transition between patterns as time evolves. Such a characterization makes sense in case the longitudinal data correspond to having observed participants at several distinct and well-defined time points that are well-separated in time, as is common in randomized trials and, more generally, in some types of cohort studies. The corresponding statistical method is called latent transition analysis (LTA). In contrast to above LCA-type methods where pattern membership of each participant is determined once and assumed constant over time, LTA allows for pattern membership to change over time, capturing the dynamics in participants’ behaviour over time. LTA has been applied to explore and explain transitions in substance, tobacco, and nicotine use, changes between dietary patterns and between psychological needs profiles [6–9]; several of these examples are secondary analyses of data from randomized controlled trials where the interaction between transitions and interventions was of interest.

Currently, mostly highly specialized commercial statistical software is used to carry out LTA, including Latent GOLD^® with the Adv/Syntax add-on (https://www.statisticalinnovations.com) and Mplus (https://www.statmodel.com) [10]. A freely available add-on procedure is available for SAS [11]. These software solution rely on fitting multinomial models. There is also similar functionality in R for carrying out LTA [12]. Specifically, the package "LMest" can be used for estimating transition probabilities and also for investigating how transitions are affected by covariates, assuming a multinomial regression model [13]. There also exists an R package "MplusAutomation" providing convenience functions for using with the above-mentioned software Mplus.

The aim of this tutorial is to demonstrate that LTA can be carried out in a very modular way through the open source statistical environment R. This approach involves a combination of modelling steps, including latent class analysis and logistic regression. Moreover, this approach also provides a much greater flexibility in terms of model specification, modelling assumptions, and handling of missing values, as compared to existing software solutions that rely on multinomial models. R code is shown for the re-analysis of published open-access data, previously analyzed using commercial software.

Materials and methods

Data

To study well-being in early adolescence and, specifically, to explore how well-being changes over time, a longitudinal study with two waves in autumn 2019 and 2020 was carried out in Switzerland [14]. At each wave the participants were asked to fill in an online questionnaire. In order to construct latent classes, six indicators were used as detailed below. Specifically, through the online questionnaire, the participating high school students were asked about three hedonic indicators (life satisfaction (range 1 to 7), self-esteem (range 1 to 4), and well-being (range 1 to 5)) and three eudemonic indicators (self-efficacy, self-determination, and satisfaction with grades at school (range 1 to 4 for all three indicators); these indicators were based on validated scales. For each indicator, Likert scales were originally used, but only dichotomized versions (low vs. high) were available in the accompanying open-access dataset. In addition, a number of potential predictor variables were recorded: age, gender, socio-economic status (SES) based on information on parental education level (0 to 5, with 0 representing the highest education level and 5 the lowest), migration background (yes or no), and school level based on academic performance and teacher recommendations (low or high) [14].

Latent transition analysis

In principle, LTA may be carried out in a single estimation step assuming a simultaneous statistical model including LCA (for a fixed number of classes). Subsequently transition probabilities and odds ratios are estimated [13]. However, LTA is often carried out in two separate steps: LCA and multinomial regression. In this tutorial, an alternative approach is proposed where the multinomial regression model has been replaced by multiple logistic regression models, leading to three separate analytic steps, which can conveniently handled sequentially, one at a time.

Step 1.

In the first step, LCA was applied to each wave separately to identify distinct latent classes that capture salient features in the indicator variables. In short, LCA corresponds to fitting a mixture model, assuming that the data follow a distribution consisting of a mixture of several normal distributions, interpreted as the underlying distributions that generate the latent classes. Estimation aims to find the model parameters that assign the highest likelihood to the data (the global maximum of the likelihood function is being searched for), and it is commonly carried out for a fixed number of latent classes, which needs to be specified in advance. However, multiple models for different choices of the number of latent classes may easily be fitted and subsequently compared. In our analysis, LCA models were fitted assuming 2 to 5 latent classes [15]. To avoid ending up with sub-optimal model fits due to local maxima found during the estimation, which is not uncommon to LCA, one approach is to estimate the LCA models multiple times with different starting values for the parameter estimates [15]. Trying out a range of starting values should be done routinely when fitting LCA models.

The optimal number of latent classes is often decided on using an information criterion, such as Akaike information criteria (AIC) and Bayesian information criteria (BIC), or adjusted versions of AIC and BIC [16]. Following the approach of the original study, AIC and the sample-size adjusted BIC were reported [17]. A difference in an information criterion less than 10 means that the difference between the two model fits is negligible [18]. However, other considerations, such as the interpretability of results, may also play an important role in deciding on the optimal number of latent classes [2, 9]. LCA produced prevalences of identified latent classes and item-response probabilities (conditional on a latent class). Item-response probabilities denote the likelihood of responding yes/no to a specific indicator variable, given the latent class status and they were used to distinguish between latent classes and, consequently, to assign a meaningful label to each latent class. In addition, item-response probabilities at wave 1 and 2 were compared visually to assess measurement invariance across time, i.e., similar probabilities would support the measurement invariance assumption that the same latent class structure was observed both at wave 1 and 2 [6]. As measurement invariance is a modelling assumption, a visual assessment seems sensible and in line with common practice for other statistical models such a linear regression; statistical tests for model assumptions should be avoided and an element of subjective judgment should be accepted [19, 20]. In case the assumption of measurement invariance is not tenable then the proposed approach is still applicable. The interpretation of results may, however, become more challenging (but in some cases most likely also more realistic) as it would mean that individuals transition between different classes over time. For instance, it might be difficult to entertain measurement invariance for large gaps between waves.

In Steps 2 and 3, as detailed below, multiple logistic regression models were fitted as a flexible alternative to multinomial regression models. It has been shown that there is almost no loss in efficiency when using logistic regression instead of multinomial regression [21].

Step 2.

Logistic regression models were used to estimate transition probabilities. Specifically, a binary outcome variable was defined for each class at wave 2, indicating whether or not the participant belongs to that particular class. Separate logistic regression models were fitted for each of these binary outcomes, with class membership at wave 1 as the only independent (categorical) variable. The number of logistic regression models that were fitted corresponded to the number of classes at wave 2. As results from logistic regression, by tradition, are reported on a logarithmic scale, a subsequent back-transformation step was needed. Specifically, back-transformation from the log-odds scale was used to obtained estimated transition probabilities for changing from one latent class at wave 1 to any of the three latent classes at wave 2; corresponding 95% confidence intervals were also obtained through back-transformation.

Step 3.

Logistic regression models were fitted to estimate ORs that quantify associations between transitions from wave 1 to wave 2 and relevant predictors. Specifically, for each transition of interest, a binary outcome variable was defined indicating whether or not the participant underwent the transition or remained in the same class at wave 2 as at wave 1. Separate logistic regression models were fitted for each of these derived binary outcomes and for each of the following five predictors: age, gender, migration, school, and socio-economic status. Each of these logistic regression models included an interaction term between the class membership at wave 1 and the predictor in question. The number of logistic regression models fitted for each predictor corresponded to the number of transitions of interest. In this tutorial, three transitions were considered: from low to middle wellbeing, low to high wellbeing, and middle to high wellbeing [14]. However, it should be noted that there would be three more transitions to consider (middle to low, high to low, and high to middle), but they seem less relevant in the present context as they corresponded to transitions towards reduced instead of improved well-being.

Handling missing values.

Missing values for the latent class membership were handled using multiple imputation through chained equations (MICE) based on the six input variables at wave 1 and 2, predictors to be investigated (age, gender, socio-economic status, migration status, and school level), and the latent class membership variables at wave 1 and wave 2 (in total 19 variables). Based on this dataset, MICE was used to generate 10 complete, imputed datasets. Often a small number of imputed datasets, such as 5 or 10, are used as it suffices for capturing the uncertainty in the imputation step, but also avoids that the computational burden becomes too large [22]. Note that a pre-specified random seed was given to ensure reproducibility of results, as was also the case for LCA in the above Step 1. For each imputed dataset, separate model logistic regression models fits were obtained. Subsequently, individual parameter estimates were combined using Rubin’s rule, which makes due allowance for the uncertainty introduced through use of multiple imputation [23]. The MICE approach is advantageous as it can handle arbitrary types of variables (e.g., continuous and categorical variables), ensuring that imputed values will be of the same type as the observed values [23]. In this tutorial MICE was applied after LCA but before any fitting logistic regression models as multiple imputation is useful for statistical inference, obtaining estimates and standard errors that reflect the uncertainty in the imputation step.

Software used.

Statistical analyses were carried out in R using the extension packages "mice" (version 3.15) and "poLCA" (version 1.6) [12, 15, 23]. Supplementary material containing an annotated R script, including precise definitions of binary outcome variables used and detailed explanations of model specifications, is available online at zenodo.org (https://doi.org/10.5281/zenodo.10794077). Key steps using R functionality are also explained and shown in boxes in the Results section. A significance level of 0.05 was assumed.

Approach in the original study.

LTA was carried out using the commercial software Mplus. Specifically, latent classes were identified and the corresponding transition probabilities were estimated. Missing values in wave 1 were imputed although the imputation method was not described. Subsequently, for each wave, a multivariate multinomial regression model was fitted with latent class membership as a multinomial outcome and gender, migration background (yes or no), school level, and socio-economic status as predictors. However, these analyses, which were carried out using SPSS (version 25), did not investigate the effects of predictors on transitions but only evaluated cross-sectional effects [14].

Results

Descriptive statistics for wave 1

A total of 377 students completed the questionnaires at both wave 1 and 2. There were 167 female and 190 male respondents, respectively, and 20 respondents missing gender information. The age range was 11–15 years, but almost all respondents were 12–14 years (99%); 13 respondents did not provide information on age. Socio-economic status was reported using five categories (lowest, lower, middle, high, and highest), with most respondents in the middle or lower categories (51%). Data on socio-economic status were missing for 58 respondents. Likewise, there were 13 missing values for the well-being indicator, which will be used to define latent classes, for wave 1.

LTA step 1—estimating latent classes.

Box 1 shows the R code for carrying out LCA, exemplified for wave 1; it can briefly be explained as follows: the LCA requires that input variables are initially combined into a model formula, which will be the first argument for the function poLCA() from the extension package of the same name. The second argument of poLCA() needs to be the data set where the variables in the model formula can be found. Finally, the third argument specifies the number of latent classes to be assumed. The random number generator is initiated by means of the function set.seed() to ensure reproducibility.

Box 1. LCA at wave in R

library(poLCA) # activating the R package "poLCA"

# defining a suitable model formula

varList1 <- cbind(lifesat1, selfeff1, selfacc1, selfdet1, wellbe1, satis1) ~ 1

# fitting LCA models

set.seed(202408143)

lc1.2 <- poLCA(varList1, ltadata, nclass = 2, nrep = 10) # LCA assuming 2 classes

lc1.3 <- poLCA(varList1, ltadata, nclass = 3, nrep = 10)

lc1.4 <- poLCA(varList1, ltadata, nclass = 4, nrep = 10)

lc1.5 <- poLCA(varList1, ltadata, nclass = 5, nrep = 10)

# showing the results

lc1.2

lc1.3

lc1.4

lc1.5

Based on the fitted LCA models, relevant fit statistics were extracted as summarized in Table 1. The model fit statistics indicated that either 2 or 3 latent classes were optimal. This result was found both for the proposed alternative approach and for the original study despite the discrepancy in AIC and aBIC values for wave 1 where missing values were imputed in the original study. Specifically, for both waves, AIC values did not appreciably differ for 3, 4, and 5 classes (as differences were not larger than 10) but AIC was appreciably larger for 2 classes. A similar picture was observed for aBIC. Consequently, a parsimonious choice was to select the smallest number of latent classes among the options with similar AIC and aBIC values, resulting in a three latent class solution; this choice also ensured that sufficient participants were included in each class [16]. The above analysis, as described in Box 1, included the argument nrep = 10 in the function poLCA(), implying that LCA was repeated 10 times with different starting values to ensure that the final model fit corresponded the global maximum and not some local maximum [15].

Download:

Table 1. Model fit statistics for the proposed approach and as reported in the original study by Kassis et al. (n = 377).

https://doi.org/10.1371/journal.pone.0317617.t001

The three latent classes were well-separated as item response probabilities group in a low, middle, and high trend (Fig 1). Moreover, the same latent classes were found for both waves, supporting the measurement invariance model assumption [10]. For wave 1, predicted class membership percentages were 21%, 48%, and 31% for the low, middle, and high well-being classes, respectively, compared to 24%, 46%, and 30% in the original study; the small discrepancy was due the missing values being imputed in the original study. For wave 2, predicted class membership percentages were 31%, 31%, and 38% for the low, middle, and high well-being classes, respectively, compared to 31%, 30.5%, and 38.5% in the original study.

Download:

Fig 1. Item-response probabilities for the low, middle, and high well-being latent classes.Solid and dashed lines connect item-response probabilities between the 6 input variables (indicators) for wave 1 and wave 2, respectively.

https://doi.org/10.1371/journal.pone.0317617.g001

LTA step 2—estimating transition probabilities.

In R multiple imputation is carried out using the package "mice" as shown in Box 2. The first argument of the function named mice() was the dataset containing the variables that should be used for the imputation step. In this case all variables needed for the subsequent logistic regression analyses were provided. The second argument specified that 10 complete, imputed datasets should be generated. Note that the seed for the random number generator is again specified using the function set.seed(). Once these datasets were constructed, logistic regression models were fitted to each of these 10 datasets using the function with() with arguments "with" and "exp", specifying the list of imputed datasets and the logistic regression model to be fitted. The logistic regression model is fitted using the model fitting function glm() with an outcome that is defined as being in high well-being class at wave 2 (yes or no), a single predictor that is the variable denoting the class membership at wave 1, and, finally, the argument needed to inform glm() that a logistic regression model is specified: link = binomial, ie., logistic regression is a generalized linear model (glm) with a link function that accommodates a binomial distribution [24]. Subsequently, the use of summary(pool()) implied that results from the 10 separate model fits were combined into a single result, in this case three parameter estimates corresponding to the three latent classes at wave 1. These estimates were then back-transformed to probabilties.

Box 2. Transition probabilities in R

library(mice) # activating the R package "mice"

# constructing a dataset only containing the variables to be imputed/used in the analyses

# lca1 and lca2 are variables on class membership at wave 1 and 2

ltadata.sim <- cbind(ltadata[, c("School_Level", "Gender", "Age", "Migration", "Socioeconomicstatus", "lifesat1", "selfeff1", "selfacc1", "selfdet1", "wellbe1", "satis1",

"lifesat2", "selfeff2", "selfacc2", "selfdet2", "wellbe2", "satis2")], lca1, lca2)

# generating 10 complete, imputed datasets

set.seed(202401233)

ltadata.imputed <- mice(ltadata.sim, m = 10)

# fitting logistic regression to each imputed dataset

ltadata.mice1 <- with(data = ltadata.imputed,

exp = glm (lca2 = = "high" ~ lca1–1, family = binomial))

# pooling results using Rubin’s rule

logreg1 <- summary(pool(ltadata.mice1))

# transforming back to probability scale

coef1 <- logreg1[, 2]

llim1 <- coef1–1.96 * logreg1[, 3]

ulim1 <- coef1 + 1.96 * logreg1[, 3]

data.frame(class = logreg1[, 1],

exp(cbind(coef1, llim1, ulim1)) / (1 + exp(cbind(coef1, llim1, ulim1))))

Both for the low and high well-being classes, the majority of students were most likely to remain in the same class in both waves (Table 2). For the high well-being class, there was a 67% probability of remaining in that class, whereas it was somewhat smaller, only 56% for the low well-being class. For the middle well-being class, transition to either low or high well-being classes at wave 2 was almost equally likely, with probabilities of 27% and 33%, respectively, corresponding to a 40% probability of remaining in middle well-being class.

Download:

Table 2. Transition probabilities between wave 1 and 2 expressed as percentages (n = 377) for the proposedapproach and as reported in the original study by Kassis et al.

https://doi.org/10.1371/journal.pone.0317617.t002

LTA step 3—investigating predictors of transition.

Box 3 shows the R code for fitting a logistic regression model where the effect of a specific predictor on a specific transition was investigated. In this case it was the effect of age on the transition from low to middle (or remain in low); for the other predictors the R code may be found in the online supplementary material. As in Step 2 the logistic regression model was fitted to each of the imputed datasets. The logistic regression model was fitted to the subset corresponding to low or middle well-being at wave 2, and, consequently, the outcome, which was class membership at wave 2, became binary and suitable for logistic regression. Moreover, the model included an interaction between class membership at wave 1 and the predictor age. The model specification involved both the main effect of the class membership variable and the interaction term specified using ":" and not "*". This parameterization ensured that the output contained results that could be directly back-transformed into interpretable odds ratios reported in Table 3 below.

Box 3. Predictors of transition in R

# fitting logistic regression to each imputed dataset

# with the transition to low or middle well-being (middle = 1; low = 0) at wave 2 as outcome

mice.step3 <- with(data = ltadata.imputed,

exp = glm (lca2 = = "mid" ~ lca1 + lca1:Age-Age, family = binomial,

subset = lca2%in% c("low", "mid")))

# pooling results using Rubin’s rule

logreg.step3 <- summary(pool(mice.step3))

# back-transforming to OR for transitioning from low well-being at wave 1

# (row 6 corresponds to "low" at wave 1)

coef.step3 <- as.vector(unlist(logreg.step3[6,])) # row 6 selected

c(exp(c(coef.step3[2], # estimated OR

coef.step3[2] - 1.96*coef.step3[3], # lower limit of 95% CI

coef.step3[2] + 1.96*coef.step3[3])), # upper limit of 95% CI

coef.step3[6]) # p-value

Download:

Table 3. Associations between selected transition and predictors (n = 377).

https://doi.org/10.1371/journal.pone.0317617.t003

There was a significant association between age and the low to middle transition, where the odds of transition were increased by 179% (95% CI: [27, 511] %; p = 0.01) per year, whereas odds for transitions from low to high and middle to high well-being class were reduced by 20% and 1% per year, respectively (Table 3).

No other associations between predictors and transitions were significant. Other notable findings included: Male students had higher odds for the transition from low to middle and low to high well-being classes compared to female students (64% and 29%, respectively) but 28% reduced odds for the transition from middle to high. Migration only affected transitions slightly (-14% to 2% change in odds). Students at higher secondary schools had 79% and 87% higher odds for transitioning from low to middle well-being and from low to high well-being, respectively, but 52% reduced odds for transitioning from middle to high well-being.

The high SES group had 20% and 10% higher odds for the transition from low to middle and from low to high well-being, respectively, compared to the middle SES group, but only 1% higher for the transition from middle to high well-being. In contrast, the low SES group had 78% and 55% reduced odds for the transitions from low to middle and from low to high, respectively, but 465% increased odds for the transition from middle to high.

No similar results were reported in the original study where only cross-sectional associations of predictors of class membership at wave 1 and wave 2, respectively, were investigated [14]. It is also worth noting that the effect of age was not investigated at all in the original study. However, in the original study, a gender effect was found at wave 2 but not at wave 1.

Discussion

The proposed alternative approach identified exactly the same three latent classes and found very similar estimated transition probabilities when compared to the results obtained using commercial software in the original study [14]. The findings in Step 3 could not be compared as they were not reported in the original study. However, this tutorial explicitly imputed missing values such that transition probabilities and OR’s were estimated using data from all 377 students. Moreover, as a novel contribution and added value compared to the original study, associations between predictors and selected transitions were investigated and indeed resulted in establishing a novel and meaningful association between age and the transition from low to middle well-being; this finding may be a chance finding as any kind of multiplicity adjustment of the reported p-values would render it non-significant. Moreover, it should be noted that many LTA’s are secondary analyses of data from studies that were neither designed nor powered to investigate transitions over time. Indeed, LTA may be viewed in the same way as statistical analyses investigating effect modifications (as Step 3 involves interaction terms); these are also often under-powered.

The proposed approach gains flexibility through a stepwise procedure that essentially replaces fitting a multinomial model by fitting a number of logistic regression models. Consequently. this approach has a number of advantages as compared to available software solutions: Arbitrary hierarchical study designs, such as cluster-randomized trials and repeated measurements studies, can be conveniently handled using logistic mixed-effects model with suitable random effects, including random intercepts, which can be different for different models. Similarly, covariate adjustments can be easily included, and they may vary across analyses e.g., be different for different waves or even transitions. Missing values can be handled flexibly through imputation methods such as MICE [23]. In this tutorial, the same imputation model was used for all logistic regression models, but differential handling of missing values through different collections of imputed datasets would also be possible. Over-dispersion can be addressed and, more generally, robust sandwich-type standard errors may be applied in case of model misspecification [25]. In addition, relative risks and risk differences may be estimated as alternatives to odds ratios where appropriate [26]; this is not easily achieved using available LTA software, which relies on multinomial models with odds ratio as the effect measure. The proposed LTA approach does not require classes to remain consistent across waves. Admittedly, changing classes over time renders interpretation of transitions more difficult but possibly also more realistic, in particular if time gaps between waves are large. In contrast to the proposed approach, the R package "LMest" is of limited value for analyzing more complex study designs involving hierarchical structures and it lacks flexible handling of missing values and is limited to odds ratios as effect measures. Finally, as an alternative to R, the proposed LTA approach could also be carried out using the commercial, but general-purpose statistical software STATA (StataCorp, College Station, Texas 77845, USA) with the LCA Stata plug-in [27, 28]. The advantage of using R or Stata is that LTA can be carried out step-wise such that powerful functionality can be applied in each step, rendering the statistical analysis very flexible. Output from one step feeds directly into the next step.

The modular structure of the proposed approach also entails some limitations. In contrast to a single joint model for both the LCA step and subsequent estimation of transition probabilities and odds ratios, some uncertainty from the estimation of latent classes in Step 1 will not be propagated to Step 2 and Step 3. Specifically, class membership at wave 1 will be assumed to be known without error. One way to alleviate this problem is to consider modifications of logistic regression that address measurement error [29, 30] or using a weighting method [31], exploiting the modular structure of the approach even more. Some additional limitations should also be mentioned. There exist several alternatives to the R package "poLCA" for doing LCA, providing additional flexibility, such as the packages "BayesLCA" and "tidySEM" [32, 33]. On a related note, it should be pointed out that the use of LCA as outlined in Step 1 (using poLCA() in R) tacitly assumed independent data, which might be an assumption that is compromised by certain hierarchical structures. Extensions of LCA allowing for dependent data could be consider instead, including the R packages "glca" and "multilevLCA" [13, 34].

It has previously been shown that using multiple logistic regression models instead of a multinomial regression model leads to a very modest loss in efficiency, but it cannot be ruled out that a correctly specified multinomial regression model leads to a small gain in efficiency, especially in case of highly correlated binary outcomes [21]. Specifically, multinomial regression models could also be used in the proposed approach in Step 2 and 3 instead of logistic regression, but they introduce more modelling assumptions. However, the stronger assumptions also alleviate some of the problems encountered in case of sparse data where logistic regression models may face convergence problems. Another limitation is that LTA is only applicable to longitudinal data from randomized trials and repeated cross-sectional studies where time points often are well-defined and the same for all participants. However, LTA is not suitable for longitudinal data where measurements for each participant may be taken at different time points as is not unusual in many cohort studies. Finally, it is a limitation and indeed the price for a high degree of flexibility that the programming needed involves some lines of code as shown in Box 1, 2 and 3, although it should be mentioned that also Mplus and SAS, say, require programming in order to define suitable matrices prior to carrying out LTA [10].

References

1. Johnson SK. Latent profile transition analyses and growth mixture models: A very non-technical guide for researchers in child and adolescent development. New Dir Child Adolesc Dev. 2021; 2021: 111–139. pmid:33634554
- View Article
- PubMed/NCBI
- Google Scholar
2. Herle M, Micali N, Abdulkadir M, Loos R, Bryant-Waugh R, Hübel C, Bulik CM, De Stavola BL. Identifying typical trajectories in longitudinal data: modelling strategies and interpretations. Eur J Epidemiol. 2020; 35: 205–222. pmid:32140937
- View Article
- PubMed/NCBI
- Google Scholar
3. Vistisen D, Witte DR, Tabák AG, Herder C, Brunner EJ, Kivimäki M, et al. Patterns of Obesity Development before the Diagnosis of Type 2 Diabetes: The Whitehall II Cohort Study. PLoS Med. 2014; 11: e1001602. pmid:24523667
- View Article
- PubMed/NCBI
- Google Scholar
4. Aris IM, Rifas-Shiman SL, Li LJ, Kleinman KP, Coull BA, Gold DR, et al. Patterns of body mass index milestones in early life and cardiometabolic risk in early adolescence. Int J Epidemiol. 2019; 48: 157–167. pmid:30624710
- View Article
- PubMed/NCBI
- Google Scholar
5. Proust-Lima C, Philipps V, Liquet B. Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm. J Stat Softw. 2017; 78: 1–56. https://doi.org/10.18637/jss.v078.i02.
- View Article
- Google Scholar
6. Lanza ST, Patrick ME, Maggs JL. Latent transition analysis: Benefits of a latent variable approach to modeling transitions in substance use. J Drug Issues. 2010; 40 93–120. pmid:20672019
- View Article
- PubMed/NCBI
- Google Scholar
7. Oliveira A, Lopes C, Torres D, Ramos E, Severo M. Application of a Latent Transition Model to Estimate the Usual Prevalence of Dietary Patterns. Nutrients. 2021; 13: 133. https://doi.org/10.3390/nu13010133.
- View Article
- Google Scholar
8. Huyghebaert-Zouaghi T, Gillet N, Fernet C, Thomas J, Ntoumanis N. Managerial predictors and motivational outcomes of workers’ psychological need states profiles: A two-wave examination. Eur J Work Organ Psy. 2022; 32: 216–233. https://doi.org/10.1080/1359432X.2022.2127354.
- View Article
- Google Scholar
9. Simon P, Buta E, Gueorguieva R, Kong G, Morean ME, Camenga DR, et al. Transitions across tobacco use profiles among adolescents: results from the Population Assessment of Tobacco and Health (PATH) study waves 1 and 2. Addiction. 2020; 115: 740–747. pmid:31618491
- View Article
- PubMed/NCBI
- Google Scholar
10. Nylund-Gibson K, Garber AC, Carter DB, Chan M, Arch DAN, Simon O, et al. Ten frequently asked questions about latent transition analysis. Psychol Methods. 2023; 28: 284–300. pmid:35834194
- View Article
- PubMed/NCBI
- Google Scholar
11. Lanza ST, Dziak JJ, Huang L, Wagner A, Collins LM. Proc LCA & Proc LTA users’ guide (Version 1.3.2). The Methodology Center, Penn State; 2015.
- View Article
- Google Scholar
12. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2021. Available from: https://www.r-project.org/.
13. Bartolucci F, Pandolfi S, Pennoni F. LMest: An R Package for Latent Markov Models for Longitudinal Categorical Data. J Stat Softw. 2017; 81: 1–38. https://doi.org/10.18637/jss.v081.i04.
- View Article
- Google Scholar
14. Kassis W, Janousch C, Sidler P, Aksoy D, Favre C, Ertanir B. Patterns of students’ well-being in early adolescence: A latent class and two-wave latent transition analysis. PLoS ONE. 2022; 17: e0276794. pmid:36454868
- View Article
- PubMed/NCBI
- Google Scholar
15. Linzer DA, Lewis JB. poLCA: an R Package for Polytomous Variable Latent Class Analysis. J Stat Softw. 2011; 42: 1–29. https://www.jstatsoft.org/v42/i10.
- View Article
- Google Scholar
16. Weller BE, Bowen NK, Faubert SJ. Latent Class Analysis: A Guide to Best Practice. J Black Psychol. 2020; 46: 287–311. https://doi.org/10.1177/0095798420930932.
- View Article
- Google Scholar
17. Nylund KL, Asparouhov T, Muthén BO. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Struct Equ Modeling. 2007; 14: 535–569. https://doi.org/10.1080/10705510701575396.
- View Article
- Google Scholar
18. Burnham KP, Anderson DR. Model Selection and Multimodel Inference. New York: Springer; 2010.
19. Schmitt N, Kuljanin G. Measurement invariance: Review of practice and implications, Hum Resour Manag Rev. 2008; 18: 210–222. https://doi.org/10.1016/j.hrmr.2008.03.003.
- View Article
- Google Scholar
20. Welzel C, Brunkert L, Kruse S, Inglehart RF. Non-invariance? An Overstated Problem With Misconceived Causes. Sociol Methods Res. 2023; 52: 1368–1400. https://doi.org/10.1177/0049124121995521.
- View Article
- Google Scholar
21. Begg CB, Gray R. Calculation of Polychotomous Logistic Regression Parameters Using Individualized Regressions. Biometrika. 1984; 71: 11–8. https://doi.org/10.2307/2336391.
- View Article
- Google Scholar
22. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011; 20 40–49. pmid:21499542
- View Article
- PubMed/NCBI
- Google Scholar
23. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011; 45: 1–67. https://doi.org/10.18637/jss.v045.i03.
- View Article
- Google Scholar
24. McCullagh P, Nelder JA. Generalized Linear Models (2nd ed.). New York: Routledge; 1989. https://doi.org/10.1201/9780203753736.
25. Zeileis A. Object-Oriented Computation of Sandwich Estimators. J Stat Softw. 2006; 16: 1–16. https://doi.org/10.18637/jss.v016.i09.
- View Article
- Google Scholar
26. Gallis JA, Turner EL. Relative Measures of Association for Binary Outcomes: Challenges and Recommendations for the Global Health Researcher. Ann Glob Health. 2019; 85: 137. pmid:31807416
- View Article
- PubMed/NCBI
- Google Scholar
27. Lanza ST, Dziak JJ, Huang L, Wagner AT, Collins LM. LCA Stata plugin users’ guide (Version 1.2). University Park: The Methodology Center, Penn State; 2015.
- View Article
- Google Scholar
28. Lund L, Andersen S, Ritz C, Bast LS. (2024). Predicting longitudinal changes in patterns of tobacco and nicotine product among adolescents: A Latent Transition Analysis based on the X:IT study. Soc Sci Med. 2024; 352: 117029. https://doi.org/10.1016/j.socscimed.2024.117029.
- View Article
- Google Scholar
29. Thoresen M, Laake P. A Simulation Study of Measurement Error Correction Methods in Logistic Regression. Biometrics. 2000; 56: 868–872. pmid:10985228
- View Article
- PubMed/NCBI
- Google Scholar
30. Chen J, Hanfelt JJ, Huang Y. A Simple Corrected Score for Logistic Regression with Errors-in-Covariates. Commun. Stat—Theory Methods. 2015; 44: 2024–2036. https://doi.org/10.1080/03610926.2013.773350.
- View Article
- Google Scholar
31. Vermunt JK. Latent Class Modeling with Covariates: Two Improved Three-Step Approaches. Polit Anal. 2010; 18: 450–469. https://doi.org/10.1093/pan/mpq025.
- View Article
- Google Scholar
32. White A, Murphy TB. BayesLCA: An R Package for Bayesian Latent Class Analysis. J Stat Softw. 2014; 61: 1–28. http://www.jstatsoft.org/v61/i13/.
- View Article
- Google Scholar
33. Van Lissa CJ, Garnier-Villarreal M, Anadria D. Recommended Practices in Latent Class Analysis Using the Open-Source R-Package tidySEM. Struct Equ Modeling. 2023; 31, 526–534. https://doi.org/10.1080/10705511.2023.2250920.
- View Article
- Google Scholar
34. Kim Y, Jeon S, Chang C, Chung H. glca: An R Package for Multiple-Group Latent Class Analysis. Appl Psychol Meas. 2022; 46: 439–441. pmid:35812815
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Johnson SK. Latent profile transition analyses and growth mixture models: A very non-technical guide for researchers in child and adolescent development. New Dir Child Adolesc Dev. 2021; 2021: 111–139. pmid:33634554
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Herle M, Micali N, Abdulkadir M, Loos R, Bryant-Waugh R, Hübel C, Bulik CM, De Stavola BL. Identifying typical trajectories in longitudinal data: modelling strategies and interpretations. Eur J Epidemiol. 2020; 35: 205–222. pmid:32140937
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Vistisen D, Witte DR, Tabák AG, Herder C, Brunner EJ, Kivimäki M, et al. Patterns of Obesity Development before the Diagnosis of Type 2 Diabetes: The Whitehall II Cohort Study. PLoS Med. 2014; 11: e1001602. pmid:24523667
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Aris IM, Rifas-Shiman SL, Li LJ, Kleinman KP, Coull BA, Gold DR, et al. Patterns of body mass index milestones in early life and cardiometabolic risk in early adolescence. Int J Epidemiol. 2019; 48: 157–167. pmid:30624710
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Proust-Lima C, Philipps V, Liquet B. Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm. J Stat Softw. 2017; 78: 1–56. https://doi.org/10.18637/jss.v078.i02.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Lanza ST, Patrick ME, Maggs JL. Latent transition analysis: Benefits of a latent variable approach to modeling transitions in substance use. J Drug Issues. 2010; 40 93–120. pmid:20672019
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Oliveira A, Lopes C, Torres D, Ramos E, Severo M. Application of a Latent Transition Model to Estimate the Usual Prevalence of Dietary Patterns. Nutrients. 2021; 13: 133. https://doi.org/10.3390/nu13010133.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref8] 8. Huyghebaert-Zouaghi T, Gillet N, Fernet C, Thomas J, Ntoumanis N. Managerial predictors and motivational outcomes of workers’ psychological need states profiles: A two-wave examination. Eur J Work Organ Psy. 2022; 32: 216–233. https://doi.org/10.1080/1359432X.2022.2127354.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref9] 9. Simon P, Buta E, Gueorguieva R, Kong G, Morean ME, Camenga DR, et al. Transitions across tobacco use profiles among adolescents: results from the Population Assessment of Tobacco and Health (PATH) study waves 1 and 2. Addiction. 2020; 115: 740–747. pmid:31618491
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Nylund-Gibson K, Garber AC, Carter DB, Chan M, Arch DAN, Simon O, et al. Ten frequently asked questions about latent transition analysis. Psychol Methods. 2023; 28: 284–300. pmid:35834194
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Lanza ST, Dziak JJ, Huang L, Wagner A, Collins LM. Proc LCA & Proc LTA users’ guide (Version 1.3.2). The Methodology Center, Penn State; 2015.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2021. Available from: https://www.r-project.org/.

[ref13] 13. Bartolucci F, Pandolfi S, Pennoni F. LMest: An R Package for Latent Markov Models for Longitudinal Categorical Data. J Stat Softw. 2017; 81: 1–38. https://doi.org/10.18637/jss.v081.i04.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref14] 14. Kassis W, Janousch C, Sidler P, Aksoy D, Favre C, Ertanir B. Patterns of students’ well-being in early adolescence: A latent class and two-wave latent transition analysis. PLoS ONE. 2022; 17: e0276794. pmid:36454868
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref15] 15. Linzer DA, Lewis JB. poLCA: an R Package for Polytomous Variable Latent Class Analysis. J Stat Softw. 2011; 42: 1–29. https://www.jstatsoft.org/v42/i10.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref16] 16. Weller BE, Bowen NK, Faubert SJ. Latent Class Analysis: A Guide to Best Practice. J Black Psychol. 2020; 46: 287–311. https://doi.org/10.1177/0095798420930932.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref17] 17. Nylund KL, Asparouhov T, Muthén BO. Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Struct Equ Modeling. 2007; 14: 535–569. https://doi.org/10.1080/10705510701575396.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref18] 18. Burnham KP, Anderson DR. Model Selection and Multimodel Inference. New York: Springer; 2010.

[ref19] 19. Schmitt N, Kuljanin G. Measurement invariance: Review of practice and implications, Hum Resour Manag Rev. 2008; 18: 210–222. https://doi.org/10.1016/j.hrmr.2008.03.003.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref20] 20. Welzel C, Brunkert L, Kruse S, Inglehart RF. Non-invariance? An Overstated Problem With Misconceived Causes. Sociol Methods Res. 2023; 52: 1368–1400. https://doi.org/10.1177/0049124121995521.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref21] 21. Begg CB, Gray R. Calculation of Polychotomous Logistic Regression Parameters Using Individualized Regressions. Biometrika. 1984; 71: 11–8. https://doi.org/10.2307/2336391.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref22] 22. Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psychiatr Res. 2011; 20 40–49. pmid:21499542
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref23] 23. van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw. 2011; 45: 1–67. https://doi.org/10.18637/jss.v045.i03.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref24] 24. McCullagh P, Nelder JA. Generalized Linear Models (2nd ed.). New York: Routledge; 1989. https://doi.org/10.1201/9780203753736.

[ref25] 25. Zeileis A. Object-Oriented Computation of Sandwich Estimators. J Stat Softw. 2006; 16: 1–16. https://doi.org/10.18637/jss.v016.i09.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref26] 26. Gallis JA, Turner EL. Relative Measures of Association for Binary Outcomes: Challenges and Recommendations for the Global Health Researcher. Ann Glob Health. 2019; 85: 137. pmid:31807416
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref27] 27. Lanza ST, Dziak JJ, Huang L, Wagner AT, Collins LM. LCA Stata plugin users’ guide (Version 1.2). University Park: The Methodology Center, Penn State; 2015.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref28] 28. Lund L, Andersen S, Ritz C, Bast LS. (2024). Predicting longitudinal changes in patterns of tobacco and nicotine product among adolescents: A Latent Transition Analysis based on the X:IT study. Soc Sci Med. 2024; 352: 117029. https://doi.org/10.1016/j.socscimed.2024.117029.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref29] 29. Thoresen M, Laake P. A Simulation Study of Measurement Error Correction Methods in Logistic Regression. Biometrics. 2000; 56: 868–872. pmid:10985228
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref30] 30. Chen J, Hanfelt JJ, Huang Y. A Simple Corrected Score for Logistic Regression with Errors-in-Covariates. Commun. Stat—Theory Methods. 2015; 44: 2024–2036. https://doi.org/10.1080/03610926.2013.773350.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref31] 31. Vermunt JK. Latent Class Modeling with Covariates: Two Improved Three-Step Approaches. Polit Anal. 2010; 18: 450–469. https://doi.org/10.1093/pan/mpq025.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref32] 32. White A, Murphy TB. BayesLCA: An R Package for Bayesian Latent Class Analysis. J Stat Softw. 2014; 61: 1–28. http://www.jstatsoft.org/v61/i13/.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref33] 33. Van Lissa CJ, Garnier-Villarreal M, Anadria D. Recommended Practices in Latent Class Analysis Using the Open-Source R-Package tidySEM. Struct Equ Modeling. 2023; 31, 526–534. https://doi.org/10.1080/10705511.2023.2250920.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref34] 34. Kim Y, Jeon S, Chang C, Chung H. glca: An R Package for Multiple-Group Latent Class Analysis. Appl Psychol Meas. 2022; 46: 439–441. pmid:35812815
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Data

Latent transition analysis

Step 1.

Step 2.

Step 3.

Handling missing values.

Software used.

Approach in the original study.

Results

Descriptive statistics for wave 1

LTA step 1—estimating latent classes.

Box 1. LCA at wave in R

LTA step 2—estimating transition probabilities.

Box 2. Transition probabilities in R

LTA step 3—investigating predictors of transition.

Box 3. Predictors of transition in R

Discussion

References