Figures
Abstract
One of the main concerns in multidimensional item response theory (MIRT) is to detect the relationship between observed items and latent traits, which is typically addressed by the exploratory analysis and factor rotation techniques. Recently, an EM-based L1-penalized log-likelihood method (EML1) is proposed as a vital alternative to factor rotation. Based on the observed test response data, EML1 can yield a sparse and interpretable estimate of the loading matrix. However, EML1 suffers from high computational burden. In this paper, we consider the coordinate descent algorithm to optimize a new weighted log-likelihood, and consequently propose an improved EML1 (IEML1) which is more than 30 times faster than EML1. The performance of IEML1 is evaluated through simulation studies and an application on a real data set related to the Eysenck Personality Questionnaire is used to demonstrate our methodologies.
Citation: Shang L, Xu P-F, Shan N, Tang M-L, Ho GT-S (2023) Accelerating L1-penalized expectation maximization algorithm for latent variable selection in multidimensional two-parameter logistic models. PLoS ONE 18(1): e0279918. https://doi.org/10.1371/journal.pone.0279918
Editor: Mahdi Roozbeh, Semnan University, IRAN, ISLAMIC REPUBLIC OF
Received: May 17, 2022; Accepted: December 16, 2022; Published: January 17, 2023
Copyright: © 2023 Shang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The research of Ping-Feng Xu is supported by the Natural Science Foundation of Jilin Province in China (No. 20210101152JC) and the National Natural Science Foundation of China (No. 11571050). The research of Na Shan is supported by the National Natural Science Foundation of China (No. 11871013). The research of George To-Sum Ho is supported by the Research Grants Council of Hong Kong (No. UGC/FDS14/P05/20) and the Big Data Intelligence Centre in The Hang Seng University of Hong Kong. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Multidimensional item response theory (MIRT) models are widely used to describe the relationship between the designed items and the intrinsic latent traits in psychological and educational tests [1]. Early researches for the estimation of MIRT models are confirmatory, where the relationship between the responses and the latent traits are pre-specified by prior knowledge [2, 3]. Under this setting, parameters are estimated by various methods including marginal maximum likelihood method [4] and Bayesian estimation [5]. However, misspecification of the item-trait relationships in the confirmatory analysis may lead to serious model lack of fit, and consequently, erroneous assessment [6].
To avoid the misfit problem caused by improperly specifying the item-trait relationships, the exploratory item factor analysis (IFA) [4, 7] is usually adopted. The exploratory IFA freely estimate the entire item-trait relationships (i.e., the loading matrix) only with some constraints on the covariance of the latent traits. To obtain a simpler loading structure for better interpretation, the factor rotation [8, 9] is adopted, followed by a cut-off. Although the exploratory IFA and rotation techniques are very useful, they can not be utilized without limitations. For some applications, different rotation techniques yield very different or even conflicting loading matrices. Therefore, it can be arduous to select an appropriate rotation or decide which rotation is the best [10]. In addition, different subjective choices of the cut-off value possibly lead to a substantial change in the loading matrix [11].
Recently, regularization has been proposed as a viable alternative to factor rotation, and it can automatically rotate the factors to produce a sparse loadings structure for exploratory IFA [12, 13]. Scharf and Nestler [14] compared factor rotation and regularization in recovering predefined factor loading patterns and concluded that regularization is a suitable alternative to factor rotation for psychometric applications. Regularization has also been applied to produce sparse and more interpretable estimations in many other psychometric fields such as exploratory linear factor analysis [11, 15, 16], the cognitive diagnostic models [17, 18], structural equation modeling [19], and differential item functioning analysis [20, 21].
For MIRT models, Sun et al. [12] proposed a latent variable selection framework to investigate the item-trait relationships by maximizing the L1-penalized likelihood [22]. In this framework, one can impose prior knowledge of the item-trait relationships into the estimate of loading matrix to resolve the rotational indeterminacy. Based on the observed test response data, the L1-penalized likelihood approach can yield a sparse loading structure by shrinking some loadings towards zero if the corresponding latent traits are not associated with a test item. Consequently, it produces a sparse and interpretable estimation of loading matrix, and it addresses the subjectivity of rotation approach.
Since the marginal likelihood for MIRT involves an integral of unobserved latent variables, Sun et al. [12] carried out the expectation maximization (EM) algorithm [23] to solve the L1-penalized optimization problem. We denote this method as EML1 for simplicity. In the E-step of EML1, numerical quadrature by fixed grid points is used to approximate the conditional expectation of the log-likelihood. This results in a naive weighted log-likelihood on augmented data set with size equal to N × G, where N is the total number of subjects and G is the number of grid points. To optimize the naive weighted L1-penalized log-likelihood in the M-step, the coordinate descent algorithm [24] is used, whose computational complexity is O(N × G). However, N × G is usually very large, and this consequently leads to high computational burden of the coordinate decent algorithm in the M-step. As shown by Sun et al. [12], EML1 requires several hours for MIRT models with three to four latent traits. Another limitation for EML1 is that it does not update the covariance matrix Σ of latent traits in the EM iteration. Sun et al. [12] proposed a two-stage method. It first computes an estimation of Σ via a constrained exploratory analysis under identification conditions, and then substitutes the estimated Σ into EML1 as a known Σ to estimate discrimination and difficulty parameters. However, our simulation studies show that the estimation of Σ obtained by the two-stage method could be quite inaccurate.
Further development for latent variable selection in MIRT models can be found in [25, 26]. Zhang and Chen [25] proposed a stochastic proximal algorithm for optimizing the L1-penalized marginal likelihood. They used the stochastic approximation in the stochastic step, which avoids repeatedly evaluating the numerical integral with respect to the multiple latent traits. However, the choice of several tuning parameters, such as a sequence of step size to ensure convergence and burn-in size, may affect the empirical performance of stochastic proximal algorithm. Xu et al. [26] applied the expectation model selection (EMS) algorithm [27] to minimize the L0-penalized log-likelihood (for example, the Bayesian information criterion [28]) for latent variable selection in MIRT models. In their EMS framework, the model (i.e., structure of loading matrix) and parameters (i.e., item parameters and the covariance matrix of latent traits) are updated simultaneously in each iteration. In the simulation of Xu et al. [26], the EMS algorithm runs significantly faster than EML1, but it still requires about one hour for MIRT with four latent traits.
In this paper, we focus on the classic EM framework of Sun et al. [12] and give an improved EM-based L1-penalized marginal likelihood (IEML1) with the M-step’s computational complexity being reduced to O(2 × G). The fundamental idea comes from the “artificial data” widely used in the EM algorithm for computing maximum marginal likelihood estimation in the IRT literature [4, 29–32]. In Bock and Aitkin (1981) [29] and Bock et al. (1988) [4], “artificial data” are the expected number of attempts and correct responses to each item in a sample of size N at a given ability level. Essentially, “artificial data” are used to replace the unobservable statistics in the expected likelihood equation of MIRT models. It should be noted that, the number of “artificial data” is G but not N × G, as “artificial data” correspond to G ability levels (i.e., grid points in numerical quadrature). As a result, the number of data involved in the weighted log-likelihood obtained in E-step is reduced and the efficiency of the M-step is then improved.
In our IEML1, we use a slightly different artificial data to obtain the weighted complete data log-likelihood [33] which is widely used in generalized linear models with incomplete data. Specifically, we classify the N × G augmented data into 2 × G artificial data (z, θ(g)), where z (equals to 0 or 1) is the response to one item and θ(g) is one discrete ability level (i.e., grid point value). Thus, we obtain a new weighted L1-penalized log-likelihood based on a total number of 2 × G artificial data (z, θ(g)), which reduces the computational complexity of the M-step to O(2 × G) from O(N × G).
In addition, it is crucial to choose the grid points being used in the numerical quadrature of the E-step for both EML1 and IEML1. There are various papers that discuss this issue in non-penalized maximum marginal likelihood estimation in MIRT models [4, 29, 30, 34]. To the best of our knowledge, there is however no discussion about the penalized log-likelihood estimator in the literature. In this paper, we will give a heuristic approach to choose artificial data with larger weights in the new weighted log-likelihood. Based on this heuristic approach, IEML1 needs only a few minutes for MIRT models with five latent traits.
The rest of the article is organized as follows. In Section 2, we introduce the multidimensional two-parameter logistic (M2PL) model as a widely used MIRT model, and review the L1-penalized log-likelihood method for latent variable selection in M2PL models. In Section 3, we give an improved EM-based L1-penalized log-likelihood method for M2PL models with unknown covariance of latent traits. In Section 4, we conduct simulation studies to compare the performance of IEML1, EML1, the two-stage method [12], a constrained exploratory IFA with hard-threshold (EIFAthr) and a constrained exploratory IFA with optimal threshold (EIFAopt). In Section 5, we apply IEML1 to a real dataset from the Eysenck Personality Questionnaire. A concluding remark is provided in Section 6.
2 Latent variable selection in multidimensional two-parameter logistic models
In this section, the M2PL model that is widely used in MIRT is introduced. Furthermore, the L1-penalized log-likelihood method for latent variable selection in M2PL models is reviewed.
2.1 Multidimensional two-parameter logistic model
Consider a J-item test that measures K latent traits of N subjects. Let Y = (yij)N×J be the dichotomous observed responses to the J items for all N subjects, where yij = 1 represents the correct response of subject i to item j, and yij = 0 represents the wrong response. Let θi = (θi1, …, θiK)T be the K-dimensional latent traits to be measured for subject i = 1, …, N. The relationship between the jth item response and the K-dimensional latent traits for subject i can be expressed by the M2PL model as follows
(1)
where aj = (aj1, …, ajK)T and bj are known as the discrimination and difficulty parameters, respectively. The parameter ajk ≠ 0 implies that item j is associated with latent trait k. P(yij = 1|θi, aj, bj) denotes the probability that subject i correctly responds to the jth item based on his/her latent traits θi and item parameters aj and bj. For the sake of simplicity, we use the notation A = (a1, …, aJ)T, b = (b1, …, bJ)T, and Θ = (θ1, …, θN)T. The discrimination parameter matrix A is also known as the loading matrix, and the corresponding structure is denoted by Λ = (λjk) with λjk = I(ajk ≠ 0).
In M2PL models, several general assumptions are adopted. The latent traits θi, i = 1, …, N, are assumed to be independent and identically distributed, and follow a K-dimensional normal distribution N(0, Σ) with zero mean vector and covariance matrix Σ = (σkk′)K×K. Furthermore, the local independence assumption is assumed, that is, given the latent traits θi, yi1, …, yiJ are conditional independent.
To guarantee the parameter identification and resolve the rotational indeterminacy for M2PL models, some constraints should be imposed. To identify the scale of the latent traits, we assume the variances of all latent trait are unity, i.e., σkk = 1 for k = 1, …, K. Dealing with the rotational indeterminacy issue requires additional constraints on the loading matrix A. We adopt the constraints used by Sun et al. [12] and Xu et al. [26], that is, each of the first K items is associated with only one latent trait separately, i.e., ajj ≠ 0 and ajk = 0 for 1 ≤ j ≠ k ≤ K. In practice, the constraint on A should be determined according to priori knowledge of the item and the entire study.
2.2 Latent variable selection based on L1-penalized method
The response function for M2PL model in Eq (1) takes a logistic regression form, where yij acts as the response, the latent traits θi as the covariates, aj and bj as the regression coefficients and intercept, respectively. We are interested in exploring the subset of the latent traits related to each item, that is, to find all non-zero ajks. This can be viewed as variable selection problem in a statistical sense.
Under the local independence assumption, the likelihood function of the complete data (Y, Θ) for M2PL model can be expressed as follow
(2)
where φ(θi|Σ) is the density function of latent trait θi. The log-likelihood function of observed data Y can be written as
(3)
To investigate the item-trait relationships, Sun et al. [12] applied the L1-penalized marginal log-likelihood method to obtain the sparse estimate of A for latent variable selection in M2PL model. They carried out the EM algorithm [23] with coordinate descent algorithm [24] to solve the L1-penalized optimization problem. However, the covariance matrix Σ of latent traits is assumed to be known and is not realistic in real-world applications.
Instead, we will treat Σ as an unknown parameter and update it in each EM iteration. For this purpose, the L1-penalized optimization problem including Σ is represented as
(4)
where
denotes the entry-wise L1 norm of A. The tuning parameter η > 0 controls the sparsity of A. Larger value of η results in a more sparse estimate of A. The tuning parameter is always chosen by cross validation or certain information criteria. In this paper, we employ the Bayesian information criterion (BIC) as described by Sun et al. [12].
3 Implementation of the EM algorithm
Due to the presence of the unobserved variable (e.g., the latent traits Θ), the parameter estimates in Eq (4) can not be directly obtained. Sun et al. [12] carried out EML1 to optimize Eq (4) with a known Σ. Similarly, we first give a naive implementation of the EM algorithm to optimize Eq (4) with an unknown Σ. Then, we give an efficient implementation with the M-step’s computational complexity being reduced to O(2 × G), where G is the number of grid points. Lastly, we will give a heuristic approach to choose grid points being used in the numerical quadrature in the E-step.
3.1 A naive implementation of the EM algorithm
The EM algorithm iteratively executes the expectation step (E-step) and maximization step (M-step) until certain convergence criterion is satisfied. Specifically, the E-step is to compute the Q-function, i.e., the conditional expectation of the L1-penalized complete log-likelihood with respect to the posterior distribution of latent traits Θ. The M-step is to maximize the Q-function. Let Ψ = (A, b, Σ) be the set of model parameters, and Ψ(t) = (A(t), b(t), Σ(t)) be the parameters in the tth iteration. The (t + 1)th iteration is described as follows.
3.1.1 E-step.
In the E-step of the (t + 1)th iteration, under the current parameters Ψ(t), we compute the Q-function involving a Σ-term as follows
(5)
where Q0 is
and for j = 1, …, J, Qj is
where
denotes the L1-norm of vector aj. The conditional expectations in Q0 and each Qj are computed with respect to the posterior distribution of θi as follows
where
,
is the jth row of A(t), and
is the jth element in b(t).
Note that the conditional expectations in Q0 and each Qj do not have closed-form solutions. It is usually approximated using the Gaussian-Hermite quadrature [4, 29] and Monte Carlo integration [35]. For simplicity, we approximate these conditional expectations by summations following Sun et al. [12]. Specifically, we choose fixed grid points and the posterior distribution of θi is then approximated by
(6)
where
serves as a normalizing factor. Thus, Q0 can be approximated by
(7)
and Qj for j = 1, …, J is approximated by
(8)
Hence, the Q-function can be approximated by
(9)
3.1.2 M-step.
In the M-step of the (t + 1)th iteration, we maximize the approximation of Q-function obtained by E-step
(10)
subject to Σ ≻ 0 and diag(Σ) = 1, where Σ ≻ 0 denotes that Σ is a positive definite matrix, and diag(Σ) = 1 denotes that all the diagonal entries of Σ are unity.
It can be easily seen from Eq (9) that can be factorized as the summation of
involving Σ and
involving (aj, bj). Thus, the maximization problem in Eq (10) can be decomposed to maximizing
and maximizing penalized
separately, that is,
(11)
and for j = 1, …, J,
(12)
For maximization problem (11), can be represented as
where tr[⋅] denotes the trace operator of a matrix, where
(13)
Therefore, the optimization problem in (11) is known as a semi-definite programming problem in convex optimization. We can obtain the Σ(t + 1) in the same way as Zhang et al. [36] by applying a proximal gradient descent algorithm [37]. It is noteworthy that in the EM algorithm used by Sun et al. [12], Q0 is a constant and thus need not be optimized, as Σ is assumed to be known.
For maximization problem (12), it is noted that in Eq (8) can be regarded as the weighted L1-penalized log-likelihood in logistic regression with naive augmented data (yij, θi) and weights
, where
. Hence, the maximization problem in (Eq 12) is equivalent to the variable selection in logistic regression based on the L1-penalized likelihood. Several existing methods such as the coordinate decent algorithm [24] can be directly used.
After solving the maximization problems in Eqs (11) and (12), it is straightforward to obtain the parameter estimates of Σ(t + 1), and
for the next iteration.
We call the implementation described in this subsection the naive version since the M-step suffers from a high computational burden. It should be noted that the computational complexity of the coordinate descent algorithm for maximization problem (12) in the M-step is proportional to the sample size of the data set used in the logistic regression [24]. In (12), the sample size (i.e., N × G) of the naive augmented data set {(yij, θi)|i = 1, …, N, and is usually large, where G is the number of quadrature grid points in
. For example, if N = 1000, K = 3 and 11 quadrature grid points are used in each latent trait dimension, then G = 1331 and N × G = 1.331 × 106. This leads to a heavy computational burden for maximizing (12) in the M-step. As a result, the EML1 developed by Sun et al. [12] is computationally expensive.
3.2 An improved EM-based L1-penalized likelihood method
In this subsection, motivated by the idea about “artificial data” widely used in maximum marginal likelihood estimation in the IRT literature [30], we will derive another form of weighted log-likelihood based on a new artificial data set with size 2 × G. Therefore, the computational complexity of the M-step is reduced to O(2 × G) from O(N × G).
As described in Section 3.1.1, we use the same set of fixed grid points for all θis to approximate the conditional expectation. Let
with θ(g) representing a discrete ability level, and
denote the value of
at θi = θ(g). Using the traditional “artificial data” described in Baker and Kim [30], we can write
as
(14)
where
is the “expected sample size” at ability level θ(g), and
is the “expected frequency” of correct response to item j at ability θ(g). Note that, in the IRT literature,
and
are known as “artificial data”, and they are applied to replace the unobservable sufficient statistics in the complete data likelihood equation in the E-step of the EM algorithm for computing maximum marginal likelihood estimation [30–32]. If η = 0, differentiating Eq (14), we can obtain a likelihood equation involving the traditional “artificial data”, which can be solved by standard optimization methods [30, 32].
For L1-penalized log-likelihood estimation, we should maximize Eq (14) for η > 0. Although the coordinate descent algorithm [24] can be applied to maximize Eq (14), some technical details are needed. In this paper, from a novel perspective, we will view as a weighted L1-penalized log-likelihood of logistic regression based on our new artificial data inspirited by Ibrahim (1990) [33] and maximize
by applying the efficient R package glmnet [24].
Specifically, we group the N × G naive augmented data in Eq (8) into 2 × G new artificial data (z, θ(g)), where z (equals to 0 or 1) is the response to item j and θ(g) is a discrete ability level. Thus, in Eq (8) can be rewritten as
(15)
where
is the “expected frequency” of correct or incorrect response to item j at ability θ(g). The second equality in Eq (15) holds since z and Fj(θ(g))) do not depend on yij and the order of the summation is interchanged. Thus, we obtain a new form of weighted L1-penalized log-likelihood of logistic regression in the last line of Eq (15) based on the new artificial data (z, θ(g)) with a weight
. Note that
and
, so the traditional “artificial data” can be viewed as weights for our new artificial data (z, θ(g)).
Since Eq (15) is a weighted L1-penalized log-likelihood of logistic regression, it can be optimized directly via the efficient R package glmnet [24]. This is an advantage of using Eq (15) instead of Eq (14). Moreover, the size of the new artificial data set {(z, θ(g))|z = 0, 1, and involved in Eq (15) is 2 × G, which is substantially smaller than N × G. This significantly reduces the computational burden for optimizing
in the M-step. We call this version of EM as the improved EML1 (IEML1). Since the computational complexity of the coordinate descent algorithm is O(M) where M is the sample size of data involved in penalized log-likelihood [24], the computational complexity of M-step of IEML1 is reduced to O(2 × G) from O(N × G).
It is noteworthy that, for yi = yi′ with the same response pattern, the posterior distribution of θi is the same as that of θi′, i.e., . When the sample size N is large, the item response vectors y1, ⋯, yN can be grouped into distinct response patterns, and then the summation in computing
is not over N, but over the number of distinct patterns, which will greatly reduce the computational time [30].
It should be noted that any fixed quadrature grid points set, such as Gaussian-Hermite quadrature points set, will result in the same weighted L1-penalized log-likelihood as in Eq (15). However, neither the adaptive Gaussian-Hermite quadrature [34] nor the Monte Carlo integration [35] will result in Eq (15) since the adaptive Gaussian-Hermite quadrature requires different adaptive quadrature grid points for different θi while the Monte Carlo integration usually draws different Monte Carlo samples for different θi.
3.3 Heuristic approach for choosing grid points
In the new weighted log-likelihood in Eq (15), the more artificial data (z, θ(g)) are used, the more accurate the approximation of is; but, the more computational burden IEML1 has. To reduce the computational burden of IEML1 without sacrificing too much accuracy, we will give a heuristic approach for choosing a few grid points used to compute
.
Let us consider a motivating example based on a M2PL model with item discrimination parameter matrix A1 with K = 3 and J = 40, which is given in Table A in S1 Appendix. The grid point set , where
denotes a set of equally spaced 11 grid points on the interval [−4, 4]. Therefore, the size of our new artificial data set used in Eq (15) is 2 × 113 = 2662. Based on one iteration of the EM algorithm for one simulated data set, we calculate the weights of the new artificial data
and then sort them in descending order.
Fig 1 (left) gives the histogram of all weights, which shows that most of the weights are very small and only a few of them are relatively large. Fig 1 (right) gives the plot of the sorted weights, in which the top 355 sorted weights are bounded by the dashed line. The sum of the top 355 weights consitutes 95.9% of the sum of all the 2662 weights. This suggests that only a few (z, θ(g)) contribute significantly to . Furthermore, Fig 2 presents scatter plots of our artificial data (z, θ(g)), in which the darker the color of (z, θ(g)), the greater the weight
. It can be seen roughly that most (z, θ(g)) with greater weights are included in {0, 1} × [−2.4, 2.4]3. In fact, artificial data with the top 355 sorted weights in Fig 1 (right) are all in {0, 1} × [−2.4, 2.4]3. These observations suggest that we should use a reduced grid point set
with each dimension consisting of 7 equally spaced grid points on the interval [−2.4, 2.4]. Thus, the size of the corresponding reduced artificial data set is 2 × 73 = 686. In this way, only 686 artificial data are required in the new weighted log-likelihood in Eq (15). Our simulation studies show that IEML1 with this reduced artificial data set performs well in terms of correctly selected latent variables and computing time.
In the literature, Xu et al. [26] gives a similar approach to choose the naive augmented data (yij, θi) with larger weight for computing Eq (8). In this paper, we however choose our new artificial data (z, θ(g)) with larger weight
to compute Eq (15).
4 Simulation studies
In this section, we conduct simulation studies to evaluate and compare the performance of our IEML1, the EML1 proposed by Sun et al. [12] and the constrained exploratory IFAs with hard-threshold and optimal threshold. In all methods, we use the same identification constraints described in subsection 2.1 to resolve the rotational indeterminacy. In addition, we also give simulation studies to show the performance of the heuristic approach for choosing grid points. The R codes of the IEML1 method are provided in S4 Appendix.
Here, we consider three M2PL models with the item number J equal to 40. Three true discrimination parameter matrices A1, A2 and A3 with K = 3, 4, 5 are shown in Tables A, C and E in S1 Appendix, respectively. The corresponding difficulty parameters b1, b2 and b3 are listed in Tables B, D and F in S1 Appendix. The non-zero discrimination parameters are generated from the identically independent uniform distribution U(0.5, 2). The true difficulty parameters are generated from the standard normal distribution. The diagonal elements of the true covariance matrix Σ of the latent traits are setting to be unity with all off-diagonals being 0.1.
For parameter identification, we constrain items 1, 10, 19 to be related only to latent traits 1, 2, 3 respectively for K = 3, that is, (a1, a10, a19)T in A1 was fixed as diagonal matrix in each EM iteration. Similarly, items 1, 7, 13, 19 are related only to latent traits 1, 2, 3, 4 respectively for K = 4 and items 1, 5, 9, 13, 17 are related only to latent traits 1, 2, 3, 4, 5 respectively for K = 5.
Two sample size (i.e., N = 500, 1000) are considered. For each setting, we draw 100 independent data sets for each M2PL model. We obtain results by IEML1 and EML1 and evaluate their results in terms of computation efficiency, correct rate (CR) for the latent variable selection and accuracy of the parameter estimation. The computation efficiency is measured by the average CPU time over 100 independent runs. The CR for the latent variable selection is defined by the recovery of the loading structure Λ = (λjk) as follows:
where
is an estimate of the true loading structure Λ. The following mean squared error (MSE) is used to measure the accuracy of the parameter estimation:
where
denotes the estimate of ajk from the sth replication and S = 100 is the number of data sets. The MSE of each bj in b and σkk′ in Σ is calculated similarly to that of ajk.
4.1 Computational efficiency
We first compare computational efficiency of IEML1 and EML1. To make a fair comparison, the covariance of latent traits Σ is assumed to be known for both methods in this subsection.
In this study, we consider M2PL with A1. We use the fixed grid point set , where
is the set of equally spaced 11 grid points on the interval [4, 4]. In each M-step, the maximization problem in (12) is solved by the R-package glmnet for both methods. Due to tedious computing time of EML1, we only run the two methods on 10 data sets. For each replication, the initial value of (a1, a10, a19)T is set as identity matrix, and other initial values in A are set as 1/J = 0.025. The initial value of b is set as the zero vector. The candidate tuning parameters are given as (0.10, 0.09, …, 0.01) × N, and we choose the best tuning parameter by Bayesian information criterion as described by Sun et al. [12].
The average CPU time (in seconds) for IEML1 and EML1 are given in Table 1. From Table 1, IEML1 runs at least 30 times faster than EML1. Moreover, IEML1 and EML1 yield comparable results with the absolute error no more than 10−13. It numerically verifies that two methods are equivalent.
4.2 Simulation for the unknown Σ case
In this subsection, we compare our IEML1 with a two-stage method proposed by Sun et al. [12], a constrained exploratory IFA with hard threshold (EIFAthr) and a constrained exploratory IFA with optimal threshold (EIFAopt). In the EIFAthr, all parameters are estimated via a constrained exploratory analysis satisfying the identification conditions, and then the estimated discrimination parameters that smaller than a given threshold are truncated to be zero. In the simulation studies, several thresholds, i.e., 0.30, 0.35, …, 0.70, are used, and the corresponding EIFAthr are denoted by EIFA0.30, EIFA0.35, …, EIFA0.70, respectively. In EIFAthr, it is subjective to preset a threshold, while in EIFAopt we further choose the optimal truncated estimates correponding to the optimal threshold with minimum BIC value from several given thresholds (e.g., 0.30, 0.35, …, 0.70 used in EIFAthr) in a data-driven manner.
For IEML1, the initial value of Σ is set to be an identity matrix. For other three methods, a constrained exploratory IFA is adopted to estimate Σ first by R-package mirt with the setting being “method = EM” and the same grid points are set as in subsection 4.1.
We consider M2PL models with A1 and A2 in this study. To compare the latent variable selection performance of all methods, the boxplots of CR are dispalyed in Fig 3. From Fig 3, IEML1 performs the best and then followed by the two-stage method. As we expect, different hard thresholds leads to different estimates and the resulting different CR, and it would be difficult to choose a best hard threshold in practices. EIFAopt performs better than EIFAthr. As complements to CR, the false negative rate (FNR), false positive rate (FPR) and precision are reported in S2 Appendix. The boxplots of these metrics show that our IEML1 has very good performance overall.
Fig 4 presents boxplots of the MSE of A obtained by all methods. From Fig 4, IEML1 and the two-stage method perform similarly, and better than EIFAthr and EIFAopt. We can see that larger threshold leads to smaller median of MSE, but some very large MSEs in EIFAthr.
Figs 5 and 6 show boxplots of the MSE of b and Σ obtained by all methods. Note that, EIFAthr and EIFAopt obtain the same estimates of b and Σ, and consequently, they produce the same MSE of b and Σ. Therefore, their boxplots of b and Σ are the same and they are represented by “EIFA” in Figs 5 and 6. We can see that all methods obtain very similar estimates of b. IEML1 gives significant better estimates of Σ than other methods.
4.3 Evaluation on heuristic approach for choosing grid points
As presented in the motivating example in Section 3.3, most of the grid points with larger weights are distributed in the cube [−2.4, 2.4]3. Intuitively, the grid points for each latent trait dimension can be drawn from the interval [−2.4, 2.4]. In this subsection, we generate three grid point sets denoted by Grid11, Grid7 and Grid5 and compare the performance of IEML1 based on these three grid point sets via simulation study. Specifically, Grid11, Grid7 and Grid5 are three K-ary Cartesian power, where 11, 7 and 5 equally spaced grid points on the intervals [−4, 4], [−2.4, 2.4] and [−2.4, 2.4] in each latent trait dimension, respectively.
Fig 7 summarizes the boxplots of CRs and MSE of parameter estimates by IEML1 for all cases. From Fig 7, we obtain very similar results when Grid11, Grid7 and Grid5 are used in IEML1. Table 2 shows the average CPU time for all cases. The computing time increases with the sample size and the number of latent traits. The simulation studies show that IEML1 can give quite good results in several minutes if Grid5 is used for M2PL with K ≤ 5 latent traits.
In fact, we also try to use grid point set Grid3 in which each dimension uses three grid points equally spaced in interval [−2.4, 2.4]. But the numerical quadrature with Grid3 is not good enough to approximate the conditional expectation in the E-step. It should be noted that IEML1 may depend on the initial values. In all simulation studies, we use the initial values similarly as described for A1 in subsection 4.1. These initial values result in quite good results and they are good enough for practical users in real data applications.
5 Real data analysis
In this section, we analyze a data set of the Eysenck Personality Questionnaire given in Eysenck and Barrett [38]. The data set includes 754 Canadian females’ responses (after eliminating subjects with missing data) to 69 dichotomous items, where items 1–25 consist of the psychoticism (P), items 26–46 consist of the extraversion (E) and items 47–69 consist of the neuroticism (N). This data set was also analyzed in Xu et al. [26]. In order to guarantee the psychometric properties of the items, we select those items whose corrected item-total correlation values are greater than 0.2 [39]. The selected items and their original indices are listed in Table 3, with 10, 19 and 23 items corresponding to P, E and N respectively. Items marked by asterisk correspond to negatively worded items whose original scores have been reversed.
In the analysis, we designate two items related to each factor for identifiability. Based on the meaning of the items and previous research, we specify items 1 and 9 to P, items 14 and 15 to E, items 32 and 34 to N. We employ the IEML1 to estimate the loading structure and then compute the observed BIC under each candidate tuning parameters in (0.040, 0.038, 0.036, …, 0.002) × N, where N denotes the sample size 754. The minimal BIC value is 38902.46 corresponding to η = 0.02 × N. The parameter estimates of A and b are given in Table 4, and the estimate of Σ is
From the results, most items are found to remain associated with only one single trait while some items related to more than one trait. Most of these findings are sensible. For example, item 19 (‘Would you call yourself happy-go-lucky?’) designed for extraversion is also related to neuroticism which reflects individuals’ emotional stability. Item 49 (‘Do you often feel lonely?’) is also related to extraversion whose characteristics are enjoying going out and socializing. In addition, it is reasonable that item 30 (‘Does your mood often go up and down?’) and item 40 (‘Would you call yourself tense or ‘highly-strung’?’) are related to both neuroticism and psychoticism.
6 Concluding remarks
In this paper, we obtain a new weighted log-likelihood based on a new artificial data set for M2PL models, and consequently we propose IEML1 to optimize the L1-penalized log-likelihood for latent variable selection. We give a heuristic approach for choosing the quadrature points used in numerical quadrature in the E-step, which reduces the computational burden of IEML1 significantly. There are three advantages of IEML1 over EML1, the two-stage method, EIFAthr and EIFAopt. First, the computational complexity of M-step in IEML1 is reduced to O(2 × G) from O(N × G). In our simulation studies, IEML1 needs a few minutes for M2PL models with no more than five latent traits. Second, IEML1 updates covariance matrix Σ of latent traits and gives a more accurate estimate of Σ. Third, IEML1 outperforms the two-stage method, EIFAthr and EIFAopt in terms of CR of the latent variable selection and the MSE for the parameter estimates.
The current study will be extended in the following directions for future research. First, we will generalize IEML1 to multidimensional three-parameter (or four parameter) logistic models that give much attention in recent years. Second, other numerical integration such as Gaussian-Hermite quadrature [4, 29] and adaptive Gaussian-Hermite quadrature [34] can be adopted in the E-step of IEML1. Gaussian-Hermite quadrature uses the same fixed grid point set for each individual and can be easily adopted in the framework of IEML1. However, further simulation results are needed. Compared to the Gaussian-Hermite quadrature, the adaptive Gaussian-Hermite quadrature produces an accurate fast converging solution with as few as two points per dimension for estimation of MIRT models [34]. Therefore, the adaptive Gaussian-Hermite quadrature is also potential to be used in penalized likelihood estimation for MIRT models although it is impossible to get our new weighted log-likelihood in Eq (15) due to applying different grid point set for different individual. Third, we will accelerate IEML1 by parallel computing technique for medium-to-large scale variable selection, as [40] produced larger gains in performance for MIRT estimation by applying the parallel computing technique. Fourth, the new weighted log-likelihood on the new artificial data proposed in this paper will be applied to the EMS in [26] to reduce the computational complexity for the MS-step.
Supporting information
S1 Appendix. True discrimination and difficulty parameters in simulation studies.
https://doi.org/10.1371/journal.pone.0279918.s001
(PDF)
S2 Appendix. FNR, FPR and precision of the loading structure in the simulation for the unknown Σ case.
https://doi.org/10.1371/journal.pone.0279918.s002
(PDF)
References
- 1.
Reckase MD. Multidimensional Item Response Theory. 1st ed. New York: Springer; 2009.
- 2. Janssen R, De Boeck P. Confirmatory analyses of componential test structure using multidimensional item response theory. Multivariate Behavioral Research. 1999; 34(2): 245–268. pmid:26753937
- 3. Mckinley R. Confirmatory analysis of test structure using multidimensional item response theory. ETS Research Report Series. 1989; 2: i–40.
- 4. Bock RD, Gibbons R, Muraki E. Full-information item factor analysis. Applied Psychological Measurement. 1988; 12(3): 261–280.
- 5. Béguin AA, Glas CAW. MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika. 2001; 66(4): 541–561.
- 6. da Silva MA, Liu R, Huggins-Manley AC, Bazán JL. Incorporating the Q-matrix into multidimensional item response theory models. Educational and Psychological Measurement. 2019; 79(4): 665–687. pmid:32655178
- 7. Cai L. High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika. 2010; 75(1): 33–57.
- 8. Bernaards CA, Jennrich RI. Gradient projection algorithms and software for arbitrary rotation criteria in factor analysis. Educational and Psychological Measurement. 2005; 65(5): 676–696.
- 9. Browne MW. An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research. 2001; 36(1): 111–150.
- 10. Sass DA, Schmitt TA. A comparative investigation of rotation criteria within exploratory factor analysis. Multivariate Behavioral Research. 2010; 45(1): 73–103. pmid:26789085
- 11. Jin S, Moustaki I, Yang-Wallentin F. Approximated penalized maximum likelihood for exploratory factor analysis: An orthogonal case. Psychometrika. 2018; 83(3): 628–649. pmid:29876715
- 12. Sun J, Chen Y, Liu J, Ying Z, Xin T. Latent variable selection for multidimensional item response theory models via L1 regularization. Psychometrika. 2016; 81(4): 921–939.
- 13. Hui FKC, Tanaka E, Warton DI. Order selection and sparsity in latent variable models via the ordered factor LASSO. Biometrics. 2018; 74(4): 1311–1319. pmid:29750847
- 14. Scharf F, Nestler S. Should regularization replace simple structure rotation in exploratory factor analysis? Structural Equation Modeling: A Multidisciplinary Journal. 2019; 26(4): 576–590.
- 15. Hirose K, Konishi S. Variable selection via the weighted group lasso for factor analysis models. The Canadian Journal of Statistics. 2012; 40(2): 345–361.
- 16. Hirose K, Yamamoto M. Sparse estimation via nonconcave penalized likelihood in factor analysis model. Statistics and Computing. 2015; 25(5): 863–875.
- 17. Chen Y, Liu J, Xu G, Ying Z. Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association. 2015; 110(510): 850–866. pmid:26294801
- 18.
Liu J, Kang HA. Q-matrix learning via latent variable selection and identifiability. In: von Davier M, Lee YS, editors. Handbook of Diagnostic Classification Models. Cham: Springer; 2019. pp. 247–263.
- 19. Huang PH, Chen H, Weng LJ. A penalized likelihood method for structural equation modeling. Psychometrika. 2017; 82(2): 329–354. pmid:28417228
- 20. Magis D, Tuerlinckx F, De Boeck P. Detection of differential item functioning using the lasso approach. Journal of Educational and Behavioral Statistics. 2015; 40(2): 111–135.
- 21. Tutz G, Schauberger G. A penalty approach to differential item functioning in Rasch models. Psychometrika. 2015; 80(1): 21–43. pmid:24297435
- 22. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996; 58(1): 267–288.
- 23. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B. 1977; 39(1): 1–38.
- 24. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010; 33(1): 1–22. pmid:20808728
- 25. Zhang S, Chen Y. Computation for latent variable model estimation: A unified stochastic proximal framework. Psychometrika. 2022; 87(4): 1473–1502. pmid:35524934
- 26. Xu PF, Shang L, Zheng QZ, Shan N, Tang ML. Latent variable selection in multidimensional item response theory models using the expectation model selection algorithm. British Journal of Mathematical and Statistical Psychology. 2022; 75(2): 363–394. pmid:34918834
- 27. Jiang J, Nguyen T, Rao JS. The E-MS algorithm: Model selection with incomplete data. Journal of the American Statistical Association. 2015; 110(511): 1136–1147. pmid:26783375
- 28. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978; 6(2): 461–464.
- 29. Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika. 1981; 46(4): 443–459.
- 30.
Baker FB, Kim SH. Item Response Theory: Parameter Estimation Techniques. 2nd ed. Boca Raton: CRC press; 2004.
- 31. Zheng C, Meng X, Guo S, Liu Z. Expectation-maximization-maximization: A feasible MLE algorithm for the three-parameter logistic model based on a mixture modeling reformulation. Frontiers in Psychology. 2018; 8:2302. pmid:29354089
- 32. Chen P, Wang C. Using EM algorithm for finite mixtures and reformed supplemented EM for MIRT calibration. Psychometrika. 2021; 86(1): 299–326. pmid:33591556
- 33. Ibrahim JG. Incomplete data in generalized linear models. Journal of the American Statistical Association. 1990; 85(411): 765–769.
- 34. Schilling S, Bock RD. High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika. 2005; 70(3): 533–555.
- 35. Meng XL, Schilling S. Fitting full-information item factor models and an empirical investigation of bridge sampling. Journal of the American Statistical Association. 1996; 91(435): 1254–1267.
- 36. Zhang S, Chen Y, Liu Y. An improved stochastic EM algorithm for large-scale full-information item factor analysis. British Journal of Mathematical and Statistical Psychology. 2020; 73(1): 44–71. pmid:30511445
- 37. Parikh N, Boyd S. Proximal algorithms. Foundations and Trends in Optimization. 2014; 1(3): 127–239.
- 38. Eysenck S, Barrett P. Re-introduction to cross-cultural studies of the EPQ. Personality and Individual Differences. 2013; 54(4): 485–489.
- 39.
Kline P. A Handbook of Test Construction: Introduction to Psychometric Design. New York: Methuen; 1986.
- 40.
von Davier M. New results on an improved parallel EM algorithm for estimating generalized latent variable models. In van der Ark LA, Wiberg M, Culpepper SA, Douglas JA, Wang WC, editors. Quantitative Psychology. Cham: Springer; 2017. pp. 1–8. https://doi.org/10.1007/978-3-319-56294-0_1