Figures
Abstract
In gene expression data analysis framework, ultrahigh dimensionality and measurement error are ubiquitous features. Therefore, it is crucial to correct measurement error effects and make variable selection when fitting a regression model. In this paper, we introduce a python package BOOME, which refers to BOOsting algorithm for Measurement Error in binary responses and ultrahigh-dimensional predictors. We primarily focus on logistic regression and probit models with responses, predictors, or both contaminated with measurement error. The BOOME aims to address measurement error effects, and employ boosting procedure to make variable selection and estimation.
Citation: Chen L-P (2022) BOOME: A Python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data. PLoS ONE 17(10): e0276664. https://doi.org/10.1371/journal.pone.0276664
Editor: Angelo Moretti, Utrecht University: Universiteit Utrecht, NETHERLANDS
Received: May 7, 2022; Accepted: October 11, 2022; Published: October 27, 2022
Copyright: © 2022 Li-Pang Chen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The gene expression data considered in the manuscript are available in the R package SIS. In this package, one can insert two inputs leukemia.train and leukemia.test to get the dataset, where the first 7129 columns are gene expression values, and the last column is AML (labeled 1) and ALL (labeled 0).
Funding: Chen’s research was supported by the Ministry of Science and Technology (MOST 110-2118-M-004-006-MY2). No, the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
1.1 Motivation
Analysis of gene expression data is a popular topic and deserves careful research development. A motivating example in this paper is a gene expression microarray data collected by [1] and explored in some references (e.g., [2]). The full dataset can be found in the R package SIS. The data contain binary responses, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), and 7128 gene expression levels that were measured using Affymetrix oligonucleotide arrays. In addition, samples with size 72 come from the two classes, with 47 specimens in class ALL and 25 specimens in class AML. The primary objective of this study is to characterize the relationship between leukemia and gene expression values, and see how gene expression values interpret leukemia. To achieve this goal, a commonly used approach is to build a regression model by treating leukemia and gene expression values as binary responses and the predictors, respectively. To model a binary response, logistic regression or probit models are perhaps frequently implemented parametric approaches.
According to this gene expression data, ultrahigh dimensionality (p ≫ n) is a challenging feature. Since not every gene expression value is informative, using irrelevant predictors in regression models may affect the performance of classification and induce wrong conclusions. Therefore, making variable selection and retaining important ones are needed. While variable selection techniques have been widely explored (e.g., [3–5]), those strategies may fail to handle the case that the dimension is extremely larger than the sample size. The other concern is measurement error in the response and predictors. As discussed in [6–8], gene expression values may be measured imprecisely due to unadjusted machines. Moreover, as commented by [9], it is also possible to falsely record ALL (or AML) to AML (or ALL), known as misclassification, because the microscopy images of AML bone marrow cells contain many immature granulocytes and monocytes, and ALL bone marrow cell microscopy images contain many immature lymphocytes. It is known that ignoring measurement error effects may cause tremendous biases and induce incorrect decisions, such as the false exclusion of truly informative predictors or false inclusion of irrelevant predictors when making variable selection (e.g., [6, 10, 11]). Therefore, it is crucial to correct measurement error effects. In particular, unlike existing literature that handles either variable selection or measurement error, the main challenge of this dataset is to correct measurement error and select informative predictors under ultrahigh-dimensional data simultaneously, and measurement error may explicitly affect the performance of variable selection. In other words, truly noninformative predictors may be falsely included if measurement error effects are ignored (e.g., [6, 11, 12]). As a result, it is necessary to suitably adjust measurement error effects and then use the corrected version to make variable selection.
1.2 Contributions
To address those concerns, we develop a package BOOME that is now available on https://pypi.org/project/BOOME/0.0.2/. The purpose of this package is to correct two measurement error processes in responses and predictors, and employ the boosting procedure to retain important predictors and estimate nonzero coefficients simultaneously.
In standard analysis of regression models, to estimate unknown parameters, one may require to derive likelihood functions, or more generally, unbiased estimating functions. Then the resulting estimator can be obtained by optimizing the constructed estimating functions. In the presence of measurement error, however, naively adopting error-prone predictors to the estimating functions would yield the biased estimators (e.g., [10, 13]). Therefore, to address this challenge, as discussed in measurement error framework, one should derive the corrected estimating functions with measurement error effects eliminated before implementing computational algorithms or estimation methods to derive the estimator, which is the standard step in measurement error analysis (e.g., [6, 10–18]). Following this idea, our strategy is to derive a new estimating function with measurement error effects in responses and predictors corrected, then adopt it to select informative predictors and obtain the corresponding estimators. Specifically, to correct measurement error effects to the binary response, we define the misclassification matrix (e.g., [13], p.131), which is formulated by specificity and sensitivity (e.g., [13], p.70), and will be described in details in Section 2.2, to adjust for measurement error effects in the responses and derive a new corrected response. Regarding the error-prone predictors, we adopt the sufficient statistics of the predictors and the regression calibration to correct measurement error effects to the predictors. Based on such strategies of measurement error corrections, we develop the corrected estimating functions under logistic regression or probit models, respectively. After that, we implement the corrected estimating functions to the boosting algorithm to make variable selection and estimation (e.g., [19]; [20], p.608). Detailed descriptions of measurement error corrections and the boosting algorithm are deferred to Sections 3.1 and 3.2, respectively.
1.3 Comparisons
Variable selection and estimation with correction of measurement error have been discussed, and many methods based on different settings have been developed. For example, [21] consider generalized linear models (GLM) and proposed the generalized matrix uncertainty selector (GMUS), whose idea is based on a Taylor series expansion of the GLM mean function around the true, but unknown, predictors. [22] considered parametric and semi-parametric regression models with error-prone predictors, and developed a corrected estimating equation to make variable selection. For survival data with incomplete responses, [6, 11, 12] considered several types of survival models and developed penalized estimating function to deal with variable selection. [23] proposed the MEBoost method, which adopts the boosting method to select informative variables under error-prone linear regression models. While many methods have been developed, they primarily focus on measurement error in predictors and rare work has been available to address measurement error effects in responses. In addition, although the boosting method has been applied to error-prone data, the existing method simply focuses on linear regression models, and other types of regression models have not been explored.
In the past developments, several existing packages based on different software have been developed to deal with either measurement error or variable selection. To name a few, for the R software, two packages glmnet [24] and SIS [25] are popular methods to handle variable selection. For the Python software, xverse [26] can be adopted to do feature selection. However, they fail to deal with measurement error effects. On the other hand, the two packages GLSME [27] and mecor [28] in the R software focus on linear models and aim to adjust for measurement error effects in the response and/or predictors, but they cannot deal with variable selection.
Compared with existing packages, there are some differences from the package BOOME. Specifically, our package is able to handle ultrahigh-dimensionality and mismeasured data simultaneously. Unlike most existing frameworks that focus on measurement error in predictors or continuous responses, our approach extends measurement error in binary responses (a.k.a misclassification), and the model structure for misclassification is more complex than that under continuous responses. Our approach can deal with error-prone response and predictors simultaneously. Moreover, boosting iteration may reduce the possibility of falsely excluding important predictors and enhance the accuracy of the estimator. Most importantly, our development is based on the Python language, and, to the best of knowledge, there is no relevant development in Python packages.
1.4 Organization of this paper
The remainder is organized as follows. In Section 2, we introduce two regression models to characterize binary responses. In addition, we introduce two measurement error models to describe error-prone responses and predictors, respectively. In Section 3, we present the BOOME method. Specifically, we first discuss some valid strategies to handle measurement error effects, and then discuss the boosting method for variable selection and estimation. In Section 4, we introduce the Python package BOOME, including some important functions as well as their implementation. In Section 5, we demonstrate the application of the package BOOME and analyze the gene expression data. Moreover, we also demonstrate simulation studies. Finally, a general discussion is presented in Section 6.
2 Regression models
2.1 Regression models with binary responses
Following the motivating example in Section 1.1, let n = 72 denote the sample size. For i = 1, …, n, let Yi be a binary response where Yi = 1 represents AML and Yi = 0 indicates ALL. Moreover, let Xi be a p-dimensional vector of gene expressions with p = 7128.
With the absence of measurement error effects, our goal is to use the gene expression values Xi to characterize the disease Yi through a p-dimensional vector of parameters β. In the framework of GLM, logistic regression or probit models are commonly used. Specifically, the logistic regression model (LR) is formulated by
(1)
and the probit model (PM) is given by
(2)
where Φ(⋅) is the cumulative distribution function of the standard normal distribution.
To estimate β, a common strategy is to optimize the likelihood function, or equivalently, solve the estimating equation. Specifically, for i = 1, …, n, the estimating function based on (1) is defined as
(3)
and the estimating function based on (2) is given by
(4)
Solving gLR(β) = 0 or gPM(β) = 0 yields the estimator of β.
2.2 Measurement error models
In applications, as discussed in Section 1.1, Yi and Xi might be subject to measurement error due to wrong records by investigators or imprecise measurements by unadjusted machines. Under this scenario, we particularly denote Yi and Xi as unobserved variables, and let and
denote the surrogate measurements of Yi and Xi, respectively, and they are recorded in the data.
We now provide an intuition of modeling error-prone data. Let f(⋅|⋅) represent the conditional distribution for variables indicated by the corresponding arguments, and let f(⋅) denote marginal or joint distribution of random variables. Following the similar discussion in [13] (Chapter 8), we consider the joint distribution and factorize it as
(5)
where the second step is obtained by the nondifferential
. With the marginal distribution f(Xi) left specified, the factorization (5) says that the inference about f(Yi|Xi) is conducted based on examining
with the predictor measurement error process being characterized by
, and
facilitates the response measurement error process.
To analyze measurement error effects when constructing regression models, we first need to characterize the relationship between Yi and as well as Xi and
. Specifically, to connect Yi and
, we consider the conditional probability
(6)
for k, l ∈ {0, 1}, satisfying πi10+ πi00 = 1 and πi01+ πi11 = 1, where πi11 and πi00 are called sensitivity and specificity, respectively, or known as classification probabilities; πi10 and πi01 are known as misclassification probabilities (e.g., [13], p.70). Moreover, to characterize πi01 and πi10, logistic regression (1) or probit models (2) with additional parameter γ are suitable choices. By the law of total probability,
and
can be expressed as
(7)
where
is called a 2 × 2 misclassification matrix (e.g., [13], p.131) that is assumed to be invertible.
Next, to describe the relationship between and Xi, we employ the classical measurement error model
(8)
where ϵi is independently and identically distributed normal distribution N(0, Σϵ) with Σϵ being a p × p covariance matrix representing the magnitude of measurement error effects in the predictors. We assume that ϵi is independent of Xi.
3 Method
3.1 Correction of measurement error
In this section, we primarily correct measurement error effects to responses and predictors.
Motivated by (7), by multiplying the inverse matrix of Πi to both sides of (7), we can obtain that
(9)
which indicates that the unobserved response Yi = 1 can be implicitly characterized by
with the adjustment in terms of πi01 and πi10. It motivates us to consider the “corrected” response, denoted
, which satisfies
(10)
suggesting that
(11)
In addition, (11) indicates that , verifying that (11) is a suitable correction to recover Yi. Moreover, we note that (11) holds regardless of the choice of regression models because it is obtained by the equalities (9) and (10) where the conditional probability can be (1) or (2).
To correct measurement error effects to the predictors, we provide two different strategies for different models. For the logistic regression model in terms of and unobserved Xi, we follow the similar discussion in [18] and aim to replace Xi by its sufficient statistic:
(12)
which can be regarded as correction of
. Replacing Yi and Xi in (1) by (11) and (12) gives the corrected estimating function:
(13)
On the other hand, to handle measurement error effects to the probit model, we adopt the regression calibration (e.g., [10], Chapter 4), whose key idea is to replace Xi by the conditional expectation . By the best linear unbiased prediction, it can be expressed as (e.g., [11, 15])
where μX and μX* represent the mean vectors of X and X*, respectively, and ΣX* represents the covariance matrix of X*. Since μX = μX*, by the method of moments, we obtain that
(14)
where
and
are empirical estimates of μX* and ΣX*, respectively. Consequently, replacing Yi and Xi in (2) by (11) and (14) gives the corrected estimating function:
(15)
3.2 Boosting algorithm
Let g**(β) denote the unified notation to represent the corrected estimating function (13) or (15). To make variable selection and estimation for β, we adopt the boosting algorithm with the correction of measurement error effects. The proposed method is called BOOME, and the procedure is summarized in Algorithm 1.
Specifically, the algorithm starts by an initial value β(0) taken by the p-dimensional zero vector 0p. Suppose that we run T times iterations, and for each iteration step t = 1, …, T, we compute the estimating function g**(β) evaluated at the (t − 1)th iterated value β(t−1), and denote it as Δ(t−1). After that, we define the active set that collects the indexes satisfying
, where τ ∈ [0, 1] is a constant and
is the jth component in a vector Δ(t−1). It implies that the active set
aims to retain informative predictors by treating
as a signal. Finally, for those
, we update the iterated value of the jth component in β(t−1), say
, by adding an increment
for some positive constant η. Repeating those steps T times yields the final estimator of β.
In Algorithm 1, τ, η, and T can be user-specific and may affect the iteration result. Similar to the comment in [29], the algorithm satisfying Tη → 0 as T → ∞ and η → 0 is approximately equivalent to the LASSO method. Therefore, it suggests taking η as a small value, such as η = 0.01, in applications. On the other hand, while T is suggested being large, it may cause over-fitting. To provide a suitable T and stop the iteration earlier, we suggest a criterion: the iteration stops at T if
is satisfied for some positive constant ξ. Finally, for the choice of τ, one may adopt some criteria such as cross-validation (e.g., [19]).
Algorithm 1: Boosting Procedure in BOOME
Let β(0) = 0p denote an initial value;
for step t with t = 1, 2, …, T do
(a) calculate Δ(t−1) = g**(β)|β = β(t−1);
(b) determine ;
(c) update for all
, and define
;
The final estimator is given by .
4 Description and implementation of BOOME
We develop a Python package, called BOOME, to implement the variable selection and estimation with measurement error correction described in Section 3. The package BOOME contains three functions: ME_Generate, LR_Boost, and PM_Boost. The function ME_Generate aims to generate artificial data under specific models listed in Section 2.1 and error-prone predictors. The functions LR_Boost and PM_Boost implement the boosting procedure in Algorithm 1, except for the difference that LR_Boost is based on the logistic regression model, and PM_Boost focuses on the probit model. We now describe the details of these three functions.
4.1 ME_Generate
We use the following command to obtain the artificial data:
where the meaning of each argument is described as follows:
- n: The number of observations.
- beta: A p-dimensional vector of parameter β specified by users.
- matrix: A user-specific covariance matrix implemented to (8).
- X: A user-specific n × p matrix of predictors.
- gamma: A p-dimensional vector of parameter γ in πi10 and πi01 specified by users.
The function ME_Generate returns a list of components:
-
data: A dataset with error-prone predictors and responses. It is a n×(p+ 1) data frame, where the column with label y represents the error-prone response, and the column with label j, j = 1, …, p, represents the jth error-prone predictor
.
- pr: Two misclassification probabilities πi10 and πi01 in (7).
4.2 LR_Boost
To demonstrate Algorithm 1 with the corrected estimating function (13) for the logistic regression model, we adopt the following command:
where the meaning of each argument is described as follows:
- X: A n × p matrix of continuous predictors that are precisely measured or subject to measurement error.
- Y: A n-dimensional vector of binary responses that are precisely measured or subject to measurement error.
- ite: A number of iteration T in Algorithm 1.
- thres: A threshold value τ in Algorithm 1.
- correct_X: Determine the correction of measurement error in predictors. Select “1” if correction is needed, and “0” otherwise.
- correct_Y: Determine the correction of measurement error in the response. Select “1” if correction is needed, and “0” otherwise.
- pr: Two misclassification probabilities πi10 and πi01 in (7).
- lr: A learning rate η in Algorithm 1.
- matrix: A p × p covariance matrix Σϵ in (8).
The function LR_Boost returns a list of components:
- estimated coefficients: the p-dimensional vector of estimators of β.
- predictors: Indexes of nonzero values in estimated coefficients.
- number of predictors: The number of nonzero values in estimated coefficients.
4.3 PM_Boost
To make variable selection and estimation for probit model by using Algorithm 1 with the corrected estimating function (15), we implement the following function:
The arguments in PM_Boost as well as the output produced by PM_Boost are the same as those in LR_Boost.
5 Numerical studies
In this section, we implement the functions in the package BOOME to analyze a real dataset as well as demonstrate simulation studies. Detailed code demonstrations are also available on the pypi website https://pypi.org/project/BOOME/0.0.2/.
5.1 Analysis of gene expression microarray data
In this section, we implement the package BOOME to analyze a gene expression microarray data introduced in Section 1.1. The steps for analysis are summarized in Fig 1. As shown in Step 1 of Fig 1, we recognize that, for i = 1, …, n, is the binary random variable with outcomes AML and ALL that may possibly subject to misclassification, and
represents the gene expression values that are contaminated with measurement error. Before analyzing this dataset, we first standardize all predictors, such that the mean and the variance of each predictor become 0 and 1, respectively. Let data_GE in Python code represent the gene expression microarray data that we introduced in Section 1.1, where the first column is the binary outcome and the remaining columns are gene expression values. Based on this dataset, the following code shows the input of gene expression data and the standardized procedure:
To examine the impact of measurement error effects, we primarily consider four settings in Step 2 of Fig 1:
- Setting 1:
and
are not corrected.
- Setting 2:
is corrected while
is not.
- Setting 3:
is corrected while
is not.
- Setting 4:
and
are corrected.
Here Setting 1 aims to implement Algorithm 1 to the estimating functions (3) or (4) with Yi and Xi replaced by error-prone variables and
, respectively. Setting 2 considers the estimating function in Setting 1 with
replaced by corrected responses (11); and Setting 3 adopts the estimating function in Setting 1 with
replaced by corrected predictors (12) or (14). Setting 4 uses Algorithm 1 to estimating functions (13) or (15), where measurement error in both responses and predictors are corrected. As discussed in Section 1.1, both responses and predictors are contaminated with measurement error, then Setting 4 is the proposed method by correcting measurement error effects in responses and predictors. On the other hand, Settings 1-3, known as naive methods, reflect that measurement error in leukemia, gene expressions, or both are not corrected. Basically, Settings 1-3 are considered to show the impact of ignorance of measurement error effects and are compared with the proposed method in Setting 4.
In Step 3, we implement the functions in BOOME for four different settings. We first note that the dataset has no additional information, such as repeated measurements or validation data, to estimate parameters Σϵ in measurement error models as well as two misclassification probabilities πi10 and πi01, here we conduct sensitivity analyses, which specify reasonable values for Σϵ and enable us to examine the impact of different magnitudes of measurement error. In our study, we specify Σϵ as a diagonal matrix with diagonal entries being 0.2, 0.5, and 0.7. For the implementation, we take
as an example and use the following command to demonstrate Σϵ, denoted as matrix:
With
being specified, we further determine misclassification probabilities πi10 and πi01. Specifically, since πi10 and πi01 defined in (6) rely on Xi, we reproduce Xi by (8), where
is observed gene expression values and ϵi is generated from a normal distribution with
given by 0.2, 0.5 or 0.7. After that, we adopt logistic regression or probit models to characterize (6) with the corresponding parameter specified as γ ≜ 1p, where 1p is a p-dimensional vector with all entries being one. Therefore, values of πi10 and πi01 are obtained. To show the demonstration, we summarize the following function pi that is used to implement this idea and compute πi10 and πi01. The resulting values of πi10 and πi01 are denoted as pr:
We now implement two functions LR_Boost and PM_Boost to analyze the data, where, for logistic regression model, T is given by 1000, η is set as 0.01, and τ is equal to 0.9; for probit models, T is given by 2000, η is set as 0.01, and τ is equal to 0.9.
Detailed implementations of the proposed method are described below, keeping in mind that we demonstrate correct_X = 1 and correct_Y = 1 to show the proposed method (Setting 4); different values for arguments correct_X and correct_Y reflect different settings mentioned above.
In Step 4 of Fig 1, we report the estimation results. To save limited space and provide precise information, we summarize the predictors and their estimates that are commonly selected under or 0.7, and numerical results for all settings obtained by (1) and (2) are placed in Tables 1 and 2, respectively. Moreover, to see the impacts of different regression models, we summarize commonly chosen predictors in Table 3.
We first examine Setting 1 where measurement error corrections are not incorporated. Based on BOOME, the logistic regression model retains 45 gene expression values, and the probit model suggests that 53 gene expression values should be included. Next, we explore the case that either the response or the predictors are corrected. Under Setting 2, the logistic regression model retains 59 gene expression values, and the probit model suggests that 74 gene expression values should be included. Under Setting 3, the logistic regression model retains 42 gene expression values, and the probit model suggests that 36 gene expression values should be included. Finally, under Setting 4 where measurement error effects response and the predictors are corrected, we have that the logistic regression model retains 51 gene expression values, and the probit model retains 75 gene expression values.
For the overall comparisons, we first observe that the variable selection result may depend on the correction of measurement error effects in the response and/or the predictors. The number of selected gene expressions under Setting 1 is almost smaller than that under other settings. For two regression models, the probit model retains more predictors than what the logistic regression model does, except for Setting 3. Finally, there are 35, 45, 29, and 8 gene expressions that are commonly selected by two models under Settings 1-4, respectively.
5.2 Demonstration of simulation studies
To show the validity of the BOOME method as well as the implementation of the package, we conduct simulation studies and demonstrate the programming code in this section.
Let n = 100 denote the sample size, and let p = 1000 or 5000 denote the dimension of predictors. For i = 1, …, n, we generate the p-dimensional vector of predictors Xi from the standard multivariate normal distribution. Let denote the true value of parameters, where 0q represents the q-dimensional zero vector. Given Xi and β0, we generate the binary response Yi.
Noting that {(Yi, Xi): i = 1, …, n} is regarded as unobserved data, we now generate error-prone data . For the generation of error-prone responses
, we adopt the model (8), where misclassification probabilities πi10 and πi01 are formulated by logistic regression models. On the other hand, to generate error-prone predictors
, we adopt the model (7) with Σϵ being specified as a diagonal matrix and diagonal entries are commonly specified as
or 0.75.
To see the data generation in details, we demonstrate the following code. We first specify the generation of predictors:
Next, we specify the sample size and β0, and take p = 1000 and as an example. Based on those information, we employ the function ME_Generate to generate error-pone data, where data represents the artificial data from the output of the function ME_Generate and pr represents two misclassification probabilities.
Given the generated data, we define the response y and predictors x. To implement the BOOME method, we specify iteration number, values of τ and η to be ite = 1000, thres = 0.9, and lr = 0.00001, respectively. We now implement the function LR_Boost to examine the logistic regression model with measurement error in responses and predictors corrected. Detailed implementation and partial output are given below:
For the comparison with the proposed method with correction of measurement error in responses and predictors, we examine naive analysis based on Settings 1-3 in Section 5.1. Detailed implementation and partial outputs are given below. In general, we find from outputs that the first three estimator based on the proposed method is close to the true value 1, and selected predictors are the same as the underlying true setting. On the other hand, without correcting measurement error effects, we observe from the below results that the first three estimators have larger biases and are far from the true value 1. Moreover, additional irrelevant predictors are falsely included.
In addition to the logistic regression models, we further examine the probit model based on four settings as described in Section 5.1. Specifically, we implement the function PM_Boost to construct the probit model and specify arguments (correct_X, correct_Y) to be (0,0), (0,1), (1,0), and (1,1) that reflect Settings 1-4 in Section 5.1, respectively. Detailed implementation and partial outputs are available below. Similar to the findings based on logistic regression models, we observe that the estimator with measurement error in responses and predictors corrected outperforms other scenarios because of smaller biases and precise variable selection. As expected, without suitable variable selection, the estimators induce tremendous biases and some irrelevant predictors are included.
Finally, to precisely access the accuracy of the estimator, we use the L1-norm and the L2-norm, which are respectively defined as
and
where
,
and β0,i are the ith entry of
and β0, respectively. To access the performance of variable selection, we examine specificity (SPE) and sensitivity (SEN), which are respectively defined as
and
Numerical results under all settings described above are reported in Table 4. We can observe that biases in the L1 and L2-norms are increasing when the magnitude of measurement error and dimension p become large. As expected, when measurement error in responses and predictors are corrected (Setting 4), the biases are the smallest and SPE as well as SEN are the largest among all settings, which verify that the proposed method is valid to handle measurement error regardless of specification of regression models. On the other hand, without correcting measurement error effects, we find that the naive methods (Settings 1-3) produce significant biases and low values of SPE and SEN, indicating the worse performance of variable selection. In particular, if measurement error in responses and predictors are not corrected, as shown in Setting 1, we have the worst estimation results. Compared with Settings 2 and 3, it is interesting to see that the biases under Setting 2 are greater than those based on Setting 3, and values of SPE and SEN obtained by Setting 2 are smaller than those based on Setting 3. It implies that ignoring measurement error in predictors would incur severe biases and would be worse than ignoring measurement error effects occurred in responses.
6 Discussion
In this paper, we introduce the Python package BOOME that aims to address ultrahigh-dimensional data subject to measurement error in responses and predictors. Unlike existing packages that deal with either variable selection or measurement error but not both, our package can handle variable selection and correct measurement error effects to both responses and predictors simultaneously. In addition, the computational time is fairly fast and arguments are flexible for public use. In applications, sometimes variables in datasets can be shown to be free of measurement error and can be precisely measured, such as age or gender. The package BOOME is still flexible to handle those scenarios. For example, if researchers believe that predictors in their datasets are free of measurement error, then they can adopt Setting 2 in Section 5.1 by employing corrected responses and precisely measured predictors; if only predictors are shown to have measurement error, then one can adopt Setting 3 in Section 5.1 by implementing corrected predictors and precisely measured responses.
There are several possible extensions based on the current developments. First, in addition to continuous or binary random variables, categorical or counted data are frequently adopted in the framework of bioinformatics, such as RNA sequence or GWAS data, and they might be subject to mismeasurement error. Therefore, it is important to propose a valid approach to adjust for measurement error effects to those data. In addition, our current approach focuses on parametric logistic regression or probit models. To provide general formulations, it is interesting to extend the BOOME method to nonparametric models or semi-pamatric models. In the current development, our attention primarily focuses on variable selection for high-dimensional data subject to measurement error. In supervising learning, examining the performance of classification and prediction is a crucial concern. Provided that additional information, such as validation samples, is available, it is interesting to adopt selected predictors and adjustments of measurement error from the BOOME method to define a general model of measurement heterogeneity and develop several measures (e.g., C-statistic or Brier score) to assess the predictive performance (e.g., [30]). In addition, since responses are subject to measurement error as well, it deserves careful exploration to handle measurement error in responses when doing prediction.
Finally, as commented by a referee, dimension reduction techniques, such as principal component analysis (PCA) or factor analysis, can be valid tools to reduce dimension from ultrahigh-dimensional predictors. However, there are two main issues in the current development. First, the purpose in this study is to detect informative predictors and exclude irrelevant ones, while dimension reduction techniques aim to reduce dimension through linear combinations of high-dimensional predictors. Second, when the predictors are subject to measurement error, the BOOME package is able to address measurement error effects and correctly retain important predictors, while correction of measurement error effects for dimension reduction techniques is not explored, especially when the response is contaminated with measurement error as well. Undoubtedly, it is an interesting perspective to handle ultrahigh-dimensional data and deserves careful exploration in the future research.
Acknowledgments
The author would like to thank his master student, QinYing OuYang, for assistance in developing the package and summarizing the results of data analysis, and thank Lingyu Cai for revising Python code, helpful language editing, grammar revision, and proofreading. The author also thanks two referees for their useful comments to significantly improve the presentation of the initial manuscript.
References
- 1. Guyon I., Weston J., Barnhill S., Vapnik V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
- 2.
He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with component graphical structure incorporated. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, 557–570.
- 3. Tibshirani R. (1996). Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B, 58, 267–288.
- 4. Zou H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
- 5. Zou H. and Hastie T. (2005), Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320.
- 6. Chen L.-P. and Yi G. Y. (2021a). Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics, 77, 956–969.
- 7. Chen L.-P. and Yi G. Y. (2022a). De-noising analysis of noisy data with graphical models. Electronic Journal of Statistics, 16, 3861–3909.
- 8. Chen L.-P. and Yi G. Y. (2022b). Sufficient dimension reduction for survival data analysis with error-prone variables. Electronic Journal of Statistics, 16, 2082–2123.
- 9. Huang F., Guang P., Li F., Liu X., Zhang W., and Huang W. (2020). AML, ALL, and CML classification and diagnosis based on bone marrow cell morphology combined with convolutional neural network. Medicine, 99:45, 1–8.
- 10.
Carroll R. J., Ruppert D., Stefanski L. A., and Crainiceanu C. M. (2006). Measurement Error in Nonlinear Model. Chapman and Hall.
- 11. Chen L.-P. and Yi G. Y. (2020). Model selection and model averaging for analysis of truncated and censored data with measurement error. Electronic Journal of Statistics, 14, 4054–4109.
- 12. Chen L.-P. (2020). Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates. Journal of Statistical Computation and Simulation, 90, 3261–3300.
- 13.
Yi G. Y. (2017). Statistical Analysis with Measurement Error and Misclassication: Strategy, Method and Application. Springer.
- 14. Carroll R. J., Spiegelman C. H., Gordon Lan K. K., Bailey K. T., and Abbott R. D. (1984). On errors-in-variables for binary regression models. Biometrika, 71, 19–25.
- 15. Chen L.-P. and Yi G. Y. (2021b). Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Annals of the Institute of Statistical Mathematics, 73, 481–517.
- 16. Roy S., Banerjee T., and Maiti T. (2005). Measurement error model for misclassified binary responses. Statistics in Medicine, 24, 269–283. pmid:15546132
- 17. Schafer D. W. (1993). Analysis for probit regression with measurement errors. Biometrika, 80, 899–904.
- 18. Stefanski L. A., and Carroll R. J. (1987). Conditional scores and optimal scores for generalized linear measurement error models. Biometrika, 74, 703–716.
- 19. Brown B., Miller C. J., and Wolfson J. (2017). ThrEEBoost: Thresholded boosting for variable selection and prediction via estimating equations. Journal of Computational and Graphical Statistics, 26, 579–588.
- 20.
Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
- 21. Sørensen Ø, Hellton K. H., Frigessi A., and Thoresen M. (2018). Covariate selection in high-dimensional generalized linear models with measurement error. Journal of Computational and Graphical Statistics, 27, 739–749.
- 22. Ma Y. and Li R. (2010). Variable selection in measurement error models. Bernoulli, 16, 273–300. pmid:20209020
- 23. Brown B., Weaver T., and Wolfson J. (2019). MEBoost: Variable selection in the presence of measurement error. Statistics in Medicine, 38, 2705–2718. pmid:30856279
- 24.
Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., et al. (2022). glmnet: Lasso and Elastic-net regularized generalized linear models. R package version 4.1-4. https://CRAN.R-project.org/package=glmnet
- 25. Fan J. and Lv J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911.
- 26.
Krishnan, S. (2019). xverse. Python package version 1.0.5. https://pypi.org/project/xverse/#description
- 27.
Bartoszek, K. (2019). GLSME: Generalized least squares with measurement error. R package version 1.0.5. https://CRAN.R-project.org/package=GLSME
- 28. Nab L., van Smeden M., Keogh R. H. and Groenwol R.H.H. (2021). Mecor: An R package for measurement error correction in linear regression models with a continuous outcome. Computer Methods and Programs in Biomedicine, 208, 106238. pmid:34311414
- 29. Wolfson J. (2011). EEBOOST: a general method for prediction and variable selection based on estimating equation. Journal of the American Statistical Association, 106, 296–305.
- 30. Luijken K., Groenwold R.H.H., Van Calster B., Steyerberg E.W., and van Smeden M. (2019). Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective. Statistics in Medicine, 38, 3444–3459. pmid:31148207