BOOME: A Python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data

Li-Pang Chen

doi:10.1371/journal.pone.0276664

Abstract

In gene expression data analysis framework, ultrahigh dimensionality and measurement error are ubiquitous features. Therefore, it is crucial to correct measurement error effects and make variable selection when fitting a regression model. In this paper, we introduce a python package BOOME, which refers to BOOsting algorithm for Measurement Error in binary responses and ultrahigh-dimensional predictors. We primarily focus on logistic regression and probit models with responses, predictors, or both contaminated with measurement error. The BOOME aims to address measurement error effects, and employ boosting procedure to make variable selection and estimation.

Citation: Chen L-P (2022) BOOME: A Python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data. PLoS ONE 17(10): e0276664. https://doi.org/10.1371/journal.pone.0276664

Editor: Angelo Moretti, Utrecht University: Universiteit Utrecht, NETHERLANDS

Received: May 7, 2022; Accepted: October 11, 2022; Published: October 27, 2022

Copyright: © 2022 Li-Pang Chen. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The gene expression data considered in the manuscript are available in the R package SIS. In this package, one can insert two inputs leukemia.train and leukemia.test to get the dataset, where the first 7129 columns are gene expression values, and the last column is AML (labeled 1) and ALL (labeled 0).

Funding: Chen’s research was supported by the Ministry of Science and Technology (MOST 110-2118-M-004-006-MY2). No, the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

1.1 Motivation

Analysis of gene expression data is a popular topic and deserves careful research development. A motivating example in this paper is a gene expression microarray data collected by [1] and explored in some references (e.g., [2]). The full dataset can be found in the R package SIS. The data contain binary responses, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), and 7128 gene expression levels that were measured using Affymetrix oligonucleotide arrays. In addition, samples with size 72 come from the two classes, with 47 specimens in class ALL and 25 specimens in class AML. The primary objective of this study is to characterize the relationship between leukemia and gene expression values, and see how gene expression values interpret leukemia. To achieve this goal, a commonly used approach is to build a regression model by treating leukemia and gene expression values as binary responses and the predictors, respectively. To model a binary response, logistic regression or probit models are perhaps frequently implemented parametric approaches.

According to this gene expression data, ultrahigh dimensionality (p ≫ n) is a challenging feature. Since not every gene expression value is informative, using irrelevant predictors in regression models may affect the performance of classification and induce wrong conclusions. Therefore, making variable selection and retaining important ones are needed. While variable selection techniques have been widely explored (e.g., [3–5]), those strategies may fail to handle the case that the dimension is extremely larger than the sample size. The other concern is measurement error in the response and predictors. As discussed in [6–8], gene expression values may be measured imprecisely due to unadjusted machines. Moreover, as commented by [9], it is also possible to falsely record ALL (or AML) to AML (or ALL), known as misclassification, because the microscopy images of AML bone marrow cells contain many immature granulocytes and monocytes, and ALL bone marrow cell microscopy images contain many immature lymphocytes. It is known that ignoring measurement error effects may cause tremendous biases and induce incorrect decisions, such as the false exclusion of truly informative predictors or false inclusion of irrelevant predictors when making variable selection (e.g., [6, 10, 11]). Therefore, it is crucial to correct measurement error effects. In particular, unlike existing literature that handles either variable selection or measurement error, the main challenge of this dataset is to correct measurement error and select informative predictors under ultrahigh-dimensional data simultaneously, and measurement error may explicitly affect the performance of variable selection. In other words, truly noninformative predictors may be falsely included if measurement error effects are ignored (e.g., [6, 11, 12]). As a result, it is necessary to suitably adjust measurement error effects and then use the corrected version to make variable selection.

1.2 Contributions

To address those concerns, we develop a package BOOME that is now available on https://pypi.org/project/BOOME/0.0.2/. The purpose of this package is to correct two measurement error processes in responses and predictors, and employ the boosting procedure to retain important predictors and estimate nonzero coefficients simultaneously.

In standard analysis of regression models, to estimate unknown parameters, one may require to derive likelihood functions, or more generally, unbiased estimating functions. Then the resulting estimator can be obtained by optimizing the constructed estimating functions. In the presence of measurement error, however, naively adopting error-prone predictors to the estimating functions would yield the biased estimators (e.g., [10, 13]). Therefore, to address this challenge, as discussed in measurement error framework, one should derive the corrected estimating functions with measurement error effects eliminated before implementing computational algorithms or estimation methods to derive the estimator, which is the standard step in measurement error analysis (e.g., [6, 10–18]). Following this idea, our strategy is to derive a new estimating function with measurement error effects in responses and predictors corrected, then adopt it to select informative predictors and obtain the corresponding estimators. Specifically, to correct measurement error effects to the binary response, we define the misclassification matrix (e.g., [13], p.131), which is formulated by specificity and sensitivity (e.g., [13], p.70), and will be described in details in Section 2.2, to adjust for measurement error effects in the responses and derive a new corrected response. Regarding the error-prone predictors, we adopt the sufficient statistics of the predictors and the regression calibration to correct measurement error effects to the predictors. Based on such strategies of measurement error corrections, we develop the corrected estimating functions under logistic regression or probit models, respectively. After that, we implement the corrected estimating functions to the boosting algorithm to make variable selection and estimation (e.g., [19]; [20], p.608). Detailed descriptions of measurement error corrections and the boosting algorithm are deferred to Sections 3.1 and 3.2, respectively.

1.3 Comparisons

Variable selection and estimation with correction of measurement error have been discussed, and many methods based on different settings have been developed. For example, [21] consider generalized linear models (GLM) and proposed the generalized matrix uncertainty selector (GMUS), whose idea is based on a Taylor series expansion of the GLM mean function around the true, but unknown, predictors. [22] considered parametric and semi-parametric regression models with error-prone predictors, and developed a corrected estimating equation to make variable selection. For survival data with incomplete responses, [6, 11, 12] considered several types of survival models and developed penalized estimating function to deal with variable selection. [23] proposed the MEBoost method, which adopts the boosting method to select informative variables under error-prone linear regression models. While many methods have been developed, they primarily focus on measurement error in predictors and rare work has been available to address measurement error effects in responses. In addition, although the boosting method has been applied to error-prone data, the existing method simply focuses on linear regression models, and other types of regression models have not been explored.

In the past developments, several existing packages based on different software have been developed to deal with either measurement error or variable selection. To name a few, for the R software, two packages glmnet [24] and SIS [25] are popular methods to handle variable selection. For the Python software, xverse [26] can be adopted to do feature selection. However, they fail to deal with measurement error effects. On the other hand, the two packages GLSME [27] and mecor [28] in the R software focus on linear models and aim to adjust for measurement error effects in the response and/or predictors, but they cannot deal with variable selection.

Compared with existing packages, there are some differences from the package BOOME. Specifically, our package is able to handle ultrahigh-dimensionality and mismeasured data simultaneously. Unlike most existing frameworks that focus on measurement error in predictors or continuous responses, our approach extends measurement error in binary responses (a.k.a misclassification), and the model structure for misclassification is more complex than that under continuous responses. Our approach can deal with error-prone response and predictors simultaneously. Moreover, boosting iteration may reduce the possibility of falsely excluding important predictors and enhance the accuracy of the estimator. Most importantly, our development is based on the Python language, and, to the best of knowledge, there is no relevant development in Python packages.

1.4 Organization of this paper

The remainder is organized as follows. In Section 2, we introduce two regression models to characterize binary responses. In addition, we introduce two measurement error models to describe error-prone responses and predictors, respectively. In Section 3, we present the BOOME method. Specifically, we first discuss some valid strategies to handle measurement error effects, and then discuss the boosting method for variable selection and estimation. In Section 4, we introduce the Python package BOOME, including some important functions as well as their implementation. In Section 5, we demonstrate the application of the package BOOME and analyze the gene expression data. Moreover, we also demonstrate simulation studies. Finally, a general discussion is presented in Section 6.

2 Regression models

2.1 Regression models with binary responses

Following the motivating example in Section 1.1, let n = 72 denote the sample size. For i = 1, …, n, let Y_i be a binary response where Y_i = 1 represents AML and Y_i = 0 indicates ALL. Moreover, let X_i be a p-dimensional vector of gene expressions with p = 7128.

With the absence of measurement error effects, our goal is to use the gene expression values X_i to characterize the disease Y_i through a p-dimensional vector of parameters β. In the framework of GLM, logistic regression or probit models are commonly used. Specifically, the logistic regression model (LR) is formulated by (1) and the probit model (PM) is given by (2) where Φ(⋅) is the cumulative distribution function of the standard normal distribution.

To estimate β, a common strategy is to optimize the likelihood function, or equivalently, solve the estimating equation. Specifically, for i = 1, …, n, the estimating function based on (1) is defined as (3) and the estimating function based on (2) is given by (4)

Solving g_LR(β) = 0 or g_PM(β) = 0 yields the estimator of β.

2.2 Measurement error models

In applications, as discussed in Section 1.1, Y_i and X_i might be subject to measurement error due to wrong records by investigators or imprecise measurements by unadjusted machines. Under this scenario, we particularly denote Y_i and X_i as unobserved variables, and let and denote the surrogate measurements of Y_i and X_i, respectively, and they are recorded in the data.

We now provide an intuition of modeling error-prone data. Let f(⋅|⋅) represent the conditional distribution for variables indicated by the corresponding arguments, and let f(⋅) denote marginal or joint distribution of random variables. Following the similar discussion in [13] (Chapter 8), we consider the joint distribution and factorize it as (5) where the second step is obtained by the nondifferential . With the marginal distribution f(X_i) left specified, the factorization (5) says that the inference about f(Y_i|X_i) is conducted based on examining with the predictor measurement error process being characterized by , and facilitates the response measurement error process.

To analyze measurement error effects when constructing regression models, we first need to characterize the relationship between Y_i and as well as X_i and . Specifically, to connect Y_i and , we consider the conditional probability (6) for k, l ∈ {0, 1}, satisfying π_i10+ π_i00 = 1 and π_i01+ π_i11 = 1, where π_i11 and π_i00 are called sensitivity and specificity, respectively, or known as classification probabilities; π_i10 and π_i01 are known as misclassification probabilities (e.g., [13], p.70). Moreover, to characterize π_i01 and π_i10, logistic regression (1) or probit models (2) with additional parameter γ are suitable choices. By the law of total probability, and can be expressed as (7) where is called a 2 × 2 misclassification matrix (e.g., [13], p.131) that is assumed to be invertible.

Next, to describe the relationship between and X_i, we employ the classical measurement error model (8) where ϵ_i is independently and identically distributed normal distribution N(0, Σ_ϵ) with Σ_ϵ being a p × p covariance matrix representing the magnitude of measurement error effects in the predictors. We assume that ϵ_i is independent of X_i.

3 Method

3.1 Correction of measurement error

In this section, we primarily correct measurement error effects to responses and predictors.

Motivated by (7), by multiplying the inverse matrix of Π_i to both sides of (7), we can obtain that (9) which indicates that the unobserved response Y_i = 1 can be implicitly characterized by with the adjustment in terms of π_i01 and π_i10. It motivates us to consider the “corrected” response, denoted , which satisfies (10) suggesting that (11)

In addition, (11) indicates that , verifying that (11) is a suitable correction to recover Y_i. Moreover, we note that (11) holds regardless of the choice of regression models because it is obtained by the equalities (9) and (10) where the conditional probability can be (1) or (2).

To correct measurement error effects to the predictors, we provide two different strategies for different models. For the logistic regression model in terms of and unobserved X_i, we follow the similar discussion in [18] and aim to replace X_i by its sufficient statistic: (12) which can be regarded as correction of . Replacing Y_i and X_i in (1) by (11) and (12) gives the corrected estimating function: (13)

On the other hand, to handle measurement error effects to the probit model, we adopt the regression calibration (e.g., [10], Chapter 4), whose key idea is to replace X_i by the conditional expectation . By the best linear unbiased prediction, it can be expressed as (e.g., [11, 15]) where μ_X and μ_X* represent the mean vectors of X and X*, respectively, and Σ_X* represents the covariance matrix of X*. Since μ_X = μ_X*, by the method of moments, we obtain that (14) where and are empirical estimates of μ_X* and Σ_X*, respectively. Consequently, replacing Y_i and X_i in (2) by (11) and (14) gives the corrected estimating function: (15)

3.2 Boosting algorithm

Let g**(β) denote the unified notation to represent the corrected estimating function (13) or (15). To make variable selection and estimation for β, we adopt the boosting algorithm with the correction of measurement error effects. The proposed method is called BOOME, and the procedure is summarized in Algorithm 1.

Specifically, the algorithm starts by an initial value β⁽⁰⁾ taken by the p-dimensional zero vector 0_p. Suppose that we run T times iterations, and for each iteration step t = 1, …, T, we compute the estimating function g**(β) evaluated at the (t − 1)th iterated value β^(t−1), and denote it as Δ^(t−1). After that, we define the active set that collects the indexes satisfying , where τ ∈ [0, 1] is a constant and is the jth component in a vector Δ^(t−1). It implies that the active set aims to retain informative predictors by treating as a signal. Finally, for those , we update the iterated value of the jth component in β^(t−1), say , by adding an increment for some positive constant η. Repeating those steps T times yields the final estimator of β.

In Algorithm 1, τ, η, and T can be user-specific and may affect the iteration result. Similar to the comment in [29], the algorithm satisfying Tη → 0 as T → ∞ and η → 0 is approximately equivalent to the LASSO method. Therefore, it suggests taking η as a small value, such as η = 0.01, in applications. On the other hand, while T is suggested being large, it may cause over-fitting. To provide a suitable T and stop the iteration earlier, we suggest a criterion: the iteration stops at T if is satisfied for some positive constant ξ. Finally, for the choice of τ, one may adopt some criteria such as cross-validation (e.g., [19]).

Algorithm 1: Boosting Procedure in BOOME

Let β⁽⁰⁾ = 0_p denote an initial value;

for step t with t = 1, 2, …, T do

(a) calculate Δ^(t−1) = g**(β)|_{β = β^(t−1)};

(b) determine ;

(c) update for all , and define ;

The final estimator is given by .

4 Description and implementation of BOOME

We develop a Python package, called BOOME, to implement the variable selection and estimation with measurement error correction described in Section 3. The package BOOME contains three functions: ME_Generate, LR_Boost, and PM_Boost. The function ME_Generate aims to generate artificial data under specific models listed in Section 2.1 and error-prone predictors. The functions LR_Boost and PM_Boost implement the boosting procedure in Algorithm 1, except for the difference that LR_Boost is based on the logistic regression model, and PM_Boost focuses on the probit model. We now describe the details of these three functions.

4.1 ME_Generate

We use the following command to obtain the artificial data: where the meaning of each argument is described as follows:

n: The number of observations.
beta: A p-dimensional vector of parameter β specified by users.
matrix: A user-specific covariance matrix implemented to (8).
X: A user-specific n × p matrix of predictors.
gamma: A p-dimensional vector of parameter γ in π_i10 and π_i01 specified by users.

The function ME_Generate returns a list of components:

data: A dataset with error-prone predictors and responses. It is a n×(p+ 1) data frame, where the column with label y represents the error-prone response, and the column with label j, j = 1, …, p, represents the jth error-prone predictor .
pr: Two misclassification probabilities π_i10 and π_i01 in (7).

4.2 LR_Boost

To demonstrate Algorithm 1 with the corrected estimating function (13) for the logistic regression model, we adopt the following command: where the meaning of each argument is described as follows:

X: A n × p matrix of continuous predictors that are precisely measured or subject to measurement error.
Y: A n-dimensional vector of binary responses that are precisely measured or subject to measurement error.
ite: A number of iteration T in Algorithm 1.
thres: A threshold value τ in Algorithm 1.
correct_X: Determine the correction of measurement error in predictors. Select “1” if correction is needed, and “0” otherwise.
correct_Y: Determine the correction of measurement error in the response. Select “1” if correction is needed, and “0” otherwise.
pr: Two misclassification probabilities π_i10 and π_i01 in (7).
lr: A learning rate η in Algorithm 1.
matrix: A p × p covariance matrix Σ_ϵ in (8).

The function LR_Boost returns a list of components:

estimated coefficients: the p-dimensional vector of estimators of β.
predictors: Indexes of nonzero values in estimated coefficients.
number of predictors: The number of nonzero values in estimated coefficients.

4.3 PM_Boost

To make variable selection and estimation for probit model by using Algorithm 1 with the corrected estimating function (15), we implement the following function:

The arguments in PM_Boost as well as the output produced by PM_Boost are the same as those in LR_Boost.

5 Numerical studies

In this section, we implement the functions in the package BOOME to analyze a real dataset as well as demonstrate simulation studies. Detailed code demonstrations are also available on the pypi website https://pypi.org/project/BOOME/0.0.2/.

5.1 Analysis of gene expression microarray data

In this section, we implement the package BOOME to analyze a gene expression microarray data introduced in Section 1.1. The steps for analysis are summarized in Fig 1. As shown in Step 1 of Fig 1, we recognize that, for i = 1, …, n, is the binary random variable with outcomes AML and ALL that may possibly subject to misclassification, and represents the gene expression values that are contaminated with measurement error. Before analyzing this dataset, we first standardize all predictors, such that the mean and the variance of each predictor become 0 and 1, respectively. Let data_GE in Python code represent the gene expression microarray data that we introduced in Section 1.1, where the first column is the binary outcome and the remaining columns are gene expression values. Based on this dataset, the following code shows the input of gene expression data and the standardized procedure:

Download:

Fig 1. Diagram of data analysis and implementation of the package BOOME.

https://doi.org/10.1371/journal.pone.0276664.g001

To examine the impact of measurement error effects, we primarily consider four settings in Step 2 of Fig 1:

Setting 1: and are not corrected.
Setting 2: is corrected while is not.
Setting 3: is corrected while is not.
Setting 4: and are corrected.

Here Setting 1 aims to implement Algorithm 1 to the estimating functions (3) or (4) with Y_i and X_i replaced by error-prone variables and , respectively. Setting 2 considers the estimating function in Setting 1 with replaced by corrected responses (11); and Setting 3 adopts the estimating function in Setting 1 with replaced by corrected predictors (12) or (14). Setting 4 uses Algorithm 1 to estimating functions (13) or (15), where measurement error in both responses and predictors are corrected. As discussed in Section 1.1, both responses and predictors are contaminated with measurement error, then Setting 4 is the proposed method by correcting measurement error effects in responses and predictors. On the other hand, Settings 1-3, known as naive methods, reflect that measurement error in leukemia, gene expressions, or both are not corrected. Basically, Settings 1-3 are considered to show the impact of ignorance of measurement error effects and are compared with the proposed method in Setting 4.

In Step 3, we implement the functions in BOOME for four different settings. We first note that the dataset has no additional information, such as repeated measurements or validation data, to estimate parameters Σ_ϵ in measurement error models as well as two misclassification probabilities π_i10 and π_i01, here we conduct sensitivity analyses, which specify reasonable values for Σ_ϵ and enable us to examine the impact of different magnitudes of measurement error. In our study, we specify Σ_ϵ as a diagonal matrix with diagonal entries being 0.2, 0.5, and 0.7. For the implementation, we take as an example and use the following command to demonstrate Σ_ϵ, denoted as matrix: With being specified, we further determine misclassification probabilities π_i10 and π_i01. Specifically, since π_i10 and π_i01 defined in (6) rely on X_i, we reproduce X_i by (8), where is observed gene expression values and ϵ_i is generated from a normal distribution with given by 0.2, 0.5 or 0.7. After that, we adopt logistic regression or probit models to characterize (6) with the corresponding parameter specified as γ ≜ 1_p, where 1_p is a p-dimensional vector with all entries being one. Therefore, values of π_i10 and π_i01 are obtained. To show the demonstration, we summarize the following function pi that is used to implement this idea and compute π_i10 and π_i01. The resulting values of π_i10 and π_i01 are denoted as pr:

We now implement two functions LR_Boost and PM_Boost to analyze the data, where, for logistic regression model, T is given by 1000, η is set as 0.01, and τ is equal to 0.9; for probit models, T is given by 2000, η is set as 0.01, and τ is equal to 0.9.

Detailed implementations of the proposed method are described below, keeping in mind that we demonstrate correct_X = 1 and correct_Y = 1 to show the proposed method (Setting 4); different values for arguments correct_X and correct_Y reflect different settings mentioned above.

In Step 4 of Fig 1, we report the estimation results. To save limited space and provide precise information, we summarize the predictors and their estimates that are commonly selected under or 0.7, and numerical results for all settings obtained by (1) and (2) are placed in Tables 1 and 2, respectively. Moreover, to see the impacts of different regression models, we summarize commonly chosen predictors in Table 3.

Download:

Table 1. Estimation results based on the logistic regression model.

https://doi.org/10.1371/journal.pone.0276664.t001

Download:

Table 2. Estimation results based on the probit model.

https://doi.org/10.1371/journal.pone.0276664.t002

Download:

Table 3. Summary of common genes based on the logistic regression model and the probit model.

https://doi.org/10.1371/journal.pone.0276664.t003

We first examine Setting 1 where measurement error corrections are not incorporated. Based on BOOME, the logistic regression model retains 45 gene expression values, and the probit model suggests that 53 gene expression values should be included. Next, we explore the case that either the response or the predictors are corrected. Under Setting 2, the logistic regression model retains 59 gene expression values, and the probit model suggests that 74 gene expression values should be included. Under Setting 3, the logistic regression model retains 42 gene expression values, and the probit model suggests that 36 gene expression values should be included. Finally, under Setting 4 where measurement error effects response and the predictors are corrected, we have that the logistic regression model retains 51 gene expression values, and the probit model retains 75 gene expression values.

For the overall comparisons, we first observe that the variable selection result may depend on the correction of measurement error effects in the response and/or the predictors. The number of selected gene expressions under Setting 1 is almost smaller than that under other settings. For two regression models, the probit model retains more predictors than what the logistic regression model does, except for Setting 3. Finally, there are 35, 45, 29, and 8 gene expressions that are commonly selected by two models under Settings 1-4, respectively.

5.2 Demonstration of simulation studies

To show the validity of the BOOME method as well as the implementation of the package, we conduct simulation studies and demonstrate the programming code in this section.

Let n = 100 denote the sample size, and let p = 1000 or 5000 denote the dimension of predictors. For i = 1, …, n, we generate the p-dimensional vector of predictors X_i from the standard multivariate normal distribution. Let denote the true value of parameters, where 0_q represents the q-dimensional zero vector. Given X_i and β₀, we generate the binary response Y_i.

Noting that {(Y_i, X_i): i = 1, …, n} is regarded as unobserved data, we now generate error-prone data . For the generation of error-prone responses , we adopt the model (8), where misclassification probabilities π_i10 and π_i01 are formulated by logistic regression models. On the other hand, to generate error-prone predictors , we adopt the model (7) with Σ_ϵ being specified as a diagonal matrix and diagonal entries are commonly specified as or 0.75.

To see the data generation in details, we demonstrate the following code. We first specify the generation of predictors:

Next, we specify the sample size and β₀, and take p = 1000 and as an example. Based on those information, we employ the function ME_Generate to generate error-pone data, where data represents the artificial data from the output of the function ME_Generate and pr represents two misclassification probabilities.

Given the generated data, we define the response y and predictors x. To implement the BOOME method, we specify iteration number, values of τ and η to be ite = 1000, thres = 0.9, and lr = 0.00001, respectively. We now implement the function LR_Boost to examine the logistic regression model with measurement error in responses and predictors corrected. Detailed implementation and partial output are given below:

For the comparison with the proposed method with correction of measurement error in responses and predictors, we examine naive analysis based on Settings 1-3 in Section 5.1. Detailed implementation and partial outputs are given below. In general, we find from outputs that the first three estimator based on the proposed method is close to the true value 1, and selected predictors are the same as the underlying true setting. On the other hand, without correcting measurement error effects, we observe from the below results that the first three estimators have larger biases and are far from the true value 1. Moreover, additional irrelevant predictors are falsely included.

In addition to the logistic regression models, we further examine the probit model based on four settings as described in Section 5.1. Specifically, we implement the function PM_Boost to construct the probit model and specify arguments (correct_X, correct_Y) to be (0,0), (0,1), (1,0), and (1,1) that reflect Settings 1-4 in Section 5.1, respectively. Detailed implementation and partial outputs are available below. Similar to the findings based on logistic regression models, we observe that the estimator with measurement error in responses and predictors corrected outperforms other scenarios because of smaller biases and precise variable selection. As expected, without suitable variable selection, the estimators induce tremendous biases and some irrelevant predictors are included.

Finally, to precisely access the accuracy of the estimator, we use the L₁-norm and the L₂-norm, which are respectively defined as and where , and β_0,i are the ith entry of and β₀, respectively. To access the performance of variable selection, we examine specificity (SPE) and sensitivity (SEN), which are respectively defined as and

Numerical results under all settings described above are reported in Table 4. We can observe that biases in the L₁ and L₂-norms are increasing when the magnitude of measurement error and dimension p become large. As expected, when measurement error in responses and predictors are corrected (Setting 4), the biases are the smallest and SPE as well as SEN are the largest among all settings, which verify that the proposed method is valid to handle measurement error regardless of specification of regression models. On the other hand, without correcting measurement error effects, we find that the naive methods (Settings 1-3) produce significant biases and low values of SPE and SEN, indicating the worse performance of variable selection. In particular, if measurement error in responses and predictors are not corrected, as shown in Setting 1, we have the worst estimation results. Compared with Settings 2 and 3, it is interesting to see that the biases under Setting 2 are greater than those based on Setting 3, and values of SPE and SEN obtained by Setting 2 are smaller than those based on Setting 3. It implies that ignoring measurement error in predictors would incur severe biases and would be worse than ignoring measurement error effects occurred in responses.

Download:

Table 4. Simulation results for two regression models with n = 100.

https://doi.org/10.1371/journal.pone.0276664.t004

6 Discussion

In this paper, we introduce the Python package BOOME that aims to address ultrahigh-dimensional data subject to measurement error in responses and predictors. Unlike existing packages that deal with either variable selection or measurement error but not both, our package can handle variable selection and correct measurement error effects to both responses and predictors simultaneously. In addition, the computational time is fairly fast and arguments are flexible for public use. In applications, sometimes variables in datasets can be shown to be free of measurement error and can be precisely measured, such as age or gender. The package BOOME is still flexible to handle those scenarios. For example, if researchers believe that predictors in their datasets are free of measurement error, then they can adopt Setting 2 in Section 5.1 by employing corrected responses and precisely measured predictors; if only predictors are shown to have measurement error, then one can adopt Setting 3 in Section 5.1 by implementing corrected predictors and precisely measured responses.

There are several possible extensions based on the current developments. First, in addition to continuous or binary random variables, categorical or counted data are frequently adopted in the framework of bioinformatics, such as RNA sequence or GWAS data, and they might be subject to mismeasurement error. Therefore, it is important to propose a valid approach to adjust for measurement error effects to those data. In addition, our current approach focuses on parametric logistic regression or probit models. To provide general formulations, it is interesting to extend the BOOME method to nonparametric models or semi-pamatric models. In the current development, our attention primarily focuses on variable selection for high-dimensional data subject to measurement error. In supervising learning, examining the performance of classification and prediction is a crucial concern. Provided that additional information, such as validation samples, is available, it is interesting to adopt selected predictors and adjustments of measurement error from the BOOME method to define a general model of measurement heterogeneity and develop several measures (e.g., C-statistic or Brier score) to assess the predictive performance (e.g., [30]). In addition, since responses are subject to measurement error as well, it deserves careful exploration to handle measurement error in responses when doing prediction.

Finally, as commented by a referee, dimension reduction techniques, such as principal component analysis (PCA) or factor analysis, can be valid tools to reduce dimension from ultrahigh-dimensional predictors. However, there are two main issues in the current development. First, the purpose in this study is to detect informative predictors and exclude irrelevant ones, while dimension reduction techniques aim to reduce dimension through linear combinations of high-dimensional predictors. Second, when the predictors are subject to measurement error, the BOOME package is able to address measurement error effects and correctly retain important predictors, while correction of measurement error effects for dimension reduction techniques is not explored, especially when the response is contaminated with measurement error as well. Undoubtedly, it is an interesting perspective to handle ultrahigh-dimensional data and deserves careful exploration in the future research.

Acknowledgments

The author would like to thank his master student, QinYing OuYang, for assistance in developing the package and summarizing the results of data analysis, and thank Lingyu Cai for revising Python code, helpful language editing, grammar revision, and proofreading. The author also thanks two referees for their useful comments to significantly improve the presentation of the initial manuscript.

References

1. Guyon I., Weston J., Barnhill S., Vapnik V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
- View Article
- Google Scholar
2. He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with component graphical structure incorporated. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, 557–570.
3. Tibshirani R. (1996). Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B, 58, 267–288.
- View Article
- Google Scholar
4. Zou H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
- View Article
- Google Scholar
5. Zou H. and Hastie T. (2005), Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320.
- View Article
- Google Scholar
6. Chen L.-P. and Yi G. Y. (2021a). Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics, 77, 956–969.
- View Article
- Google Scholar
7. Chen L.-P. and Yi G. Y. (2022a). De-noising analysis of noisy data with graphical models. Electronic Journal of Statistics, 16, 3861–3909.
- View Article
- Google Scholar
8. Chen L.-P. and Yi G. Y. (2022b). Sufficient dimension reduction for survival data analysis with error-prone variables. Electronic Journal of Statistics, 16, 2082–2123.
- View Article
- Google Scholar
9. Huang F., Guang P., Li F., Liu X., Zhang W., and Huang W. (2020). AML, ALL, and CML classification and diagnosis based on bone marrow cell morphology combined with convolutional neural network. Medicine, 99:45, 1–8.
- View Article
- Google Scholar
10. Carroll R. J., Ruppert D., Stefanski L. A., and Crainiceanu C. M. (2006). Measurement Error in Nonlinear Model. Chapman and Hall.
11. Chen L.-P. and Yi G. Y. (2020). Model selection and model averaging for analysis of truncated and censored data with measurement error. Electronic Journal of Statistics, 14, 4054–4109.
- View Article
- Google Scholar
12. Chen L.-P. (2020). Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates. Journal of Statistical Computation and Simulation, 90, 3261–3300.
- View Article
- Google Scholar
13. Yi G. Y. (2017). Statistical Analysis with Measurement Error and Misclassication: Strategy, Method and Application. Springer.
14. Carroll R. J., Spiegelman C. H., Gordon Lan K. K., Bailey K. T., and Abbott R. D. (1984). On errors-in-variables for binary regression models. Biometrika, 71, 19–25.
- View Article
- Google Scholar
15. Chen L.-P. and Yi G. Y. (2021b). Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Annals of the Institute of Statistical Mathematics, 73, 481–517.
- View Article
- Google Scholar
16. Roy S., Banerjee T., and Maiti T. (2005). Measurement error model for misclassified binary responses. Statistics in Medicine, 24, 269–283. pmid:15546132
- View Article
- PubMed/NCBI
- Google Scholar
17. Schafer D. W. (1993). Analysis for probit regression with measurement errors. Biometrika, 80, 899–904.
- View Article
- Google Scholar
18. Stefanski L. A., and Carroll R. J. (1987). Conditional scores and optimal scores for generalized linear measurement error models. Biometrika, 74, 703–716.
- View Article
- Google Scholar
19. Brown B., Miller C. J., and Wolfson J. (2017). ThrEEBoost: Thresholded boosting for variable selection and prediction via estimating equations. Journal of Computational and Graphical Statistics, 26, 579–588.
- View Article
- Google Scholar
20. Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
21. Sørensen Ø, Hellton K. H., Frigessi A., and Thoresen M. (2018). Covariate selection in high-dimensional generalized linear models with measurement error. Journal of Computational and Graphical Statistics, 27, 739–749.
- View Article
- Google Scholar
22. Ma Y. and Li R. (2010). Variable selection in measurement error models. Bernoulli, 16, 273–300. pmid:20209020
- View Article
- PubMed/NCBI
- Google Scholar
23. Brown B., Weaver T., and Wolfson J. (2019). MEBoost: Variable selection in the presence of measurement error. Statistics in Medicine, 38, 2705–2718. pmid:30856279
- View Article
- PubMed/NCBI
- Google Scholar
24. Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., et al. (2022). glmnet: Lasso and Elastic-net regularized generalized linear models. R package version 4.1-4. https://CRAN.R-project.org/package=glmnet
25. Fan J. and Lv J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911.
- View Article
- Google Scholar
26. Krishnan, S. (2019). xverse. Python package version 1.0.5. https://pypi.org/project/xverse/#description
27. Bartoszek, K. (2019). GLSME: Generalized least squares with measurement error. R package version 1.0.5. https://CRAN.R-project.org/package=GLSME
28. Nab L., van Smeden M., Keogh R. H. and Groenwol R.H.H. (2021). Mecor: An R package for measurement error correction in linear regression models with a continuous outcome. Computer Methods and Programs in Biomedicine, 208, 106238. pmid:34311414
- View Article
- PubMed/NCBI
- Google Scholar
29. Wolfson J. (2011). EEBOOST: a general method for prediction and variable selection based on estimating equation. Journal of the American Statistical Association, 106, 296–305.
- View Article
- Google Scholar
30. Luijken K., Groenwold R.H.H., Van Calster B., Steyerberg E.W., and van Smeden M. (2019). Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective. Statistics in Medicine, 38, 3444–3459. pmid:31148207
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Guyon I., Weston J., Barnhill S., Vapnik V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. He, W., Yi, G. Y., and Chen, L.-P. (2019). Support vector machine with component graphical structure incorporated. Proceedings, Machine Learning and Data Mining in Pattern Recognition, 15th International Conference on Machine Learning and Data Mining, MLDM 2019, vol.II, 557–570.

[ref3] 3. Tibshirani R. (1996). Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B, 58, 267–288.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Zou H. (2006). The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Zou H. and Hastie T. (2005), Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Chen L.-P. and Yi G. Y. (2021a). Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics, 77, 956–969.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Chen L.-P. and Yi G. Y. (2022a). De-noising analysis of noisy data with graphical models. Electronic Journal of Statistics, 16, 3861–3909.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref8] 8. Chen L.-P. and Yi G. Y. (2022b). Sufficient dimension reduction for survival data analysis with error-prone variables. Electronic Journal of Statistics, 16, 2082–2123.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Huang F., Guang P., Li F., Liu X., Zhang W., and Huang W. (2020). AML, ALL, and CML classification and diagnosis based on bone marrow cell morphology combined with convolutional neural network. Medicine, 99:45, 1–8.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Carroll R. J., Ruppert D., Stefanski L. A., and Crainiceanu C. M. (2006). Measurement Error in Nonlinear Model. Chapman and Hall.

[ref11] 11. Chen L.-P. and Yi G. Y. (2020). Model selection and model averaging for analysis of truncated and censored data with measurement error. Electronic Journal of Statistics, 14, 4054–4109.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref12] 12. Chen L.-P. (2020). Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates. Journal of Statistical Computation and Simulation, 90, 3261–3300.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref13] 13. Yi G. Y. (2017). Statistical Analysis with Measurement Error and Misclassication: Strategy, Method and Application. Springer.

[ref14] 14. Carroll R. J., Spiegelman C. H., Gordon Lan K. K., Bailey K. T., and Abbott R. D. (1984). On errors-in-variables for binary regression models. Biometrika, 71, 19–25.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref15] 15. Chen L.-P. and Yi G. Y. (2021b). Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error. Annals of the Institute of Statistical Mathematics, 73, 481–517.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref16] 16. Roy S., Banerjee T., and Maiti T. (2005). Measurement error model for misclassified binary responses. Statistics in Medicine, 24, 269–283. pmid:15546132
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref17] 17. Schafer D. W. (1993). Analysis for probit regression with measurement errors. Biometrika, 80, 899–904.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref18] 18. Stefanski L. A., and Carroll R. J. (1987). Conditional scores and optimal scores for generalized linear measurement error models. Biometrika, 74, 703–716.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref19] 19. Brown B., Miller C. J., and Wolfson J. (2017). ThrEEBoost: Thresholded boosting for variable selection and prediction via estimating equations. Journal of Computational and Graphical Statistics, 26, 579–588.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref20] 20. Hastie T., Tibshirani R., and Friedman J. (2008). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.

[ref21] 21. Sørensen Ø, Hellton K. H., Frigessi A., and Thoresen M. (2018). Covariate selection in high-dimensional generalized linear models with measurement error. Journal of Computational and Graphical Statistics, 27, 739–749.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref22] 22. Ma Y. and Li R. (2010). Variable selection in measurement error models. Bernoulli, 16, 273–300. pmid:20209020
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref23] 23. Brown B., Weaver T., and Wolfson J. (2019). MEBoost: Variable selection in the presence of measurement error. Statistics in Medicine, 38, 2705–2718. pmid:30856279
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref24] 24. Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., et al. (2022). glmnet: Lasso and Elastic-net regularized generalized linear models. R package version 4.1-4. https://CRAN.R-project.org/package=glmnet

[ref25] 25. Fan J. and Lv J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref26] 26. Krishnan, S. (2019). xverse. Python package version 1.0.5. https://pypi.org/project/xverse/#description

[ref27] 27. Bartoszek, K. (2019). GLSME: Generalized least squares with measurement error. R package version 1.0.5. https://CRAN.R-project.org/package=GLSME

[ref28] 28. Nab L., van Smeden M., Keogh R. H. and Groenwol R.H.H. (2021). Mecor: An R package for measurement error correction in linear regression models with a continuous outcome. Computer Methods and Programs in Biomedicine, 208, 106238. pmid:34311414
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref29] 29. Wolfson J. (2011). EEBOOST: a general method for prediction and variable selection based on estimating equation. Journal of the American Statistical Association, 106, 296–305.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref30] 30. Luijken K., Groenwold R.H.H., Van Calster B., Steyerberg E.W., and van Smeden M. (2019). Impact of predictor measurement heterogeneity across settings on the performance of prediction models: A measurement error perspective. Statistics in Medicine, 38, 3444–3459. pmid:31148207
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

Figures

Abstract

1 Introduction

1.1 Motivation

1.2 Contributions

1.3 Comparisons

1.4 Organization of this paper

2 Regression models

2.1 Regression models with binary responses

2.2 Measurement error models

3 Method

3.1 Correction of measurement error

3.2 Boosting algorithm

4 Description and implementation of BOOME

4.1 ME_Generate

4.2 LR_Boost

4.3 PM_Boost

5 Numerical studies

5.1 Analysis of gene expression microarray data

5.2 Demonstration of simulation studies

6 Discussion

Acknowledgments

References