Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data

Khurram Nadeem; Mehdi-Abderrahman Jabri

doi:10.1371/journal.pone.0280258

Abstract

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

Citation: Nadeem K, Jabri M-A (2023) Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data. PLoS ONE 18(1): e0280258. https://doi.org/10.1371/journal.pone.0280258

Editor: Xiao Song, University of Georgia, UNITED STATES

Received: January 24, 2022; Accepted: December 23, 2022; Published: January 17, 2023

Copyright: © 2023 Nadeem, Jabri. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data files corresponding the Case Study Analyses reported in S1 Appendix (see Supporting information) are available from the public repository database Dryad (https://doi.org/10.5061/dryad.ttdz08m20).

Funding: Natural Resources Canada and Food from Thought Program at the University of Guelph, funded by the Canada First Research Excellence Fund.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Robust and efficient selection of covariates in logistic regression models is a core objective of statistical analysis in domains as diverse as epidemiology, forestry and insurance risk assessment as it enhances interpretation of fitted models and leads to improved prediction of the binary outcome (see, for example [1–3]). As compared to the ordinary logistic regression (OLR) model combined with traditional variable selection methods such as best subset selection, regularized versions of OLR have the potential to flag important variables in the presence of multicollinearity and are computationally attractive for high dimensional data. Bühlmann and Van De Geer [4] provide an excellent exposition of regularization techniques such as Lasso [5], elastic net [6], group Lasso [7], Dantzig selector [8] and SCAD [9].

Presence of rare events, as manifested through severe imbalance in observed frequencies of the binary classes, along with high dimensional and potentially multicollinear data can lead to serious challenges in deriving stable variable importance rankings in regularized regression models. For instance, it is well known that presence of a large number of irrelevant (noise) covariates having a strong correlation structure with the relevant (signal) covariates can seriously exacerbate variable selection in regularized regression models such as Lasso [10]. Class-imbalance occurs when one class is represented by a very small number of examples (minority class) as compared to the other (majority) class, where despite the fact that the majority class makes up most of the cases, it is the minority class that is often relevant for statistical analysis. Rarity of the minority class introduces problems and complications in conducting statistical inference [11, 12]. A severe minority-to-majority class ratio, such as in the range of 1:100 to 1:1000, often leads to very high volumes of data (big data) and significantly increases computational cost of model estimation [13].

The issue of severe class-imbalance can be addressed by employing response-based random sampling, which is done by sampling a subset of instances from both classes [14]. Response-based sampling is commonly used in case-control studies in epidemiological research and choice-based studies in econometrics [15–17]. Hosmer and Lemeshow [18] provide a thorough treatment of the effect of response-based sampling on estimation of the regression coefficients in logistic regression. A key advantage of response-based downsampling is that it results in substantially reduced datasets which in turn leads to a marked reduction in computational burden in training the models for data with millions of records. We refer the reader to [13, 19] for a detailed overview of sampling-based approaches in statistical and machine learning applications involving high class-imbalance in big binary data.

In this study, we employ response-based downsampling to develop a novel variable ranking and selection algorithm in the context of regularized OLR models. Our methodology consists of the following two key features: (i) repeated subsampling of the minority class (controls) to create a large number of balanced datasets (say M), and (ii) computing stabilized aggregate covariate rank scores by generating an ensemble of regularized OLR model fits using the M balanced datasets created via case-control sampling. Our approach is similar in spirit to Bach’s [20] Bolasso algorithm which is based on replications of bootstrap sampling and intersecting the supports (i.e. covariates with nonzero coefficients) of the resulting Lasso bootstrap estimates. Bach [20] shows that this approach leads to consistent model selection whereas the original Lasso model estimated from a single dataset is shown to be inconsistent in the presence of strong correlations between signal and noise covariates [21]. Bolasso exploits the fact that bootstrap mimics availability of a large number of datasets by resampling from the same unique dataset and thereby diminishing the probability of selecting noise covariates from the intersected supports. We refer the read to [22–28] for examples of other recent approaches that involve the general notion of subsampling to perform variable selection in the context of machine learning based classifiers. However, recent variable selection literature, both in the context of machine learning classifiers and other regularized based methods, is focused on analyzing datasets that are moderate in size and with respect to class-imbalance severity. For example, highest sample size and the most severe class-imbalance ratio considered in [22–28] are 20,000 and 1:42 [26], respectively. In contrast, our methodology is designed to deal with variable selection in very large datasets in a computationally efficient manner, e.g. our case study models involve approximately 5 to 13 million observations with class-imbalance ratios around 1:500.

Here we replace the bootstrap sampling as employed in [20] with repeated response-based subsampling to create a large number of balanced datasets for variable ranking and selection in the presence of severe class-imbalance with potentially large volumes of data and varying degrees of multicollinearity. This study presents an extension to a heuristic variable ranking approach previously introduced in Nadeem et al. [29] who employed it to rank covariates in a large wildland fire occurrence dataset. Our work here is novel in that we rigorously develop the ranking approach, employ it to propose a new variable selection methodology for a general class of regularization methods, and thoroughly assess its selection performance by conducting an extensive simulation study. We demonstrate our framework using three regularization methods including Lasso [5, 30], adaLasso [21] and ridge regression [31]. Another important novel element of our stable variable ranking and selection (SVRS) methodology is its flexibility: it does not require hard shrinkage of the regression coefficients, as is the case with Lasso, and works equally well with the general class of regularization techniques that induce soft shrinkage only. We conduct a detailed simulation experiment and use a real-world case study involving big and highly imbalanced wildland fire occurrence datasets to show that, when compared to the usual practice where only a single regularized model fit is employed to determine variable importance, our variable selection approach successfully filters out the noise covariates and recovers a substantially higher proportion of signal covariates.

Materials and methods

The regularization methods

Binary response outcomes, observed as a negative (y_i = 0) or a positive (y_i = 01) occurrence from the i^th instance (example), can be modeled using the ordinary logistic regression model with the joint likelihood function for n instances given as: , where π_i = P(Y_i = 1|x_i) for a given vector of p covariate values, x_i. The probability of a positive occurrence is then parametrised via the logistic link function: , where α is the intercept term and β is the vector of regression coefficients. The resulting negative log-likelihood function can be expressed as follows: (1)

Lasso.

Lasso is a commonly employed regularization method for OLR [5, 30] where l₁ penalty on β is imposed, leading to the following penalised form of (1): (2) where is the l₁-norm penalty, λ is a tuning parameter to be estimated separately and covariates typically enter in (2) in standardized form for the penalty to be meaningful. Lasso is a much more popular choice as compared to OLR for high dimensional data in that the l₁ penalty allows simultaneous regularization of the coefficients and model selection by shrinking some of the coefficients to zero. Furthermore, efficient computational algorithms are available for computing the entire solution path for the tuning parameter λ for optimizing (2) [6, 30].

However, lasso does have its limitations including lack of robustness to high correlations among covariates and violation of the oracle property. A procedure is said to satisfy the oracle property if it estimates regression coefficients corresponding to noise covariates as zero with probability approaching to 1 and produces asymptotically unbiased and normally distributed estimates of the nonzero coefficients. Lasso can only perform consistent variable selection under strong conditions on the design matrix [21].

Adaptative lasso (adaLasso).

The adaLasso was introduced by Zou [21] which unlike the standard Lasso, is consistent in variable selection and satisfies the oracle property. Here the l₁ penalty is modified by incorporating adaptive data-driven weights to the penalty. The adaLasso solution is obtained by maximizing the following penalized form of (1): (3) where are the data-driven weights given as is a positive constant and is an initial estimator of β_j. Cross-validation can be employed to obtain optimal values of γ and λ from a grid of values where (0.5, 1, 2) are commonly used values of γ [21]. Here we use the standard Lasso solution for initial values and set γ = 1 for simplicity of exposition.

Ridge regression.

Ridge regression [31, 32] constraints the regression coefficients using the l₂ norm penalty, , instead of the l₁ penalty employed in (1). It tends to perform particularly well in the presence of strong multicollinearity among many covariates with potentially small effect sizes as it effectively regularizes the coefficients through a trade-off in bias and variance. As opposed to the Lasso solution, ridge regression produces even shrinkage for the correlated covariates [30]. A limitation of ridge regression is that it induces soft shrinkage, i.e. it does not force coefficients to vanish and therefore does not automatically perform variable selection, as is the case with Lasso and adaLasso.

The ridge regression based penalised log-likelihood function takes the following form: (4)

The tuning parameter λ in (2–4) can be estimated using various methods including cross-validation [5], which we employ here in fitting models (2–4) using the R package glmnet [30].

Class-imbalance and response-based sampling

As noted earlier, rarity of one of the binary response classes (usually the positive cases, Y = 1) can lead to significant imbalance in observed frequencies of the two classes. Datasets with severe class-imbalance are often prohibitively large and challenging for training statistical and machine learning models in a computationally tractable manner [33]. In this study we employ a response-based downsampling scheme for controls (Y = 0), where we retain all case instances and draw a simple random sample from controls. Let n be the size of the entire available sample and n₀ and n₁ (n₀ > n₁) be the total number of controls and cases, respectively; then n₁ instances are randomly selected from the n₀ controls resulting in a balanced (and therefore reduced) dataset of size n_b = 2n₁. This has the effect of inducing an offset intercept term, log(p₁/p₀), into the full-data linear predictor , i.e. [18]: (5) where , s_i is selection status (0 or 1) of the i^th instance in the original dataset; p₁ and p₀ are selection probabilities of cases and controls, respectively. These probabilities can be determined from respective sampling proportions. In case of the balanced sampling design employed in this study, p₁ = 1 and p₀ is estimated as the ratio of the number of controls to the number of cases in the full dataset.

Stable Variable Ranking and Selection (SVRS) algorithm

Our variable ranking and selection algorithm is based on the following key steps:

Start with the full dataset of size n, Z = [z₁, z₂, …, z_n], where is the i^th observed data vector and Z can be written as: Z = [Z₀, Z₁]; Z₀ and Z₁ denote data partitions corresponding to y_i = 0 (control) and y_i = 1 (case) observations, respectively.
For each j in (1, 2, …, M), do:
1. Draw a simple random sample of size n₁, Z_0,j, without replacement from control observations, Z₀.
2. Generate the balanced dataset of size n_b as .
3. Fit a regularized OLR model (e.g. Lasso) to and store the estimated regression coefficients vector, .
Employ the variable ranking algorithm described in the next section to the resulting M × p coefficients matrix, , to compute an aggregate rank score, Rank(X_i), for each covariate X_i, i = 1, 2, 3, …, p.
Choose a subset of variables as the most influential (relevant) covariates by selecting a threshold rank score from sorted rank scores, Rank(X_i).

Next, we describe the variable ranking algorithm and our methodology for selecting a rank threshold for variable selection.

Ranking algorithm.

For simplicity of exposition, we drop the hat symbol from the various quantities presented in this section, e.g. is replaced by β_j. Let us assume that the computed regression coefficients β_j are on their original scale of measurement (i.e., not standardized covariates) and let denote the corresponding vector of standardized coefficients whose elements, , are defined as , where sd(x_i) is the standard deviation of the observed values for the i^th covariate. These standardized coefficients correspond to transformed covariates, .

Also, let (R_1,j, R_2,j, …, R_p,j) be the ranks assigned to absolute values of in the vector of standardized coefficients, namely , where 1 ≤ R_i,j ≤ p. Note that the covariates ranked at P and 1 have the highest and lowest values of the absolute standardized coefficients, respectively. After sorting in increasing order we denote resulting vector as , such that and is the covariate whose assigned rank value is h. For instance, if X₅ gets assigned a rank value of 10, we have i⁽¹⁰⁾ = 5. Note also that for a given rank position h, can differ between model fits for j = 1, 2, …, M.

An aggregate rank score based on M fits of the model is given by the following formula: (6) where I_A is an indicator function of event A and h indicates the rank positon.

Theorem 1. We have:

1 ≤ Rank(X_i) ≤ p, and
.

Proof: i) Let us rewrite Rank(X_i) as:

Then we must have for a given j. So, the average over the index j must also satisfy the same constraints.

ii) Consider the sum

For regularization methods with hard shrinkage, such as Lasso logistic regression, it is common for several of the estimated coefficients to be forced to zero values at convergence as a consequence of the l₁ regularization penalty. When displaying a ranking of estimated coefficients based on Lasso logistic fit, these zeroed-out coefficients need not be included in the ranking as they have no effect on predictive performance. A modified version of (1) is therefore presented as follows.

Let x* be a covariate having a zero coefficient in some j^th Lasso fit. Suppose x* is ranked at 10 in a particular fit, then its rank contribution in (1) from this fit is h = 10; even though its estimated coefficient is 0. However, we should penalize x* for this as it was in fact dropped from selection in that j^th fit. This is achieved by restructuring (1) as follows: (7) where Rank(X_i) ≤ p, and .

A related scenario can also arise when several covariates are repeatedly zeroed-out in all M model fits, causing all the regression coefficients falling within a certain rank position to become zero. This leads us to the following definitions and a related theorem.

Definition 1 (supported variables): The set of supported covariates is given as

Essentially, χ consists of all covariates that were not dropped in at least of the M model fits. For soft-shrinkage regularization such as in Ridge regression, we typically have χ = x.

Definition 2 (effective maximum rank): The effective maximum rank is given as: where for a given value of h, and p_EF ≤ p.

Theorem 2. Let , and assume that for some , then we have S_h = 0 for all .

Proof: implies that for all j. This also implies that must all be zero because, for any given j, we have by definition: . Therefore, we must have ; and hence for all .

Therefore, considering Theorem 2, we further modify (7) as follows.

Definition 3. Let be the M × p matrix whose entries (j, h) are given as . A modified version of Rank(X_i) for a maximum effective rank p_EF, is defined as: (8) where X_i ∈ χ, and the second summation on the right is taken over the first p_EF columns of .

Here, (3) now satisfies the properties: i) 0 < Rank(X_i) ≤ p_EF and, ii) , where Rank(X_i) = 0 for X_i ∈ χ^c.

We use (8) as a basis for ranking covariates arising from the model fits. Note that (8) is equivalent to (6) when χ = x.

Threshold selection.

Acquiring a list of covariates ranked by their relative importance obtained from the algorithm described above only completes the task of computing stable variable rank scores. Our objective here is to determine how many and which of these covariates are influential predictors in the logistic regression model. One simple approach to selecting important covariates is to examine selection metrics for choosing a threshold that partitions the covariates in influential and irrelevant sets. For instance, Nadeem et al. [29] used a selection metric, p_drop, to determine the selection threshold. The metric p_drop is defined as: ; that is, it denotes fraction of times the i^th covariate is dropped from Lasso logistic model fits. Nadeem et al. [29] then plotted values and noted that it tends to have a change point which they determined visually by choosing a threshold to separate the variables in clusters of important and irrelevant covariates (see Fig 8 in Nadeem et al. [29]). Notice that the notion of the p_drop metric used by Nadeem et al. [29] for variable selection is similar in spirit to the concept of empirical selection probability as introduced by Meinshausen and Bühlmann [34]. However, here we introduce Rank(X_i) as the selection metric and show that it works well for both Lasso and other regularization methods that does not enforce hard shrinkage of the covariates.

We employ automatic thresholding of the rank scores based on change-point detection methodology [35–37]. Suppose the rank scores are ordered as a sequence of values (r₁, r₂, …, r_v) and if there exists an index τ ∈ (1, 2, …, v − 1), such that some feature (e.g. mean) in the probability distribution of (r₁, r₂, …, r_τ) and (r_τ+1, r_τ+2, …, r_v) differ, then a change point has occurred. We employ the changepoint package [38] in R, to find a single change point in the mean Rank(X_i) values. We refer the reader to Killick et al. [39] for further details on the detection tests implemented in the changepoint package.

Computational efficiency.

Here we provide an illustration of computational efficiency of the SVRS algorithm in the context of the Lasso regularization method. Similar gains in efficiency are expected under other regularization techniques, such as ridge regression, due to the use of response-based subsampling. Computational complexity of Lasso algorithm is O(np² + p³) [34] for n > p case considered in this study. The regularization parameter is selected in practice using K-fold cross-validation, which results in a computing time of O(np² (K − 1) + p³). On the other hand, the cost for a single balanced dataset with K-fold cross-validation is O(2n₁ (K − 1)p² + p³), where n₁ is the total number of cases in the entire dataset. Computational complexity for SVRS is therefore O(2n₁M(K − 1)p² + p³), when Lasso model is fitted to M balanced datasets. As we demonstrate in our case study analyses later, M = 100 is a reasonable choice for very large imbalanced datasets. Hence, algorithmic complexity of SVRS scales with the size of the minority class only. For instance, one of the datasets analyzed in our case study has n = 10.7 million and n₁ = 22,525. Therefore, fitting the Lasso model to the entire original data would require approximately 238 times as many computational resources as needed to analyze a single balanced dataset. Whereas fitting the model to full data would still cost about 2.4 times more resources than running our SVRS algorithm.

It is however crucial to note that fitting regularized versions of the OLR model to massively large datasets is often computationally infeasible due to restricted amount of memory (RAM) available on computers. This issue is especially exacerbated for commonly used analysis languages such as R. The SVRS algorithm is therefore attractive in the sense that: i) it circumvents the computational bottleneck by making it feasible to estimate the model from much smaller subsets of the original data, and ii) its implementation is highly parallelizable which yields further gains in computational efficiency. For the abovementioned case study example, parallelized implementation of Lasso based SVRS algorithm (M = 100, p = 82) only took about 7.5 minutes on a 3.4-GHz machine with 16 cores, using glmnet R package [30].

Simulation experiment

We simulate data under (1) using four OLR models with varying number of signal and noise covariates where two of them include correlated noise and signal covariates, while the remaining two models do not possess correlated covariates. Model descriptions along with respective ratios of number of signal to number of noise covariates, s: m; and regression coefficient vectors, are reported in Table 1.

Download:

Table 1. Number of covariates (p), ratio of number of signal to number of noise covariates (s: m), and the nonzero regression coefficients corresponding to the signal covariates for the four simulation models.

https://doi.org/10.1371/journal.pone.0280258.t001

Simulation of model with uncorrelated covariates.

Data under the models with uncorrelated covariates, UNCOR-12 and UNCOR-24, are generated as follows where (X₁, X₂, …., X_s) are the signal covariates (Table 1).

UNCOR-12. Signal covariates are distributed as X_j (j = 1, 2,.., 6)~ Bernoulli(0.5) and X_j (j = 7, 8,.., 12) ~ N(0,1); whereas the noise covariates are X_j (j = 13, 14,.., 18)~Bernoulli(0.5) and X_j (j = 19, 20,.., 100) ~ Unif(0,1).

UNCOR-24. We have X_j (j = 1, 2,.., 12)~ Bernoulli(0.5) and X_j (j = 13, 8,.., 24) ~ N(0,1) generated as the signal covariates. Noise covariates are simulated as X_j ~ Unif(0,1), j = 25, …, 200.

All covariates under UNCOR-12 and UNCOR-24 are distributed as independent random variables. We also apply logistic transformation to the normally distributed covariates so that their support is the unit interval, [0,1]. This was done to ensure that the covariates scales are relatively consistent, and the magnitudes of regression coefficients (Table 1) reflect relative variable importance in the simulated OLR model. Distribution and correlation structure of the covariates under COR-12 and COR-24 simulation models are described in the subsection ahead.

Simulation of models with correlated covariates.

We include both categorical and continuous correlated covariates in the COR-12 and COR-24 simulation models. Touloumis [40] provide a computational framework to simulate correlated clusters of categorical variables using marginal baseline-category logit models [41]. They describe and implement the NORmal-To-Anything (NORTA) method introduced by Cario and Nelson [42] for simulating correlated binary and nominal responses under a desired marginal model specification. We employ R package SimCorMultRes [40] to simulate data and use the NORTA method to generate data for categorical responses which we incorporate in our simulation models as categorical covariates.

Data generation process under COR-12 and COR-24 models is described as follows where the first s covariates (X₁, X₂, …, X_s) are signal and the rest are all noise (Table 1).

COR-12. We generate four clusters of a 4-category nominal variable Z where denote the four dummy variables (i.e. X₁ + X₂ + X₃ + X_*1 = 1) corresponding to the categories of Z_c,1 in cluster 1. Similarly, we have and for cluster 2 and 3 respectively, where contains noise covariates. Here, are all correlated, and we do not use X_*i to avoid perfect multicollinearity among the four dummy variables (e.g. ). We do not also incorporate in the COR-12 simulations. We further generate a separate set of correlated binary covariates (X₁₆, X₁₇, X₁₈) independently from the categorical covariates.

Rest of the covariates are continuous and independently generated as follows: (X₇, X₈, …, X₁₂)~MVN(0, σR), where off-diagonal entries σ_k,l of the correlation matrix R are all 0.8; and the remaining noise covariates (X₁₉, X₂₀, …, X₁₀₀) are iid Unif (0,1).

COR-24. Here, we retain the four correlated clusters of the 4-category variable Z with the revised labeling and for cluster 3 and 4, respectively. Labels for and remain unchanged. Independently of Z_c,i, we simulate (X₇, …, X₁₂) and (X₃₁, …, X₃₆) as correlated binary covariates. Continuous signal covariates are simulated as follows: where σ_k,l = 0.8 for the off-diagonal entries of R; and rest of the noise covariates (X₃₇, X₃₈, …, X₂₀₀) are iid Unif(0,1).

For COR-12 and COR-24, all normally distributed covariates were mapped to [0,1] using the logistic transformation.

Generation of balanced datasets.

We create balanced datasets using response-based sampling performed on initial datasets (of size n) that are generated under the four simulation models with three class-imbalance ratios (I_R) and increasing sizes (n_b) of the balanced samples. Combinations of various (I_R, n_b) values considered in our simulation experiment along with initial sample sizes (n) required to ensure various imbalance ratios are reported in Table 2. We induce rarity of the positive class by incorporating appropriately small values of the intercept, α, in the simulation models. We generate M = 500 balanced datasets under each of the resulting 60 combinations involving four models, three I_R values and five n_b values (Tables 1 and 2). Lasso, adaLasso and ridge regression are then fitted to each balanced dataset to generate the estimated coefficient vectors, (see SVRS algorithm steps above).

Download:

Table 2. Sample size (n) of the full dataset generated under each class-imbalance ratio (I_R) to achieve a target balanced sample size (n_b).

https://doi.org/10.1371/journal.pone.0280258.t002

Fig 1 provides an example of the distribution of simulated probabilities of the positive class in the original data sample (of size n) and for balanced datasets under COR-24 simulation model with varying imbalance ratios (I_R). We notice that distribution of simulated probability for the original imbalanced data is highly skewed towards zero (i.e. cases are rare), where degree of skewness exacerbates as I_R becomes severe (Fig 1A). However, balancing the sample removes the skewness and renders a distribution that is invariant across imbalance ratios (Fig 1B). We observed similar characteristics in other simulation models as well.

Download:

Fig 1. Distribution of simulated probabilities with imbalance ratios 1:50 (black), 1:100 (dark grey), and 1:1000 (light grey), under COR-24.

(A) Original imbalanced datasets (B) Balanced datasets based on response-based downsampling.

https://doi.org/10.1371/journal.pone.0280258.g001

Results

The simulation results in this section focus on variable selection performance of the SVRS algorithm based on performance metrics such as true positive rate (TPR), false positive rate (FPR), false discovery rate (FDR) and area under the ROC curve (AUC). The rate metrics are defined as follows: TPR = TP/s; FPR = FP/m and FDR = FP/(TP + FP), where s, m, FP and TP denote number of signal covariates, noise covariates, false positives, and true positives, respectively. We also compare selection performance with the distribution of these metrics computed from variable selection performed on the 500 individual fits of Lasso and adaLasso where variables with non-zero regression coefficients are retained in the selected set. Notice that ridge regression does not perform automatic variable selection and therefore selection performance based on individual fits is presented for Lasso and adaLasso methods only. Fig 2 reports a summary of the predictive performance of the fitted regularized logistic regression models across all simulation scenarios. We find that the individual model fits have reasonable skill in predicting the binary response variable based on the mean cross-validated AUC scores corresponding to the optimal value of the tuning parameter, denoted here as λ_min. Notice that all results presented in this section are based on λ_min.

Download:

Fig 2. Predictive performance for various regularized logistic regression models fitted to simulated balanced datasets with imbalance ratios 1:50, 1:100 and 1:1000.

AUC values reported here correspond to λ_min.

https://doi.org/10.1371/journal.pone.0280258.g002

Selection performance with un-correlated data

UNCOR-12.

Fig 3 depicts TPR, FPR, FDR and AUC values obtained from the SVRS algorithm and corresponding distribution of scores across the individual fits for the various regularization methods, imbalance ratios (I_R) and balanced sample sizes, n_b. Apart from the considerable variation in individual TPR scores for Lasso at the smallest sample size (n_b = 1000), both Lasso and adaLasso scores are stable and close to 1 irrespective of the sample size and I_R. It is also evident that SVRS TPR scores are near perfect for the three regularization methods in all scenarios. The individual FPR and FDR scores for Lasso and ada-Lasso are however much more variable with median values dropping as a function of n_b. The highest mean FPR score for the Lasso and adaLasso was 0.191 and 0.176 respectively, which is approximately 17 and 15 noise covariates selected on average (Table 3). Similarly, the maximum FDR values ranged from 0.261 to 0.521, and 0.218 to 0.504 for Lasso and adaLasso respectively, showing that a substantially large proportion of selected covariates from an individual fit can be false positives. The SVRS FPR and FDR scores across the two Lasso methods on the other hand range from 0 to 0.023, and 0 to 0.143 respectively, across all scenarios (Table 3). In comparison, ridge regression registers higher SVRS FPR and FDR values ranging in 0 to 0.136 and 0 to 0.520 respectively.

Download:

Fig 3. Simulation under UNCOR-12.

Performance metric scores for variable selection based on the SVRS algorithm (symbol x) and corresponding distributions across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.g003

Download:

Table 3. Simulation under UNCOR-12.

Performance metric scores for variable selection based on the SVRS algorithm and average scores (reported in parenthesis) across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.t003

The suboptimal performance of the individual across Lasso and adaLasso fits is also clear from viewing the large variability in AUC scores (Fig 3), with poor scores observed in all scenarios. SVRS algorithm improves the selection performance by a substantial margin with AUC scores ranging in 0.958 to 1 for the two Lasso methods. SVRS based AUC scores for Ridge regression are comparatively lower due mostly to worse performance in terms of FPR and FDR.

UNCOR-24.

Distributions of TPR values for the two Lasso methods reveal that individual fits tend to recover a much lower proportion of signal covariates under UNCOR-24 when compared to UNCOR-12; however, mean TPR increases with sample size (Fig 4; Table 4). This is explained in part from noticing that UNCOR-24 has a substantial proportion of covariates with relatively small effect sizes (|β_i| < 0.35; Table 1) that are hard to detect at a lower sample size. The SVRS TPR scores on the other hand are generally above the 25^th percentile of the individual scores, especially for higher sample sizes. Ridge regression shows further improvement in terms of SVRS TPR scores. Behavior of FPR and FDR scores based on individual fits and SVRS algorithm is similar to that of under UNCOR-12 with one exception: scores do not improve much with the increasing sampling size. It is evident from Fig 4 that SVRS AUC scores are generally higher than the median score from the individual fits, which shows that SVRS is superior at achieving stable ranking and selection as compared to individual fits.

Download:

Fig 4. Simulation under UNCOR-24.

Performance metric scores for variable selection based on the SVRS algorithm (symbol x) and corresponding distributions across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.g004

Download:

Table 4. Simulation under UNCOR-24.

Performance metric scores for variable selection based on the SVRS algorithm and average scores (reported in parenthesis) across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.t004

Selection performance with correlated data

COR-12.

Mean TPR scores under COR-12 in individual fits of the various regularization methods are lower as compared to those of attained under UNCOR-12, falling in the range of 0.847 to 1 (Table 5). Likewise, SVRS tends to recover most or all signal covariates despite the severe class-imbalance and high correlation between the signal and noise covariates (TPR > 0.83 for n_b > 1000; Table 5). Even at n_b = 1000, SVRS TPR values are higher than the 25^th percentile of the individual score distributions in most cases (Fig 5). Performance in terms of FPR and FDR is again similar to the pattern seen under the un-correlated simulation models as individual Lasso and adaLasso scores are much worse with high variability as compared to SVRS scores. Ridge regression has substantially higher SVRS FPR and FDR scores. We see that the two Lasso methods have similar performance in terms of AUC scores, which is far superior as compared to the individual fits, with values range from 0.875 to 1 across the two methods. On the other hand, SVRS AUC scores for ridge regression are relatively smaller owing to large number of higher false positive discoveries.

Download:

Fig 5. Simulation under COR-12.

Performance metric scores for variable selection based on the SVRS algorithm (symbol x) and corresponding distributions across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.g005

Download:

Table 5. Simulation under COR-12.

Performance metric scores for variable selection based on the SVRS algorithm and average scores (reported in parenthesis) across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.t005

COR-24.

COR-24 is the most involved simulation model including several highly correlated candidate covariates (Table 1) where the correlation structure spans across signal and noise variables. Here we see that the mean TPR scores for Lasso/adaLasso individual fits range from 0.740 to 0.921 with high variability in the score distribution (Fig 6). On the other hand, SVRS based TPR values exceed the 25^th percentile of the individual score distribution in most cases. It is evident that SVRS algorithm with Lasso/adaLasso excels at filtering out the noise covariates with very low FPR and FDR scores ranging from 0 to 0.034 and 0 to 0.217, respectively. Individual Lasso/adaLasso fits in comparison do not provide a reliable selection tool as they tend to include a very large proportion of noise covariates in the selection set (Fig 6; Table 6; FDR range: 0.349 to 0.485). Ridge regression has near perfect SVRS TPR values in all cases, but with much higher FPR and FDR scores as compared to Lasso and adaLasso. However, SVRS based selection performance of the three regularization methods is quite similar in terms of AUC scores with Ridge regression registering slightly higher scores (Fig 6; Table 6). We emphasize that a majority of SVRS AUC scores for Lasso and adaLasso lie beyond the 75^th percentile of the individual score distributions across all scenarios. Overall, our results clearly demonstrate that SVRS algorithm renders a marked improvement over individual fits by stabilising the selection variability inherent in those fits.

Download:

Fig 6. Simulation under COR-24.

Performance metric scores for variable selection based on the SVRS algorithm (symbol x) and corresponding distributions across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.g006

Download:

Table 6. Simulation under COR-24.

Performance metric scores for variable selection based on the SVRS algorithm and average scores (reported in parenthesis) across 500 individual fits of each regularization model with imbalance ratios 1:50, 1:100 and 1:1000. Performance scores for individual ridge regression fits are not applicable as it does not perform automatic variable selection.

https://doi.org/10.1371/journal.pone.0280258.t006

Case study: Analysis of wildland fire occurrence data

In this section, we consider a large wildland fire dataset compiled by Nadeem et al. [29] comprising fire occurrence records by ignition source (human- and lightning-caused) and a suite of explanatory variables on a fine spatial grid in British Columbia, Canada, during 1981–2014. Wildland fires account for an average of 2.5 million hectares of burnt area across Canada annually, costing the government of Canada up to $1.5 billion per year in fire suppression and management [43]. The province of British Columbia is a major contributor to daily fire load in Canada, as 70% of its land area contains coniferous forest or grasslands that are potentially flammable [29]. Anthropogenic climate change is attributed to further exacerbate the risk of extreme fire-weather conditions in recent years [44]. For instance, a record 1.2 million ha of wildland burned during the historically extreme 2017 wildfire season alone [45]. It is therefore crucial to develop robust statistical models that enable identification of key underlying drivers of fire occurrence process and predicting daily ignitions across the province for fire management preparedness and detection planning.

Response variable in Nadeem et al.’s [29] dataset is a Bernoulli random variable Y (indicator variable for at least one ignition) observed on a space-time voxel that represents a 24-h time period for a 20×20-km cell in the National Forest Inventory (NFI) grid [46]. There are approximately 13 million 1-day space–time voxels spanning 2541 grid cells and 34 fire seasons (1981–2014) where human- and lightning-caused fire occurrences were observed in only 0.23% and 0.18% of the voxels respectively, which highlights the severity of class-imbalance in the observed distribution of Y. A large number of candidate covariates were compiled including geographic, vegetation, ecumene, surface fire weather, atmospheric stability and other derived variables (see Table 1 in Nadeem et al. [29]). We refer the reader to Nadeem et al. [29] for more details about the study area and methods used in data collection and compilation.

Nadeem et al. [29] employed the lasso-logistic modelling framework to develop the following three fire occurrence models: i) a human-caused fire (HCF) model; ii) a lightning-caused fire (OLCF) model which included observed cloud-to-ground lightning strikes as a candidate covariate, and iii) a lightning-caused fire (PLCF) model which did not include the lightning strike covariate but included atmospheric stability covariates as proxies for lightning strikes. Note that, OLCF did not include the atmospheric stability covariates. The PLCF and HCF models were trained on data from 1981–2008 (10.69 million observations), leaving the years 2009–2014 for testing the fitted models. The OLCF was trained on data from years 1999–2008 and 2010–2013 (5.12 million observations), leaving the years 2009 and 2014 for testing. Number of cases in the training dataset under HCF, PLCF and OLCF were 22,525; 24,392 and 10,128; respectively. The case-control class-imbalance ratio for the HCF, PLCF and OLCF models were therefore approximately 1:469, 1:433 and 1:504, respectively. Nadeem et al. [29] employed response-based sampling to handle this severe class-imbalance and used p_drop to perform variable selection. We emphasise that our selection methodology in this study is instead based on Rank(X_i), which works for both Lasso and Ridge regularization.

We illustrate our SVRS algorithm by reanalyzing HCF, PLCF and OLCF logistic regression models using both Lasso and Ridge regularization where we also permute a subset of variables that were deemed unimportant in Nadeem et al. [29]. Permuting a covariate destroys its relationship with the observed response vector and is widely implemented to assess variable importance in regression and classification models, e.g. random forests [47]. Here, barring the 36, 32 and 27 highest ranked covariates in HCF, PLCF, and OLCF models respectively, as reported in Nadeem et al. [29], all the remaining candidate variables are permuted before implementing the SVRS algorithm. Total number of candidate variables included in HCF, PLCF, and OLCF are 82, 69 and 67 respectively (Table 7).

Download:

Table 7. Case study.

Number of permuted and unpermuted covariates, number of permuted covariates classified as important (false positives, FP), and number of unpermuted covariates classified as important (UI). FP and UI values outside and inside the parentheses correspond to λ_1se and λ_min, respectively. The results are based on M = 500 balanced datasets.

https://doi.org/10.1371/journal.pone.0280258.t007

Figs 7 and 8 plot Rank(X_i) scores for the various models along with selection thresholds computed via the changepoint detection method. We also assess change in selection performance with low and high number of downsampled balanced datasets. We further consider the effect of choosing the optimally regularized model as compared to the more conventional practice of choosing a parsimonious model corresponding to tuning parameter (λ) value that satisfies the one-standard-error rule [48–50]. Following the terminology used in the glment R package, we denote the λ values corresponding the best and one-standard-error based models as λ_min and λ_1se in Figs 7 and 8 and Table 7.

Download:

Fig 7. Lasso regularization-based variable rank scores computed using SVRS algorithm for various wildland fire occurrence models.

Symbol × marks a changepoint point in sorted Rank(X) values where variables falling to the right of × are classified be noise covariates.

https://doi.org/10.1371/journal.pone.0280258.g007

Download:

Fig 8. Ridge regularization-based variable ranks scores computed using SVRS algorithm for various wildland fire occurrence models.

Symbol × marks a changepoint point in sorted Rank(X) values where variables falling to the right of × are classified as noise covariates.

https://doi.org/10.1371/journal.pone.0280258.g008

Regardless of the choice of the tuning parameter and number of balanced datasets (M) used, selected threshold values for HCF, PLCF and OLCF models in nearly all cases show that there is a marked change (or a breakpoint) in sorted Rank(X_i) scores which naturally separates the candidate covariates in two mutually exclusive sets of important (signal) and unimportant (noise) covariates (Figs 7 and 8). It is also evident that the changepoint method accurately detects these breakpoints as shown by the symbols ×. The key finding here is that all permuted covariates are classified as unimportant under Lasso and Ridge regression for λ_min and λ_1se (Table 7; M = 500), providing strong evidence that our variable selection approach achieves very low FPR and FDR scores for complex real-world datasets. Notice that this result remains unchanged with M = 100 for both Lasso and Ridge regularization methods. On the other hand, FDR and FPR scores computed from the individual Lasso fits are at unacceptably high levels as shown in Fig 9 even when λ_1se is used to force a stronger penalization of the regression coefficients. This finding is inline with the behavior observed in the simulation experiments where individual fit based FDR and FPR values for Lasso/adaLasso were highly elevated under all simulation models. Furthermore, we found that a large proportion of unpermuted covariates are deemed important which agrees with the variable selection results obtained by Nadeem et al. [29] based on Lasso-logistic modeling of the original unperturbed dataset (Table 7). We refer the reader to Nadeem et al. [29] for a detailed discussion of the relevance of selected covariates to the wildland fire occurrence process in British Columbia, Canada.

Download:

Fig 9. Lasso regularization-based distributions of FDR and FPR for variable selection performed across 500 individual model fits to permuted and balanced wildland fire occurrence datasets.

https://doi.org/10.1371/journal.pone.0280258.g009

Discussion

This study presents a novel variable ranking and selection method, SVRS, for regularized logistic regression in the context of severely imbalanced and potentially massive and high-dimensional datasets. It consists of three basic components: i) a base regularization method, e.g. Ridge regression, that outputs regularized estimates of regression coefficients, ii) response-based sampling to generate an ensemble of regularized coefficients by fitting the base algorithm to several balanced datasets, and iii) an algorithm to stabilize the variable rank scores from the ensemble to select variables having with high rank scores. The method is very flexible as we demonstrate its applicability for regularization methods that enforce both hard- and soft-shrinkage of the regression coefficients. The simulation experiments show that methods with built-in structure estimation, such as Lasso, can produce highly instable and misleading selection results for high-dimensional data. Analysis of permuted wildland fire occurrence data further reveals that these methods can fail spectacularly at controlling the false discovery rate. SVRS, on the other hand, stabilizes the noise in estimated regression coefficients and yields variable selection with high accuracy and very low false discovery rate.

Another potential general regularization framework for logistic regression is the elastic net penalty [6], which includes Lasso and Ridge penalties as the special case. It involves an additional mixing parameter, 0 ≤ α ≤ 1; where values of 0 and 1 correspond to Ridge and Lasso penalties, respectively. Here, α induces a tradeoff between l₂-norm (Lasso) and l₁-norm (ridge) and works well in the presence of strongly correlated covariates. We studied the performance SVRS algorithm with elastic net in a small simulation experiment under various settings as defined in Tables 1 and 2 and found that the results (not reported here) were similar to Lasso and Ridge cases examined herein. We therefore opted to focus on the two widely applicable special cases of elastic net to simplify the exposition of our methodology.

Use of response-based downsampling in our work is an instance of the general notion of subsampling and is particularly useful in reducing the computational burden when dealing with prohibitively large datasets with extreme class-imbalance. For instance, the full BC wildland fire occurrence dataset for person-caused fires comprises 13 million instances with a case-control class-imbalance ratio of 1:469, whereas a single balanced dataset generated via response-based sampling consists of 45,050 observations only. The methodology introduced herein is in the same vein as stability selection [34, 51], which is a general subsampling based variable selection approach that is designed to improve selection performance of a base selection algorithm for high-dimensional data. However, SVRS differs from stability selection in the following two aspects: i) it stabilizes the regression coefficients across the ensemble whereas the latter is designed to stabilize the selections (i.e. classification of a variables into noise or signal group) performed on subsampled data; and ii) response-based sampling in SVRS ensures that the size of the subsamples can potentially be several hundreds of order of magnitude smaller than [n/2], which is the required size to implement stability selection. This latter aspect is crucial in the context of severely imbalanced and massive datasets as training the base algorithm repeatedly on millions of observations can be computationally prohibitive. It is also important to note that response-based sampling approach in SVRS alleviates class-imbalance in individual subsamples without affecting the predictive performance of the base regularization method, whereas the usual random sampling as implemented in stability selection inherits the same severe class-imbalance present in the original dataset.

In summary, this study introduces a new variable selection method for logistic regression modeling of extreme rare events data. The method combines response-based subsampling and commonly employed regularization methods to perform accurate variable selection for high-dimensional and large datasets. The performance results are supported by an extensive simulation experiment and analysis of big and severely imbalanced real-life datasets.

Supporting information

S1 Appendix. Analysis of a reduced case study dataset.

https://doi.org/10.1371/journal.pone.0280258.s001

(DOCX)

Acknowledgments

We wish to thank Steve Taylor of the Pacific Forestry Centre in British Columbia for providing the data and Douglas Woolford of the University of Western Ontario for reviewing an earlier heuristic version of the SVRS algorithm.

References

1. Nusinovici S, Tham YC, Yan MY, Ting DS, Li J, Sabanayagam C, et al. Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of clinical epidemiology. 2020 Jun 1;122:56–69. pmid:32169597
- View Article
- PubMed/NCBI
- Google Scholar
2. Costa e Silva E, Lopes IC, Correia A, Faria S. A logistic regression model for consumer default risk. Journal of Applied Statistics. 2020 Nov 17;47(13–15):2879–94. pmid:35707418
- View Article
- PubMed/NCBI
- Google Scholar
3. Bavaghar MP. Deforestation modelling using logistic regression and GIS. Journal of Forest Science. 2015 May 15;61(5):193–9.
- View Article
- Google Scholar
4. Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media; 2011 Jun 8.
5. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267–88.
- View Article
- Google Scholar
6. Trofimov I, Genkin A. Distributed coordinate descent for generalized linear models with regularization. Pattern Recognition and Image Analysis. 2017 Apr;27(2):349–64.
- View Article
- Google Scholar
7. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67.
- View Article
- Google Scholar
8. Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics. 2007 Dec;35(6):2313–51.
- View Article
- Google Scholar
9. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–60.
- View Article
- Google Scholar
10. Sirimongkolkasem T, Drikvandi R. On regularisation methods for analysis of high dimensional data. Annals of Data Science. 2019 Dec;6(4):737–63.
- View Article
- Google Scholar
11. King G, Zeng L. Logistic regression in rare events data. Political analysis. 2001;9(2):137–63.
- View Article
- Google Scholar
12. Owen AB. Infinitely Imbalanced Logistic Regression. Journal of Machine Learning Research. 2007 Apr 1;8(4):761–73.
- View Article
- Google Scholar
13. Hasanin T, Khoshgoftaar TM, Leevy JL, Bauder RA. Severely imbalanced big data challenges: investigating data sampling approaches. Journal of Big Data. 2019 Dec;6(1):1–25.
- View Article
- Google Scholar
14. Arezzo MF, Guagnano G. Response-based sampling for binary choice models with sample selection. Econometrics. 2018 Mar;6(1):12.
- View Article
- Google Scholar
15. Breslow NE, Day NE, Heseltine E. Statistical methods in cancer research. Lyon: International Agency for Research on Cancer; 1980.
16. Jiang Y, Scott AJ, Wild CJ. Adjusting for Non‐Response in Population‐Based Case‐Control Studies. International statistical review. 2011 Aug;79(2):145–59.
- View Article
- Google Scholar
17. Manski CF, McFadden D, editors. Structural analysis of discrete data with econometric applications. Cambridge, MA: MIT press; 1981 Mar. 2–50.
18. Hosmer David W., Lemeshow S, Sturdivant Rodney X. Applied logistic regression. New York: Wiley; 2000.
19. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data. 2018 Dec;5(1):1–30.
- View Article
- Google Scholar
20. Bach FR. Bolasso: model consistent lasso estimation through the bootstrap. InProceedings of the 25th international conference on Machine learning 2008 Jul 5 (pp. 33–40).
21. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–29.
- View Article
- Google Scholar
22. Antelo-Collado A, Carrasco-Velar R, García-Pedrajas N, Cerruela-García G. Effective feature selection method for class-imbalance datasets applied to chemical toxicity prediction. Journal of Chemical Information and Modeling. 2020 Dec 22;61(1):76–94. pmid:33350301
- View Article
- PubMed/NCBI
- Google Scholar
23. Chen H, Li T, Fan X, Luo C. Feature selection for imbalanced data based on neighborhood rough sets. Information sciences. 2019 May 1;483:1–20.
- View Article
- Google Scholar
24. Chen RC, Dewi C, Huang SW, Caraka RE. Selecting critical features for data classification based on machine learning methods. Journal of Big Data. 2020 Dec;7(1):1–26.
- View Article
- Google Scholar
25. Fu GH, Xu F, Zhang BY, Yi LZ. Stable variable selection of class-imbalanced data with precision-recall criterion. Chemometrics and Intelligent Laboratory Systems. 2017 Dec 15;171:241–50.
- View Article
- Google Scholar
26. Kamalov F, Thabtah F, Leung HH. Feature Selection in Imbalanced Data. Annals of Data Science. 2022 Jan 24:1–5.
- View Article
- Google Scholar
27. Khaldy MA, Kambhampati C. Resampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset. International Robotics & Automation Journal. 2018 Feb;4(1):1–0.
- View Article
- Google Scholar
28. Massi MC, Gasperoni F, Ieva F, Paganoni AM. Feature selection for imbalanced data with deep sparse autoencoders ensemble. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2022 Jun;15(3):376–95.
- View Article
- Google Scholar
29. Nadeem K, Taylor SW, Woolford DG, Dean CB. Mesoscale spatiotemporal predictive models of daily human-and lightning-caused wildland fire occurrence in British Columbia. International journal of wildland fire. 2019 Dec 24;29(1):11–27.
- View Article
- Google Scholar
30. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33(1):1–22.
- View Article
- Google Scholar
31. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970 Feb 1;12(1):55–67.
- View Article
- Google Scholar
32. Lee AH, Silvapulle MJ. Ridge estimation in logistic regression. Communications in Statistics-Simulation and Computation. 1988 Jan 1;17(4):1231–57.
- View Article
- Google Scholar
33. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications. 2017 May 1;73:220–39.
- View Article
- Google Scholar
34. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010 Sep;72(4):417–73.
- View Article
- Google Scholar
35. Eckley IA, Fearnhead P, Killick R. Analysis of changepoint models. Bayesian time series models. 2011 Jan:205–24.
- View Article
- Google Scholar
36. Hinkley DV. Inference about the change-point in a Sequence of Random Variables. Biometika. 1970 Apr;57(1):1–7.
- View Article
- Google Scholar
37. Silva EG, Teixeira AA. Surveying structural change: Seminal contributions and a bibliometric account. Structural Change and Economic Dynamics. 2008 Dec 1;19(4):273–300.
- View Article
- Google Scholar
38. Killick R, Haynes K, Eckley I, Fearnhead P, Lee J. changepoint: methods for changepoint detection. R package version 2.2. 2.
39. Killick R, Eckley I. changepoint: An R package for changepoint analysis. Journal of statistical software. 2014;58(3):1–9.
- View Article
- Google Scholar
40. Touloumis A. Simulating Correlated Binary and Multinomial Responses under Marginal Model Specification: The SimCorMultRes Package. R J. 2016 Dec 1;8(2):79.
- View Article
- Google Scholar
41. Agresti A. Categorical data analysis. John Wiley & Sons; 2003 Mar 31.
42. Cario MC, Nelson BL. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois; 1997 Apr 9.
43. Government of Canada [Internet]. Forest Fires; 2021 [Updated 2021 April 14; Cited 2022 Jan 8]. https://www.nrcan.gc.ca/our-natural-resources/forests/wildland-fires-insects-disturbances/forest-fires/13143.
44. Kirchmeier‐Young MC, Gillett NP, Zwiers FW, Cannon AJ, Anslow FS. Attribution of the influence of human‐induced climate change on an extreme fire season. Earth’s Future. 2019 Jan;7(1):2–10. pmid:35860503
- View Article
- PubMed/NCBI
- Google Scholar
45. Government of British Columbia [Internet]. Wildfire Season Summary; 2020 [Cited 2022 Jan 8]. https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-history/wildfire-season-summary.
46. Gillis MD, Omule AY, Brierley T. Monitoring Canada’s forests: the national forest inventory. The Forestry Chronicle. 2005 Apr 1;81(2):214–21.
- View Article
- Google Scholar
47. Breiman L. Random forests. Machine learning. 2001 Oct;45(1):5–32.
- View Article
- Google Scholar
48. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Routledge; 2017 Oct 19.
49. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Prediction, Inference and Data Mining: Springer New York. 2009.
50. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics. 2014 Dec;6(1):1–5. pmid:24678909
- View Article
- PubMed/NCBI
- Google Scholar
51. Shah RD, Samworth RJ. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013 Jan;75(1):55–80.
- View Article
- Google Scholar

[ref1] 1. Nusinovici S, Tham YC, Yan MY, Ting DS, Li J, Sabanayagam C, et al. Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of clinical epidemiology. 2020 Jun 1;122:56–69. pmid:32169597
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Costa e Silva E, Lopes IC, Correia A, Faria S. A logistic regression model for consumer default risk. Journal of Applied Statistics. 2020 Nov 17;47(13–15):2879–94. pmid:35707418
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Bavaghar MP. Deforestation modelling using logistic regression and GIS. Journal of Forest Science. 2015 May 15;61(5):193–9.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref4] 4. Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media; 2011 Jun 8.

[ref5] 5. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267–88.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Trofimov I, Genkin A. Distributed coordinate descent for generalized linear models with regularization. Pattern Recognition and Image Analysis. 2017 Apr;27(2):349–64.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Candes E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The annals of Statistics. 2007 Dec;35(6):2313–51.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–60.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Sirimongkolkasem T, Drikvandi R. On regularisation methods for analysis of high dimensional data. Annals of Data Science. 2019 Dec;6(4):737–63.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. King G, Zeng L. Logistic regression in rare events data. Political analysis. 2001;9(2):137–63.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Owen AB. Infinitely Imbalanced Logistic Regression. Journal of Machine Learning Research. 2007 Apr 1;8(4):761–73.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Hasanin T, Khoshgoftaar TM, Leevy JL, Bauder RA. Severely imbalanced big data challenges: investigating data sampling approaches. Journal of Big Data. 2019 Dec;6(1):1–25.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref14] 14. Arezzo MF, Guagnano G. Response-based sampling for binary choice models with sample selection. Econometrics. 2018 Mar;6(1):12.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref15] 15. Breslow NE, Day NE, Heseltine E. Statistical methods in cancer research. Lyon: International Agency for Research on Cancer; 1980.

[ref16] 16. Jiang Y, Scott AJ, Wild CJ. Adjusting for Non‐Response in Population‐Based Case‐Control Studies. International statistical review. 2011 Aug;79(2):145–59.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref17] 17. Manski CF, McFadden D, editors. Structural analysis of discrete data with econometric applications. Cambridge, MA: MIT press; 1981 Mar. 2–50.

[ref18] 18. Hosmer David W., Lemeshow S, Sturdivant Rodney X. Applied logistic regression. New York: Wiley; 2000.

[ref19] 19. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data. 2018 Dec;5(1):1–30.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref20] 20. Bach FR. Bolasso: model consistent lasso estimation through the bootstrap. InProceedings of the 25th international conference on Machine learning 2008 Jul 5 (pp. 33–40).

[ref21] 21. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–29.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref22] 22. Antelo-Collado A, Carrasco-Velar R, García-Pedrajas N, Cerruela-García G. Effective feature selection method for class-imbalance datasets applied to chemical toxicity prediction. Journal of Chemical Information and Modeling. 2020 Dec 22;61(1):76–94. pmid:33350301
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref23] 23. Chen H, Li T, Fan X, Luo C. Feature selection for imbalanced data based on neighborhood rough sets. Information sciences. 2019 May 1;483:1–20.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref24] 24. Chen RC, Dewi C, Huang SW, Caraka RE. Selecting critical features for data classification based on machine learning methods. Journal of Big Data. 2020 Dec;7(1):1–26.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref25] 25. Fu GH, Xu F, Zhang BY, Yi LZ. Stable variable selection of class-imbalanced data with precision-recall criterion. Chemometrics and Intelligent Laboratory Systems. 2017 Dec 15;171:241–50.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref26] 26. Kamalov F, Thabtah F, Leung HH. Feature Selection in Imbalanced Data. Annals of Data Science. 2022 Jan 24:1–5.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref27] 27. Khaldy MA, Kambhampati C. Resampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset. International Robotics & Automation Journal. 2018 Feb;4(1):1–0.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref28] 28. Massi MC, Gasperoni F, Ieva F, Paganoni AM. Feature selection for imbalanced data with deep sparse autoencoders ensemble. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2022 Jun;15(3):376–95.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref29] 29. Nadeem K, Taylor SW, Woolford DG, Dean CB. Mesoscale spatiotemporal predictive models of daily human-and lightning-caused wildland fire occurrence in British Columbia. International journal of wildland fire. 2019 Dec 24;29(1):11–27.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref30] 30. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33(1):1–22.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref31] 31. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970 Feb 1;12(1):55–67.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref32] 32. Lee AH, Silvapulle MJ. Ridge estimation in logistic regression. Communications in Statistics-Simulation and Computation. 1988 Jan 1;17(4):1231–57.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref33] 33. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications. 2017 May 1;73:220–39.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref34] 34. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010 Sep;72(4):417–73.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref35] 35. Eckley IA, Fearnhead P, Killick R. Analysis of changepoint models. Bayesian time series models. 2011 Jan:205–24.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref36] 36. Hinkley DV. Inference about the change-point in a Sequence of Random Variables. Biometika. 1970 Apr;57(1):1–7.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref37] 37. Silva EG, Teixeira AA. Surveying structural change: Seminal contributions and a bibliometric account. Structural Change and Economic Dynamics. 2008 Dec 1;19(4):273–300.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref38] 38. Killick R, Haynes K, Eckley I, Fearnhead P, Lee J. changepoint: methods for changepoint detection. R package version 2.2. 2.

[ref39] 39. Killick R, Eckley I. changepoint: An R package for changepoint analysis. Journal of statistical software. 2014;58(3):1–9.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref40] 40. Touloumis A. Simulating Correlated Binary and Multinomial Responses under Marginal Model Specification: The SimCorMultRes Package. R J. 2016 Dec 1;8(2):79.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref41] 41. Agresti A. Categorical data analysis. John Wiley & Sons; 2003 Mar 31.

[ref42] 42. Cario MC, Nelson BL. Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois; 1997 Apr 9.

[ref43] 43. Government of Canada [Internet]. Forest Fires; 2021 [Updated 2021 April 14; Cited 2022 Jan 8]. https://www.nrcan.gc.ca/our-natural-resources/forests/wildland-fires-insects-disturbances/forest-fires/13143.

[ref44] 44. Kirchmeier‐Young MC, Gillett NP, Zwiers FW, Cannon AJ, Anslow FS. Attribution of the influence of human‐induced climate change on an extreme fire season. Earth’s Future. 2019 Jan;7(1):2–10. pmid:35860503
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref45] 45. Government of British Columbia [Internet]. Wildfire Season Summary; 2020 [Cited 2022 Jan 8]. https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-history/wildfire-season-summary.

[ref46] 46. Gillis MD, Omule AY, Brierley T. Monitoring Canada’s forests: the national forest inventory. The Forestry Chronicle. 2005 Apr 1;81(2):214–21.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref47] 47. Breiman L. Random forests. Machine learning. 2001 Oct;45(1):5–32.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref48] 48. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Routledge; 2017 Oct 19.

[ref49] 49. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Prediction, Inference and Data Mining: Springer New York. 2009.

[ref50] 50. Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics. 2014 Dec;6(1):1–5. pmid:24678909
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref51] 51. Shah RD, Samworth RJ. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013 Jan;75(1):55–80.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

Figures

Abstract

Introduction

Materials and methods

The regularization methods

Lasso.

Adaptative lasso (adaLasso).

Ridge regression.

Class-imbalance and response-based sampling

Stable Variable Ranking and Selection (SVRS) algorithm

Ranking algorithm.

Threshold selection.

Computational efficiency.

Simulation experiment

Simulation of model with uncorrelated covariates.

Simulation of models with correlated covariates.

Generation of balanced datasets.

Results

Selection performance with un-correlated data

UNCOR-12.

UNCOR-24.

Selection performance with correlated data

COR-12.

COR-24.

Case study: Analysis of wildland fire occurrence data

Discussion

Supporting information

S1 Appendix. Analysis of a reduced case study dataset.

Acknowledgments

References