Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhanced random forest with geologically-informed feature optimization for complex volcanic rock lithology identification: A case study in the Wangfu Fault Depression, Songliao Basin

  • Xiu Jin,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review & editing

    Affiliation School of Business Administration, Liaoning Technical University, Huludao, Liaoning, China

  • Taiji Yu ,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    yutj1988@126.com

    Affiliations College of Safety Science and Engineering, Liaoning Technical University, Huludao, Liaoning, China, College of Earth Sciences, Jilin University, Changchun, Jilin, China

  • Pujun Wang,

    Roles Data curation, Funding acquisition, Resources, Supervision

    Affiliation College of Earth Sciences, Jilin University, Changchun, Jilin, China

  • Minglong Nie,

    Roles Software, Validation

    Affiliation College of Safety Science and Engineering, Liaoning Technical University, Huludao, Liaoning, China

  • Shichang Lu

    Roles Investigation, Visualization

    Affiliation School of Business Administration, Liaoning Technical University, Huludao, Liaoning, China

Abstract

Identifying lithologies within the volcanic reservoirs of the Huoshiling Formation (Wangfu Fault Depression, Songliao Basin) remains challenging due to extreme heterogeneity, limited core control, and ambiguous responses on conventional logs. We introduce an enhanced machine-learning framework for high-precision classification of these complex volcanic sequences, leveraging detailed core descriptions and five conventional well logs—gamma ray (GR), compensated neutron (CNL), bulk density (DEN), acoustic travel-time/sonic (AC), and deep array laterolog resistivity (RLA5)—from 12,388 depth-matched samples across 20 wells, encompassing 18 lithologies. The core innovation is an enhanced Random Forest (eRF) specifically engineered for geological and data-centric challenges. The eRF synergistically integrates: (1) Borderline-SMOTE to counteract severe class imbalance by selectively augmenting minority instances near decision boundaries, critical for rare but geologically significant facies; (2) C4.5 decision trees with gain-ratio splitting to optimize node-level feature selection from correlated continuous logs; and (3) Kendall’s coefficient of concordance (Kendall’s W) to stabilize feature-importance ranking across trees, prioritizing robust predictors. Against standard RF, back-propagation neural network (BPNN), k-nearest neighbors (kNN), and support-vector machine (SVM), the eRF attains 96.34% overall accuracy. Per-class accuracies exceed 0.88 for all 18 lithologies, with the largest improvement (+43 percentage points) for trachytic tuff. Sensitivity analysis indicates GR and AC dominate, together accounting for >60% of model decisions. This geologically attuned, optimized ensemble provides a transferable route to high-resolution lithology logs in uncored intervals, substantially aiding hydrocarbon sweet-spot prediction in complex volcanic settings.

Introduction

The escalating global demand for hydrocarbons necessitates exploration in increasingly complex geological settings, with deep volcanic sequences emerging as critical targets [1]. Volcanic reservoirs are globally significant, documented in numerous basins worldwide; in China, high-yield discoveries in the Liaohe, Jilin, and Shengli oilfields underscore their potential [2]. The Songliao Basin—Northeast China’s largest Late Mesozoic continental hydrocarbon-bearing basin—hosts substantial Cretaceous volcanic–sedimentary successions owing to its complex tectonic history [36]. Within this basin, the Huoshiling Formation of the Wangfu Fault Depression is a key exploration focus with significant natural-gas reserves [7] (Fig 1). To contextualize the classification task, a generalized lithostratigraphic section for the Huoshiling succession is shown in Fig 1b [8].

thumbnail
Fig 1. Location and tectonic subdivision of the Songliao Basin.

(a) Distribution of fault depressions within the basin, highlighting the Wangfu Fault Depression; (b) Stratigraphic column of Huoshiling Formation succession.

https://doi.org/10.1371/journal.pone.0335630.g001

Accurate lithology identification in the Huoshiling Formation is hampered by the extreme complexity of its volcanic rock assemblages. The succession comprises 18 distinct lithologies—including lavas, pyroclastic lavas, pyroclastic rocks, and sedimentary volcaniclastic rocks—that frequently exhibit overlapping responses on conventional logs: gamma ray (GR), compensated neutron (CNL), bulk density (DEN), acoustic travel-time/sonic (AC), and deep array laterolog resistivity (RLA5). Rapid facies transitions, brecciation/vesiculation, and post-emplacement alteration (e.g., zeolitization, clay enrichment) further blur contrasts across gradational contacts, while limited core control for several lithologies constrains supervision and hampers validation [911]. Under these conditions, thin tuffaceous or brecciated interbeds embedded within thick lava packages are readily obscured in routine cross-plots and rule-based interpretations, yielding inconsistent and operator-dependent results [12].

Machine learning (ML) offers a scalable alternative by discovering implicit, nonlinear patterns in high-dimensional well-log data. Classical algorithms—neural networks, support-vector machines (SVMs), and decision-tree ensembles—have been applied to lithology prediction with encouraging outcomes in multiple basins [13,14]. Recent contributions further demonstrate both the promise and the methodological caveats: ML-based lithology prediction from conventional logs in the Cambay Basin with multi-classifier evaluation [15]; vertical lithological proxies using statistical and AI approaches in the Krishna–Godavari offshore [16]; a state-of-the-art synthesis on petrographic classification from geophysical logs reviewing data, features, and algorithm choices [12]; and ML-assisted petrographic classification in coal-measure successions from the Bokaro coalfield [17]. Collectively, these works highlight the potential of data-driven classification while emphasizing the need to address class imbalance, correlated predictors, and validation protocols aligned with geological structure and data provenance.

To address these challenges in the Huoshiling succession, we develop an enhanced Random Forest (eRF) tailored to volcanic-lithology identification. The approach couples geological realities with algorithmic mechanisms in a unified framework: Borderline-SMOTE concentrates synthetic augmentation at difficult decision boundaries to improve separability for minority lithologies; C4.5-style gain-ratio splitting prioritizes geologically discriminative thresholds and mitigates the multi-valued-attribute bias of Gini when selecting among correlated continuous predictors; and Kendall’s coefficient of concordance (Kendall’s W) promotes stability in feature importance across trees, discouraging idiosyncratic splits and enhancing ensemble robustness. We evaluate the eRF against established baselines—standard RF, back-propagation neural network (BPNN), k-nearest neighbors (kNN), and SVM—using a multi-well dataset and rigorous model assessment. The generalized lithostratigraphic context (Fig 1b) anchors the classification task in the volcanic architecture of the Huoshiling succession and facilitates geological interpretation of the results.

Geological setting and data sources

The Songliao Basin in northeastern China is the world’s largest Late Mesozoic continental hydrocarbon-bearing basin. The Wangfu Fault Depression, in its southern sector, covers ~3,500 km2 and hosts extensive volcanic successions within the Cretaceous Huoshiling and Shahezi formations. The Huoshiling Formation unconformably overlies Jurassic strata and is lithologically subdivided, from base to top, into trachyte, pyroclastic rocks, and rhyolite [18]. Data from 20 wells indicate a total thickness of volcanic rocks in the Huoshiling Formation reaching 9,679.69 m.

Following the volcanic-reservoir classification of Wang Pujun et al. [19], and integrating core observations with thin-section analyses, the Huoshiling succession in the study area comprises four structural classes and 18 principal lithologies (Fig 2; Table 1): volcanic lava, volcanic breccia, pyroclastic lava, and pyroclastic rock. These encompass diverse compositional types (e.g., trachyte, rhyolite, basalt) and multiple tuff and breccia variants, each with distinctive textures and fabrics that influence well-log responses. The large number of classes and their uneven representation (Table 1) underscore a pronounced class-imbalance challenge.

thumbnail
Table 1. Lithology types and relative thicknesses of volcanic rocks in the Huoshiling Formation, Wangfu Fault Depression.

https://doi.org/10.1371/journal.pone.0335630.t001

thumbnail
Fig 2. Representative core photographs of volcanic lithologies from the Huoshiling Formation, Wangfu Fault Depression.

Each image corresponds to a specific volcanic rock type classified based on texture and composition. See captions (a–t) for sample depth and well ID. (a) Rhyolite, C9, 2270.12 m;(b) Rhyolitic tuff lava, C607, 2720.84 m;(c-d) Rhyolitic breccia lava, C14, 3005.85 m;(e) Rhyolitic tuff, C9, 2277.62 m;(f) Rhyolitic volcanic breccia, C14, 2292.82 m;(g) Andesite, C13, 2241.00 m;(h) Andesitic tuff lava, C11, 2769.00 m;(i) Andesitic breccia lava, C11, 2769.00 m;(j) Andesitic tuff, C4, 3929.00 m;(k-l) Andesitic volcanic breccia, C8, 2209.02 m;(m) Trachyte, C606, 2405.15 m;(n) Trachytic tuff lava, C6, 3020.00 m;(o) Trachytic breccia lava, C606, 2406.55 m;(p) Trachytic tuff, C14, 3090.00 m;(q) Trachytic volcanic breccia, C13, 2308.00 m;(r) Basalt, C10, 3324.00 m;(s) Sedimentary tuff, C9, 2279.52 m;(t) Sedimentary volcanic breccia, C607, 2740.04 m.

https://doi.org/10.1371/journal.pone.0335630.g002

Core descriptions and five conventional well logs—gamma ray (GR), compensated neutron (CNL), bulk density (DEN), acoustic travel-time/sonic (AC), and deep array laterolog resistivity (RLA5)—were compiled from 20 wells as input features. Ground-truth lithology labels were derived from detailed core/cuttings analyses. In total, 12,388 depth-matched samples were assembled, each comprising a lithology label and five log readings. Prior to modeling, the logging data underwent quality control and normalization to correct depth mismatches and scale differences, yielding a consistent dataset for training and validation.

Materials and methods

To achieve robust lithology identification, we compare the proposed enhanced Random Forest (eRF) against four widely used machine-learning algorithms: standard Random Forest (RF), back-propagation neural network (BPNN), k-nearest neighbors (kNN), and support-vector machine (SVM). All models are configured for 18-class prediction using five conventional logs (GR, CNL, DEN, AC, RLA5) as inputs. Hyperparameters are tuned by nested cross-validation with groups defined at the well level (Table 4).

thumbnail
Table 4. Key hyperparameters and training settings for each machine-learning model.

https://doi.org/10.1371/journal.pone.0335630.t004

Standard machine learning algorithms (Baselines)

Random Forest (RF). The standard RF uses 100 decision trees with bootstrap resampling; node splits are determined by the Gini index, considering a random subset of features at each split (max_features = sqrt of total features). See Table 4 for complete settings.

Back-propagation Neural Network (BPNN). A feed-forward network is configured with one hidden layer (10 neurons, ReLU activation) and a softmax output layer for multiclass classification. Optimization is performed with Adam (Table 4).

k-Nearest Neighbors (kNN). This instance-based learner [20] classifies a sample by the majority class among its k nearest neighbors (k = 5, Euclidean distance) in the normalized feature space (Table 4).

Support-Vector Machine (SVM). A multiclass SVM [21] with one-vs-one strategy and an RBF kernel is employed. Hyperparameters (C and γ) are selected by grid search (Table 4).

Enhanced random forest (eRF)

To address the specific challenges of Huoshiling lithology identification—namely, severe class imbalance among 18 lithotypes and the need to exploit subtle, overlapping log responses—we propose an eRF that integrates three components synergistically (Fig 3): Borderline-SMOTE for targeted imbalance correction at decision boundaries [22]; a C4.5-style gain-ratio probe to prioritize geologically discriminative thresholds on continuous logs [23]; and Kendall’s coefficient of concordance (Kendall’s W) to promote stability in feature usage across trees [24]. The source code for the enhanced Random Forest (eRF) model used in this study is publicly available on GitHub. The repository can be accessed at: https://github.com/yuzc18/erf-volcanic-lithology. A DOI has been assigned to the repository and can be cited as follows: [10.5281/zenodo.17121940].

thumbnail
Fig 3. Workflow of the proposed enhanced Random Forest (eRF) algorithm for volcanic lithology classification.

Data preprocessing includes quality control, Z-score standardization, and targeted imbalance correction with Borderline-SMOTE. The eRF is trained under nested GroupKFold cross-validation (inner loop: hyperparameter tuning by macro-F1; outer loop: model assessment). Trees adopt C4.5 gain-ratio splitting, and Kendall’s W is used to promote stability in feature usage across trees. After aggregating outer-fold results, the final eRF is refit on the full training set and applied to the held-out test set and to blind wells in a zero-shot manner.

https://doi.org/10.1371/journal.pone.0335630.g003

Borderline-SMOTE for imbalance correction.

The dataset exhibits pronounced imbalance (e.g., 2,670 trachyte samples vs. 116 basalt samples; Table 3), which biases learners toward majority classes. We therefore apply Borderline-SMOTE [22] during preprocessing. Unlike global oversampling, Borderline-SMOTE selectively synthesizes minority instances near decision boundaries—precisely where misclassification risk is highest—thereby increasing minority density in ambiguous regions and improving recognition of rare but geologically significant lithologies. Given our data characteristics, this choice is preferable to alternatives such as NearMiss or ADASYN [25,26], which either remove informative boundary cases or may introduce noisier samples.

thumbnail
Table 3. Sample counts and statistical ranges of log parameters for volcanic lithologies.

https://doi.org/10.1371/journal.pone.0335630.t003

Feature optimization with C4.5 decision trees.

To strengthen each tree within the ensemble, we adopt C4.5-style gain-ratio splitting [23]. The gain ratio normalizes information gain by split information, mitigating the tendency to favor attributes with many effective cut points—a known issue for the Gini index on continuous, potentially multi-valued predictors. By recursively selecting thresholds with the highest gain ratio, C4.5 emphasizes log features that are most discriminative for lithology and reduces interference from redundant parameters, yielding stronger base learners.

Ensuring feature stability with Kendall’s W.

To further refine feature utilization and enhance model stability, Kendall’s coefficient of concordance (Kendall’s W) [24] is integrated into the eRF framework. Kendall’s W quantifies the consistency (agreement) of feature importance rankings derived from multiple decision trees within the ensemble. In this eRF, features demonstrating high concordance (i.e., consistently ranked as important across many trees) and strong correlation with lithology are prioritized or weighted more heavily during the construction of subsequent decision trees or in the feature selection process for node splitting. This mechanism aims to reduce information redundancy, enhance the diversity of effective features used by different trees, and ensure that the model relies on genuinely robust and stable discriminators rather than spurious correlations present in subsets of the data. This contributes to the overall stability and reliability of the ensemble’s predictions.

By combining Borderline-SMOTE to rebalance informative boundaries, C4.5 gain-ratio to improve node-level feature selection, and Kendall’s W to stabilize feature usage, the eRF is designed to learn geologically meaningful decision rules from noisy, imbalanced logs. The rationale for each enhancement module is summarized in Table 2.

thumbnail
Table 2. Rationale for enhanced random forest (eRF) enhancement modules.

https://doi.org/10.1371/journal.pone.0335630.t002

In summary, the eRF directly addresses the geological complexities of the study area by increasing sensitivity to minority lithologies and stabilizing feature usage via gain-ratio splits and Kendall’s W, thereby improving robustness and generalization.

Results

Data preprocessing and model training

Data Preparation: A dataset of 12,388 samples from 18 volcanic lithologies was compiled from cored intervals of 20 wells in the Wangfu Fault Depression. Each sample (0.125 m interval) includes five logging parameters (GR, CNL, DEN, AC, RLA5) and a lithology label. Table 3 details sample counts and logging characteristics, highlighting severe class imbalance (e.g., trachyte: 2,670 samples; basalt: 116 samples) and overlapping log responses, which pose classification challenges. Despite overlaps, some lithologies show distinct signatures (e.g., basalt’s low GR; rhyolitic tuff’s high GR).

Data Standardization: Z-score normalization was applied to the five log curves to eliminate scale disparities:

(1)

where x is the raw logging value, μ is the sample mean, σ is the standard deviation, and x′ is the normalised value.

Following the protocol, data were split into a held-out test set (20%) and a training set (80%). Model selection on the training set used nested GroupKFold cross-validation (group = well): the inner loop performed grid search with macro-F1 as the selection metric, and the outer loop estimated generalization. Selected models were then refit on the full training set and evaluated on the held-out test set. Table 4 summarizes the deployed hyper-parameters; the Notes list the inner-CV search spaces. The decision threshold was 0.5 unless stated otherwise.

Test-set performance was reported by accuracy, precision, recall, and F1-score, following standard definitions for classification evaluation [27]. For the multiclass case, results are macro-averaged unless otherwise noted. For a given class c in a one-vs-rest setting with true positives TP, true negatives TN, false positives FP, and false negatives FN, the metrics are

Accuracy:

(2)

Precision:

(3)

Recall:

(4)

F1-score:

(5)

Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives, generalized for multi-class via averaging.

Lithology identification results

Table 5 summarizes the overall performance metrics. The proposed eRF model achieved the highest accuracy (96.34%) and F1-score (0.9635), significantly outperforming standard RF (87.73% accuracy), BPNN (86.97%), kNN (94.84%), and SVM (82.32%). The balanced precision and recall for eRF indicate robust classification across all lithologies.

thumbnail
Table 5. Overall evaluation metrics for volcanic lithology identification using different machine-learning models.

https://doi.org/10.1371/journal.pone.0335630.t005

Per-class accuracies (Table 6, Fig 4) further highlight the eRF’s superiority. The eRF achieved >0.88 accuracy for all 18 lithologies. Notably, for trachytic tuff (a minority class), accuracy improved from 0.5455 (standard RF) to 0.9787 (eRF). Other models showed weaknesses: standard RF struggled with minority classes; BPNN underperformed for classes like sedimentary tuff (0.7087 vs. eRF’s 0.9771); kNN, while competitive for some distinct lithologies, showed performance degradation for those with overlapping features; SVM had the lowest overall per-class performance.

thumbnail
Table 6. Per-class identification accuracy for all 18 volcanic lithologies across five classification models.

https://doi.org/10.1371/journal.pone.0335630.t006

thumbnail
Fig 4. Confusion matrices for lithology classification using five machine-learning models.

Each matrix shows the classification performance for 18 volcanic lithologies. The enhanced Random Forest (eRF) model exhibits the highest diagonal accuracy and lowest misclassification rates.(Lithology labels 1–18 correspond to: 1. Basalt; 2. Trachyte; 3. Trachytic brecciated lava; 4. Trachytic tuff lava; 5. Trachytic volcanic breccia; 6. Trachytic tuff; 7. Andesite; 8. Andesitic brecciated lava; 9. Andesitic tuff lava; 10. Andesitic volcanic breccia; 11. Andesitic tuff; 12. Rhyolite; 13. Rhyolitic brecciated lava; 14. Rhyolitic tuff lava; 15. Rhyolitic volcanic breccia; 16. Rhyolitic tuff; 17. Sedimentary volcanic breccia; 18. Sedimentary tuff).

https://doi.org/10.1371/journal.pone.0335630.g004

Fig 4 shows that eRF delivers the sharpest diagonal and the sparsest off-diagonals. Residual errors concentrate within compositionally/texturally related pairs where log responses overlap—e.g., Trachytic Tuff Lava vs Trachytic Tuff, Trachytic Brecciated Lava vs Trachytic Volcanic Breccia, Andesitic Tuff Lava vs Andesitic Tuff, and their breccia counterparts; limited cross-confusion also appears inside the rhyolitic suite. Compared with the standard RF, eRF markedly suppresses leakage from minority pyroclastic units into dominant lavas, especially for Trachytic Tuff, consistent with Table 6. Baselines show characteristic weaknesses: BPNN exhibits broad low-amplitude bands under imbalance; kNN is competitive for distinctive end-members but degrades for overlapping pairs; SVM yields the weakest diagonal. These matrices corroborate that eRF improves not only aggregate metrics but also error structure.

Fig 5 (panels a–r) shows that the enhanced Random Forest (eRF) discriminates most lithologies very well: the one-vs-rest ROC curves cluster near the upper-left, with uniformly high AUCs (macro-AUC ≈ 0.999; most AUCs ≥ 0.98). At the operating threshold of 0.5 (orange markers), false-positive rates remain near zero while true-positive rates span ~0.82–0.99. Slightly lower TPRs are observed for several transitional or texturally heterogeneous classes (b, c, e, g, h, o), consistent with overlapping log responses near facies contacts.

thumbnail
Fig 5. Class-wise ROC on the test set for the enhanced Random Forest (eRF) (18 lithologies) with eRF train–test accuracy comparison.

Panels a–r display one-vs-rest ROC curves for each lithology; the orange dot marks the operating point at threshold = 0.5, and the parenthetical label gives its (FPR, TPR). The bottom box in each panel reports the class name and AUC (three decimals, truncated). Panel (s) shows per-class train/test one-vs-rest accuracy; x-axis letters a–r correspond to the ROC panels. Letter–lithology mapping: a Basalt; b Trachyte; c Trachytic brecciated lava; d Trachytic tuff lava; e Trachytic volcanic breccia; f Trachytic tuff; g Andesite; h Andesitic brecciated lava; i Andesitic tuff lava; j Andesitic volcanic breccia; k Andesitic tuff; l Rhyolite; m Rhyolitic brecciated lava; n Rhyolitic tuff lava; o Sedimentary volcanic breccia; p Rhyolitic tuff; q Rhyolitic volcanic breccia; r Sedimentary tuff.

https://doi.org/10.1371/journal.pone.0335630.g005

Panel (s) indicates small train–test gaps across classes, with no systematic drop on the test set, suggesting limited overfitting and stable generalization of the eRF. In conjunction with the method design, these observations align with the intended mechanism: Borderline-SMOTE enriches boundary cases in rare/ambiguous facies, C4.5 gain-ratio splits exploit subtle log sensitivities, and Kendall’s W promotes stability in feature usage across trees. Together, these elements underpin the eRF’s effectiveness for multi-class volcanic-lithology identification from conventional logs and motivate the subsequent discussion on geological factors and algorithmic synergy.

Discussion

Influence of geological factors on model design and success

The Huoshiling volcanic succession comprises 18 lithologies with diverse mineralogy and textures, and many contacts are gradational rather than sharp. Intra-facies textural variability, variable welding degree, and changes in clast–matrix proportions or alteration can make gamma ray (GR), compensated neutron (CNL), bulk density (DEN), acoustic travel-time/sonic (AC), and deep array laterolog resistivity (RLA5) respond in similar ways even when the underlying lithologies are distinct. These geological realities make class boundaries intrinsically fuzzy on conventional logs and motivate an ensemble-based solution that can accommodate multi-modal distributions and correlated continuous predictors. Accordingly, we selected Random Forest (RF) as the base learner and constructed an enhanced Random Forest (eRF) that embeds algorithmic elements tailored to the data-generating processes in this setting.

The five logs were chosen because they are physically sensitive to composition, porosity, fluid content, welding, and clast–matrix architecture, which together govern the petrophysical contrasts among lavas, pyroclastic lavas, pyroclastic rocks, and sedimentary volcaniclastic rocks. Within eRF, C4.5 gain-ratio splitting is used to identify informative thresholds on these continuous predictors while mitigating the bias of criteria that favor attributes with many effective cut points. Borderline-SMOTE is introduced because the scarcity of certain facies—thin tuffs, breccias, or transitional units—is a geological fact rather than an artifact of sampling, and the most consequential ambiguities occur precisely at inter-facies boundaries where minority instances are under-represented. Kendall’s coefficient of concordance (Kendall’s W) is incorporated to promote stability by favoring features whose importance is repeatedly high across trees, thereby limiting reliance on idiosyncratic splits that may be induced by local noise in complex volcanic sequences.

Viewed through this geological lens, the confusion matrices in Fig 4 are not arbitrary. The remaining off-diagonal entries concentrate within geologically related pairs whose petrophysical responses are known to overlap near facies contacts and within texturally heterogeneous intervals. Typical examples occur between Trachytic Tuff Lava and Trachytic Tuff; between Trachytic Brecciated Lava and Trachytic Volcanic Breccia; between Andesitic Tuff Lava and Andesitic Tuff; between Andesitic Brecciated Lava and Andesitic Volcanic Breccia; and within the rhyolitic suite between brecciated end-members and volcanic-breccia end-members. Occasional confusion is also observed between Sedimentary Volcanic Breccia and volcanic-breccia classes. In each of these cases, similar combinations of GR, DEN, CNL, AC, and RLA5 arise from comparable mineral assemblages and fabric, as well as from gradational contacts that attenuate sharp log contrasts. The pattern is therefore geologically plausible and provides a diagnostic target for the algorithmic design choices embedded in eRF. Taken together, these geologically plausible confusions also expose where a vanilla RF is prone to fail in this setting: random feature selection may not consistently prioritize the most geologically informative log parameters at node splits; pronounced class imbalance tends to bias decisions toward dominant lithologies and suppress recall for rare but diagnostically important facies; and high structural similarity among trees can limit the ensemble’s capacity to resolve multi-faceted boundaries. In the Huoshiling dataset, these effects are amplified by domain specifics—thin tuffaceous or brecciated interbeds embedded within thick lava packages, strongly correlated and multi-valued continuous predictors in conventional logs, and weak separability near gradational contacts—so boundary cases become both the most informative and the most easily misclassified. These considerations directly motivate the targeted enhancements embedded in the eRF (Borderline-SMOTE, C4.5 gain-ratio splitting, and Kendall’s W), and they are consistent with the residual off-diagonal patterns in Fig 4 and the class-wise improvements for eRF in Fig 5 and Table 6.

Effectiveness and synergy of algorithmic optimizations in eRF

The empirical improvements delivered by eRF can be traced to the coordinated roles of its three components and their alignment with the geological sources of ambiguity. Borderline-SMOTE enriches the local density of minority instances specifically at transition zones, which are the regions of greatest confusion in Fig 4. This increases the chance that decision boundaries are learned where they matter most rather than being pulled by the statistical dominance of thick lava units. C4.5 gain-ratio splitting then identifies thresholds on continuous logs that exploit subtle but physically meaningful contrasts while reducing the bias that would otherwise favor highly partitionable attributes. Kendall’s W promotes ensemble consistency by prioritizing features that are repeatedly discriminative across trees, which curbs idiosyncratic splits that could widen scattered off-diagonals.

These mechanisms are reflected both in the aggregate metrics and in the structure of errors. Relative to the standard RF, eRF produces the sharpest diagonals and the sparsest off-diagonals in Fig 4, with especially clear suppression of leakage from minority pyroclastic units into dominant trachytic or andesitic lavas. The benefit is most striking for Trachytic Tuff, where accuracy rises from 0.5455 with standard RF to 0.9787 with eRF (Table 6). The baseline methods exhibit characteristic weaknesses that align with their inductive biases under severe class imbalance and overlapping clusters: back-propagation neural networks show broader low-amplitude off-diagonals; k-nearest neighbors remains competitive for distinctive end-members but degrades for overlapping trachytic and andesitic pairs; and support vector machines with radial kernels display the weakest diagonal when skew is pronounced. The eRF-only ROC panels in Fig 5 further corroborate these observations. The one-vs-rest curves cluster near the upper-left, macro-AUC is approximately 0.999 with most class AUCs at or above 0.98, and at the default operating threshold of 0.5 false-positive rates are near zero while true-positive rates span approximately 0.82 to 0.99 across classes. Panel (s) shows small train–test gaps, which is consistent with the stability imparted by Kendall’s W and indicates that the observed gains are not the by-product of overfitting.

From a methodological perspective, the novelty here is not in any single component taken in isolation but in their integrated use within an RF framework that is explicitly tuned to the geological context of 18-class volcanic lithology identification from conventional logs. Addressing imbalance at critical boundaries, optimizing per-node splits for continuous and potentially multi-valued predictors, and enforcing stability of feature usage across trees are complementary interventions. Their synergy explains why the error structure itself changes in Fig 4 in a way that is geologically meaningful and why the ROC behavior in Fig 5 is uniformly strong across facies, including many of the previously hard-to-separate minority types.

External blind-well validation and generalization

Spatial generalization was assessed by zero-shot inference on an external blind-well set comprising 2,964 depth samples from five Huoshiling wells (C_1, C_8, C_9, C_10, C_21) that were entirely excluded from model development. The final eRF trained on the source field was applied directly—without retraining, without calibration, and without any resampling on blind data—using the same five logs as inputs. Ground-truth labels were compiled from the operator’s well documentation, and intervals with missing or ambiguous labels were removed. Preprocessing mirrored the training pipeline with parameters frozen from the training stage: rows with any missing log value were discarded; the StandardScaler and label encoder fitted on the training set were reused to transform logs and map lithology codes; and predictions were generated once at the default threshold of 0.5.

Under this zero-shot setting, eRF achieved an average accuracy of 92.24%, with macro Precision, Recall, and F1 of 0.900, 0.906, and 0.903, respectively (Table 7). These values exceed those of the baselines—standard RF at 87.63%, k-nearest neighbors at 87.37%, back-propagation neural network at 81.38%, and support vector machine at 78.27%—by between 4.6 and 14.0 percentage points in accuracy. The per-class results corroborate robust transfer across lithologies: accuracies remain high for most classes, such as Basalt at 1.0000, Sedimentary Volcanic Breccia at 0.9670, and Trachytic Volcanic Breccia at 0.9360. Performance is relatively lower for transitional or texturally heterogeneous facies, such as Rhyolitic Volcanic Breccia at 0.7941 and Trachytic Tuff at 0.8429, which is consistent with gradational contacts and overlapping log responses. Taken together with the small train–test gaps observed earlier for eRF in Fig 5, panel (s), these blind-well outcomes support limited overfitting and stable generalization beyond the training area, and they indicate that the learned decision rules transfer across fields under spatial domain shift.

thumbnail
Table 7. Lithology classification performance on blind test wells using different machine-learning models.

https://doi.org/10.1371/journal.pone.0335630.t007

Conclusions

This study developed and validated an enhanced Random Forest (eRF) for identifying 18 volcanic lithologies in the Wangfu Fault Depression (Huoshiling Formation, Songliao Basin) from five conventional well logs. The main conclusions are:

  1. (i). The Huoshiling volcanic succession exhibits high lithological diversity (18 classes) and pronounced class imbalance; gradational contacts produce overlapping log signatures, which complicate automated identification. A labeled dataset integrating core/cuttings descriptions with GR, AC, DEN, CNL, and RLA5 was established accordingly.
  2. (ii). The proposed eRF—synergistically integrating Borderline-SMOTE for targeted imbalance correction at decision boundaries, C4.5 decision trees (gain-ratio splitting) for optimized feature thresholds on continuous logs, and Kendall’s coefficient of concordance to guide stable feature prioritization across trees—achieved a test-set accuracy of 96.34% with macro-F1 = 0.9635, indicating balanced performance across all classes.
  3. (iii). eRF outperformed four classical machine-learning algorithms (standard RF, BPNN, kNN, SVM) on both overall and per-class metrics. The enhancements effectively address class imbalance at critical boundaries, improve split quality with a robust criterion, and promote stability of feature usage in the ensemble, yielding excellent recognition for all lithologies, especially previously hard-to-classify minority types and facies with ambiguous boundary signatures.
  4. (iv). In external blind-well evaluation (zero-shot; five wells; 2,964 depth samples), eRF attained 92.24% average accuracy with macro Precision/Recall/F1 = 0.900/0.906/0.903, demonstrating robust spatial generalization beyond the training area and supporting deployment in similar multi-class volcanic settings to generate high-resolution lithological profiles in uncored intervals.

Acknowledgments

We would like to thank working group of Volcanic Reservoirs and their Exploration, Jilin University, Changchun, China for their helps with field work. We also thank the two reviewers and Academic Editor Hu Li for the constructive reviews that significantly improved the manuscript.

References

  1. 1. Feng ZQ, Liu JQ, Wang PJ. New oil and gas exploration field: volcanic hydrocarbon reservoirs enlightenment from the discovery of the large gas field in Songliao basin. Chin J Geophys. 2011;54:269–79.
  2. 2. Tang H, Tian Z, Gao Y, Dai X. Review of volcanic reservoir geology in China. Earth-Sci Rev. 2022;232:104158.
  3. 3. Xu W-L, Pei F-P, Wang F, Meng E, Ji W-Q, Yang D-B, et al. Spatial–temporal relationships of Mesozoic volcanic rocks in NE China: Constraints on tectonic overprinting and transformations between multiple tectonic regimes. J Asian Earth Sci. 2013;74:167–93.
  4. 4. Wang P-J, Mattern F, Didenko NA, Zhu D-F, Singer B, Sun X-M. Tectonics and cycle system of the Cretaceous Songliao Basin: an inverted active continental margin basin. Earth-Sci Rev. 2016;159:82–102.
  5. 5. Yu T, Wang P, Zhang Y, Gao Y, Chen C. Discovery of the late jurassic-early cretaceous lamprophyres in Western Songliao Basin of Northeast China and their constraint on regional lithospheric evolution. Front Earth Sci. 2022;10.
  6. 6. Yu T, Wang P, Gao Y, Zhang Y, Chen C. Discovery of the Late Jurassic peraluminous rhyolites and tonalite porphyrites in the Tuquan area along the western margin of the Songliao Basin: Geological records from closure of the Mongol-Okhotsk Ocean to continental collision between the Siberian plate and the Erguna-Songliao block. Acta Petrologica Sinica. 2024;40(1):159–77.
  7. 7. Wang P, Chen S. Cretaceous volcanic reservoirs and their exploration in the Songliao Basin, northeast China. Bulletin. 2015;99(03):499–523.
  8. 8. Qu XJ, Wang PJ, Yao RS. Stratigraphical sequence and regional correlation of Huoshiling Formation in southern Songliao Basin. J Cent South Univ (Sci Technol). 2014;45:2716–27.
  9. 9. Wang YQ, Hu DQ, Cai GG. Characteristics and controlling factors of Cenozoic volcanic reservoirs in Liaohe Basin, NE China. Acta Petrol Sin. 2013;34:896–904.
  10. 10. Zhang B, Gu GZ, Shan JF. Lithology, lithofacies characteristics and reservoir control factors of Cenozoic igneous rocks in eastern sag of Liaohe depression. J Jilin Univ (Earth Sci Ed). 2019;49:279–93.
  11. 11. Nie ML, Zhang B, Zhao W. Geochemical characteristics and exploration significance of middle-lower Jurassic source rocks in the Beshkent Depression and adjacent areas of the Amu Darya Basin. Nat Gas Ind. 2023;43:47–54.
  12. 12. Mukherjee B, Kar S, Sain K. Machine learning assisted state-of-the-art-of petrographic classification from geophysical logs. Pure Appl Geophys. 2024;181(9):2839–71.
  13. 13. Chen TQ, Guestrin C. XGBoost. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13-17; San Francisco, CA. New York: ACM; 2016. pp. 785–94.
  14. 14. Mou D, Zhang LC, Xu CL. Comparison of three classical machine learning algorithms for lithology identification of volcanic rocks using well logging data. J Jilin Univ (Earth Sci Ed). 2021;51:951–6.
  15. 15. Prajapati R, Mukherjee B, Singh UK, Sain K. Machine learning assisted lithology prediction using geophysical logs: a case study from Cambay basin. J Earth Syst Sci. 2024;133(2).
  16. 16. Mukherjee B, Sain K. Vertical lithological proxy using statistical and artificial intelligence approach: a case study from Krishna-Godavari Basin, offshore India. Mar Geophys Res. 2021;42(1).
  17. 17. Banerjee A, Mukherjee B, Sain K. Machine learning assisted model based petrographic classification: a case study from Bokaro coal field. Acta Geod Geophys. 2024;59(4):463–90.
  18. 18. Wang PJ, Chi YL, Liu WZ. Volcanic facies of the Songliao basin: classification, characteristics and reservoir significance. J Jilin Univ (Earth Sci Ed). 2016;46:1056–70.
  19. 19. Wang PJ, Zheng CQ, Ping S. Classification of deep volcanic rocks in Songliao Basin. Pet Geol Oilfield Dev Daqing. 2007;26:17–22.
  20. 20. Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967;13(1):21–7.
  21. 21. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
  22. 22. Han HY, Wang WY, Mao BH. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Hoffmann F, Hutter M, Klinkenberg R, Renz J, editos. Advances in Intelligent Data Analysis VI. IDA 2005. Lecture Notes in Computer Science. vol 3646. Berlin, Heidelberg: Springer; 2005. pp. 878–87.
  23. 23. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann; 1993.
  24. 24. Yang ZY. Educational Statistics. Chongqing: Chongqing Branch of the Scientific and Technical Documentation Press; 1990.
  25. 25. Zhang JP, Mani I. kNN approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of ICML Workshop on Learning from Imbalanced Data Sets. Washington, D.C.; 2003.
  26. 26. He HB, Bai Y, Garcia EA, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008 Jun 1-8; Hong Kong. Piscataway, NJ: IEEE; 2008. pp. 1322–8.
  27. 27. Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. J Mach Learn Technol. 2011;2(1):37–63.