Biologically-informed regional subset analysis with CatBoost for robust tissue-of-origin prediction

Sungmin Yang; Hong-Gee Kim

doi:10.1371/journal.pone.0337106

Abstract

Accurate identification of cancer tissue/cell of origin (TOO/COO) is critical for diagnosis and treatment; yet existing whole-genome approaches demand extensive computation and often struggle with sparse mutation signals. Here, we introduce an informative regional subset framework that selects a small number of biologically and statistically significant 1Mbp genomic intervals to train a CatBoost prediction model. On a benchmark of 137 whole-genome samples across six cancer types, our method achieved a 4% gain in melanoma accuracy (from 88.0% using all 2,128 regions to 92.0% with 300 regions), a 4.4% gain in multiple myeloma (87.0% with 600 regions), and perfect (100%) accuracy in high-mutation cancers such as esophageal adenocarcinoma and glioblastoma with as few as 50 informative regions. When extended to 934 PCAWG samples spanning 14 cancer lineages, the same limited regional subsets matched or improved whole-genome performance, reaching up to 100% accuracy in gastrointestinal, skin, and brain cancers, demonstrating exceptional scalability. Our approach not only reduces computational burden and enhances interpretability but also provides a robust, generalizable tool for precision oncology and the diagnosis of cancers of unknown primary.

Citation: Yang S, Kim H-G (2025) Biologically-informed regional subset analysis with CatBoost for robust tissue-of-origin prediction. PLoS One 20(12): e0337106. https://doi.org/10.1371/journal.pone.0337106

Editor: Yogendra Kumar Prajapati, MNNIT Allahabad: Motilal Nehru National Institute of Technology, INDIA

Received: August 4, 2025; Accepted: November 4, 2025; Published: December 4, 2025

Copyright: © 2025 Yang, Kim. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Section [Method] 2.1 Somatic Mutation Data Collection of the manuscript describes the data sources and provides information about the data owners. The PCAWG data can also be easily downloaded from publicly accessible services such as https://www.cbioportal.org/.

Funding: This work was supported by the government of the Republic of Korea (MSIT) and the National Research Foundation of Korea (NRF-RS-2023-00268071 to Hong-Gee Kim).

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

The remarkable advances in DNA sequencing technologies have enabled the large-scale accumulation of genome-wide somatic mutation data across diverse cancer types, precancerous lesions, normal tissues, and stem cells [1]. This wealth of genomic data has progressively unveiled the complexity of cancer genomes, encompassing driver and passenger mutations, mutational signatures [2], clonal and subclonal evolution, and structural variations including kataegis, chromothripsis, and whole genome doubling.

Tissue-of-origin (TOO) and cell-of-origin (COO) prediction addresses the question: “From which normal cell or tissue did this cancer initially arise?” with clinical implications for metastatic cancers and cancers of unknown primary [3,4]. Cancers of unknown primary (CUP) account for approximately 3% of all cancer diagnoses, and patients suffering from this condition face significant therapeutic challenges, as primary cancer type classification is the dominant factor guiding treatment decisions. Early computational methods applied Random Forests for classification [5], later adopting XGBoost for scalable gradient boosting [6], and most recently CatBoost to leverage ordered boosting and categorical feature support [7,8]. Among these approaches, CUPLR [9] has established itself as the current state-of-the-art WGS-based classifier, achieving 90% recall across 35 cancer types by integrating complex features, including structural variant signatures, viral DNA integrations, and gene fusions. Complementary modalities have also demonstrated promise: deep neural networks trained on pan-cancer gene expression data achieve 97% accuracy with robust performance on metastatic samples [10,11], though they remain constrained by tissue availability and RNA stability limitations. DNA methylation-based machine learning models classify primary organs with 87–97% accuracy from tissue or cell-free DNA [12,13], and serum miRNA classifiers achieve up to 95% accuracy in top-3 predictions [14,15]. Histopathology-based deep learning approaches can predict tumor origin with an AUC of 0.95–0.99, occasionally outperforming experienced pathologists [16,17]. Despite these advances, these approaches face significant limitations: they often lack biological grounding, require data modalities that may not be available in all clinical contexts, and impose substantial computational burdens on clinical laboratories.

Existing genome-wide approaches, particularly whole-genome based methods, present fundamental limitations that constrain their clinical adoption and biological interpretability. Existing regional approaches such as COOBoostR [18] partition the genome arbitrarily without integrating biological context or employing systematic statistical selection criteria. These genome-wide strategies generate high-dimensional feature spaces that are computationally expensive, difficult to interpret biologically, and create practical barriers to clinical implementation. Furthermore, genome-wide models fail to leverage known cancer biology regarding tissue-specific mutational processes and chromatin organization [19–21], resulting in feature sets that may contain substantial noise and redundancy. This disconnect between computational methodology and underlying cancer biology limits confidence in model predictions and complicates clinical validation in molecular tumor board settings. Additionally, current whole-genome approaches emphasize high-dimensional feature selection without systematically identifying which genomic intervals are truly informative for cancer classification nor do they provide cancer-specific biological insights that could advance understanding of tissue-specific mutation patterns.

Here, we introduce a fundamentally different approach that addresses these critical limitations. Rather than leveraging whole-genome features, we present a novel biologically-informed regional subset framework that systematically identifies small sets of informative genomic intervals—optimized at various scales (50, 100, 200, 300, 400, 500, 600 Mbp)—guided by mutation density profiles reflecting tissue-type-specific chromatin states and chromatin accessibility features. This represents a paradigm shift from traditional whole-genome approaches: instead of using all genomic information indiscriminately, we employ rigorous biological and statistical criteria to identify the most discriminative regional subsets that capture tissue-specific mutational signatures. Using these carefully selected regional subsets, we train an optimized CatBoost model that achieves superior performance compared to existing whole-genome methods. The key innovations of this work are threefold: (1) we demonstrate that informative regional subsets (typically representing only 2–30% of the genome) can replace genome-wide features while maintaining or improving classification accuracy, fundamentally challenging the assumption that more data necessarily yields better predictions; (2) we develop a machine learning algorithm with dramatically improved efficiency and accuracy compared to existing approaches, significantly reducing computational burden while enhancing prediction speed and reliability; and (3) we identify cancer-specific biological regional subsets that provide actionable insights into the genomic basis of tissue-of-origin classification, enabling clinicians to understand which genomic features drive predictions for specific cancer types. These regional subsets represent a novel biological resource that can inform future studies of cancer genomics and tissue-specific mutational processes. By combining biological sophistication with computational efficiency and clinical practicality, our approach represents a significant advance toward integrating precision genomics into routine cancer diagnostics.

2 Materials and methods

2.1 Somatic mutation data collection and preprocessing

2.1.1 Benchmark dataset.

To directly compare the predictive performance of CatBoost with existing Random Forest and XGBoost–based methods, we assembled a benchmark cohort of 137 whole-genome sequencing (WGS) tumor samples across six cancer types: melanoma (ME, n = 25) [22], multiple myeloma (MM, n = 23) [23], liver cancer (RK, n = 64) [24], colorectal cancer (CRC, n = 9) [25], glioblastoma (GBM, n = 7) [26], and esophageal adenocarcinoma (ESO, n = 9) [27]. All variant coordinates were converted from GRCh38 to GRCh37 (hg19) using CrossMap [28]. Hypermutant samples and low-quality or duplicated variant calls were removed to ensure high-confidence somatic mutation profiles. This dataset served both to benchmark CatBoost performance and to inform our Informative Region Selection framework.

2.1.2 PCAWG dataset.

For generalizability and scalability assessment, we obtained 934 tumor samples from the ICGC and PCAWG Consortium [29]. Samples were grouped into lineage cohorts analogous to the benchmark set (e.g., SKCM/MELA for melanoma; ESAD/READ/STAD for esophageal; LIHC/LIRI/LICA for liver; COAD for colorectal; CLLE/LAML for hematologic; GBM for brain; MALY for myeloma). All samples underwent identical GRCh37 liftover and filtering of low-confidence and duplicate variants.

2.1.3 Mutation density profiling.

Autosomal chromosomes (1–22) were partitioned into consecutive 1 Mbp windows, excluding centromeric/telomeric regions, low-mappability segments (score < 0.5), and extreme GC content (< 20% or > 80%). Somatic point mutations were aggregated per window using BEDOPS [30] to generate mutation density profiles. These densities served as input features for CatBoost and as the basis for regional subset identification.

2.2 Chromatin mark data collection and preprocessing

We collected 673 ChIP-seq datasets of histone modifications from ENCODE [31], the International Human Epigenome Consortium [32], and the NIH Roadmap Epigenomics Project [33]. Data were binned into the same 1 Mbp windows (hg19), normalized, and aggregated to yield chromatin state feature matrices for each sample.

2.3 CatBoost–based tissue/cell of origin prediction

2.3.1 Gradient boosting fundamentals.

CatBoost is a variant of gradient boosting that sequentially learns decision trees to correct prediction errors from previous trees. The objective of basic gradient boosting is to minimize the following loss function:

(1)

where y_i represents the true class (tissue/cell of origin), is the model’s prediction, and l is the loss function. For our multi-class classification problem, we use multinomial cross-entropy loss:

(2)

where K is the total number of tissue/cell origin classes and p_i,k is the predicted probability that sample i belongs to class k.

2.3.2 Algorithm rationale and architecture.

A key feature of CatBoost is its effective handling of categorical variables. To encode categorical variables present in genomic metadata, CatBoost employs ordered target encoding:

(3)

where a is a regularization parameter and p is the prior class probability. This approach prevents target leakage by using only data prior to each sample.

Symmetric trees.

CatBoost uses symmetric tree structures to reduce overfitting. At each split, the optimal feature f^* and threshold t^* are selected as follows:

(4)

where and are the losses of the left and right child nodes, respectively.

Boosting procedure.

The model is trained iteratively as follows:

(5)

where is the prediction after the m-th iteration, h_m is the m-th tree, and η is the learning rate. Each tree h_m learns to fit the negative gradient of the previous prediction:

(6)

We selected CatBoost [34] for its ordered boosting and permutation-driven training, which mitigates target leakage and captures sequential correlations among adjacent genomic windows. Input features comprised mutation density and chromatin signals across all 2,128 windows. A backward-elimination procedure, coupled with ten-fold cross-validation, was employed over 20 iterative training rounds to identify the top 20 predictive chromatin features.

2.3.3 Hyperparameter optimization.

We defined the search space for hyperparameter tuning as follows:

Learning rate (η): 0.005 to 0.5, representing the contribution of each tree to the final prediction
Tree depth (d): 3 to 7, controlling the complexity of individual decision trees
Number of iterations (M): 10 to 100, determining the total number of boosting rounds

The learning rate controls the step size in the boosting update (Eq 5), with smaller values typically requiring more iterations to converge but potentially avoiding overfitting. Tree depth determines the interaction complexity among features, with deeper trees capturing more complex patterns but increasing overfitting risk. The number of iterations determines the total boosting rounds performed in the ensemble.

We employed an exhaustive grid search combined with early stopping to identify the optimal hyperparameter configuration. The grid search evaluated all combinations of hyperparameters within the defined search space. To prevent unnecessary computation and overfitting, we implemented early stopping with a patience of 10 rounds, which terminates training if the validation loss does not improve for 10 consecutive iterations.

The optimization criterion was root mean square error (RMSE), defined as:

(7)

where y_i is the true class prediction score and is the predicted score. We utilized stratified k-fold cross-validation (k=5) to ensure robust evaluation across the entire dataset and mitigate the impact of data splits.

Grid search identified the optimal hyperparameter configuration as follows:

Learning rate:
Tree depth: d = 5
Number of iterations: M = 30

This configuration was selected based on multiple criteria: (1) minimal validation RMSE, (2) stable convergence behavior without oscillations, and (3) minimal overfitting as evidenced by small gaps between training and validation metrics. The moderate learning rate of 0.01 provides a good balance between convergence speed and optimization precision, avoiding both underfitting with larger rates and excessive computational cost with smaller rates. The tree depth of 5 captures sufficient feature interactions while remaining interpretable and avoiding excessive model complexity. The iteration count of 30 represents a point of diminishing returns where additional boosting rounds provided negligible improvements in performance.

2.4 Informative region selection

2.4.1 Procedure overview.

To overcome sparsity in low-mutation cancers, we devised a two-stage selection: (1) iterative sampling and CatBoost training (1,000 repeats per subset size of 25, 50, 100, 150, and 200 windows), recording binary prediction outcomes; (2) Fisher’s exact test with Benjamini–Hochberg FDR correction (q < 0.05) was used to identify regions whose repeated inclusion yielded significantly high accuracy.

2.4.2 Optimal region number determination.

Using the significant windows, we retrained CatBoost models on subsets of increasing size (50, 100, 200, 300, 400, 500, 600 regions) to chart the relationship between the number of regions and TOO/COO prediction accuracy, thereby selecting the optimal subset scale.

3 Results

3.1 Benchmark evaluation demonstrates CatBoost superiority in multi-cancer TOO/COO classification

To demonstrate the superior performance of our CatBoost-based prediction model compared to existing methods, we evaluated tissue-of-origin (TOO) and cell-of-origin (COO) classification accuracy on a benchmark dataset of 137 samples spanning six cancer types. Fig 1 visualizes the prediction results obtained using Random Forest [5], XGBoost (COOBoostR) [18], and our CatBoost model [34]. The Nature study reported variance-explained values per cancer type and deemed a prediction correct if epigenetic marker signals fell within predefined thresholds. COOBoostR instead counted the number of exact top-1 epigenetic marker matches [18]. Our evaluation applies the stricter top-1 accuracy criterion across all methods to ensure a fair comparison.

Download:

Fig 1. Comparison of top-1 prediction accuracy for tissue-of-origin classification across models.

CatBoost (our study) outperformed or matched previous methods, including Random Forest (Nature, 2015) and COOBoostR (XGBoost-based), across all cancer types, demonstrating enhanced robustness and accuracy in tissue-of-origin prediction.

https://doi.org/10.1371/journal.pone.0337106.g001

Our CatBoost approach achieved equal or superior accuracy in all six cancers. In melanoma, CatBoost correctly classified 25/25 samples versus 22/25 for Random Forest and 19/25 for COOBoostR. Liver cancer accuracy reached ∼90%, exceeding 84% for COOBoostR and outperforming Random Forest. Multiple myeloma classification was correct in 19/23 samples, matching or improving upon existing methods. For colorectal cancer, esophageal adenocarcinoma, and glioblastoma, CatBoost achieved 100% accuracy. These results establish CatBoost as a highly reliable alternative for TOO/COO prediction.

3.2 Informative regional subset analysis

We next assessed the effectiveness of Informative Region Selection by comparing CatBoost performance on regional subsets of size 50, 100, 200, 300, 400, 500, and 600 windows against whole-genome input (2,128 windows) (Table 1, Fig 2). Heatmap visualizations of selected regions per cancer are shown in Fig 3, (S1 Fig–S6 Fig).

Download:

Table 1. TOO/COO prediction accuracy by region selection strategy and subset size.

https://doi.org/10.1371/journal.pone.0337106.t001

Download:

Fig 2. Prediction accuracy by number of informative regions across six cancer types.

Most cancer types achieved over 60% accuracy with only 100 informative regions, demonstrating the efficiency of the regional subset strategy. Accuracy steadily improved with increasing region numbers. In liver cancer (RK), although initial accuracy was low, a consistent upward trend was observed, indicating effective learning with larger region sets.

https://doi.org/10.1371/journal.pone.0337106.g002

Download:

Fig 3. Selection of 600 informative regional subsets out of 2,128 whole regions for each cancer type in the benchmark.

The figure shows the most frequently occurring genomic regions by aggregating the locations of regional subsets that successfully predicted the tissue or cell of origin for each cancer type.

https://doi.org/10.1371/journal.pone.0337106.g003

High-mutation cancers (ESO, GBM) yielded 100% accuracy across all subset sizes, indicating robust signals that survive aggressive downsampling. Low-mutation cancers benefited most: melanoma accuracy rose from 88.0% (whole genome) to 92.0% (300 and 600 regions), and multiple myeloma from 82.6% to 87.0%. Colorectal cancer remained at 100% for several subset sizes but dipped to 88.9% when key regions were omitted. Liver cancer accuracy fell from 90.6% (whole genome) to 21.9% (50–100 regions) before recovering to 56.3–85.9% at larger subset sizes.

These findings demonstrate that Informative Region Selection significantly improves or maintains classification accuracy compared to random subsets, particularly in cancers with sparse mutation profiles, while reducing computational burden relative to whole-genome analysis.

3.3 Generalizability assessment in independent cohort (PCAWG)

To evaluate cross-cohort generalizability, we applied the selected regional subsets to 934 PCAWG samples across 14 lineages (Table 2, Fig 4). Gastrointestinal cancers (ESAD, COAD, STAD, READ) exhibited dramatic accuracy gains: starting at 28–46% with 50 regions, rising above 90% for 200+ regions, and reaching 100% for READ at 300 regions. Liver lineages (LIHC, LIRI, LICA, LINC) showed heterogeneous patterns, with cholangiocarcinoma (LICA) at 100% from 200 regions onward, whereas LIHC improved more gradually (up to 87.0% at 600 regions). Skin lineages improved from ∼40% to >90% with 600 regions, and GBM accuracy climbed from 51.2% to 95.1%. Hematologic malignancies (MALY, CLLE, LAML) displayed variable performance, reflecting the complexity of predicting closely related cell origins.

Download:

Table 2. PCAWG dataset TOO/COO prediction accuracy by informative regional subset size.

https://doi.org/10.1371/journal.pone.0337106.t002

Download:

Fig 4. PCAWG dataset accuracy donut chart.

The accuracy of Tissue/Cell of origin prediction using informative regional subsets is shown for various cancer types collected from similar tissues in the PCAWG dataset. Most similar cancer types demonstrated strong performance, but cancers like LAML, which arose from similar tissues but have different oncogenic mechanisms, were found to have poor utilization of the informative regional subsets.

https://doi.org/10.1371/journal.pone.0337106.g004

Overall, Informative Region Selection consistently preserved or enhanced TOO/COO prediction accuracy across diverse independent cohorts, demonstrating its utility for efficient, interpretable cancer origin classification.

4 Discussion

In this study, we developed a regional subset–based framework to predict tissue-of-origin (TOO) and cell-of-origin (COO) across diverse cancer types, moving beyond conventional whole-genome approaches. A striking finding emerged from our analysis: two specific genomic intervals—the 921 region (chr2:155,326,172-156,326,171) and the 379 region (chr12:107,856,695-108,856,694)—were consistently and unanimously selected across all six cancer types in the benchmark cohort. This convergent selection across multiple cancer types suggests that these loci are not statistical artifacts but rather encode genuine tissue-associated biological signals. To understand why our CatBoost model converged on these regions, we conducted a detailed mechanistic analysis to validate that our predictive framework is grounded in real biological features rather than spurious correlations.

The predictive signal of the 921 region (chr2) appears to be rooted in mechanisms related to cell survival, transcriptional control, and stress response. Although a notable protein-coding gene, NR4A2 (Nuclear Receptor 4A2), is located immediately adjacent to the defined 1Mb boundary, its established function is highly pertinent to cancer biology. NR4A2 is a nuclear transcription factor that, when aberrantly regulated, is known to promote cancer progression by enhancing autophagy and inducing chemotherapy resistance. The consistent selection of this locus across multiple cancer types, therefore, suggests that intrinsic differences in survival and stress-evasion capabilities among various TOO phenotypes constitute a decisive genomic signature for classification. Furthermore, the presence of non-coding elements in these intervals indicates that predictive power may not rely exclusively on protein-coding genes; instead, lncRNAs and pseudogene/snRNA loci could act as tissue-specific transcriptional or chromatin-state markers that contribute indirect signals for origin classification. LINC01876 and nearby snRNA/pseudogene loci (e.g., RNU6-546P) are plausible markers of lineage-specific transcriptional programs; their roles are best interpreted as indicators of local transcriptional or epigenomic context that distinguish cellular origins.

In contrast, the 379 region (chr12) presents a dense cluster of protein-coding genes governing fundamental cellular processes, suggesting that the TOO/COO signature reflects inherent functional properties inherited from the cell of origin. A significant functional cluster within this region relates to cellular motility and invasion. Genes such as CORO1C and SSH1 are essential components of the actin cytoskeleton remodeling pathway. CORO1C has been implicated in cell migration and invasion in several studies; however, its role as a driver in the tumor types examined here remains unestablished and requires further validation. Similarly, SSH1—which facilitates actin dynamics by regulating the actin-depolymerizing factor cofilin—may contribute to motility-related programs but also needs confirmatory evidence. The consistent selection of these motility-regulating genes across multiple cancer types strongly suggests that a tumor’s innate tendency toward metastasis or migration is a major genomic feature distinguishing TOO and COO phenotypes. Additionally, the inclusion of ISCU (Iron-Sulfur Cluster Assembly Enzyme) may reflect the predictive signature of cancer metabolic reprogramming; ISCU-related mitochondrial dysfunction may contribute to tissue-specific metabolic phenotypes that the model captures. Finally, the roles of SART3 (RNA splicing and tumor antigen) and FICD (ER protein quality control) suggest that subtle differences in cellular housekeeping mechanisms—such as RNA processing fidelity and cellular stress management—are also predictive features captured by this region.

In summary, the robust selection of the 921 and 379 regions is strongly supported by a clear mechanistic analysis demonstrating that the genes within these loci are deeply integrated into core cancer pathways involving survival, motility, and metabolism. This finding is particularly significant because it demonstrates that our model is capturing genuine biological signals that distinguish cancer origins, rather than exploiting statistical noise or trivial patterns. Having established this biological foundation, we turn to examining how our CatBoost model leverages these mechanisms to achieve high predictive accuracy, while also identifying contexts where performance varies.

Our CatBoost model achieved high overall accuracy in predicting tissue and cell of origin across the six cancer types in the benchmark cohort. However, we observed notable variability in performance among liver cancer subtypes, providing insights into how the mechanistic features identified above translate into practical model behavior. Hepatocellular carcinoma (LIHC) and intrahepatic cholangiocarcinoma (LICA) were predicted with high accuracy, whereas the LIRI and LINC subcohorts showed diminished performance. To understand this discrepancy and connect it to the underlying biological mechanisms, we compared tumor mutational burden (TMB) distributions among these subtypes. LICA samples exhibited relatively uniform TMB (mean = 6.68; range 3.462–13.95), whereas LINC samples had lower and more homogeneous TMB (mean = 4.38; range 1.449–8.346). In contrast, LIRI displayed broad TMB variation (mean = 6.14; range 0.156–75.604) (S1 Table). These differences indicate that TMB magnitude and consistency critically impact model learning: the high heterogeneity observed in LIRI may obscure the consistent mutational patterns and metabolic signatures captured within the 921 and 379 regions, making it difficult for the model to identify reliable cell-of-origin markers. Conversely, the uniform and elevated TMB in LICA facilitates accurate classification because the biological signals encoded in these regions remain consistent and distinguishable across the tumor population. This observation underscores that the mechanistic features we identified (survival pathways, motility-related genes, and metabolic regulators) are most effective as classification signals when tumor populations maintain reasonable internal consistency.

Our region-selection approach generalized effectively to an independent PCAWG cohort, maintaining high accuracy in cancer types that were represented during model training. This cross-cohort validation strengthens our confidence that the 921 and 379 regions capture generalizable biological features rather than artifacts specific to our benchmark dataset. However, predictive performance declined for lineages not included in model training—such as acute myeloid leukemia (LAML)—highlighting an important limitation. This performance drop for untrained cancer types suggests that while the biological mechanisms we identified (cell survival, motility, and metabolism) are indeed central to cancer identity, the specific manifestations of these mechanisms may vary significantly across diverse lineages. Consequently, extending our framework to rare or highly heterogeneous malignancies will require lineage-specific or more granular subset selection strategies that capture the unique genomic signatures of each cancer type.

These results demonstrate that biologically informed regional subset selection can reduce input dimensionality, mitigate data sparsity, and capture key epigenetic–mutation associations, thereby enhancing TOO/COO prediction accuracy. This work moves beyond presenting a mere predictive model, as it successfully identifies the essential genomic regions involved in establishing cancer identity, with clear mechanistic explanations for why these regions matter. The consistently selected loci represent highly promising targets for focused functional genomic studies that can validate the roles of identified genes and pathways in determining cancer origin. Future work will focus on refining selection procedures to account for intra-tumor heterogeneity, conducting experimental validation of the biological mechanisms identified in this analysis, and developing lineage-specific frameworks for cancer types not represented in the current training set. By integrating computational predictions with functional studies and expanding our approach to encompass broader cancer lineages, we can develop more mechanistically informed and clinically actionable frameworks for cancer origin determination, ultimately supporting their diagnostic and therapeutic applications.

Supporting information

S1 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Liver cancer).

https://doi.org/10.1371/journal.pone.0337106.s001

(PNG)

S2 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Melanoma).

https://doi.org/10.1371/journal.pone.0337106.s002

(PNG)

S3 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Multiple Myloma).

https://doi.org/10.1371/journal.pone.0337106.s003

(PNG)

S4 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Glioblastoma).

https://doi.org/10.1371/journal.pone.0337106.s004

(PNG)

S5 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Esophagus cancer).

https://doi.org/10.1371/journal.pone.0337106.s005

(PNG)

S6 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Colorectal cancer).

https://doi.org/10.1371/journal.pone.0337106.s006

(PNG)

S1 Table. Summary of cancer types sourced from PCAWG.

https://doi.org/10.1371/journal.pone.0337106.s007

(PNG)

References

1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31–46. pmid:19997069
- View Article
- PubMed/NCBI
- Google Scholar
2. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21. pmid:23945592
- View Article
- PubMed/NCBI
- Google Scholar
3. Varadhachary GR, Raber MN. Cancer of unknown primary site. N Engl J Med. 2014;371(8):757–65. pmid:25140961
- View Article
- PubMed/NCBI
- Google Scholar
4. Rhee JW, et al. Epigenomic analysis of cancers of unknown primary. Genome Research. 2020;30(12):1712–23.
- View Article
- Google Scholar
5. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- View Article
- Google Scholar
6. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.
7. Bahrami A, Rakhshaninejad M, Ghousi R, Atashi A. Enhancing machine learning performance in cardiac surgery ICU: hyperparameter optimization with metaheuristic algorithm. PLoS One. 2025;20(2):e0311250. pmid:39928609
- View Article
- PubMed/NCBI
- Google Scholar
8. Yin H, Wang K, Yang R, Tan Y, Li Q, Zhu W, et al. A machine learning model for predicting acute exacerbation of in-home chronic obstructive pulmonary disease patients. Comput Methods Programs Biomed. 2024;246:108005. pmid:38354578
- View Article
- PubMed/NCBI
- Google Scholar
9. Nguyen L, Van Hoeck A, Cuppen E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat Commun. 2022;13(1):4013. pmid:35817764
- View Article
- PubMed/NCBI
- Google Scholar
10. Divate M, Tyagi A, Richard DJ, Prasad PA, Gowda H, Nagaraj SH. Deep learning-based pan-cancer classification model reveals tissue-of-origin specific gene expression signatures. Cancers (Basel). 2022;14(5):1185. pmid:35267493
- View Article
- PubMed/NCBI
- Google Scholar
11. Wang X, Li J, Zhou Y, Chen Y, Liu T, Chen S. Tumor origin identification through machine learning and gene expression profiling.. JCO. 2024;42(16_suppl):e13597–e13597.
- View Article
- Google Scholar
12. De Velasco M, Sskai K, Mitani S, Kura Y, Minamoto S, Haeno T, et al. Abstract 4331 : machine learning-based classification of tissue origin of cancer using methylation profiles. Cancer Research. 2024.
- View Article
- Google Scholar
13. Farashahi S, Kia A, Kashef D, Brown E, Hantash F, Chacko KI. Abstract PR019: transfer learning for accurate tissue of origin classification from cfDNA methylation. Clinical Cancer Research. 2024;30(21_Supplement):PR019–PR019.
- View Article
- Google Scholar
14. Matsuzaki J, Kato K, Oono K, Tsuchiya N, Sudo K, Shimomura A, et al. Prediction of tissue-of-origin of early stage cancers using serum miRNomes. JNCI Cancer Spectr. 2023;7(1):pkac080. pmid:36426871
- View Article
- PubMed/NCBI
- Google Scholar
15. Zhang A, Rui H, Hu H. Machine learning-based noninvasive diagnostic classifiers for the prediction of cancer tissue of origin using serum microRNAs.. JCO. 2024;42(23_suppl):101–101.
- View Article
- Google Scholar
16. Shaban M, Lu MY, Williamson DFK, Chen RJ, Lipkova J, Chen TY, et al. Abstract PR005: Deep learning-based multimodal integration of histology and genomics improves cancer origin prediction. Cancer Research. 2023;83(2_Supplement _2):PR005–PR005.
- View Article
- Google Scholar
17. Tian F, Liu D, Wei N, Fu Q, Sun L, Liu W, et al. Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning. Nat Med. 2024;30(5):1309–19. pmid:38627559
- View Article
- PubMed/NCBI
- Google Scholar
18. Yang S, Ha K, Song W, Fujita M, Kübler K, Polak P, et al. COOBoostR: an extreme gradient boosting-based tool for robust tissue or cell-of-origin prediction of tumors. Life (Basel). 2022;13(1):71. pmid:36676020
- View Article
- PubMed/NCBI
- Google Scholar
19. Schuster-Böckler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488(7412):504–7. pmid:22820252
- View Article
- PubMed/NCBI
- Google Scholar
20. Polak P, Karlić R, Koren A, Thurman R, Sandstrom R, Lawrence M, et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature. 2015;518(7539):360–4. pmid:25693567
- View Article
- PubMed/NCBI
- Google Scholar
21. Morganella S, Alexandrov LB, Glodzik D, Zou X, Davies H, Staaf J, et al. The topography of mutational processes in breast cancer genomes. Nat Commun. 2016;7:11383. pmid:27136393
- View Article
- PubMed/NCBI
- Google Scholar
22. Berger MF, Hodis E, Heffernan TP, Deribe YL, Lawrence MS, Protopopov A, et al. Melanoma genome sequencing reveals frequent PREX2 mutations. Nature. 2012;485(7399):502–6. pmid:22622578
- View Article
- PubMed/NCBI
- Google Scholar
23. Chapman MA, Lawrence MS, Keats JJ, Cibulskis K, Sougnez C, Schinzel AC, et al. Initial genome sequencing and analysis of multiple myeloma. Nature. 2011;471(7339):467–72. pmid:21430775
- View Article
- PubMed/NCBI
- Google Scholar
24. Totoki Y, Tatsuno K, Yamamoto S, Arai Y, Hosoda F, Ishikawa S, et al. High-resolution characterization of a hepatocellular carcinoma genome. Nat Genet. 2011;43(5):464–9. pmid:21499249
- View Article
- PubMed/NCBI
- Google Scholar
25. Bass AJ, Lawrence MS, Brace LE, Ramos AH, Drier Y, Cibulskis K, et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet. 2011;43(10):964–8. pmid:21892161
- View Article
- PubMed/NCBI
- Google Scholar
26. Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, Salama SR, et al. The somatic genomic landscape of glioblastoma. Cell. 2013;155(2):462–77. pmid:24120142
- View Article
- PubMed/NCBI
- Google Scholar
27. Dulak AM, Stojanov P, Peng S, Lawrence MS, Fox C, Stewart C, et al. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat Genet. 2013;45(5):478–86. pmid:23525077
- View Article
- PubMed/NCBI
- Google Scholar
28. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014;30(7):1006–7. pmid:24351709
- View Article
- PubMed/NCBI
- Google Scholar
29. International Cancer Genome Consortium, Pan-Cancer Analysis of Whole Genomes Consortium. Pan cancer analysis of whole genomes. Nature. 2020;578(7793):82–93.
- View Article
- Google Scholar
30. Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, et al. BEDOPS: high-performance genomic feature operations. Bioinformatics. 2012;28(14):1919–20. pmid:22576172
- View Article
- PubMed/NCBI
- Google Scholar
31. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. pmid:22955616
- View Article
- PubMed/NCBI
- Google Scholar
32. Stunnenberg HG, International Human Epigenome Consortium, Hirst M. The international human epigenome consortium: a blueprint for scientific collaboration and discovery. Cell. 2016;167(7):1897. pmid:27984737
- View Article
- PubMed/NCBI
- Google Scholar
33. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30. pmid:25693563
- View Article
- PubMed/NCBI
- Google Scholar
34. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. arXiv prerprint 2018. https://arxiv.org/abs/1810.11363

[ref1] 1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31–46. pmid:19997069
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21. pmid:23945592
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Varadhachary GR, Raber MN. Cancer of unknown primary site. N Engl J Med. 2014;371(8):757–65. pmid:25140961
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Rhee JW, et al. Epigenomic analysis of cancers of unknown primary. Genome Research. 2020;30(12):1712–23.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref5] 5. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.

[ref7] 7. Bahrami A, Rakhshaninejad M, Ghousi R, Atashi A. Enhancing machine learning performance in cardiac surgery ICU: hyperparameter optimization with metaheuristic algorithm. PLoS One. 2025;20(2):e0311250. pmid:39928609
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref8] 8. Yin H, Wang K, Yang R, Tan Y, Li Q, Zhu W, et al. A machine learning model for predicting acute exacerbation of in-home chronic obstructive pulmonary disease patients. Comput Methods Programs Biomed. 2024;246:108005. pmid:38354578
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref9] 9. Nguyen L, Van Hoeck A, Cuppen E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat Commun. 2022;13(1):4013. pmid:35817764
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref10] 10. Divate M, Tyagi A, Richard DJ, Prasad PA, Gowda H, Nagaraj SH. Deep learning-based pan-cancer classification model reveals tissue-of-origin specific gene expression signatures. Cancers (Basel). 2022;14(5):1185. pmid:35267493
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref11] 11. Wang X, Li J, Zhou Y, Chen Y, Liu T, Chen S. Tumor origin identification through machine learning and gene expression profiling.. JCO. 2024;42(16_suppl):e13597–e13597.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref12] 12. De Velasco M, Sskai K, Mitani S, Kura Y, Minamoto S, Haeno T, et al. Abstract 4331 : machine learning-based classification of tissue origin of cancer using methylation profiles. Cancer Research. 2024.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref13] 13. Farashahi S, Kia A, Kashef D, Brown E, Hantash F, Chacko KI. Abstract PR019: transfer learning for accurate tissue of origin classification from cfDNA methylation. Clinical Cancer Research. 2024;30(21_Supplement):PR019–PR019.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref14] 14. Matsuzaki J, Kato K, Oono K, Tsuchiya N, Sudo K, Shimomura A, et al. Prediction of tissue-of-origin of early stage cancers using serum miRNomes. JNCI Cancer Spectr. 2023;7(1):pkac080. pmid:36426871
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref15] 15. Zhang A, Rui H, Hu H. Machine learning-based noninvasive diagnostic classifiers for the prediction of cancer tissue of origin using serum microRNAs.. JCO. 2024;42(23_suppl):101–101.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref16] 16. Shaban M, Lu MY, Williamson DFK, Chen RJ, Lipkova J, Chen TY, et al. Abstract PR005: Deep learning-based multimodal integration of histology and genomics improves cancer origin prediction. Cancer Research. 2023;83(2_Supplement _2):PR005–PR005.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref17] 17. Tian F, Liu D, Wei N, Fu Q, Sun L, Liu W, et al. Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning. Nat Med. 2024;30(5):1309–19. pmid:38627559
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref18] 18. Yang S, Ha K, Song W, Fujita M, Kübler K, Polak P, et al. COOBoostR: an extreme gradient boosting-based tool for robust tissue or cell-of-origin prediction of tumors. Life (Basel). 2022;13(1):71. pmid:36676020
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref19] 19. Schuster-Böckler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488(7412):504–7. pmid:22820252
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref20] 20. Polak P, Karlić R, Koren A, Thurman R, Sandstrom R, Lawrence M, et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature. 2015;518(7539):360–4. pmid:25693567
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref21] 21. Morganella S, Alexandrov LB, Glodzik D, Zou X, Davies H, Staaf J, et al. The topography of mutational processes in breast cancer genomes. Nat Commun. 2016;7:11383. pmid:27136393
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref22] 22. Berger MF, Hodis E, Heffernan TP, Deribe YL, Lawrence MS, Protopopov A, et al. Melanoma genome sequencing reveals frequent PREX2 mutations. Nature. 2012;485(7399):502–6. pmid:22622578
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref23] 23. Chapman MA, Lawrence MS, Keats JJ, Cibulskis K, Sougnez C, Schinzel AC, et al. Initial genome sequencing and analysis of multiple myeloma. Nature. 2011;471(7339):467–72. pmid:21430775
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref24] 24. Totoki Y, Tatsuno K, Yamamoto S, Arai Y, Hosoda F, Ishikawa S, et al. High-resolution characterization of a hepatocellular carcinoma genome. Nat Genet. 2011;43(5):464–9. pmid:21499249
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref25] 25. Bass AJ, Lawrence MS, Brace LE, Ramos AH, Drier Y, Cibulskis K, et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet. 2011;43(10):964–8. pmid:21892161
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref26] 26. Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, Salama SR, et al. The somatic genomic landscape of glioblastoma. Cell. 2013;155(2):462–77. pmid:24120142
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref27] 27. Dulak AM, Stojanov P, Peng S, Lawrence MS, Fox C, Stewart C, et al. Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity. Nat Genet. 2013;45(5):478–86. pmid:23525077
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref28] 28. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2014;30(7):1006–7. pmid:24351709
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref29] 29. International Cancer Genome Consortium, Pan-Cancer Analysis of Whole Genomes Consortium. Pan cancer analysis of whole genomes. Nature. 2020;578(7793):82–93.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref30] 30. Neph S, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, et al. BEDOPS: high-performance genomic feature operations. Bioinformatics. 2012;28(14):1919–20. pmid:22576172
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref31] 31. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. pmid:22955616
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref32] 32. Stunnenberg HG, International Human Epigenome Consortium, Hirst M. The international human epigenome consortium: a blueprint for scientific collaboration and discovery. Cell. 2016;167(7):1897. pmid:27984737
View Article
PubMed/NCBI
Google Scholar

[115] View Article

[116] PubMed/NCBI

[117] Google Scholar

[ref33] 33. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30. pmid:25693563
View Article
PubMed/NCBI
Google Scholar

[119] View Article

[120] PubMed/NCBI

[121] Google Scholar

[ref34] 34. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. arXiv prerprint 2018. https://arxiv.org/abs/1810.11363

Figures

Abstract

1 Introduction

2 Materials and methods

2.1 Somatic mutation data collection and preprocessing

2.1.1 Benchmark dataset.

2.1.2 PCAWG dataset.

2.1.3 Mutation density profiling.

2.2 Chromatin mark data collection and preprocessing

2.3 CatBoost–based tissue/cell of origin prediction

2.3.1 Gradient boosting fundamentals.

2.3.2 Algorithm rationale and architecture.

2.3.3 Hyperparameter optimization.

2.4 Informative region selection

2.4.1 Procedure overview.

2.4.2 Optimal region number determination.

3 Results

3.1 Benchmark evaluation demonstrates CatBoost superiority in multi-cancer TOO/COO classification

3.2 Informative regional subset analysis

3.3 Generalizability assessment in independent cohort (PCAWG)

4 Discussion

Supporting information

S1 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Liver cancer).

S2 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Melanoma).

S3 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Multiple Myloma).

S4 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Glioblastoma).

S5 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Esophagus cancer).

S6 Fig. Occurrence frequency of each genomic region in random regional subsets yielding correct predictions (Colorectal cancer).

S1 Table. Summary of cancer types sourced from PCAWG.

References