UNNT: A novel Utility for comparing Neural Net and Tree-based models

Vineeth Gutta; Satish Ranganathan Ganakammal; Sara Jones; Matthew Beyers; Sunita Chandrasekaran

doi:10.1371/journal.pcbi.1011504

Abstract

The use of deep learning (DL) is steadily gaining traction in scientific challenges such as cancer research. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost (eXtreme Gradient Boosting) have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models). The case studies, in this manuscript, focus on cancer drug response datasets however the application can be used on datasets from other domains, such as chemistry.

Author summary

Advancement in data science, machine learning (ML), and artificial intelligence (AI) methods has enabled extraction of meaningful information from large and complex datasets that has assisted in better understanding, diagnosing, and treating cancer. The understanding of the drug response domain in cancer research has been accelerated with developing ML models to aid in predicting the effectiveness of the drugs based on a specific genomic molecular feature. In this study we developed a novel robust framework called UNNT (A novel Utility for comparing Neural Net and Tree-based models) that trains and compares deep learning method such as CNN and tree-based method such as XGBoost on the user input dataset. We applied this software to single drug response problem in cancer to identify the best performing ML method based on the National Cancer Institute 60 (NCI60) dataset. In addition, we studied the computational aspects of training each of these models where our results show that neither is evidently superior on both CPUs and GPUs while training. This shows that when both models have similar error rates for a dataset the hardware available determines the model choice for training.

Citation: Gutta V, Ganakammal SR, Jones S, Beyers M, Chandrasekaran S (2024) UNNT: A novel Utility for comparing Neural Net and Tree-based models. PLoS Comput Biol 20(4): e1011504. https://doi.org/10.1371/journal.pcbi.1011504

Editor: Samuel V. Scarpino, Northeastern University, UNITED STATES

Received: September 11, 2023; Accepted: March 28, 2024; Published: April 29, 2024

Copyright: © 2024 Gutta et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files. The code and data are included in ‘S1 Data’ where the library and a smaller subset of the data are also available on GitHub: https://github.com/vgutta/UNNT.git.

Funding: This study has been supported in part by the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program established by the US Department of Energy (DOE) and the National Cancer Institute (NCI) of the National Institutes of Health (NIH) Leidos Biomedical Research contract no. 75N91019D00024. MB, SG, SJ acknowledge support from contract no. 75N9019D00024 received as employees of Frederick National Laboratory for Cancer Research. VVG acknowledges indirect financial support from contract no. 75N91019D00024 received through a subcontract with Frederick National Laboratory for Cancer Research. The data is hosted on the funding agency’s infrastructure. Authors affiliated with FNL helped design the study, were integral for data selection, and also assisted in the analysis of the results. The decision to publish this study was a joint decision between the authors of this study and FNL.

Competing interests: The authors have declared that no competing interests exist.

Introduction

To leverage machine learning (ML) for cancer applications, the National Cancer Institute (NCI) at the National Institutes of Health (NIH) in collaboration with the Department of Energy (DOE) established the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) program [1]. This program developed three pilot projects focused on cancer research: Pilot 1-cellular level; Pilot 2-molecular level; Pilot 3-population level [2]. Along with the pilots, NCI-DOE also developed the CANDLE (Cancer Distributed Learning Environment) [3] project for hyperparameter optimization (HPO) on the models.

Our work in this paper is related to a subset of the drug response problems addressed in Pilot 1-cellular level. Specifically, our work builds on the existing single drug response predictor models officially known as P1B3, which uses a deep neural network to model tumor growth based on gene expression, drug concentrations, and drug descriptors data. We compared the performances of the existing CNN-based P1B3, built within the CANDLE framework, with the new tree-based methods. We show that a tree-based method, like XGBoost, is a better model than neural network CNN when the training data, such as drug response data, is tabular.

Background

Unlike computer vision models, which rely on unstructured data such as images, the CANDLE framework drug response, and other models, rely on structured data in tabular format. The big breakthrough in Deep Learning came because of the ability of neural networks to perform well on unstructured data such as images. Deep neural networks have been successfully adapted to various domains outside of computer vision such as natural language processing (NLP) [4] and with the CANDLE framework to various problems in cancer research [3]. We tested the existing CNN model architecture used in CANDLE’s single drug response model with the NCI60 dataset and it peaked at an accuracy around 70%. Further improving such models will require data augmentation or changing to a new model architecture.

Recent studies [5] have shown that tabular data may not require complex black box models such as CNNs to perform well. Gradient boosted decision tree (GBDT) models such as XGBoost [6] can match or exceed the performance of deep learning models [5]. State-of-the-art deep learning models for tabular data perform worse than XGBoost when tested on new data not in their respective original studies [5]. In addition, more recent work investigates the differences between the models to help researchers understand the inductive biases of each type of model [7]. Another recent work conducts an in-depth survey comparing machine learning methods with deep learning approaches [8]. It also concludes that GBDT ensembles tend to outperform state-of-the art deep learning models for tabular datasets [8]. On the other hand, [9] takes a slightly different perspective compared to the previous two works cited [8] [7]. The authors of this work explore the properties of datasets that make them better suited for either Neural Networks or GBDTs. They find that GBDT models are better at handling skewed feature distributions compared to Neural networks. This is yet another study performed with the explicit goal of helping practitioners choose the best model for their work. All of these studies [8] [7] [9] help researchers and practitioners understand the strategies they should use for their own tabular datasets. In spite of all the analysis from previous works, there still remains a need for helping researchers and practitioners, with their own tabular datasets, seamlessly compare the two model architectures rather than only rely on the analysis and comparisons using fixed datasets provided by recent studies. One of the major contributions of our work is to address that existing gap.

In the following two paragraphs, we summarize the major characteristics of the two types of models that our analysis in the rest of the paper is based on. Tree-based models are supervised learning methods that create decision trees based on the training data provided. Decision trees are nodes in the tree-based model that create a “split” at a particular point in the data range for a particular feature in the data inferring rules during this process. This model uses the Exact Greedy Algorithm [6] for split finding. They can be used for both classification and regression problems. The main difference between them when splitting occurs due to differing metrics used to minimize loss. And unlike decision trees, regression trees contain a score on each leaf that is used to calculate the final score by summing the leafs [6]. They are formally known as Gradient Boosted Regression Trees (GBRT) and are part of a larger class of method known as Classification and Regression Trees (CART). During training, XGBoost builds decision trees sequentially and then uses a technique known as boosting where each successive tree gives more weight to examples that were previously misclassified. After each iteration, XGBoost computes the gradients of loss functions based on predictions and then creates a new decision tree to reduce errors made by previous trees [6].

The CNN model consists of a multilayer perceptron (MLP) and is a feed forward neural network. It generally consists of an input layer, hidden layers, and an output layer. The input layer receives the data and the hidden layers learn a continuous function based on the training data. These consist of convolutional layers called filters (kernels) that slide over the input data and compute the dot product between their weights and the input data by computing the summation of the product of corresponding elements in two matrices [10]. Following this operation, known as a convolution, non-linearity is introduced into the network using an activation function to learn complex relationships between features. Then a pooling layer, also using a kernel, slides across the data to reduce spatial dimensions and overfitting [11]. A series of convolutional and pooling layers are typically followed by at least one fully connected layer to learn high-level features extracted by previous layers and the relationships between those features. After the forward pass through the network, a loss function is used to compare CNN’s output to the ground truth and is used to update the network’s weights and biases using backpropagation and gradient descent. Backpropagation computes the gradient of the loss functions for each of the weights and biases in the network going back to the input layer and these gradients are used to update the model parameters to minimize loss using optimization techniques such as gradient descent [10]. Finally the output layers perform predictions which can be classification or a numerical output in a regression problem. For classification, a softmax layer converts the raw output from the network into class probabilities. Many other regularization techniques such as dropout are applied to prevent overfitting and improve model performance [11]. This network architecture achieved state-of-the-art results in domains with grid-like unstructured data such as images but is not ideal for structured data.

Design and implementation

Recent works such as [8], [7], and [9], which are complementary to our work, explain how various models and datasets they explore perform with tabular datasets. The extensive analysis in these papers [8] [7] [9] still leaves a gap where domain scientists and researchers need to validate and test with their own data the findings of the recent works cited. Their work further reinforces the need for an abstraction that allows users to quickly test their own data and see if the conclusions in the studies can be replicated. We address a major gap that remains because a user must eventually apply the findings in these studies to their own datasets.

To address this gap and provide users with structured (tabular) data the option to compare both the CNN and XGBoost models, we developed an open-source comparative library called UNNT that allows users to bring their data to train both models. See Fig 1 for a flowchart that represents the library in S1 Fig. The XGBoost model relies on the open source libraries from Distributed (Deep) Machine Learning Community (DMLC) which UNNT uses to provide users the ability to train XGBoost models [6].

Download:

Fig 1. Flow chart of UNNT.

Depicts how UNNT works by combining existing libraries and adding functionality to create multiple models with the same dataset with the option to run on a CPU or a GPU.

https://doi.org/10.1371/journal.pcbi.1011504.g001

For data preprocessing, calculating metrics after training, and displaying results, we rely on other packages such as Pandas, Scikit-Learn, Numpy, Matplotlib. To build CNN models we employ some of the functionality provided by the CANDLE library in Pilot 1 Benchmark 3 to do data preprocessing, model definition and instantiation, and model training. This has dependencies such as TensorFlow 1.0. Also, we use scikit-learn to import metrics to quantify model performance.

The preprocessing steps are important for the predictive accuracy of our pretrained models to work on any data users bring to train CNN and XGBoost models. Because the exact format of user provided datasets is unknown, users are responsible for any preprocessing and data cleaning steps that would be necessary prior to model training. Users need to decide what features they want to keep in the dataset being used to train and test the models and our models should be able to accommodate data of any shape as long as it fits into the device memory. If the data does not fit into the device memory of the system being used if will require the user to distribute the data and compute across more than one CPU/GPU. Libraries such as cuDF [12] can be used to distribute data across NVIDIA GPUs and Dask [13] can be used to distribute compute across CPUs/GPUs.

Once users provide their data, UNNT splits that data into training, validation, and testing sets and the user can specify the percentage of data used for testing and validation. The defaults are 70% training and 30% test data for XGBoost while half the test data is used for validation of CNN model. We use scikit-learn’s train_test_split() method to randomly sample since we are comparing it to a CNN model that randomly sampled. We convert each of the data splits into numpy arrays to train using both CNN and XGBoost. There are several hyperparameters that can be set, for both CNN and XGBoost models, and we will have recommended default parameters which can be customized by the user. It’s important to note that compared to XGBoost CNN models are more complex and have more hyperparameters. In many cases there is little overlap between them. Users have full control of the parameters tested to find the best combination of hyperparameters for their dataset for each of the models.

Finally UNNT provides users error metrics to evaluate both models trained using their data including R² and Root mean square error (RMSE).

Data

The data for the study was obtained from The Predictive Oncology Model and Data Clearinghouse (MoDaC) [14] data warehouse that was released as part of JDACS4C [1]. The dataset includes RNA-Seq expression profiles, drug response and molecular drug descriptors for National Cancer Institute 60 (NCI60) [15] cell lines.

The RNA-Seq expression data from these various cell lines are normalized using ComBat-seq [16]. Drug dose response data obtained from these cell lines are normalized using a Hill slope model with three bounded parameters [17] and drug molecular descriptors were generated using Dragon software package (version 7.0) [18].

NCI60 data includes a combination of gene expression values [19], drug response values [20], and drug descriptors [21] found in MoDaC [14].

Our NCI60 data consists of data similar to the NCI60 data the original Single Drug Response model in CANDLE Benchmarks was trained with, with two main differences. Firstly our dataset uses only using lincs1000 genes [22] in the RNA sequence gene expression data instead of all the protein coding genes, due to the importance of those genes in the dataset. Secondly, the original NCI60 dataset in drug response data uses ‘Growth’ and ‘Concentration’ values where ‘Growth’ is the target variable and ‘Concentration’ is a feature which we no longer use. ‘Growth’ is replaced with ‘AUC’ data in the drug response dataset because area under the curve (AUC) [23] is a more definitive parameter that combines both potency (concentration) and efficacy of a drug and is more robust when comparing a single drug across multiple cell lines for similar dose levels.

Results

Experimental setup

The compute resources for this work used NSF sponsored cluster, DARWIN, [24] at UDEL and Perlmutter [25], at LBNL. DARWIN and Perlmutter have GPU and CPU nodes. DARWIN uses NVIDIA V100 GPUs while Perlmutter has the latest A100 GPUs. DARWIN has AMD EPYC 32 core processors which are similar to Perlmutter that has an AMD EPYC 7713 64-core CPU.

XGBoost

We trained an XGBoost model using NCI60 expression, dose response, and drug descriptor data with AUC as the target (predictor) variable. Our observed test accuracy yielded 0.83 for R² score and 0.05 RMSE score. In addition, we created a separate dataset set aside before training and included 10% of the cell lines. Table 1 shows the difference in test errors between the two.

Download:

Table 1. XGBoost Errors for model trained NCI60 data.

https://doi.org/10.1371/journal.pcbi.1011504.t001

To find the best set of hyperparameters, we performed hyperparameter optimization using grid search technique and cross validation. Grid search trains a new model for every combination of hyperparameters while cross validation uses a different subset as test data to get an average across five subsets. Best set of hyperparameters found were ETA:0.1, Max depth: 10, Subsample: 0.5, N estimators:500. We used these hyperparameters to train a new model and the results are shown in Table 1. Hyperparameter optimization led to slight improvement that was lower than our initial expectations. This was a result of well documented ranges for the various parameters and thus we happened to choose values that were close to optimal for each of the hyperparameters.

XGBoost model training requires the training data to be fully merged before training commences and this resulted in memory issues. To solve the memory issues, we experimented with smaller datasets and reduced the number of drugs from 30,000 to 159, based on an FDA approved drugs list [20]. Results found in Table 2. The list can be found in the file S1 FDA.

Download:

Table 2. XGBoost Errors for model trained on NCI60 data with only FDA approved drugs list.

https://doi.org/10.1371/journal.pcbi.1011504.t002

CNN

We trained the original CNN model using the new NCI60 data as described in the data section above. The results shown in Table 3 were after applying hyperparameter optimization (HPO) and to determine the best parameters where the learning rate is 0.01 and tanh as the activation function. The model failed to converge on other activation functions such as ReLU for NCI60 data.

Download:

Table 3. CNN errors for NCI60 after training after performing HPO.

https://doi.org/10.1371/journal.pcbi.1011504.t003

As shown in in Table 3, the CNN model performed much worse than XGBoost when trained on NCI60 data. The RMSE metric’s best possible value is 0 and it can go to infinity, but a value such as 0.81 does not give us good insight into the quality of the regression model. On the other hand, the negative R² value for the test score indicates the model performance is poor but its magnitude does not indicate how poorly it performed. The best value for R² is 1, meaning the model completely explains predicted data variability [26]. These results indicate that both R² and RMSE scores show XGBoost model is outperforming the CNN model trained on this NCI60 dataset. It is important to emphasize that the choice of R² was highly dependent on the nature of the dataset used in this study and may not be widely applicable. For more analysis of the advantages of R² score for datasets similar to what we used for this study see [26].

Training times of CNN and XGBoost

The original CNN model is capable of running on both GPUs and CPUs, by design, since it is built with the TensorFlow framework. While the XGBoost model runs on CPU by default it can also be trained on GPU where we use the parameter tree_method=“gpu_hist” in the XGBRegressor function. This means both models in UNNT can be accelerated using GPUs. In the following section we show comparisons of model training times with CPUs and GPUs. Comparison of CPU threads vs single GPU performance shown in Table 4.

Download:

Table 4. Results using XGBoost.

Times for threads represents model runs on CPU with the corresponding threads used for speedup. Last row corresponds to running the same model on single NVIDIA V100 GPU.

https://doi.org/10.1371/journal.pcbi.1011504.t004

Training of full NCI60 drugs.

In addition to the model built using only the FDA approved drugs, we also built an XGBoost model using all the available drug data we have access to, however, use of 30,000 drugs presented many challenges due to the volume. The main challenge using the entire drug list entailed finding a system with at least 500GB of memory for the merged data before we train the XGBoost model.

CNN models can have varying training times based on parameters used for training such as subsampling with fewer features, specifying fewer training, validation, test steps, and by reducing the number of training epochs. We discuss some of those results. Table 5 shows that the CNN model does not improve as the number of epochs increases, eliminating a benefit of higher epochs. And the difference between training times of CNN and XGBoost is large. Table 6 shows that CNN model converges to its optimal learning capacity in 1 epoch, hence it would still take three times longer to train than an XGBoost model trained with a V100 GPU in the best case scenario of 1 epoch.

Download:

Table 5. CNN model trained with all features on an NVIDIA V100 GPU.

https://doi.org/10.1371/journal.pcbi.1011504.t005

Download:

Table 6. CNN model trained with all features on a CPU.

https://doi.org/10.1371/journal.pcbi.1011504.t006

When comparing training times from Tables 5 and 6 we can see that the CNN model takes half as long to train on CPU compared with training on GPU. This is most likely a result of the size of the dataset where data transfer from CPU to GPU becomes a bottleneck and increases training time.

Training on FDA approved drugs.

Table 7 shows us that an XGBoost model trains much faster on a GPU when training on a dataset that only contains FDA drug subset. We observed that the training times on the CPU increased as more threads were added. This occurs when the communication overhead is greater than the computational benefit of distributing a model across cores.

Download:

Table 7. Fastest training times for CNN and XGBoost on CPU and GPU (all features).

https://doi.org/10.1371/journal.pcbi.1011504.t007

Table 8 shows the one instance where CNN model trains faster than an XGBoost model when training on a similar dataset. Comparing Tables 8 and 9 we see the CNN model trains faster on a CPU even with less training data. The CPU results for CNN are not broken down by the number of threads because TensorFlow 1.0, the framework used to build CNN model, does not support threading on CPUs.

Download:

Table 8. Results using XGBoost.

Times for threads represents model runs on CPU with the corresponding threads used for speedup. Last row corresponds to running the same model on single NVIDIA V100 GPU.

https://doi.org/10.1371/journal.pcbi.1011504.t008

Download:

Table 9. CNN model FDA drugs trained on a CPU.

https://doi.org/10.1371/journal.pcbi.1011504.t009

Tables 7 and 10 show the advantage of training XGBoost on a V100 GPU that is consistently the fastest for the same data. Finally, Table 11 validates that using a GPU for XGBoost is optimal for training even with a smaller dataset.

Download:

Table 10. CNN model FDA drugs trained on a V100 GPU.

https://doi.org/10.1371/journal.pcbi.1011504.t010

Download:

Table 11. Fastest training times for CNN and XGBoost on CPU and GPU (FDA model).

https://doi.org/10.1371/journal.pcbi.1011504.t011

Conclusions

Exploring a niche domain such as drug response modeling for cancer cell lines, we show that utilizing a neural network (CNN) does not yield the best results. Instead, we show the impact of tree-based XGBoost model over a CNN model especially when the datasets trained on are tabular and running on a GPU. Our results demonstrate that using the same dataset, an XBoost model is faster than a CNN model while running on an NVIDIA GPU. An observable downside to using XGBoost is the larger memory requirement for training, as documented in this work, and this varies depending on the size of the dataset. As part of this work, we have developed a software, UNNT, that allows users to bring their own data and build models such as CNNs and XGBoost as well as compare how the models perform on their dataset. Thus UNNT makes a useful software for domain scientists to experiment with two unique model architectures for tabular data.

Supporting information

S1 Text. UNNT installation + running instructions.

https://doi.org/10.1371/journal.pcbi.1011504.s001

(PDF)

S1 FDA. FDA approved drug list.

https://doi.org/10.1371/journal.pcbi.1011504.s002

(PDF)

S1 Data. unnt.zipfile with the software and data.

https://doi.org/10.1371/journal.pcbi.1011504.s003

(ZIP)

S1 Fig. Flowchart of UNNT.

https://doi.org/10.1371/journal.pcbi.1011504.s004

(TIFF)

References

1. Bhattacharya T, Brettin T, Doroshow JH, Evrard YA, Greenspan EJ, Gryshuk AL, et al. AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing. Frontiers in Oncology. 2019;9. pmid:31632915
- View Article
- PubMed/NCBI
- Google Scholar
2. ECP-Candle-Benchmarks;. Available from: https://github.com/ECP-CANDLE/Benchmarks.
3. Wozniak JM, Jain R, Balaprakash P, Ozik J, Collier NT, Bauer J, et al. CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research. BMC Bioinformatics. 2018;19(18):491. pmid:30577736
- View Article
- PubMed/NCBI
- Google Scholar
4. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of Deep Learning: Concepts, CNN Architectures, challenges, applications, Future Directions. Journal of Big Data. 2021;8(1). pmid:33816053
- View Article
- PubMed/NCBI
- Google Scholar
5. Shwartz-Ziv R, Armon A. Tabular Data: Deep Learning is Not All You Need. CoRR. 2021;abs/2106.03253.
6. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’16. New York, NY, USA: ACM; 2016. p. 785–794. Available from: http://doi.acm.org/10.1145/2939672.2939785.
7. Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in Neural Information Processing Systems. vol. 35. Curran Associates, Inc.; 2022. p. 507–520. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf.
8. Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE Transactions on Neural Networks and Learning Systems. 2022; p. 1–21. pmid:37015381
- View Article
- PubMed/NCBI
- Google Scholar
9. McElfresh DC, Khandagale S, Valverde J, VishakPrasad C, Feuer B, Hegde C, et al. When Do Neural Nets Outperform Boosted Trees on Tabular Data? ArXiv. 2023;abs/2305.02997.
10. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
- View Article
- Google Scholar
11. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ, editors. Advances in Neural Information Processing Systems. vol. 25. Curran Associates, Inc.; 2012. Available from: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
12. NVIDIA. cuDF;. Available from: https://github.com/rapidsai/cudf.
13. Dask;. Available from: https://github.com/dask/dask.
14. Institute NC. Predictive Oncology Model and Data Clearinghouse (MoDaC); 2023. Available from: https://modac.cancer.gov.
15. Shoemaker RH. The NCI60 human tumour cell line Anticancer Drug Screen. Nature Reviews Cancer. 2006;6(10):813–823. pmid:16990858
- View Article
- PubMed/NCBI
- Google Scholar
16. Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics and Bioinformatics. 2020;2(3):lqaa078. pmid:33015620
- View Article
- PubMed/NCBI
- Google Scholar
17. Smirnov P, Kofia V, Maru A, Freeman M, Ho C, El-Hachem N, et al. PharmacoDB: An integrative database for mining in vitro anticancer drug screening studies. Nucleic Acids Research. 2017;46(D1).
- View Article
- Google Scholar
18. CHEMOINFORMATICS K. Dragon (software for molecular descriptor calculation); 2017. Available from: https://chm.kode-solutions.net/pf/dragon-7-0/.
19. Reinhold WC. NCI60 RNA-sequence gene expression value dataset;. Available from: https://discover.nci.nih.gov/cellminer/download/processeddataset/nci60_RNA__RNA_seq_composite_expression.zip.
20. NCI60 drug response dataset;. Available from: https://modac.cancer.gov/api/v2/dataObject/NCI_DOE_Archive/JDACS4C/JDACS4C_Pilot_1/cancer_drug_response_prediction_dataset/combined_single_response_agg.
21. NCI60 molecular drug descriptors dataset;. Available from: https://modac.cancer.gov/api/v2/dataObject/NCI_DOE_Archive/JDACS4C/JDACS4C_Pilot_1/cancer_drug_response_prediction_dataset/descriptors.2D-NSC.5dose.filtered.txt.
22. Subramanian A. Broad Institute Human L1000 epsilon;. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL20573.
23. Fallahi-Sichani M, Honarnejad S, Heiser LM, Gray JW, Sorger PK. Metrics other than potency reveal systematic variation in responses to cancer drugs. Nature Chemical Biology. 2013;9(11):708–714. pmid:24013279
- View Article
- PubMed/NCBI
- Google Scholar
24. DARWIN. Delaware Advanced Research Workforce and Innovation Network (DARWIN); 2021. Available from: https://dsi.udel.edu/core/computational-resources/darwin/.
25. NERSC. Perlmutter (National Energy Research Scientific Computing center (NERSC); 2022. Available from: https://docs.nersc.gov/systems/perlmutter/architecture/.
26. Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021;7:e623. pmid:34307865
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Bhattacharya T, Brettin T, Doroshow JH, Evrard YA, Greenspan EJ, Gryshuk AL, et al. AI Meets Exascale Computing: Advancing Cancer Research With Large-Scale High Performance Computing. Frontiers in Oncology. 2019;9. pmid:31632915
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. ECP-Candle-Benchmarks;. Available from: https://github.com/ECP-CANDLE/Benchmarks.

[ref3] 3. Wozniak JM, Jain R, Balaprakash P, Ozik J, Collier NT, Bauer J, et al. CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research. BMC Bioinformatics. 2018;19(18):491. pmid:30577736
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref4] 4. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of Deep Learning: Concepts, CNN Architectures, challenges, applications, Future Directions. Journal of Big Data. 2021;8(1). pmid:33816053
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Shwartz-Ziv R, Armon A. Tabular Data: Deep Learning is Not All You Need. CoRR. 2021;abs/2106.03253.

[ref6] 6. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’16. New York, NY, USA: ACM; 2016. p. 785–794. Available from: http://doi.acm.org/10.1145/2939672.2939785.

[ref7] 7. Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in Neural Information Processing Systems. vol. 35. Curran Associates, Inc.; 2022. p. 507–520. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/0378c7692da36807bdec87ab043cdadc-Paper-Datasets_and_Benchmarks.pdf.

[ref8] 8. Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE Transactions on Neural Networks and Learning Systems. 2022; p. 1–21. pmid:37015381
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref9] 9. McElfresh DC, Khandagale S, Valverde J, VishakPrasad C, Feuer B, Hegde C, et al. When Do Neural Nets Outperform Boosted Trees on Tabular Data? ArXiv. 2023;abs/2305.02997.

[ref10] 10. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998;86(11):2278–2324.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref11] 11. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira F, Burges CJ, Bottou L, Weinberger KQ, editors. Advances in Neural Information Processing Systems. vol. 25. Curran Associates, Inc.; 2012. Available from: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

[ref12] 12. NVIDIA. cuDF;. Available from: https://github.com/rapidsai/cudf.

[ref13] 13. Dask;. Available from: https://github.com/dask/dask.

[ref14] 14. Institute NC. Predictive Oncology Model and Data Clearinghouse (MoDaC); 2023. Available from: https://modac.cancer.gov.

[ref15] 15. Shoemaker RH. The NCI60 human tumour cell line Anticancer Drug Screen. Nature Reviews Cancer. 2006;6(10):813–823. pmid:16990858
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref16] 16. Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics and Bioinformatics. 2020;2(3):lqaa078. pmid:33015620
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref17] 17. Smirnov P, Kofia V, Maru A, Freeman M, Ho C, El-Hachem N, et al. PharmacoDB: An integrative database for mining in vitro anticancer drug screening studies. Nucleic Acids Research. 2017;46(D1).
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref18] 18. CHEMOINFORMATICS K. Dragon (software for molecular descriptor calculation); 2017. Available from: https://chm.kode-solutions.net/pf/dragon-7-0/.

[ref19] 19. Reinhold WC. NCI60 RNA-sequence gene expression value dataset;. Available from: https://discover.nci.nih.gov/cellminer/download/processeddataset/nci60_RNA__RNA_seq_composite_expression.zip.

[ref20] 20. NCI60 drug response dataset;. Available from: https://modac.cancer.gov/api/v2/dataObject/NCI_DOE_Archive/JDACS4C/JDACS4C_Pilot_1/cancer_drug_response_prediction_dataset/combined_single_response_agg.

[ref21] 21. NCI60 molecular drug descriptors dataset;. Available from: https://modac.cancer.gov/api/v2/dataObject/NCI_DOE_Archive/JDACS4C/JDACS4C_Pilot_1/cancer_drug_response_prediction_dataset/descriptors.2D-NSC.5dose.filtered.txt.

[ref22] 22. Subramanian A. Broad Institute Human L1000 epsilon;. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL20573.

[ref23] 23. Fallahi-Sichani M, Honarnejad S, Heiser LM, Gray JW, Sorger PK. Metrics other than potency reveal systematic variation in responses to cancer drugs. Nature Chemical Biology. 2013;9(11):708–714. pmid:24013279
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref24] 24. DARWIN. Delaware Advanced Research Workforce and Innovation Network (DARWIN); 2021. Available from: https://dsi.udel.edu/core/computational-resources/darwin/.

[ref25] 25. NERSC. Perlmutter (National Energy Research Scientific Computing center (NERSC); 2022. Available from: https://docs.nersc.gov/systems/perlmutter/architecture/.

[ref26] 26. Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021;7:e623. pmid:34307865
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

Figures

Abstract

Author summary

Introduction

Background

Design and implementation

Data

Results

Experimental setup

XGBoost

CNN

Training times of CNN and XGBoost

Training of full NCI60 drugs.

Training on FDA approved drugs.

Conclusions

Supporting information

S1 Text. UNNT installation + running instructions.

S1 FDA. FDA approved drug list.

S1 Data. unnt.zipfile with the software and data.

S1 Fig. Flowchart of UNNT.

References