Statistical approaches to identifying significant differences in predictive performance between machine learning and classical statistical models for survival data

Research that seeks to compare two predictive models requires a thorough statistical approach to draw valid inferences about comparisons between the performance of the two models. Researchers present estimates of model performance with little evidence on whether they reflect true differences in model performance. In this study, we apply two statistical tests, that is, the 5 × 2-fold cv paired t-test, and the combined 5 × 2-fold cv F-test to provide statistical evidence on differences in predictive performance between the Fine-Gray (FG) and random survival forest (RSF) models for competing risks. These models are trained on different scenarios of low-dimensional simulated survival data to determine whether the differences in their predictive performance that exist are indeed significant. Each simulation was repeated one hundred times on ten different seeds. The results indicate that the RSF model is superior in predictive performance in the presence of complex relationships (quadratic and interactions) between the outcome and its predictors. The two statistical tests show that the differences in performance are significant in quadratic simulation but not significant in interaction simulations. The study has also revealed that the FG model is superior in predictive performance in linear simulations and its differences in predictive performance compared to the RSF model are significant. The combined 5 × 2-fold cv F-test has lower type I error rates compared to the 5 × 2-fold cv paired t-test.


Introduction
The advent of machine learning has provided challenges especially to the statistical community [1]. Unlike the classical statistical models that have decision theory embedded in them, the machine learning models are yet to have this theory embedded within them. The fears of over fitting, type I and II errors have therefore led to criticism of machine learning models since they were first conceived [2]. The statistical community has for sometime ignored these models until it became categorically clear that they can not be ignored especially in this era of big data. There has therefore been a shift in research on making sure that robust tests are developed to make sure that the machine learning and statistical models agree at least on the basics or the building blocks of statistical theory [3].
In this study, we focus on statistical tests to evaluate whether the difference in the predictive performance of the Fine-gray (FG) [4] and the random survival forests (RSF) [5] models for competing risks data are significant under three low-dimensional data simulation scenarios. Both the FG and the RSF models for competing risk outcomes, model time-to-event distributions for mutually exclusive event.
In the analysis of time-to-event outcomes, a competing risk is an event whose occurrence precludes the occurrence of the event of primary interest [6]. This complicates the analysis of such a dataset [6][7][8].
When outcomes are time-to-event in nature, the objective of prognostic models is frequently focused on estimating the cumulative incidence function (CIF) [7].
The cause-specific hazard approach is the most commonly used classical statistical approach in analysing competing risk data [4]. However, it treats events other than the event of interest as censored. This leads to inflated survival probabilities and therefore does not result into meaningful conclusions [9,10]. The Fine-Gray model or the proportional hazards model for the sub-distribution approach is known to handle competing risks well by allowing the events that are competing with the event of interest to continue being in the risk set [4]. The Fine-gray model also has an advantage of directly modeling the effects of the covariates on the cumulative incidence function [10]. An alternative state of the art model in modelling competing risk events is the random survival forest for competing risks [5]. It is a machine learning model whose goal is to also estimate the CIF. Assessing the accuracy of predictions from the above mentioned models is an important part in their development. This is because they are commonly used in predicting important and very sensitive biological phenomena of occurrence of binary outcomes like presence of disease, death within a given duration of time, or hospital readmission within a given duration of time [11]. A study by [12] noted that methods for assessing the calibration of prognostic models for use with competing risk data have received little attention.
A recent study by [7] provides strong evidence that random survival forests models predict default and prepayment risk more accurately than statistical benchmarks in the form of the Cox proportional hazard model and the Fine and Gray model. However, no statistical tests were used to show evidence for the significance difference in their predictive performance.
Properties of the random survival model in modeling competing risks in low and high dimension data were studied by [5]. The authors' results show that the Fine-gray model was better than the random survival forest model in predictive performance in linear low-dimension settings. The results further show that that the random survival forest is better in non-linear low-dimension settings. To obtain these results, they compared the predictive performance values of the models with all the covariates to a benchmark or threshold value. The threshold model's predictive performance value was obtained from the null model that ignored all the covariates. In this study, we use two statistical tests to evaluate whether the difference in the predictive performance of the Fine-gray and the random survival forests models for competing risks data is significant under three low-dimensional data simulation scenarios.
We employed two statistical tests, namely; the 5 × 2-fold cv paired t-test, and the combined 5 × 2-fold cv F-test [13,14] via a simulation study to examine whether the differences in the predictive performance of the two models are significant in each of the three scenarios considered.
The rest of the article is structured as follows: Section 2, describes the nature of competing risks data and the methods used in this study; Section 3, describes the two statistical tests; Section 4, describes the simulation study; Section 5, presents the simulation results, and in Section 6, we discuss and present conclusions of the study.

Competing risk models in survival analysis
A competing risk is an event that, if it occurs, prevents the primary event of interest from occurring. For competing risks, we are interested in the time T j between the time origin and the occurrence of an event of interest. Individuals who are subjected to competing risks are observed from the time they enter the study to the time the competing event or the event of interest occurs. Often, individuals are observed before the occurrence of one of the events. To describe the nature of competing risk data let T 0 j denote event time for the j th individual, and let d 0 j be his or her event type, such that d 0 j 2 f1 . . . Kg, where K � 1. Furthermore, we let C 0 j denote the individual's censoring time such that the actual time of event T 0 j is unobserved and one only observes When δ j = 0, the individual is said to be censored at T j , otherwise if δ j = k > 0, the individual is said to have an event of type k at time T j . Thus, the observed competing risk data is such that (T j , δ j , X j ) 1�j�n where X j is a p-dimensional vector of covariates. In addition, we let t 1 < t 2 < . . . < t m , m � n, be distinct event times.
Thus, the main goal of survival analysis is to estimate the survival probability of the event T j for a new instance using the feature predictors denoted by X j . It should be noted that in survival analysis problems, T j will be both continuous and non-negative.

The survival and hazard functions
The survival function S(t) is represented by: Eq 1 estimates the probability that the survival probability of an event of interest does not occur before time t [15,16]. S(t) is non-negative and has an initial condition, (S(0) = 1), indicating that 100% of the observed individuals survive when none of the events of interest has occurred. The survival function has two important properties: S(0) = 1 (i.e., the event has not yet occurred for any subjects at the start of the study) and lim t!1 S(t) = 0 (i.e., the event of interest eventually occurs for all subjects).
The hazard function (λ(t)), is another commonly used function that is referred to as the instantaneous death rate [17].
The hazard function is mathematically defined by [16]: Fðt þ DtÞ À FðtÞ Dt:SðtÞ ; , is a non-negative function. According to [18], the survival function S(t) can also be expressed as: where LðtÞ ¼

Fine Gray model
The survival function S(t) can be estimated using traditional statistical methods and machine learning methods. This study focuses on a semi-parametric method, the Fine-Gray (FG) [4] model to evaluate the cumulative incidence function (CIF). Fine and Gray [4] developed the sub-distribution hazard function defined by; where l � k ðtÞ is known as the sub-distribution hazard and it measures the instantaneous rate of occurrence of the event of interest among subjects that have not yet experienced it. In this study, and our interest is modeling the cumulative incidence function for failure from cause 1 conditional on the covariates.
As reported by [19], classical survival methods are not appropriate to analyse time-to-event data in complex situations such as in a competing risk setup, in which an individual in the risk set is exposed to multiple causes of failure. The proportional hazard (PH) model [20] is one of the classical methods for analysing competing risk data to examine the effect of covariates on the cause specific hazard function. The main drawback of using the PH model in a competing risk setup is that when estimating regression parameters for a specific cause, it considers individuals failing for reasons other than the cause of interest as censored observations [19,21]. To address the limitation of the PH model, Fine and Gray [4] developed a survival regression based model that uses the cumulative incidence function (CIF) and sub-distribution hazard functions to describe the likelihood of an event occurring prior to a specific time. Unlike the PH model, the CIF does not exclude other competing risks when a specific cause is of interest [22].
The cumulative incidence function (CIF) is defined by CIF(k) = P(T � t, δ j = k). Furthermore, CIF(k) represents the probability of the k th event occurring before time t and before the occurrence of another type of event [21]. This means that CIF allows for the estimation of the occurrence of an event while accounting for competing risk. A key point is that, in the competing risks setting, only one event type can occur, such that the occurrence of one event precludes the subsequent occurrence of other event types.
Although the FG model was developed to address the limitations of Cox-based models, there is still considerable confusion regarding how the estimates from FG models are interpreted [23]. The confusion arises because the regression coefficients associated with this model are unclear or incorrectly interpreted. Also, when comparing results from different studies, an incorrect and inconsistent interpretation of the regression coefficients can cause confusion. Furthermore, an incorrect interpretation of the estimated regression coefficients can lead to an incorrect understanding of the magnitude of the relationship between exposure and incidence of the outcome.
The predictive performance of FG, a classical statistical model, is compared to that of a machine learning model, the random survival model (RSF). When the PH assumption is violated, survival trees and random survival forests (RSF) approaches offer an appealing alternative to Cox proportional hazards models [24]. Survival trees and RSF extend the classification and regression trees [24]. In addition, survival tree methods are non-parametric, flexible, and capable of dealing with high-dimensional covariate data.

Random survival forests for competing risks
In recent years, random forests [25] have been extended to regression problems and survival outcomes. The random survival forest (RSF) algorithm [26] is a collection of survival trees that extends the random forest to evaluate survival analysis with censored data. The RSF's algorithm implementation [26] is illustrated in a flowchart in Fig 1 below.
RSFs have also been extended to competing risks. Random survival forests for competing risks are grown in a manner similar to the general algorithm (Algorithm 1) in Fig 1 of random survival forests, with the main difference being the splitting rule used [27,28]. Furthermore, the RSF differs from the random forest method in that the RSF's tree-growing splitting rule takes into account both the survival time and the censoring indicator. In this study, we will implement RSFs for competing risks Algorithm 2 outlined in the flowchart in Fig 2 that uses the log-rank splitting rule described in detail in [29] to split nodes by maximizing the log-rank test statistic. Before we outline the random survival forest algorithm for competing risks, we describe the split criteria used.
The generalised log-rank split-rule. Let the number of individuals at risk in the two daughter nodes be R α (t j ) and R γ (t j ), respectively. Then R a ðt j Þ ¼ P n j¼1 IðT j � t; x j � sÞ, R g ðt j Þ ¼ P n j¼1 IðT j � t; x j > sÞ , and x j is the x-predictor for individual j = 1, 2, . . ., n. The total number of individuals at risk at time t is R(t) = R α (t) + R γ (t). The number of type K events for the left and right daughter nodes is, respectively.
, is the number of type k, events at time t. Suppose that t m ; t m a ; and t m g are the largest times of study in the root node and the two daughters, respectively. The generalised log-rank split-rule in the competing risk setting is based on a null hypothesis that the where τ, is a fixed time point set by the user in accordance with the observed follow-up period for the given dataset [5]. The split-rule at a point s on covariate x is given as: whereŝ k ;a ðx; sÞ is the variance estimate given by: Time-dependent weights, W k (t) > 0, are used to make the test more sensitive to early or late differences between the cause-specific hazards. The best split is found by maximizing, | The flow chart illustrates the details of Algorithm 1, that is to say, the random survival forest algorithm. This a general algorithm for building a random survival forest. A survival tree is grown for each bootstrap sample by splitting the node after selecting a variable that maximizes the difference between daughter nodes using a predetermined split rule.

PLOS ONE
i k (x, s)|, over all covariates and the split-points. Often the log-rank splitting rule is used to build trees for competing risks. As earlier stated, it tests the null hypothesis H 0 : λ k,α (t j ) = λ k, γ (t j ), 8t j � t, which makes it inefficient in accounting for competing risks. It is therefore recommended that one uses the Gray's test. An approximation to the Gray's test which is performed by modifying the risk set of the log-rank test is available and implemented in R. It is a weighted log-rank test for testing the equivalence of the subdistribution hazard functions between two groups. It tests the null hypothesis H 0 :

Simulations
Data simulations. We used the Cox-exponential cause-specific hazard approach [5,30] to simulate competing risk data. This is the standard approach that is achieved by formulating competing risk data using the hazard for each cause: where λ k (tjX) is the cause specific hazard for event k at time t for an individual with covariates X, λ 0k is a baseline hazard function that describes the risk for individuals with no covariate information, and expðb T k XÞ is the relative risk for two competing events k = 1, 2, given a vector of covariates X = (x 1 , x 2 , . . ., x p ). With two competing risk events, the cause specific hazards of event one and two given the covariates are defined using: where λ 1 (t|X) and λ 2 (t|X) are the cause specific hazards for event 1 and 2 at time t, respectively. The overall hazard is defined as: In all simulations, we set λ 0k = 0.01. Six continuous covariates (x 1 , x 2 , . . ., x 6 ), were drawn independently from a standard normal distribution and six binary predictors (x 7 , x 8 , . . ., x 12 ), from a binomial distribution with success probability of 50%. We considered the following three simulation scenarios for low-dimensional data (p < n): i) Linear simulations; ii) Quadratic simulations; and iii) Interaction simulations.
Linear simulations. The linear simulation scenario has an additive structure, and we set the effect size of the covariates at: The continuous effect size was set at a 1 = log (2), and the discrete effect size was set at a 2 = 1.5. Covariates x 1 , x 2 , x 7 , x 8 have an effect on the hazard of event one only, whereas, covariates x 5 , x 6 , x 11 , x 12 have an effect on both hazards. The covariates x 3 , x 4 , x 9 , x 10 have an effect on the hazard of event two only.
Quadratic simulations. The linear additive structure was broken by introducing squared covariates x 2 1 ; x 2 2 ; . . . ; x 2 6 with their effect sizes set at: b Quad Interaction simulations. The interaction terms are constructed as: b Int k Ifx l > 0gx i for l ¼ f1; 2; . . . ; 6g and i ¼ f7; 8; . . . ; 12g : The interaction effect sizes for the interaction terms are set at:

Model training for each simulation experiment
In this simulation experiment, we consider two models, the RSF and the FG models. The R packages randomForestSRC [31] and cmprsk [32] were used to implement the random survival forest for competing risks and the Fine-Gray model, respectively. Six datasets with sample sizes; 200, 300, 400, 500, 2000 and 3000 are used. For each simulation experiment the dataset is divided into two equal-sized sets, and the models are trained on one set and tested on the other. The difference between the error rates (integrated Brier scores) of the models are computed. The t-statistics, the F-statistics and the p-values associated with the tests are evaluated in each experiment. For the random survival forest, 500 trees are trained using the "log-rankCR" splitting rule. A default terminal node size, n 0 = 15 is used. Randomized splitting as described above is used, that is to say, at each parent node, for each of the randomly selected subset of covariates, "nsplit" randomly selected split points were chosen. The tree node is then split on that variable and random split point maximizing the absolute value of the split-statistic. For this simulation study, nsplit is set at 2 (nsplit = 2) because the simulation study has both continuous and discrete covariates. A small nsplit value is recommended in cases where there are both discrete and continuous covariates [5,24]. The number of randomly selected subsets of the covariates to split on at each node known as "mtry" is set at ffi ffi ffi p p .

Model evaluation for each simulation experiment
Evaluation metrics. The integrated Brier score (IBS) [33] is used as a measure of predictive performance for both models. The IBS is the squared difference between actual and predicted outcome.
Integrated Brier score (IBS). The Brier score is used when one is investigating the overall performance of survival models. It is desirable to have a model that is both discriminative (high concordance) and calibrated [34]. The Brier score is desirable because it measures both calibration and discrimination.
The Brier score is the average squared distances between the observed survival status and the predicted survival probability. For example, at a given time point t, the Brier score for a single subject is the squared difference between the observed event status (e.g., 1 = alive at time t and 0 = dead at time t) and a model based prediction of surviving to time t. For a test sample of size n test , the Brier score at time t, is given by: WhereĜðtjxÞ � PðC > tjX ¼ xÞ, is the Kaplan-Meier estimate of the conditional survival function of the censoring times. These are weightings of the Brier score to adjust for the presecnce of censored survival times. The integrated Brier score (IBS) is often used and it is given by: The IBS gives an average Brier score across a time interval, and we use it as a metric to compare the performance of the FG and RSF models. As stated above, the Brier score is used to measure both calibration and discrimination. This implies that it can be employed when one is evaluating the overall performance of survival models or when the goal is to find a model that performs well on both calibration and discrimination.
The 5 × 2-fold cv paired t-test, and the combined 5 × 2-fold cv F-test statistics are calculated based on the differences of the values of IBS scores.

Approximate statistical tests for comparing the Fine-Gray model and the random survival forest
Statistical hypothesis tests can be used to evaluate whether the difference in performance between two models is statistically significant. Two tests, that is, the 5 × 2-fold cv paired t-test [13], and the combined 5 × 2-fold cv F-test [14] were used in this study to determine whether the difference in the predictive performance between the FG and RSF models are significant.

× 2-fold cv paired t-test.
The K-fold cross-validated paired t-test is the most commonly used method for comparing the performance of two models. The problem with this method, however is that the training sets overlap and it is therefore not recommended to be used in practice [13]. The 5 × 2-fold cv paired t-test solves the problem of overlap in the training datasets that is prevalent in K-fold cross-validation paired t-test [13]. In addition, the 5 × 2-fold cv paired t-test yields larger test data and training data sets that do not overlap. Thus, the 5 × 2fold cv paired t-test becomes a more powerful test compared to the k-fold cross-validated paired t-test. This is because it measures directly the variation that is brought about by the choice of the training data set. The 5 × 2-fold cv paired t-test is therefore used as a post-hoc analysis to test whether the differences in the mean Brier scores of the FG and RSF models are statistically significant. The test statistict, for the 5 × 2-fold cv paired t-test is calculated as: 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 5 where p ð1Þ 1 is the difference in the Brier Scores of the FG and RSF models for the first fold of the first iteration, s 2 i is the variance of the Brier Scores differences of the ith iteration. The variance  is computed using: In addition, p ðjÞ i is the difference in the Brier Scores of the FG and RSF models for the i th iteration and fold j. Note that: The flowchart in Fig 3 below is Algorithm 3 for the 5 × 2-fold cv paired t-test. Although the 5 × 2-fold cv paired t-test described in Fig 3 produces acceptable Type I errors, it fails in situations where performance metric's scores that are measured in the various 2-fold cross-validation replications vary wildly [14].
Combined 5 × 2-fold cv F-test. A study by [14] proposed a variant, the combined 5 × 2fold cv F test, that combines the results of the 10 possible statistics to get a more robust test. The test statistic of the combined 5 × 2-fold cv F-test is computed using: The statistic f is approximately F distributed with 10 and 5 degrees of freedom, and the hypothesis that the FG and RSF algorithms have the same value of the evaluation measurement is rejected if the statistic f is greater than 4.74 at α-level equal to 0.05. To compare the performance of FG and RSF models, the integrated Brier score (IBS) is used in this study.
Type I error. To control Type I error, that is, the likelihood of rejecting the null hypothesis that is true at some level α, we should reject the null hypothesis when the observed p-value is less than α: The p-value is a random variable that depends on the observed data used to compute it. From the definition of the cumulative distribution function of any random variable, when the null is true, the p-value has a uniform distribution on the interval 0 � p-value � 1 klam-mer2009statistical. If the null is true, the sample of the p-values will look exactly like a sample of uniform random variables from the interval [0, 1]. To calculate the the Type I error of the 5 × 2-fold cv paired t-test, and the combined 5 × 2-fold cv F-test. The null and alternative hypotheses are such that: Hypothesis H 0 : There is no significant difference in performance between the two models.
Hypothesis H 1 : There is a significant difference in performance between the two models.

Results
The simulations were repeated 100 times at ten different seeds for each of the sample sizes considered in the study. For each sample size, there is therefore a total of 1000 independent simulations. shows that the mean cv IBS scores for the RSF model decrease markedly for larger sample sizes. These results therefore indicate that for the linear simulations, the FG outperforms the RSF model as it produced the lowest mean IBS for the different sample sizes. It is also important to note that the mean cv IBS scores are below 0.25, which indicates that both models are predictive on the datasets given. The results of the quadratic simulations are shown in Fig  5. They indicate that for different sample sizes, the mean cv IBS scores of the FG model are higher than those of the RSF model. In addition, Fig 5 shows that the mean cv IBS scores for the RSF model decrease with the increase in the sample size. These results show that for the quadratic simulations, the RSF outperforms the FG model as it produced the lowest mean IBS for the different sample sizes. Furthermore, Fig 5 shows that for the quadratic simulations, the IBS results of the RSF model are more consistent than those of the FG model. Fig 6 shows summarises of the results of the interaction simulations. The summary indicates that the RSF model has lower mean cv IBS scores compared to the RSF model. It is also important noted that for large samples (greater than 500) the results for the interaction simulations are indistinguishable. Figs 4 to 6 also show that the variability in the predictive performance of the two models decrease with increase in the sample size. Our results are consistent with previous studies that indicated that variability in predictive performance decreases with increasing sample size [35,36]. Table 1 summarises all the simulations of the linear, quadratic and interaction results based on the FG and the RSF models. The results for the linear simulations show that the mean cv IBS scores for the FG model are on average between 0.16-0.18 across all sample

PLOS ONE
sizes, which is lower compared to the mean cv IBS scores for the RSF model which are betwen 0.19-0.22 across all sample sizes.
The F and t-statistics show that the proportion of significant tests largely increases as the sample sizes increase for the linear simulations. This means that for larger sample sizes, the two models have an even large significant difference in their predictive performance. For example, up-to 95% of the simulated samples have significant F-statistics and t-statistics for the sample size N = 3000. The results imply that the FG model is superior in predictive performance in linear simulations compared to the RSF model because it has a lower mean IBS scores for linear simulations in larger sample sizes as shown in Table 1.
The results further show that, the mean cv IBS scores for the FG model on the quadratic simulations are on average between 0.24-0.27 across all sample sizes. In contrast, the mean cv IBS scores for the RSF model are on average much lower and between 0.21-0.23 across all sample sizes. The table shows that 100% of the samples considered have significant F and t-statistics for a sample size of N = 3000. This indicates that the performance of the RSF model on the quadratic simulations is statistically significant and better than that of the FG model especially in larger sample sizes.
The results in Table 1 also indicate that the mean cv IBS scores for the FG range from 0.23 to 0.25 for the interaction simulations compared to 0.22 to 0.23 for the RSF model. The results for the interaction simulations further suggest that this difference in the predictive performance of the two models is not statistically significant. This is because, the percentage of statistically significant t-statistics for the samples considered range from 1.80% to 21.40%. The percentage of significant F-statistics range from 0.10% to 10.20%. Large sample simulations confirm the result that the two model's predictive performance is not significantly different

PLOS ONE
with approximately 1.8% samples with significant t-statistics and 0.10% samples with significant F-statistics for the samples of size of N = 3000. Table 1 states performance values together with the statistical tests results to tell whether these differences that exist in predictive performance are significant or not. This type of reporting is a good practice in clinical research especially when deciding on the best model to use when the choice is between a more interpretable (classical statistical) and a machine learning (black box) type of model. This is because the researcher can use the statistical test results to justify their model choice.
The study further investigated the Type I error of the 5 × 2-fold cv paired t-test, and the combined 5 × 2-fold cv F-test. Fig 7 shows that the 5 × 2-fold cv paired t-test, has a higher Type I error compared to the combined 5 × 2-fold cv F-test. This is expected because the combined 5 × 2-fold cv F-test combines the results of the 10 possible statistics rather-than using only one of them. The observed p-values tend to have a uniform distribution for the larger sample sizes. The large type I error of the two tests in smaller sample sizes needs to be investigated further. However, the most plausible explanation of this phenomena arises from the assumptions made when constructing the tests. One of the assumptions is that the difference of two identically distributed predictive performance values (p ðjÞ i ) are assumed to be independent when in-fact they are not independent. The differences are also assumed to be independently normally distributed which is not strictly true because the training and test sets are not drawn independently of each other [14]. The strict assumption could be very amplified in the smaller sample sizes than in larger sample sizes. The independence assumption affects the combined 5 × 2-fold cv F-test where the assumption is that P 5 i¼1 P 2 j¼1 ðp ðjÞ i Þ 2 and P 5 i¼1 s 2 i are independent which is not technically true [14].
Another explanation is the fact that machine learning models trained on a small dataset are more likely to see patterns that do not exist, which results in high variance and very high error on a test set. These are the common signs of overfitting. A study by [37] used datasests to train supervised ML methods to classify healthy individuals and individuals with brain disorders. The study used datasets with smaller sample sizes with a median number of samples equal to 88 and interestingly, the overall reported accuracy was higher in the datasets with smaller sample sizes [37,38]. A study by [38] trained machine learning and classical statistical methods using simulations at different sample sizes to provide an insight into whether the tendency to report higher performance estimates with smaller sample sizes could be due to insufficiently reliable validation. They used the K-Fold CV and their results showed that the machine learning model accuracies were considerably higher than the theoretical chance level of 50%. The highest difference was observed with smaller sample sizes; however, the difference was still evident even at the sample size of N = 1000. The results from these two studies agree with the results from our study as demonstrated in Fig 8 below  This is an indication that the F-statistic is "hovering around" 1 after repeatedly computing the F-statistic for situations when the null is true. Fig 10 shows that the t-statistics under the null is normally distributed but the peak of the graph is not at zero, for most of the sample sizes except for the one where N = 3000. This indicates that obtaining a sample value close to the null hypothesis is most likely in a larger sample size.

Discussion and conclusion
This study explores the existing statistical tests that can be used to identify a significant difference between classical and machine learning models in the analysis of survival data. The study trained two models, that is the Fine-Gray and the Random survival forests for competing risks in three low-dimensional data scenarios namely; the linear, the quadratic and the interaction models. The Fine-Gray is a classical statistical model while the random survival forest model for competing risks is a machine learning model. The study revealed that the FG model is superior in predictive performance in the linear-low-dimension data simulation scenarios and that the difference in the predictive performance in comparison to the RSF is significant.
The study further revealed that the RSF model has lower IBS values in the interaction lowdimension simulations but the two statistical tests showed that there is no significant difference in the predictive performance of this model in comparison with the FG model in the interaction simulations.
Furthermore, the study showed that the RSF model is superior in predictive performance in quadratic low-dimension data simulation scenarios compared to the FG model. The F and the t-statistics tests also revealed that the difference in this predictive performance is highly significant especially in large data samples.
This study confirms that in the presence of complex relationships between the outcome and the predictors, the machine learning model (RSF) is superior in predictive performance. In linear simulations, however, the FG model model is superior. These results are similar to those obtained in the study by [5]. However, this study goes further to state whether this difference in predictive performance is significant or not.
The results revealed that sometimes there is no significant difference between the classical statistical model and the machine learning model. Having knowledge of this can guide clinicians to use the most interpretable model. This result is important especially if the goal of any

PLOS ONE
given study is not to predict the outcome which is usually the motivation for using a machine learning model.
The study recommends that statistical tests such as the ones used in this study that is, 5 × 2fold cv paired t-test, and the combined 5 × 2-fold cv F-test become part of regular practice in justification for the use of both machine learning and classical statistics models for data analysis in medical studies.