Integrating additional factors into the TNM staging for cutaneous melanoma by machine learning

Background Integrating additional factors into the TNM staging system is needed for more accurate risk classification and survival prediction for patients with cutaneous melanoma. In the present study, we introduce machine learning as a novel tool that incorporates additional prognostic factors to improve the current TNM staging system. Methods and findings Cancer-specific survival data for cutaneous melanoma with at least a 5 years follow-up were extracted from the Surveillance, Epidemiology, and End Results Program of the National Cancer Institute and split into the training set (40,781 cases) and validation set (5,390 cases). Five factors were studied: the primary tumor (T), regional lymph nodes (N), distant metastasis (M), age (A), and sex (S). The Ensemble Algorithm for Clustering Cancer Data (EACCD) was applied to the training set to generate prognostic groups. Utilizing only T, N, and M, a basic prognostic system was built where patients were stratified into 10 prognostic groups with well-separated survival curves, similar to 10 AJCC stages. These 10 groups had a significantly higher accuracy in survival prediction than 10 stages (C-index = 0.7682 vs 0.7643; increase in C-index = 0.0039, 95% CI = (0.0032, 0.0047); p-value = 7.2×10−23). Nevertheless, a positive association remained between the EACCD grouping and the AJCC staging (Spearman’s rank correlation coefficient = 0.8316; p-value = 4.5×10−13). With additional information from A and S, a more advanced prognostic system was established using the training data that stratified patients into 10 groups and further improved the prediction accuracy (C-index = 0.7865 vs 0.7643; increase in C-index = 0.0222, 95% CI = (0.0191, 0.0254); p-value = 8.8×10−43). Both internal validation using the training set and temporal validation using the validation set showed good stratification and a high predictive accuracy of the prognostic systems. Conclusions The EACCD allows additional factors to be integrated into the TNM to create a prognostic system that improves patient stratification and survival prediction for cutaneous melanoma. This integration separates favorable from unfavorable clinical outcomes for patients and improves both cohort selection for clinical trials and treatment management.


Introduction
Cutaneous melanoma is responsible for the vast majority of skin cancer-related deaths in the US and accounts for 7 percent of new cancer cases in men and 4 percent of new cancer cases in women [1]. There are multiple prognostic factors for cutaneous melanoma that correlate with survival. These include lesion thickness, lymph node involvement, metastasis, ulceration, age, sex, anatomic location, mitotic rate, satellite/in-transit lesions, and serum lactic dehydrogenase level, etc [2]. Among these, the primary tumor feature (thickness and ulceration), nodal involvement, and distant metastasis are the three factors that constitute the AJCC staging system [3]. The initial staging system for cutaneous melanoma was developed by the American Joint Committee on Cancer (AJCC) in 1977 [4]. Through the years, the AJCC makes periodic revisions to the TNM system in an effort to reflect a better understanding of the disease. The current TNM system for melanoma classifies the disease from stage I through IV. Surgical excision is often adequate for stages I-II. Sentinel lymph node biopsy may have a role in early to intermediate cases to improve staging and identify cases who may benefit from earlier lymphadenectomy [5]. Treatment of the locally advanced or metastatic disease is complex and may require additional adjuvant treatments. Multidisciplinary tumor boards are needed to individualize treatment plans for advanced stage patients. Although melanoma has been shown historically to be a radioresistant and chemoresistant cancer, clinical trials have shown the efficacy of targeted immunotherapy, especially checkpoint blockade in metastatic cases. Additional trials are ongoing to evaluate its use in the neoadjuvant, adjuvant, and palliative settings, with or without local therapy such as surgery or radiation [6].
In the context of evolving knowledge and modern tumor registries, additional factors should be integrated to augment TNM to create prognostic systems for improved research and treatment. Cox regression modeling [7] and tree modeling [8,9] are two approaches that could expand the AJCC staging system by integrating additional factors. Cox regression, which focuses on the optimal fitting of the data, can achieve a high accuracy in predicting survival. However, the assumptions made with Cox modeling (e.g., each factor/variable makes a linear contribution to the model) usually do not prove realistic. In addition, there is no clear rule that can guide how to use the output (e.g., the nomogram) to stratify patients into risk groups analogous to AJCC stages. As a result, studies often empirically partition patients into a small number of groups (e.g., 2 or 3 quantile groups) [10]. On the other hand, the traditional survival tree method, partitioning the space of values of factors into disjoint and non-overlapping regions, can be used to form prognostic groups explicitly. However, the tree models in general have a low prediction accuracy [11].
In this study, we described a machine learning approach using the Ensemble Algorithm for Clustering Cancer Data (EACCD) [12][13][14][15][16][17][18][19][20][21][22] to expand the AJCC staging system by integrating additional factors that contribute to the prognosis of the disease. We demonstrated the approach by building 2 prognostic systems. One system, based on primary tumor (T), regional lymph nodes (N), and distant metastasis (M), was employed to examine the current 8th AJCC staging system. The second system, based on T, N, M, age (A), and sex (S), expanded the current staging system.

Data
Data from patients diagnosed with cutaneous melanoma during 2004 through 2013 were obtained from 18 databases of the Surveillance, Epidemiology, and End Results Program (SEER) of the National Cancer Institute [23]. The restriction on year of diagnosis was used for reasons that 1) information of some factors' levels was not available until 2004; 2) 2013 is the latest year that ensures a minimum 5-year follow up since the current release of SEER data followed patients up to 2018. Survival time (measured in months) and SEER cause-specific death classification variable [24] were used to obtain melanoma-specific survival for analysis. In addition to restricting the years of diagnosis between 2004 and 2013, we required patients' nodes to be pathologically evaluated (CS Lymph Nodes Eval = 2,3,6,8 [25]). This study investigated 5 factors: T, N, M, A, and S. The datasets used for analysis were generated as follows.
Survival time (in months), SEER cause-specific death classification variable, T, N, M, A, and S were collected for the studied patients. The levels of T (9 levels: T0, T1a, T1b, T2a, T2b, T3a, T3b, T4a, and T4b), N (7 levels: N0, N1a, N1b, N2a, N2b, N2c, and N3), M (2 levels: M0 and M1), A (2 levels: A0, A1), and S (2 levels: S1 and S2) are defined in S1 Table. Age 70 was used as the cut-off since an age greater than 70 years was associated with worse primary tumor characteristics and higher mortality compared with younger patients [26]. A combination is defined as a subset of the data corresponding to one level of each factor and is denoted by levels of factors (e.g., T1N0M0A0S1 represents a subset of patients with T = T1, N = N0, M = M0, A = A0, S = S1). We retained patients without missing values in any of the five factors (i.e., T, N, M, A, and S) so that their combinations can be formed. Then patients were assigned to the training set if they were diagnosed with melanoma before 2013, and to the validation set if diagnosed in 2013. The validation set obtained this way was used in temporal validation to examine the reproducibility of the prognostic system built on the training set [27]. For more accurate estimation in the development of prognostic systems, we removed all rare combinations each containing less than 25 patients in the training set. The final training set used for analysis includes 113 combinations giving a total number of 40,781 cases. For the validation set, we excluded patients who could not be classified into any of the 113 combinations of the training set. The final validation set contains 5,390 patients. See Fig 1 and Table 1 for details on data. Additional information on datasets is contained in S1 Data.

EACCD
The EACCD is a machine learning algorithm that can be used to cluster patients into n � prognostic groups so that i) patients from the same group have a similar survival experience and patients from two distinct groups have different survival experiences; and ii) n � is as small as possible while n � groups can achieve the maximum possible accuracy in survival prediction. In this study, we will use an expanded version of the algorithm that consists of 4 steps: 1) defining initial dissimilarities between survival functions of any two combinations; 2) obtaining learned dissimilarities by using initial dissimilarities and an ensemble learning method; 3) applying hierarchical clustering to cluster combinations by the learned dissimilarities; and 4) generating n � prognostic groups by exploiting the C-index. Different approaches are available for each step. In this paper, the initial dissimilarity between two combinations in step 1 was defined by Mann-Whitney based effect size |P(T 1 >T 2 )−0.5|, where T 1 and T 2 are the failure times of two underlying populations corresponding to the two combinations (see below). In step 2, the ensemble learning method was based on the two-phase Partitioning Around Medoids algorithm [28]. The minimax linkage method [29] was chosen for hierarchical clustering in step 3. The output of this step is a tree-structured dendrogram, which represents the relationship among survival of patients in different combinations. In step 4, a C-index curve was first produced using the dendrogram in step 3 and then the optimal number n � of prognostic groups was found using the "knee" point of the C-index curve and then the n � prognostic groups were generated by cutting the dendrogram [16,18,20].

Mann-Whitney based effect size
The Mann-Whitney parameter arises from the widely used Mann-Whitney test [30] that examines whether one of two random variables is stochastically larger than the other. Let T 1 and T 2 denote the variables of survival time for patients from population 1 (with survival function S 1 (t)) and population 2 (with survival function S 2 (t)), respectively. The Mann-Whitney parameter is defined as P(T 1 >T 2 ), the probability that a randomly chosen patient from population 1 has a longer survival time than a randomly chosen patient from population 2. The difference between the Mann-Whitney parameter and 0.5 suggests a difference between S 1 (t) and S 2 (t). Efron proposed to useD ¼ À R 1 0Ŝ1 ðtÞdŜ 2 ðtÞ to estimate the Mann-Whitney parameter for censoring data, whereŜ 1 ðtÞ andŜ 2 ðtÞ are Kaplan-Meier estimates of S 1 (t) and S 2 (t), respectively [31]. However, Efron's estimator requires that bothŜ 1 ðtÞ andŜ 2 ðtÞ drop to 0 at the maximum study time (the longest following-up time) and is rather unstable when censoring occurs due to incomplete follow up [32]. To overcome this problem, Wang [20] proposed to 1) use the exponential tails to complete survival functions; and 2) use the completed survival curves to approximate P(T 1 >T 2 ). More specifically, Accordingly,D e ðt 1 ; t 2 Þ ¼Â þB þĈ, an estimate of P(T 1 >T 2 ), can be obtained by using the Kaplan-Meier estimates and Lebesgue-Stieltjes integration. In this paper, we used jD e ðt 1 ; t 2 Þ À 0:5j to compute the initial dissimilarity in survival between two combinations, with τ 1 and τ 2 set to be the maximum possible time by which the Kaplan-Meier estimates of the survival of all combinations can be calculated.

Prognostic systems
Survival curves using the Kaplan-Meier estimates [33] were plotted for the n � prognostic groups from EACCD. These curves allow visually evaluating survival differences among the groups. The final prognostic system includes the dendrogram, group assignment, C-index, and survival curves for the prognostic groups.

Validation approach
The prognostic system was evaluated through both internal and temporal validation [27]. Internal validation, conducted on the training set, was mainly used to evaluate the validity of the system. Stratification and the predictive accuracy, two key characteristics of the system, were evaluated with internal validation. Stratification was examined by inspecting the survival curves of prognostic groups and comparing them using the univariate Cox proportional hazards regression model and the logrank test. The predictive accuracy was examined by computing the C-index of the prognostic system. The temporal validation, conducted on the validation set, was used to assess the system's reproducibility, i.e., both stratification and the predictive accuracy obtained using the training set were also valid for patients in the validation set.

Software
All statistical analyses were done in R (R version 3.6.2) using the following libraries: survival, cluster, protoclust, factoextra, and compareC.

Prognostic system for T, N, M
Applying the EACCD with T, N, and M to the training set yielded the dendrogram in Fig 2A. Dissimilarity between two combinations was computed by completing survival curves of combinations with exponential tails from 149 months, i.e., τ 1 = τ 2 = 149. The C-index curve based on the dendrogram, shown in Fig 2B, was used to find the optimal number of prognostic groups n � . The knee point of the curve corresponds to 10 groups (C-index value of 0.7682), which suggested n � = 10. Cutting the dendrogram into n � = 10 groups is shown by rectangles in Fig 2C. Fig 2D (and S1 Fig) shows the survival curves of these 10 groups, which are well separated. For convenience, the definition for all 10 groups is restated in the 4th column of S2 Table. The resulting prognostic system for T, N, and M includes the dendrogram in Fig 2A, the groups in S2 Table (4th column), and the survival curves in Fig 2D. Note that the risk of mortality is increased as the group number increases. For a quick comparison, Fig 3 (and S2  Fig) shows the survival curves of 10 stages of the 8th AJCC staging system.

Validation of the EACCD systems
The EACCD systems in {T, N, M} and {T, N, M, A, S} built above can be validated internally using the training data and temporally using the validation data. Below we present validation results for the EACCD system based on {T, N, M, A, S}. The same procedure can be followed for the EACCD system on{T, N, M}.
The internal validation evaluates the stratification and predictive accuracy of the EACCD system using only the training set.   Table 2. This table reveals that the hazard ratios between adjacent groups are uniformly significant and are mostly between 1.3 and 1.6., and that the logrank test results between adjacent groups are also uniformly significant. The previous section showed that the C-index value of the EACCD system based on {T, N, M, A, S} has a C-index value close to 0.8, which implies a good predictive accuracy of the system [27]. Therefore, the EACCD system on {T, N, M, A, S} well stratifies and predicts the patients from the training set.
The external validation examines the stratification and predictive accuracy of the EACCD system on the validation set. Fig 6 (and S4 Fig) shows the survival curves of the 10 prognostic groups on the validation set. Although due to the small sample size these curves do not provide accurate estimates of the survival rates, they preserve the pattern from the training set (Fig 5). This is further confirmed by Table 2, which shows separation between groups, though some results are less significant due to small sample sizes. Thus a good stratification occurs in the validation set. (The large p-value between groups 6 and 7 is mainly due to their partially overlapping survival curves in early years (Fig 6 and S4 Fig).) Calculation shows that the EACCD system has a C-index value of 0.7941 on the validation set, which is high. Therefore the system has a high predictive accuracy for the patients in the validation set.
The above internal and external validation shows that the EACCD system on {T, N, M, A, S} has good stratification and a high predictive accuracy for patients used in the development and patients in the future. Applying a similar validation procedure to the EACCD system on {T, N, M} led to the same conclusion for the system in three variables.

Comparison with AJCC
With the training data, the EACCD prognostic system on {T, N, M} can be compared with the 8th ed. AJCC staging system in terms of both prediction and stratification.

PLOS ONE
The 8th edition of AJCC divides the training set into 10 stages (The 5th column of S2 Table  and Fig 3). Calculation shows that the AJCC staging system has a C-index of 0.7643. The pvalue of the non-parametric C-index based test [34] for testing the difference (0.0039, 95% CI = (0.0032, 0.0047)) between the prediction accuracy of the above EACCD prognostic system (10 groups, C-index = 0.7682) and the AJCC staging system TNM (10 stages, Cindex = 0.7643) is 7.2×10 −23 , showing that the former has a significantly higher survival prediction accuracy than the latter. Table 3 presents the distribution of patients for each of the 10 AJCC stages over the 10 EACCD groups. Roughly, it is seen that a higher stage in AJCC corresponds to a higher-risk group in EACCD, and vice versa. Indeed the assignment to ordered stages (from the smallest to the biggest risk) and the assignment to ordered prognostic groups have a large Spearman's rank correlation coefficient [35] of 0.8316 with a p-value of 4.5×10 −13 . Therefore, there is a strong positive association between AJCC staging and EACCD grouping. The p-value for the hazard ratio is from the Wald test testing the null hypothesis that the hazard ratio equals 1. The p-value for the logrank test is from the logrank test testing the null hypothesis that two survival curves are equal.

Table 2. Output of the Cox proportional hazards regression model and the logrank test for EACCD grouping on the basis of T, N, M, A, and S on the training set.
https://doi.org/10.1371/journal.pone.0257949.t002 However, survival curves of different AJCC stages often overlap, which can cause confusion in treatment planning. For instance, stages IIID and IV have overlapping survival curves (Fig 3  and S2 Fig). In addition, due to heterogeneity, higher stage groups can have higher survival rates, which is counterintuitive. For example, stage IIIA has a more favorable survival curve than stage IIB, and stage IIIB has a more favorable survival curve than stage IIC. These drawbacks were also reported by Abdel-Rahman et al. [36], though they applied a different data management process to the SEER data. We must emphasize that the EACCD prognostic groups rarely have these problems-the survival curves are well separated and the index of prognostic groups preserves the risk order. Table 3 details how this happens.
In fact, Table 3 shows whether cases of an AJCC stage should be divided according to prognostic groups included in the EACCD. Horizontally, we can examine all stages line by line. First, we observe that no splitting for stage IA and all 14,588 cases are assigned to group 1. Then we observe all cases included in stage IB are assigned to prognostic group 2. We continue until reaching the last line where stage IV equals prognostic group 10. The key observation is that each of stages IIIA, IIIB, and IIIC is split into cases of multiple prognostic groups. For instance, cases of stage IIIC are decomposed into those in prognostic groups 4-9. We note that based on its design, EACCD grouping represents an assignment of cases so that cases within a group are more homogeneous in survival and cases from different groups differ in survival. Therefore, the above observation may indicate that each of stages IIIA, IIIB, and IIIC contains more heterogeneous cases than other stages. This is consistent with the AJCC definition of stages IIIA, IIIB, and IIIC [3].
In summary, in predicting survival, the EACCD prognostic system on {T, N, M} has a significantly higher accuracy than the AJCC staging system TNM. In stratifying patients, the EACCD grouping produces well-separated survival curves while AJCC staging does not, but EACCD grouping and AJCC staging are strongly positively associated.

Effect of factor levels on survival
The EACCD prognostic system on {T, N, M, A, S} represents an expansion to TNM. Compared with the EACCD system on {T, N, M}, this expanded system further improved the prediction accuracy of the AJCC staging system (C-index = 0.7865 vs 0.7643; increase in Cindex = 0.0222, 95% CI = 0.0191, 0.0254); p-value = 8.8×10 −43 ).
Below we applied this expanded system to examine the effect of levels of individual factors on survival. To simplify the analysis, we consider the following three risk categories by combining 10 prognostic groups from the system: low risk (groups 1-3), medium risk (groups Table 3. Contingency table between EACCD grouping

4-7)
, and high risk (groups 8-10). Fig 7 shows how patients associated with a factor level are distributed across these three risk categories. The 1 st panel of Fig 7 shows the distribution of patients associated with different T levels. T1a, T1b, and T2a curves peak at low risk with a near 100% proportion, indicating that patients associated with T1a, T1b, and T2a are almost exclusively distributed in groups with low risk. The T2b and T3a curves also peak at low risk. Patients with these two levels are more likely to appear in low-risk groups, but there is a reasonably big proportion (about 40%) of them classified into middle-risk groups. Patients with T0, T3b, and T4a are mostly distributed in middle-risk groups. We note that T0 is not a highly optimistic level as what we would typically imagine. The reason is that patients with T0 are always associated with N levels higher than N0. The indication of a medium risk is also reflected in the AJCC staging system where T0 patients are assigned to either stage IIIB or IIIC. T4b is the level with the worst prognosis and patients with T4b are mostly distributed in high-risk groups. The above indicates an increased severity of the disease in terms of the order {T1a, T1b, T2a}, {T2b, T3a}, {T0, T3b, T4a}, T4b.
The 2 nd panel shows that: 1) N0 patients are most likely to have a low risk; 2) N1a patients are more likely to assume a medium risk; 3) N1b and N2a patients are most likely at either a medium or high risk; and 4) N2b, N2c, and N3 patients are most likely to have a high risk. These indicate the increased severity of the disease in terms of the order N0, N1a, {N1b, N2a}, {N2b, N2c, N3}.
The 3 rd panel reveals that M0 patients have a decreasing distribution from low to high-risk groups, whereas the entire M1 patients are distributed in the high-risk group only. This clearly suggests that patients with M1 have a poor prognosis.
The 4 th and 5 th panels clearly show that A0 and S2 have a better prognosis than A1 and S1, respectively.
The above analysis shows how a factor level is associated with risk. Although these findings are known from the literature, this is the first time these factor levels are integrated together and described explicitly in ordered risk groups of the prognostic system TNMAS created in this paper.

Relationship to stratification and prediction
It is seen that an EACCD prognostic system is derived by partitioning, according to survival, patients into a minimum number of groups that yields approximately the maximum value of the predictive accuracy, i.e., C-index. A prognostic system is neither a model only targeting at the "best" stratification nor a model only aiming at the "highest" predictive accuracy. A prognostic system is a model that shows a tradeoff between stratification and prediction. It is this balance that possesses unique utilizations in clinical settings such as staging patients, offering treatment, and designing trials, etc. a relatively large data set in order to obtain robust estimates of survival. This report includes combinations with a minimum of 25 cases, which may exclude some "rare" but interesting combinations. The impact of this requirement on the size of combinations will be minimized as more data become available. In addition, the absence of N1c subcategory in our study could affect, to some extent, the comparison between our system and the AJCC system (S1 Data).

Conclusions
We used SEER data and the EACCD to build two prognostic systems to demonstrate how machine learning techniques can be used to improve prognostic assessment and risk stratification for cutaneous melanoma. The basic system, on the basis of T, N, and M, classifies patients into prognostic groups that are strongly positively correlated with the AJCC TNM stages but have a higher prediction accuracy in survival. Survival curves of the prognostic system from EACCD do not overlap and are reasonably ordered, which is in contrast to the AJCC TNM system where different stages can present overlapping survival. The second system, incorporating age and sex as additional prognostic factors, further improves prediction accuracy. The prognostic systems from EACCD serve the same role as the TNM staging system and are created through optimizing both prediction and stratification, which can impact clinical and treatment decisions.
As our understanding of melanoma and other cancers improves, more factors will come to the forefront of clinical significance. These additional factors will be very difficult to manually incorporate into conventional TNM staging systems. Our approach to staging allows incorporation of unlimited additional factors to create nuanced staging systems that will provide realtime prognosis information. When sufficient data become available, other variables/factors, such as treatment and immune checkpoint inhibitors, could be readily integrated into known systems by the EACCD to generate prognostic systems for refinements in stratification and outcome prediction that are necessary for patient care, such as monitoring large therapeutic trials.