SurvivalPath:A R package for conducting personalized survival path mapping based on time-series survival data

Summary The survival path mapping approach has been proposed for dynamic prognostication of cancer patients using time-series survival data. The SurvivalPath R package was developed to facilitate building personalized survival path models. The package contains functions to convert time-series data into time-slices data by fixed interval based on time information of input medical records. After the pre-processing of data, under a user-defined parameters on covariates, significance level, minimum bifurcation sample size and number of time slices for analysis, survival paths can be computed using the main function, which can be visualized as a tree diagram, with important parameters annotated. The package also includes function for analyzing the connections between exposure/treatment and node transitions, and function for screening patient subgroup with specific features, which can be used for further exploration analysis. In this study, we demonstrate the application of this package in a large dataset of patients with hepatocellular carcinoma, which is embedded in the package. The SurvivalPath R package is freely available from CRAN, with source code and documentation hosted at https://github.com/zhangt369/SurvivalPath.


Introduction
The idea of developing the SurvivalPath R package stems from our previous exploratory work, in which we attempted to achieve dynamic prognosis prediction by establishing survival paths based on the time-series data of patients with hepatocellular carcinoma (HCC) [1]. In the last decade, enormous efforts have been made in establishing prognostic models using machine learning or deep learning techniques with large dataset to facilitate dynamic management of cancer patients [2] [3]. However, most of these models are black-box, which limit their interpretability and clinical application [4] [5]. The survival path approach we proposed provide a potential solution for dynamic prognosis prediction and management of cancer patients by constructing survival path maps using returned key prognostic factors after analysis of structured time-series survival data. More importantly, the survival path model could be easily understood and utilized by clinicians when compared to black-box models.
The SurvivalPath R package is a newly developed tool to facilitate fast building of survival path models, with an aim of promoting standardization of this methodology. To make the Sur-vivalPath R package adapted to datasets in whom the included variables have potential collinearity [6], we optimized the feature selection process. One-to one collinearity analysis was embedded to screen out noncollinear candidate variables before formal feature selection in the main function. This step simplifies the process of picking variables among different categorical variables and reduces the confounding impact of potential collinearity on feature selection in the Cox model. In addition, the SurvivalPath R package is compatible with continuous variable, enabling automatic binary classification of continuous variables and their entry into the model [7]. To facilitate fast building and efficient usage of survival path, this package also provide functions to enable: 1. pre-processing of the data into regular time slices; 2. fast selection of patient subgroups based on personalized design; 3. demonstration of the connections between exposure/treatment and node transitions.

Ethics statement
This study was approved by Sun Yat-sen University Cancer Center's (SYSUCC) Ethics Committee (no. B2022-220-01). The usage example presented was an anonymously retrospective analysis of routine data and therefore we requested and were granted a waiver of individual informed consent from the SYSUCC ethics committee. Software design 1) Converting data of time-slices into data of time slices. The time series data of each patient firstly needs to be sorted out in rows with each row represent data at specific time point. The time axis needs to be divided into time slices with the proposed interval given by the researcher. For each patient, the zero point was set at first/earliest time point, which is usually the first row of the patient's time-series data. The time slice with complete data of all included variables was defined as slice with complete data (Fig 1A). This process of data preprocessing could be finished using the built-in function "timedivision" in this package.
2) Feature selection and survival path mapping. The key features for bifurcation for each node in the construction of the survival paths are selected by repeated cycles of processing flow. The processing flow in each node includes the following steps.
Step 1: Calling the data in the parent node. Initially, the data of all included cohort at no.1 time slice was called, which is denoted S (all; ts = 1) (Fig 1B). Univariate analysis with Kaplan-Meier (KM) method was utilized to identify candidate variables (X 1 , X 2 , X a , X b ,. . ., X p ). In constructing the survival path, to control the false discovery rate and create a balanced survival

PLOS COMPUTATIONAL BIOLOGY
path, the researcher can set the significance level for feature selection in the argument of p. value in the main function survivalpath().
Step 2: Judgment. The minimum sample size required for the Cox proportional hazard regression model with multiple covariates was input with the parameter "Minsample" in the main function Survivalpath ().
Step 3: Feature selection. All the significant variables detected using the KM method will undergo collinearity analysis between pairs. The correlation coefficient between variables in pairs was calculated by the formula below, where X a and X b represent one of the variables, respectively.
Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P n i¼1 ðx ai À � x a Þ 2 q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The researcher can set the correlation coefficient by himself, and the pair of variables that exceed the correlation coefficient will automatically compare their Akaike information criterion (AIC) values when each of two serve as the only predictor for outcome; the variable with the smaller AIC value will be removed. All the significant variables left were the "candidate variables" and then be put into Cox proportional hazard regression model, which assumes the hazard as follows, where (X 1 , X 2 ,. . ., X p ) is a vector of p predictor variables, and β 1 , β 2 ,. . ., β p are the corresponding regression coefficients, which are the weights given to each variable by the model. The original model included all the candidate variables and was presented as follows: The backward elimination (BE) procedure was carried out, with the following p tests, H 0j : β j = 0; j = 1; 2,. . ., p, the lowest partial F-test value F l corresponding to H 0l : β l = 0 is compared with the pre-set significance values F 0 . If F l < F 0 , then F l is deleted and the new original model is: Then, a stepwise BE procedure was continued, until all F l > F 0 , and the model is what we choose. The importance of each variable in the fixed Cox model can be obtained as follows: where L h refers to the likelihood of the fixed model and L h−q refers to the likelihood of model after elimination of the variable X q . The variable eliminated from the model with the maximal change of −2log(L) was selected.
Step 4: Cycles of processing. Based on the selected dichotomized variable X q , the cohort S (all; ts = 1) will be divided into two subgroups S (−q; ts = 1) and S (+q; ts = 1) . The two subgroups serve as parent node and the corresponding cohort at the next time slice (after removal of those lose follow-up or reach endpoint in last time slice) ( Fig 1B). If no variable is selected in certain node, the cohort in that node stays the current classification and the data in next time slice will repeat step 1-3. The process flow will end until last time slice, which could be assigned with the argument "time_slices" in the main function.
Step 5: Graphic representation: A tree-based survival path can be constructed.

3) Dealing with continuous variable: Automatic dichotomization.
A large number of studies have shown that the method based on time-dependent receiver operating characteristic (ROC) curves can accurately find the ideal cutoff value of continuous variables [8]. Therefore, we embed the function for automatic dichotomization of the continuous variables according to the best cutoff recommended by the Time-dependent ROC curve in the classifydata () function, thus the program is compatible with candidate features in the form of continuous variables.

4) Survival Path
Mapping of Re-arranged data. In the past, researchers often used the traditional staging system for dynamic staging, that is, patients with similar disease states at different time points after the onset of disease are considered to have disease with similar severity [9]. Based on the concept that similar disease status at different time points across patients can sometimes be regarded as similar diseases and set as the same starting line, we designed the matchsubgroup () function, which can capture the time-series data of patient once the disease status happen from the time series data through matching values of key features.

Implementation
Before constructing a survival path model, time-series dataset in the form of continuous time slices is needed. The prepared time-series data should meet the following requirements: 1. The time series data of each participant is arranged in rows with each row represent data at each specific time point.
2. The dataset need to contain separate fields for participants' ID, time points of the recorded data, survival time and survival outcome. The records in each row need to have complete data on these fields.
3. Dichotomies as candidate variables are recommended. When the input variables are grade variables or continuous variables, re-classification is feasible by setting argument "ifclassifydata = TRUE" in the function generatorDTSD(). Multiple categorical variables with more than three categories that are not logically correlated or in rank correlation are not recommended as the interpretation of result will be difficult.
To build a survival path model using SurvivalPath R package, four key functions are important in implementation. The first and the main function is survivalpath(), whose function signature is survivalpath (DTSD, time_slices, treatments = NULL, num_categories = 2, p. value = 0.05, minsample = 15, degreeofcorrelation = 0.7, rates = 365). The DTSD is a list object that can be generated using the built-in function generatorDTSD(), which includes list data of time, status, timeslicedata and tspatientid of each time slice. The time_slices argument specify the total number of time slices (starting from the front) need to be included in the survival path model. The default value of treatments is NULL, this argument can specify the intervention/exposure variable the researcher find interest, which could be utilized during analysis of efficacy of different arms in certain nodes. The last five arguments were used to specify the maximum number of branches that each node can divide, p.value for hypothesis testing, minimum sample size for branching, cutoff value of correlation coefficient for collinearity analysis and the time point set for estimating survival rates of each node, respectively.
The first argument dataset specify the dataframe of row arranged time-series survival data, which is usually un-preprocessed data prepared by the researcher; the data structure requirement is displayed in Fig 2. The ID and time arguments correspond to the specified variables that represent the unique identifier of each participant in each row and the date to which each data point belongs, respectively. The last three arguments were parameters used to specify the length of time slices and the time interval for inclusion of time point data into each time slice.
The third function generatorDTSD() is used to generate DTSD class objects using a preprocessed dataframe arranged in time slices. The function signature of generatorDTSD() is generatorDTSD(dataset, periodindex, IDindex, timeindex, statusindex, variable, ifclassifydata = TRUE, predict.time = 365, isfill = TRUE) The first argument dataset refers to the dataframe of time-series observations, containing identification numbers of each subject, index of time slice, value of risk factors, survival time, and survival outcomes; the dataset usually could be returned by timedivision() function. The arguments periodindex, IDindex, timeindex, statusindex specify the time slice indicator, ID indicator, time and status indicators in the dataset, respectively. The argument variable is a list object to specify the risk factors required for modeling. Arguments ifclassifydata, predict. time are optional and useful in automatic dichotomizing risk factors. The argument isfill is a logical value, which is used to confirm whether to fill in missing data. If it is True, then fill with the average level.
The last key function matchsubgroup () is used for screening and extracting data of subjects that meet the given conditions, whose function signature is matchsubgroup(DTSD, varname, varvalue) The first argument DTSD specifies a list object that can be generated using generatorDTSD (). The argument varname specifies a of list variables used to screen subjects, and the variables needs to be included in the DTSD object. The varvalue is defined as a list object, and subjects whose varname variables are equal to the values of varvalue will be selected. This function could be utilized for generating personalized survival path model for participant with certain features.

Results
In this section, we demonstrate the usage of SurvivalPath package through analyzing a dataset of patients with hepatocellular carcinoma (HCC).

Pre-processing of time-series data
The pre-processing of the time-series data could be accomplished manually or using the builtin functions. Regard the built-in function approach, first, by setting the parameters on periods The new dataset returned in the form of time slices with one columns added: "time_slice". Usually, the number of rows in the converted dataset will be less than the original dataset after screening out records of ineligible time points. After finishing converting the dataset, we can check the amount of data at each time slice.
R> table(dataset$time_slice As several variables included in building survival paths were continuous variable, here we use a build-in argument "ifclassifydata = TRUE" to convert these variables into dichotomies, which facilitates automatic identification of cutoff through time-dependent ROC curve estimation. To run this function, survivalROC package is required. The function returns a DTSD class object for conducting survivalpath() function. The object includes 4 list objects (time, status, tsdata, tsid), which correspond to the associated data of each time slice. The object have two parameters (length and ts_size) in describing the number of time slices and sample size of each time slice. The list "cutoff" stores the calculated cutoffs of continuous variables that have been dichotomized.

Construct a survival path model using main function
After finishing the preparation of data, we can input the data into the main function, namely the survivalpath(). The investigator need to set the number of time slices when constructing the survival paths. The investigator could set p.value for feature selection process, the default value is 0.05. The investigator could also set the value of correlation coefficient to screen out noncollinear candidate variables, whose default value is 0.7. As we had already prepared this variable, we included it in our demonstration case. To view 1year OS rates on the graph, we set the rates of 365.
R> result <survivalpath(DTSDdata,time_slices = 8, p.value = 0.01, degreeofcorrelation = 0.5) The main function will return a dataset that indicate the steps each participant take at each time slice, a time-slices dataset with node annotation and a treedata that could be used for visualization the survival paths.
R>  Here we get the graph of survival path mapping, which includes all the significant bifurcations identified (Fig 3).

Survival Analysis and correlation analysis based on survival path
The built-in functions plotKM(), compareTreatmentPlans() and EvolutionAfterTreatment() are utilized for data analysis in survival paths. Once we finish the construction of survival paths model, we can generate survival curves comparing survival of patients in any of the nodes. Here we compare the survival curves of no. 24 and no. 36 nodes as a demonstration (Fig 4A).

R> treepoints = c(24,36) R> plotKM(result$data, treepoints,mytree,risk.table = T)
We can also draw survival curves for any of the nodes to investigate the correlation between treatment/exposure and survival (Fig 4B).
R> The results showed that in node no.24, 46.5% patients transit to node no.25, 11.0% transit to node no.31 and 42.4% patients lost follow-up in the next time slice without embolization treatment; while 61.4% patients transit to node no.25, 14.4% transit to node no.31 and 24.4% patients lost follow-up in the next time slice after embolization treatment.

Export data with node annotation
The time slice data with annotation of node information is stored in the results data after computation by the main function and could be exported for further usage and analysis. R> write.csv(result$df, file = "output.csv", row.names = FALSE)

Availability and future directions
In this study, we have introduced the SurvivalPath R package. The package facilitates fast generation of personalized designed survival path models, which could be used for dynamic prognosis prediction. Survival paths can be computed using the main function survivalpath(), under a user-defined parameters on covariates, significance level, minimum bifurcation sample size and number of time slices for analysis. The tree-structure result can be visualized as a tree diagram using ggtree R package, with important parameters annotated. We also designed built-in functions plotKM(), compareTreatmentPlans() and EvolutionAfterTreatment() to facilitate comparison of survival between nodes, evaluate exposure/treatment and survival, and explore the connection between exposure/treatment and node transitions, respectively. Moreover, function for picking up patient subgroup with specific features within the survival path model was created, which can be used to generate refined survival paths for further exploration analysis.
In future work, we will keep on developing the survivalpath() function that will facilitate node bifurcation through risk assessment that integrated multiple prognostic factors. Besides, we will keep on optimizing the visualization of survival paths through drawing 3-D graph. Lastly, we will set up cloud database for users to upload time-series data of cancer patients, which aims to be open to all users.