Expression based biomarkers and models to classify early and late stage samples of Papillary Thyroid Carcinoma

In this study, we describe the key transcripts and machine learning models developed for classifying the early and late stage samples of Papillary Thyroid Cancer (PTC), using transcripts’ expression data from The Cancer Genome Atlas (TCGA). First, we rank all the transcripts on the basis of area under receiver operating characteristic curve, (AUROC) value to discriminate the early and late stage, based on an expression threshold. With the expression of a single transcript DCN, we can classify the stage samples with a 68.5% accuracy and AUROC of 0.66. Then we implemented various combination of multiple gene panels, selected using various gold standard feature selection techniques. The model based on the expression of 36 multiple transcripts (protein coding and non-coding) selected using SVC-L1 achieves the maximum accuracy of 74.51% with AUROC of 0.75 on independent validation dataset with balanced sensitivity and specificity. Further, these signatures also performed well on external microarray data obtained from GEO, predicting nearly 70% (12 samples out of 17 samples) early stage samples correctly. Further, multiclass model, classifying the normal, early and late stage samples achieves the accuracy of 75.43% with AUROC of 0.80 on independent validation dataset. With correlation analysis, we found that transcripts with maximum change in correlation of their expression in both the stages are significantly enriched in neuroactive ligand receptor interaction pathway. We also propose a panel of five protein coding transcripts, which on the basis of their expression, can segregate cancer and normal samples with 97.32% accuracy and AUROC of 0.99 on independent validation dataset. All the models and dataset used in this study are available from the web server CancerTSP (http://webs.iiitd.edu.in/raghava/cancertsp/).


24
In this study, we describe the key transcripts and machine learning models developed

Other datasets 149
In addition to stage classification, we also developed model for discriminating cancer 150 patients and normal tissues. This dataset comprises of 500 cancer samples and 58 151 normal samples. Similarly, we also developed models for predicting normal, early and 152 late stage. The dataset for multiclass classification comprises of 58 normal, 333 early 153 stage and 167 late stage samples. These datasets were further subdivided into training 154 and independent validation dataset, where training dataset contained 80% samples and 155 independent validation dataset contained 20% samples. 156

Data processing 157
The FPKM values were log2 transformed after addition of 1.0 as a constant number to 158 each of the value. After that, we removed features with low variance (less than 0.25) by 159 employing caret package in R, followed by z-score normalization of data. The equations 160 (1) and (2) were used for log transformation and normalization of data, respectively. 161 = 2 ( + 1) (1) 162 In equation (2), Zscore is the normalized scaled and centered score, x is the log-164 transformed transcript expression, ̅ is the mean of transcript's expression in the 165 training dataset, and s is the standard deviation of a transcript in the training dataset. 166 The log2 transformed independent validation data is z normalized by taking mean and 167 standard deviation of training features. 168 Features filtering using threshold-based models 169 In this study, we employed a simple expression threshold based approach similar to our 170 previous study (Bhalla et al., 2017) to develop threshold-based models. This method is 171 based on the fact that few transcripts are differentially expressed in early stage as 172 compared to the late stage. In this approach, for every transcript, we designated a threshold, which determines whether a sample is in the early or late stage of the cancer. 174 The threshold is selected by iterating from the minimum to maximum expression of that 175 transcript across all the patients. The threshold which gives maximum AUROC of 176 classification between early and late stage sample is reported. If the mean expression of 177 transcript is greater in early stage than late stage and for a given sample, its log2 FPKM 178 value is higher than the given threshold, then we assign that sample as early otherwise as 179 Thirdly, for the model which performed best in comparison to other models, we applied 195 two more feature selection methods. One is wrapper approach for feature selection and 196 other is SVC with L1 penalty (Baraniuk, 2007). In wrapper based approach, human The models developed using 80% data on the best parameters obtained using grid 222 search were used to evaluate independent 20% dataset which was not used for feature 223 selection and training. 224

External validation 225
Further to validate the models on the external validation dataset, first the TCGA data 226 was log2 normalized. Then, the GSE48953 expression data was quantile normalized 227 with reference to TCGA training dataset (target set) using FSQN package (Franks et 228 al., 2018). 229

230
In present study, performance of different models was measured by employing 231 threshold-dependent and threshold-independent parameters. In case of threshold-232 dependent parameters, sensitivity (Sen), specificity (Spc), overall accuracy (Acc (%)) 233 and Matthews correlation coefficient (MCC) was calculated by using equations (3), (4), 234 We also calculated a threshold-independent parameter called AUROC, which is 244 computed from receiver operating characteristic (ROC) plot in this study. The ROC curve 245 is produced by plotting true positive rate against the false positive rate at different 246 thresholds. Lastly, we calculated the area under ROC curve to compute a single parameter 247 from this curve called AUROC in the current study. We used this AUROC value for 248 optimizing and measuring the performance of our models. 249 In addition, to ascertain the reliability of prediction, we calculated PPV (Positive 250 Predictive Value) and NPV (Negative Predictive Value) at various thresholds using 251 probability score obtained by employing SVM. 252 253

Multiclass classification 254
Multiclass classification is implemented using WEKA algorithm using 255 WEKA.classifiers.meta.MultiClassClassifier with random forest as classifier. This 256 method is capable for handling multi-class datasets with 2-class classifiers. This classifier 257 also applies error adjusting output codes for improved accuracy.   Table S3) 296 which points out these genes have also been previously implicated in many cancers. 297 The 166 protein coding transcripts are enriched in many pathways of KEGG database 298 such as Focal adhesion pathway (5% genes, adjusted p-value= 4.0e-5), PI3K-Akt 299 signaling pathway (3.5% genes, adjusted p-value=0.001), Proteoglycans in cancer (3% 300 genes, adjusted p-value=0.007). Focal adhesion kinase has been already shown to be 301  Figure S3). Six long non-coding RNAs are found whose AUROC score 310 is greater than equal to 0.6. Thus, literature and enrichment analysis shows that these 311 prioritized genes are involved in various cancer progression related processes and thus 312 can be potential biomarkers of stage classification. 313

Stage classification model using multiple RNA transcripts 314
As shown in above section and in Supplementary Table S2, individual 179 RNA  315 transcripts (THCA-EL-AUROC) have limited ability to classify early and late stage 316 samples with AUROC more than 0.66. Therefore, to develop a model, that can classify stage of samples with high accuracy, we used expression of multiple RNA transcripts. 318 The models based on different machine learning techniques were developed for stage 319 classification using THCA-EL-AUROC (179 RNA transcripts) as features. As shown 320 in Supplementary Table S4, we achieved accuracy of 69.35% with AUROC of 0.72 on 321 training data and accuracy of 67.65% on independent validation data with AUROC of 322 0.70. The models developed using THCA-EL-AUROC only improved performance 323 marginally from AUROC from 0.66 to 0.70. This means rank-based features are still 324 not sufficient for developing prediction models. 325

Protein coding transcripts 326
As from the previous results, it is evident that protein coding transcripts were major 327 type of transcripts in THCA-EL-AUROC signature, therefore in this analysis we 328  Table 1) 335 using 37 features obtained using SVM. There was marginal increase in the performance 336 accuracy but the number of features was reduced to reasonable extent as compared to 337 THCA-EL-AUROC. The performance using other algorithms like random forest, 338 Naïve Bayes, SMO and J48 is also shown in Table 1.   Table S14). 393 As the above three feature selection methods and prediction models gave maximum 394 performance using both protein coding and non-coding transcripts, as compared to 395 protein coding and cancer hallmark protein coding transcripts, we employed another 396 feature selection method which selected features using SVC with L1 penalty (see Further, to ascertain its capabilities, we calculated PPV and NPV on various thresholds 403 of SVM probability score (Table 3). On training data, at the SVM score, greater than 404 0.9, 161 early stage samples are correctly predicted out of total 170 samples predicted 405 as early stage samples (PPV=94.71%). In case of late stage samples, 60 out of 64 late 406 stage predicted samples are correct (NPV=93.75%). In case of independent validation 407 data, the PPV is 85.71% and the NPV is 66.67% (Table 3). This shows that at SVM 408 score of 0.9, there is high probability of correct positive (early stage) and negative (late 409 stage) prediction. At threshold of 0.7, at which we presented other performance 410 measures in Table 2, the PPV for training data is 93.03% and NPV is 94.51% whereas 411 in case of independent validation the PPV is 83.87% and NPV is 63.64% (Table 3). 412 We also applied various other state-of-the-art machine learning methods on THCA-EL-413 SVC-L1, but SVM performed best out of all (Supplementary Table S15). One of the 414 advantage of this method is that it resulted a more balanced sensitivity and specificity  (Table 4). 445 An interesting observation was seen when these 211 transcripts were analyzed in 446 STRING. As we added 5 indirect iterations to the network, many of the protein coding 447 transcripts in our signature were interacting with PCNA (Supplementary Figure S5). 448 PCNA has been associated with fatal outcomes (Basolo et al., 1993). The genes in our 449 signature directly interacting with PCNA points out to their importance in acting as 450 stage specific biomarkers. 451

Correlation Disturbance in Early and late stages of cancer 452
To further elucidate the expression of various transcripts in early versus late stage, the 453  Table S18) were segregated for subsequent analysis based on the 458 assumption that this large change or disturbance in correlation is due to diseased 459 condition. These were the gene pairs whose correlation has been disturbed drastically 460 in early stage as compared to late stage and vice versa. From this analysis, we obtained 461 two types of transcripts i.e. one which had low correlation in early stage patients but 462 high correlation in late stage patients (L_pairs) and others had high correlation in early than 0.6. There were total 453 pairs which had correlation of 0.60 or higher in early 467 samples and their correlation difference between early and late samples is at least 0.7.  (Table 5).

Web Server Implementation 561
We established a web server, CancerTSP (Thyroid cancer stage prediction), that 562 implements models established in the present study for investigation and estimation of 563 cancer stage from the transcripts' expression data. CancerTSP consists of three key 564 modules; first for prediction whether sample is cancerous or normal, second is for predicting whether it is in early or late stage cancer, and third is for the analysis of 566 transcripts' expression data. 567 Prediction module allows the users to predict whether the sample is cancerous or not 568 and also predicts the stage of the cancer using FPKM values. The user needs to provide 569 transcript expression (FPKM values) of potential biomarker genes for every patient.