Increasing prediction accuracy of pathogenic staging by sample augmentation with a GAN

Accurate prediction of cancer stage is important in that it enables more appropriate treatment for patients with cancer. Many measures or methods have been proposed for more accurate prediction of cancer stage, but recently, machine learning, especially deep learning-based methods have been receiving increasing attention, mostly owing to their good prediction accuracy in many applications. Machine learning methods can be applied to high throughput DNA mutation or RNA expression data to predict cancer stage. However, because the number of genes or markers generally exceeds 10,000, a considerable number of data samples is required to guarantee high prediction accuracy. To solve this problem of a small number of clinical samples, we used a Generative Adversarial Networks (GANs) to augment the samples. Because GANs are not effective with whole genes, we first selected significant genes using DNA mutation data and random forest feature ranking. Next, RNA expression data for selected genes were expanded using GANs. We compared the classification accuracies using original dataset and expanded datasets generated by proposed and existing methods, using random forest, Deep Neural Networks (DNNs), and 1-Dimensional Convolutional Neural Networks (1DCNN). When using the 1DCNN, the F1 score of GAN5 (a 5-fold increase in data) was improved by 39% in relation to the original data. Moreover, the results using only 30% of the data were better than those using all of the data. Our attempt is the first to use GAN for augmentation using numeric data for both DNA and RNA. The augmented datasets obtained using the proposed method demonstrated significantly increased classification accuracy for most cases. By using GAN and 1DCNN in the prediction of cancer stage, we confirmed that good results can be obtained even with small amounts of samples, and it is expected that a great deal of the cost and time required to obtain clinical samples will be reduced. The proposed sample augmentation method could also be applied for other purposes, such as prognostic prediction or cancer classification.


Introduction
Correct prediction of cancer stage is beneficial because it can help medical doctors determine more appropriate treatment for patients with cancer. For  information to determine type of surgery to perform, or whether chemotherapy or radiation therapy is required. Numerous measures or methods have been proposed for accurate prediction of cancer stage, and one of the most widely used for cancer stage prediction is the Tumor, Node, and Metastasis (TNM) staging system developed by the American Joint Committee on Cancer (AJCC). TNM is a clinically useful staging system for cancers of almost every anatomic site and histology. From the 7 th edition of the AJCC Cancer Staging Manual to the most recent 8 th edition, few changes may be observed with respect to some cancers [1,2], but in other cancer types, such as lung, gastric, and breast cancer [3][4][5][6] numerous changes are present in the criteria for prediction of cancer stage. These changes in the criteria may cause confusion in patient treatment.
Recently, alternative methods to predict cancer stage with additional clinical information or genomic information have been proposed. These methods, for the most part, adopt machine learning techniques to increase prediction accuracy. The machine learning methods used include Random Forest (RF) [7,8], Support Vector Machine (SVM) [9], Naïve Bayes (NB) [9,10], J48 Decision Tree [11], Logistic Regression [10,11], Neural Network (NN) [12], and Neuro-Fuzzy Model [13]. In many cases, these methods showed better performance than the TNM staging system. For example, the Neuro-Fuzzy computational intelligence model [13] classified the pathological stage of patients with prostate cancer using data from The Cancer Genome Atlas (TCGA) [14], and compared these results with results using the AJCC pTNM (Pathological Tumor-Node-Metastasis) Staging Nomogram, as well as other machine learning methods such as Artificial Neural Network (ANN) or SVM, and found fewer false positives than the number achieved with AJCC or other machine learning models.
However, most of this studies used machine learning methods on a relatively small number of samples. machine learning methods generally require a substantial number of samples to ensure high predictive power. To overcome this limitation of a small sample size, many sample augmentation methods have been developed. The Synthetic Minority Oversampling Technique (SMOTE) [15,16] was primarily developed to oversample a small number of samples, and has additionally shown its ability to convert highly imbalanced data into balanced data. Since 2012, the technique of deep learning has been applied in many fields, and the application of Denoising Autoencoder (DA) [17] solved the problem of insufficient training samples by expanding small gene expression data. Generative Adversarial Networks (GANs) [18] can be used to generate synthetic samples. GANs and their variations are widely used to synthesize images, but they can be also used to generate table type numerical data, as well as tabular data such as medical or educational records. TableGAN [19] shows that fake tables that are statistically similar to the original table are synthesized using GANs using four real world datasets in four different domains to solve the security problems required when sharing or delivering the public or partners' data. Tabular GAN (TGAN) [20] shows the GANs model by applying Long Short-term Memory (LSTM) with attention to generate column-by-column data using tabular datasets of three mixed variable types.
In this study, we also used GANs to oversample small number of mRNA expression samples. GANs are difficult to use for data with a small sample size, especially when the number of features (genes) exceeds 10,000. To solve this problem, we first selected 300-800 genes depending on cancer types using DNA mutation data and RF. We synthesized the expression profiles of selected genes by applying GANs to gene expression of twelve cancer types including STAD (Stomach adenocarcinoma), BRCA (Breast invasive carcinoma), HNSC (Head and Neck squamous cell carcinoma), KIRC (Kidney renal clear cell carcinoma), KIRP (Kidney renal papillary cell carcinoma), LUAD (Lung Adenocarcinoma), THCA (Thyroid carcinoma), READ (Rectal adenocarcinoma), ESCA (Esophageal carcinoma), KICH (Kidney Chromophobe), LIHC (Liver hepatocellular carcinoma), and LUSC (Lung squamous cell carcinoma) from the TCGA database [14]. We then classified the cancer stage of augmented data using three classification methods. Comparison of the original data and augmented data obtained using existing sample augmentation methods allowed us to confirm that the prediction accuracy of cancer stage was significantly improved. This paper is organized as follows. In the Materials and Methods Section, we first describe data used for the experiment, selected features, and normalization algorithm. Then, the sample augmentation method using GAN and three classification algorithms are described. In the Results Section, we describe the characteristics of the augmented sample, and compare the effects of the five known algorithms and four GAN series that we implemented. We also verify whether our method is effective for small samples, and evaluate the importance of the selected genes. In the Discussion Section, we compare the selection criteria of our experiment with the results of other groups, and mention various fields in which our method could be applied.
• We use feature selection based on DNA mutation data and GAN for augmentation of mRNA expression data to increase the accuracy of our cancer-stage classification.
• The augmented datasets obtained using the proposed method demonstrate significant increase in the classification accuracy.
• By using GAN and 1DCNN in the prediction of cancer stage, good results are obtained even with a small amount of sample.

Data preparation and feature selection
We downloaded mRNA and DNA mutation data from the TCGA database [14] of twelve cancer types, STAD, BRCA, HNSC, KIRC, KIRP, LUAD, THCA, READ, ESCA, KICH, LIHC, and LUSC, which have at least twelve samples for all four stages. From downloaded data, only samples of which DNA and RNA IDs are matched and stage information exists were selected. Specific information regarding the data is provided in Table 1.
As the feature space is too big compared to the number of samples for training the proposed model, we selected the most important features (= genes) for each dataset. RF classifier [7,21], which showed the best performance, was used to select ranking genes using DNA mutation data. Through iterative experiments, we selected the p-value threshold as 0.004. The selected number of the most important features selected are shown in Table 1, and the list of genes is provided in S1 Table. Finally, matched mRNA data with selected genes were normalized using ComBat [22] to correct batch effects.

Sample augmentation and classification algorithm
The Generative Adversarial Networks (GANs) are composed of the generator and discriminator, which are trained in parallel. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution.
In this study, we used a GANs to augment mRNA samples. When images are generated using GANs, random values are input to the generator. In our case, random values from a normal distribution with mean and standard deviation of training mRNA data are fed into the generator. The training data are 70% of the entire data, selected at random. We used one hidden layer with 256 neurons for both a generator and a discriminator with reference to the previous study [23] and the randomly synthesized data and real data are judged to be real or fake in the discriminator, and learned repeatedly. The number of epochs used varies from 900 to 1,100 depending on the cancer type.
After the generator is trained, we generate n (= number of training samples) samples (GAN1), n � 20 samples (GAN20), and n � 100 samples (GAN100), using the trained generator, with the latent space generated by mean and standard deviation values that were used to train the generator. The mean and standard deviation created to make latent space in the Training Step are stored at a global variable and selected randomly as argument of the Generating Step. The ratio of stages is kept for augmented samples. Augmented samples are used as training data for classification of cancer stage.
We used three types of classifiers, 1DCNN [24], DNNs, and RF [7]. 1DCNN has been proposed to process 1-dimentional spectral channels. The 1DCNN we used for this study consists of two convolution layers. In this study, 20 and 40 filters with kernel size of 5 were used for first and second convolution layers, respectively. For both layers, size of pool is two and Relu is used for activation function. After the convolution step, the flattening process is performed, and flattened values are fed into the hidden layer of size 64. Activation function is Relu, optimizer is Adam, batch size is 32, and number of epochs is 1,000. For DNNs, we used three hidden layers of size 64, 32, and 4. Activation functions used are Relu for hidden layers and Softmax final layer. Adam is used for optimizer. We used the RandomForestClassifier module of scikit-learn (version 0.23.2) in python (version 3.5.2). The number of trees in the forest (n_estimators) is 100, the oob_score (whether to use out-of-bag samples to estimate the generalization accuracy) is true, and the random_state (random value) is 123456. We tried varying the number of n_estimators (70, 100, and 130), and adopted 100 according to S3 Table. Finally, these classifiers were evaluated using the remaining 30% of the entire sample. The steps described above form one cycle, and are illustrated in (Fig 1).

Characteristics of augmented samples
As mentioned in detail in the methods, we augmented samples by constructing GANs composed of components of a Generating Step and Training Step (as shown in Fig 1). These augmented samples were used for training three classifiers and the remaining 30% of the original data were classified using the classifiers. To characterize the augmented samples and to confirm the possibility that augmented samples can be effectively used for cancer stage classification, we performed principal component analysis (PCA) for the original dataset and the augmented dataset.
The first column of (Fig 2) shows PCA plots for the original dataset for eight cancer types, and we can see that the stages are not distinguished. However, we can see that the stages are clearly distinguished for GAN1 data. These results imply that augmented samples have different characteristics for each stage. The differences in the augmented samples are not the result of changes in gene expression patterns, however, as we can see that the distribution of gene expression is not very different between the original and augmented data, as shown in the third column in (Fig 2).
Features of FS data are selected from DNA mutation data using RF classifier, and are the same as those used to create GAN1, GAN5, GAN20, and GAN100. MS is randomly generated samples using mean and standard deviation/2 of training samples of each stage. SMOTE data is generated using a basic algorithm in SMOTE [16].
SMOTE is proposed to handle imbalanced data. For example, if SMOTE is run using 657 (110/383/152/12) training samples of BRCA, it generates 1,532 (383/383/383/383) samples. DA data is generated using a Denoising Autoencoder [17]. DA uses the denoising method to extract features that obtain useful structure in the input distribution and eventually generate gene expression data. Given n samples and m features, DA generates n � floor (m / 5) + n samples (floor (x) returns a largest integer not greater than x). For example, breast cancer has 659 training samples and 19,738 features, so 2,601,732 samples are generated. In (Fig 3), we can see that GAN1, GAN5, GAN20, and GAN100 show an increase over compared datasets. S2 Table  shows that most of the p-values from t-tests between GAN and comparison results are < 0.05. In particular, all GAN5 showed significantly increased accuracy and most GAN20 datasets showed good accuracy.
We can also see that the accuracies of FS increased up to 9% compared to Ori, and the error bars are narrowed except in the case of KIRP. In particular, the accuracy was 0.48 for the 19,738 gene features in BRCA, but increased to 0.57 using a selected 359 features. These results show the effect of gene selection using DNA mutation data.
Next, we compared three classifiers, 1DCNN, DNN, and RF. Tables 2-13 show the accuracy and F1 score for each dataset and for each cancer type. Tables 2-13 also show that GAN1, GAN5, GAN20, and GAN100 demonstrate better predictive performance, regardless of classifier. Overall, 1DCNN and DNN showed good results and RF showed a poor F1 score.
Next, we examined whether the proposed sample augmentation method is effective for datasets with small samples. We used whole samples and randomly selected 50%, 30%, and 10% of samples from BRCA, LUAD, and KIRC datasets, and applied 1DCNN. The results are  Table. https://doi.org/10.1371/journal.pone.0250458.g003 shown as 100O, 50O, 30O, and 10O in (Fig 4). We next expanded the sampled datasets 5 times (GAN5) and applied 1DCNN. The results are shown as 100G, 50G, 30G, and 10G in (Fig 4). We can see that reducing the number of samples lowers the classification accuracy; however, accuracies are much higher when samples are augmented. More importantly, we can see that the decrease in accuracy is generally smaller when samples are augmented. These results imply that the proposed method is effective for small datasets. Lastly we performed experiments to

PLOS ONE
Increasing prediction accuracy by sample augmentation with a GAN determine the optimal fold for sample augmentation. We compared classification accuracies from samples augmented by 1, 5, 10, 20, 30, 50, 70, and 100 fold. The results are shown in ( Fig  5). In general, we can conclude that the optimal folds differ for different cancer types; however, we can observe that 5 fold (GAN5) demonstrates generally good results.
Selected genes are also overlapped with genes in the Online Mendelian Inheritance in Man . We can see that the overlapping percentages of KEGG are the largest in general, which means that a significant number of genes are important genes involved in the pathway. The PI(3)K/AKT/MTOR pathway (altered in 28% of tumors) has been shown to be important in KIRC by papers published by TCGA, and genes in S1 Table match the PI3K-AKT pathway with p-value 0.026. It contains most of the upstream genes of the AKT pathway, for example PIK3CA, PTEN, Receptor Tyrosine Kinase (RTK)-related genes (EPHB, PDGFR) and Integrin Subunit (ITG)-related genes (ITGA7, ITGA9, ITGA11, ITGB1BP, LABA, LAMB,  THBS).
A Warburg effect-like state achieved through downregulation of AMP-activated kinase (AMPK) and upregulation of acetyl-CoA carboxylase (ACC) has also been shown to be important in cancers. Among the genes in S1 Table,

Discussion
We noted that both GAN5 and GAN20 show good results in that the error bars are generally narrower in most of the carcinomas than those of GAN1, in (Fig 3). This observation indirectly demonstrates that increasing the number of samples leads to increased classification accuracy. In addition, it can be confirmed in Tables 2-13 that the 1DCNN classification method was excellent in both accuracy and F1 score. In Jian Liu's paper [17], Sample Expansion- Based  1DCNN (SE1DCNN), a method of obtaining a large number of samples through multiple, partially corrupted inputs, improved accuracy by 1-9% compared to the method using only 1DCNN. In addition, Sample Expansion using the Sample Expansion-Based SAE (SESAE) method improved accuracy by 2-17% compared to using only the Stacked Autoencoder (SAE). It was confirmed that when a good sample augmentation method and a good classification model are combined, there is better improvement of performance, and development of good combined models is always required.
The optimal number of samples differs for different cancer types, as observed in (Fig 5). Our model used one hidden layer with 256 neurons, which is the most suitable size for an imbalanced data set, according to the previous study [23]. However, further study is needed of the remaining five options (256/512/102, 256/512, 128/256/512, 128/256, and 128). In addition to these results, optimization of the hyperparameters (such as learning rate, epochs, cost function, and hidden layer unit) used in our GAN model, need additional work.
In addition to the DNA mutation data used for feature selection in this study, various combinations of more omics data such as mRNA, DNA methylation, and miRNA data can be used to further increase the classification accuracy. Application of those data combinations will be the focus of our follow-up work. Moreover, various recently developed deep generative models such as DCGAN, cycleGAN, and Variational Autoencoder, could be explored for more accurate classification, which could be our future study.

Conclusions
In this paper, we proposed the sample augmented method using GANs, and showed that augmented samples significantly increased the classification accuracy of cancer stages. In particular, we were able to confirm that the proposed method is efficient for a dataset with small number of samples. Therefore, the proposed sample augmentation method can be applied for other purposes, such as prognostic prediction or cancer classification.

• Advantages
• The proposed method can generate additional data samples more accurately, which can increase the accuracy of cancer-stage prediction.
• The proposed method is generally applied to other types of mRNA expression data of which the aim is different from cancer-stage prediction.
• Disadvantages • If the number of features is large, the learning time is significantly slower than with other machine learning approaches such as random forest or gradient boosting.
Supporting information S1