Maximizing the Reusability of Public Gene Expression Data by Predicting Missing Metadata

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we develop a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We propose a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we show that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.


I. Introduction
Currently, large volumes of high-throughput genomic data are being generated in biomedical research every day by laboratories in both academia and industry. For example, as of May 23, 2018, the gene expression omnibus (GEO) database [1] consists of a total of 2,498,466 samples in 98,354 series generated by 18,519 different experimental platforms. Many federally-funded projects and initiatives are also generating unprecedented large volumes of genomic data, such as The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/), the Genotype-Tissue Expression (GTEx) project [2], and the ENCyclopedia Of DNA Elements (ENCODE) project [3]. The availability of so-called biomedical Big Data has allowed new scientific discoveries to be made by mining and analyzing such data. As high-throughput datasets are rich in information on cellular events, the same dataset can be reanalyzed alone or together with other data to address important questions that were previously not studied or not feasible due to limited availability of data [4][5][6].
However, the majority of public genomic data do not contain enough metadata, which severely limits their reusability. Overcoming this limited reusability forms part of the FAIR (Findable, Accessible, Interoperable and Reusable) data principle [7]. For example, to understand the heterogeneity of breast cancer and to develop personalized treatment for breast cancer [4,8], the biomarker information that determines the subtypes of breast cancer samples, such as estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) status, are normally required. Information on race is also necessary to study the racial disparity of breast cancer [5,6,9]. We did an analysis of available breast cancer gene expression data generated by platform GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array) in the GEO database and found that there are 29,631 samples, of which ~85% do not have ER information, ~88% do not have HER2 information, and ~95% do not have race information. This high percentage of missing metadata severely limits what can be studied using these data. Accurate prediction of the missing metadata will substantially increase the reusability of the existing genomic data.
There are two major strategies to predict missing metadata for existing genomic data. First, the traditional approach is to fill missing data by imputation, which uses the information of the same variable from other subjects in the same dataset to infer the missing data using relatively simple statistics [10]. For example, to fill the missing value for variable i for subject j, one can use the mean value of variable i from all the other subjects (or subjects similar to subjects j based on other variables). Second, modern approaches based on machine learning technology are now widely used for imputing missing data in biomedical sciences with better performance than the traditional approach [10][11][12][13][14].
However, if our goal is to recover the missing metadata and use such information to perform additional analysis, several issues need to be considered. First, we will likely want to exclude the data for which our prediction may not be accurate enough since the error will be transferred to downstream analysis. One may prefer the accuracy of the prediction to be at least above a certain threshold. Second, as genomic data are being generated by many different platforms, transferring models generated from one platform to other platforms is of concern. Third, the existing metrics for evaluating machine learning methods, such as area under the curve (AUC), F1-score, precision, recall, and accuracy, are not ideal for such tasks as they aim to optimize the accuracy for whole datasets. A better objective would be to recover as much data as possible with accuracy above the threshold one prefers. Here accuracy can be defined by any proper metrics, such as AUC, F1-score, etc.. We believe this objective is better suited to maximize the reusability of public gene expression data when predicting missing metadata.
In this study, we investigate the above issues to infer missing metadata using multiple gene expression datasets with a wide variety of machine learning methods. We evaluate their performance on a few representative clinical variables to assess the accuracy level the current machine learning methods can achieve. built using data generated by rank normalization are likely more transferable than other commonly used normalization methods, such as global normalization and quantile normalization [35,36]. We introduce a new performance measure to evaluate the effectiveness of methods in recovering missing metadata, called Proportion of Cases Accurately Predicted (PCAP). PCAP is the percentage of data that can be recovered given a desired level of accuracy, where the actual accuracy measure (i.e. overall accuracy, precision, F1-score, etc.) can be defined by the researcher. PCAP 90 and PCAP 95 stand for the proportion of data that can be recovered with an accuracy of 90% and 95%, respectively. Through this study, we propose a framework to select the optimal pipeline, which includes several components such as data processing, oversampling method, variable selection, machine learning model and choice of performance measures, for recovering missing metadata by maximizing PCAP 90 or PCAP 95 .

II. Materials and Methods Data
Gene expression data from both sequencing and microarray platforms were used in this study. residual disease (RD). Figure 1 shows the percentages of each possible value for the clinical variables used in this study. We assigned binary values to each clinical variable where we labeled the value with less number of cases as +1 and the value with more number of cases as -1 (or 0 depending on the convention used in a machine learning algorithm). For example, for race, we labelled AA as +1 and CA as -1.

Normalization for gene expression
Three normalization methods were used in this study: read per million (RPM) normalization for next generation sequencing (NGS) data, quantile normalization for microarray data and rank normalization (RN) for both NGS and microarray data. For RN, each gene expression value was replaced by its rank, ranging from 1 (lowest) to n (highest), within a patient. If there was a tie for two or more values, the average of these values was used as the rank. The ranks were rescaled by dividing by the sum of total ranks and multiplying by 10 6 .

Oversampling methods
There are many oversampling methods based on different assumptions and goals [37][38][39]. We chose the Synthetic Minority Oversampling Technique (SMOTE) [37] in our study. Rather than oversampling with replacement, SMOTE increases the number of the minority class by creating synthetic samples. These synthetic samples are generated from the information of minority samples. With the minority and synthetic samples, classifiers will build larger and more general decision regions, which tend to give more accurate predictions.
SMOTE can be applied from the Imbalanced-learn package in Python [40].

Models
We define a model as a combination of a machine learning algorithm, number of genes selected, oversampling choice, and data normalization choice. The machine learning algorithms used in this study, include XGBoost [41], random forest (RF) [42], support vector machines (SVM) [43], and LASSO [44]. Tuning parameters were optimized from 10 sets of random combinations using 3-fold cross validation. To select the most predictive genes, a Welch two-sample t-test was first conducted to find differentially-expressed genes between two classes at the significance level of 0.1, where the two classes correspond to the two binary values of the variables to be predicted. Next we use recursive feature elimination (RFE) [45] to select an optimal set of genes (smaller than 100) or a predefined number (10 or 25) of genes using 3-fold cross validation. We restricted the optimal number of genes to less than or equal to 100.

Evaluation measures and optimization procedure
To evaluate the prediction performance of each model, three measures were used: Area under the receiver operating characteristic curve (AUROC), F 1 -score, and our proposed new measure PCAP. In this study, PCAP x stands for the percentage of cases which can be predicted with x% precision in a model. Other performance measures other than precision can also be used in practice, such as accuracy, recall, or F1-score. The PCAP x is calculated from a 10-fold cross validation each with 90% of the data as training and 10% as testing. For each of the 10 folds, a k-fold (k=10 in this study) cross validation was further performed resulting in a smaller training dataset and a validation dataset in each fold. The predicted probabilities of the validation set were first sorted.
Each percentile of the predicted probability (from 50 to 99) was used as a cutoff to calculate the precision and recall in the validation set. If a percentile results in a precision greater than or equal to x%, than it would be labeled as 1 (otherwise 0). The smallest percentile with most 1s (up to k) among the k folds was selected as the cutoff and the corresponding average recall was recorded. After the 10-fold cross-validation, the average of the selected percentile from each k-fold cross-validation was chosen as the cutoff value for the test dataset, and the average recall is used as the estimated PCAP x for the test dataset.
The 10-fold cross validation in the training data helps to find the cutoff that results in the desired precision.
The returned value of recall is the PCAP x , which is an estimate of the recall value in the unseen test data given x% of precision. In this study, we use PCAP 90 for the assessment of different models.
Another benefit of performing cross validation in the training data is for model selection. Given various models, we select the model that has the best median PCAP, instead of the maximum PCAP, to achieve a more robust estimate of the average PCAP value. This selected model is most likely to generalize well when recovering missing metadata in future datasets.

III. Results Predictive performance in sequencing data
The performance of models was tested by stratified 10-fold cross validation.

Predictive performance in microarray data
The medians of AUROC values across 10 folds under each model ( status, it is possible to get comparative performance using fewer number of genes, which are shown in Tables   S4a and S4b. Again, the results show there is no single model that performs the best for all four variables.

Effect of rank normalization
Gene expression data are usually generated by many different experimental platforms. Currently, at the gene expression omnibus (GEO) database [46], there are 15 major platforms measuring gene expression (mRNA values), each with more than 10,000 samples, together with hundreds of minor platforms with smaller number of samples. Using commonly used normalization methods, such as quantile and global normalization, models developed using data from one platform may not be generalizable to data produced by other platforms [47].
Developing different models for data generated by different platforms may not be an ideal solution for this problem. Rank normalization (RN) is a robust normalization method since only normalized rank information is kept in the normalization process. Most normalization methods do not change the rank (relative order) of the gene expression values within a sample. Here we investigate whether RN can be used as a robust normalization method for building machine learning models, which will yield better transferability of the resulting models.
To that end, we subtracted the F 1 -score of either RPM (sequencing data) or Quantile normalization (microarray data) from the F 1 -score of RN of each fold and took the average of the differences from each fold. As can be seen from Figure 2, RN helps to improve the performance of LASSO and SVM in sequencing data while giving comparable results in other cases. Predictions using rank-normalized gene expression values got higher F 1scores than using RPM normalization ( Figure 2 and Table S1). It is interesting to note that RN has comparable or better performance than commonly used normalization methods (RPM for sequencing data and Quantile normalization for microarray data) although it loses some valuable information during the normalization process. Table 3 shows the predictive performance in terms of PCAP 90 as described in Section II. Numbers are the average PCAP 90 of 10-fold cross validation, where in each fold the model was selected by our model selection pipeline. All models were trained by maximizing either the F 1 -score or PCAP 90 , respectively. As shown in Table 3, models trained by optimizing PCAP 90 obtained higher PCAP 90 than optimizing F 1 -score in both sequencing and microarray data.

Using predicted information in statistical inference
The goal of predicting missing metadata is to use the corresponding gene expression data in other analyses. In this section, we investigate the usefulness of the predicted data in downstream analyses. A very common task for gene expression data analysis is differential gene expression analysis (DGEA). We performed DGEA between Caucasian (CA) and African American (AA) breast cancer patients based on the race information provided in TCGA and the race information predicted using our method. Totally we have 2221 patient tissue samples who are either CA (1994) or AA (227), which was divided into ten folds with approximately equal number of samples in each fold. The training data (Dataset1) has 8 folds, test data (Dataset2) has 1 fold, and the last fold was treated as a newly collected dataset (Dataset3, more details later). We first performed DGEA in Dataset1 by randomly sampling 1 fold of data (with the same CA:AA ratio) to generate the list of differentially expressed genes (DEGs), called DEG1. We then conducted DGEA for Dataset2 using their true race information to obtain another list of DEGs, called DEG2_t. The third DEGA was performed using Dataset2 again by using predicted race, where the predictive model was built using Dataset1. This gave DEG2_p. The fourth DGEA was done using Dataset3 to produce DEG3. In this comparison, we assume that we can collect some new gene expression data with race information and perform the same DGEA. The last DGEA was done by first randomly permuting the race in Dataset2 to generated DEG2_r. We compute the number of overlapping DEGs between DEG1 and DEG2_t, between DEG1 and DEG2_p, between DEG1 and DEG3, and between DEG1 and DEG2_r. Computing the number of overlapping DEGs was done iteratively for each of the ten folds, then the average number of overlapping DEGs was computed. We repeated the process 50 times and drew the box-plot for the average number of overlapping DEGs in Figure 3.
The box-plots and significance tests showed that when computing the number of overlapping DEGs, there is no significant difference between using true race information and using predicted race. There is no significant difference between using predicted race and using the true race of newly collected data, either. This indicates that datasets with metadata predicted using our method can be used reliably in other analysis.

IV. Discussion
Genomics data are not useful if necessary metadata are not available. Unfortunately, a large amount of such important metadata is missing in public genomics datasets. In this study, we proposed a framework for maximizing the reusability of public gene expression data by predicting the missing metadata using machine learning methods. To develop and validate the framework, we used microarray and sequencing gene expression data with a total of 3,833 cancer patient tissue samples to investigate whether we can predict missing metadata, such as race, ER, PR, HER2 and treatment response. Our study has shown that gene expression profiles can be used to predict metadata accurately. Out of over 20,000 genes, we can select small numbers of genes to obtain reliable predictions. For those variables for which reliable predictions cannot be achieved for all the missing metadata, we can select a subset of reliable predictions using our pipeline and a new measure designed for maximizing the reusability of public gene expression data. We found that different variables require different methods and parameter settings to achieve optimal performance. This is consistent with the well-known notion that no single method is the best for all kinds of machine learning tasks.
In addition to machine learning algorithms, normalization methods can have a substantial effect on the performance as well. In this study, we found that the robust rank normalization (RN) [48,49] can produce better or comparable performance than commonly used normalization methods (RPM for sequencing data and quantile normalization for microarray data). Predictions from rank-normalized sequencing data resulted in higher F 1 -scores. In microarray data, RN gave comparable performances. RN is less sensitive to experimental noise and outliers. It also allows researchers to conveniently combine several datasets generated from different platforms. Our result indicates that RN can be a good choice when building machine learning models either for predicting metadata to maximize the reusability of public gene expression data or for general model building purposes using gene expression data.
Successful statistical models depend on the quality of the data used for building the models. The accuracy of the predicted values for the missing data needs to be carefully evaluated to ensure the quality of the data to be used in downstream applications. While traditional evaluation metrics can be used, they are not ideal because they aim for accurate predictions of whole datasets. When researchers try to reuse public genomics data, they do not need to use all the data. Additionally, the accuracy for the whole dataset may not reach the desired accuracy threshold. They can use only the part of the data for which missing metadata can be reliably predicted. Here, a reasonable metric to optimize would be, give a certain accuracy threshold, the proportion of data that can be predicted above that threshold. The higher the number, the more data we will have for reuse with desired accuracy.
With the above reasoning, we proposed the metric Proportion of Cases Accurately Predicted (PCAP) for the purpose of maximizing reusability of public gene expression data. In addition, we proposed a selection pipeline to select a model from various combinations of algorithms, normalization methods, and data balancing procedures. Our results showed that we were able to recover a high percentage of samples with the desired accuracy. It is also recommended that one should maximize PCAP, instead of traditional performance measures for the whole dataset, when building models to obtain higher percentages of usable samples.
We also demonstrated the effectiveness of the predicted metadata in downstream inference tasks. In the study, we performed differential gene expression analysis (DGEA) using predicted race and found that the effectiveness of the analysis using predicted metadata is similar to that using true metadata. This demonstrated that our framework for maximizing the reusability of gene expression data can be reliably used in the future by other researchers.
There is a possibility that the subset of the data that can be predicted with high accuracy is systematically different in some way from the whole dataset. In such cases, the conclusion one can draw will be limited to only the subset of the data that can be predicted with high accuracy. In real clinical settings, for example biomarker discoveries, one can use the predictive model to stratify patient population and only apply the discoveries (i.e. identified diagnostic biomarkers) to that population.
It is worth noting that the metadata we have used as true information may have some noise in them. For example, it is well acknowledged that the self-reported race/ethnicity has high inaccuracy levels [50,51]. These inherent errors will limit the upper bound of the accuracy we can achieve.   . Box plot for the overlap of differential expressed genes (DEGs) detected by differential gene expression analyses using five different datasets. The numbers are averages of ten runs. True: DEGs obtained using the true race information in the original TCGA data; PCAP 95 : DEGs obtained using a dataset where race is predicted by optimizing PCAP 95 ; PCAP 90 : DEGs obtained using a dataset where race is predicted by optimizing PCAP 90 ; CollectNew: DEGs obtained by collecting a new dataset with known race information. This dataset is part of the TCGA data which was not used in other analyses; Random: DEGs by randomly permuting the race assignments among the patients in the dataset.