Batch adjustment by reference alignment (BARA): Improved prediction performance in biological test sets with batch effects

Many biological data acquisition platforms suffer from inadvertent inclusion of biologically irrelevant variance in analyzed data, collectively termed batch effects. Batch effects can lead to difficulties in downstream analysis by lowering the power to detect biologically interesting differences and can in certain instances lead to false discoveries. They are especially troublesome in predictive modelling where samples in training sets and test sets are often completely correlated with batches. In this article, we present BARA, a normalization method for adjusting batch effects in predictive modelling. BARA utilizes a few reference samples to adjust for batch effects in a compressed data space spanned by the training set. We evaluate BARA using a collection of publicly available datasets and three different prediction models, and compare its performance to already existing methods developed for similar purposes. The results show that data normalized with BARA generates high and consistent prediction performances. Further, they suggest that BARA produces reliable performances independent of the examined classifiers. We therefore conclude that BARA has great potential to facilitate the development of predictive assays where test sets and training sets are correlated with batch.


Introduction
Data acquisition techniques designed to quantify biological signals from gene-or protein expression are often associated with batch effects. The problem with batch effects is that it leads to the inclusion of biologically irrelevant variance in the obtained data, which can lower the power of subsequent analyses or lead to false discoveries [1][2][3][4]. The variance may be due to a variety of different experimental parameters, including analysis date, sample processing or reagent quality [5]. Further, batch effects are not exclusive to high throughput acquisition methods but are also observed in data from lower throughput methods such as qPCR or Nano-String nCounter technologies [6,7]. The high incidence of batch effects in multiple biological platforms is a contributing factor to the relatively small number of diagnostic and prognostic biomarker signatures that have been implemented in clinical settings [6,8].
Some actions have been shown to reduce the impact of batch effects. One such action is to carefully design experiments to minimize the correlation between possible sources of technical variance and known biological factors. However, this action is not possible for all types of experiments. For predictive modelling, for example, the correlation between biological factors and batches cannot be eliminated. This is due to the inherent nature of these experiments, where fixed training sets are often used to infer parameter values used to predict subsequently acquired test sets. This leads to total confoundment between batches and samples in the test sets, which can result in poor predictive performances on test sets [9]. Another option to reduce the impact of batch effects is to apply analytical methods on already obtained data. Many such methods have been designed, but most require prior knowledge of the biological factors of interest and low confoundment between batches and the biological groups [10][11][12].
Examples of such methods are ComBat and surrogate variable analysis (SVA). ComBat is a supervised batch correction method that requires that the sources of batch effects are known.
It is a location and scale method that uses the empirical Bayes method to moderate the batch effect estimates, making it better equipped to handle smaller datasets [10]. In contrast to Com-Bat and other supervised batch effect adjustment methods, SVA does not require that the sources of batch effects are known. Instead, the biological sources of interest should be known and specified in the model. The initial step of SVA estimates and removes the variance associated with the known biological information. Latent structures are then identified in the residual matrix, which can either be removed to generate a cleaned dataset or be incorporated in subsequent significant analyses. Identified latent structures can contain information linked to batch effects, but they can also contain other sources of expression heterogeneity, such as biological factors not included in the initial modeling [11]. Both SVA and ComBat were originally developed for datasets in discovery studies, where biological sources of interest and possible sources of batch effects are known. Because of this, they and other methods developed for similar purposes are not directly applicable to datasets generated in predictive settings. However, by making certain assumptions, the algorithms can be modified to be used in predictive modelling. For ComBat, one must assume that the composition of test sets is similar to that in the training set. But this assumption can be violated when, for example, the size of the test set decreases, as shown in [13]. For SVA, one can assume that latent structures identified in the training set can also be identified in test sets. This assumption was used to develop the frozen SVA algorithm [14]. However, this assumption is not valid if latent structures associated with batch effects are different in the training set and test sets. This can lead to poor predictive performances as shown in [13]. In general, for a normalization method to be applicable in a wide range of prediction problems, it should allow for training sets and test sets to be correlated with batch. Further, the training set should not be altered when normalizing with different test sets. Finally, it should ideally allow test sets to be acquired without the need to include a large amount of reference samples. In this paper, we introduce Batch Adjustment by Reference Alignment (BARA) to adjust for batch effects in predictive modelling. The method has the advantage that only a few reference samples are necessary to perform batch corrections. Also, rather than attempting to clean the data by removing batch effects from both training set and test sets, BARA aims to transform the test set to make it more similar to the training set. BARA performs the adjustment in a compressed data space spanned by the training set, thereby alleviating the number of necessary batch estimates that needs to be performed. We test the BARA method on a collection of 25 publicly available datasets and show that BARA consistently aids the classifier to achieve high prediction performances. We further show that the performance of BARA is better or comparable to the performance of existing methods on the examined datasets. By reducing the negative impact of batch effects, the prediction performances observed with BARA can facilitate the development of predictive assays.

Cross-study prediction evaluation
Cross-study prediction performances were used to evaluate the performance of BARA and to compare it to existing normalization methods. The normalization methods included in this Table 1. Datasets used in the cross-study analysis.

Accession Number Sample Size Reference
E-GEOD-19722 46 [30] E-GEOD-28654 112 [31] E-GEOD-29623 65 [32] E-GEOD-39084 70 [33] E-GEOD-45216 31 [34] E-GEOD-45670 38 [35] E-GEOD-46474 40 [36] E-GEOD-48278 57 [37] E-GEOD-48350 83 [38,39] E-GEOD-48780 49 [40] E-GEOD-49243 73 [41,42] E-GEOD-50774 21 [43] E-GEOD-53224 53 [44] E-GEOD-53890 41 [45] E-GEOD-54543 30 [46] E-GEOD-54837 226 [47] E-GEOD-58697 124 [48] E-GEOD-59312 79 [49] E-GEOD-60028 24 [50] E-GEOD-61804 325 [51] E-GEOD-63626 63 [52] E-GEOD-64415 209 [53] E-GEOD-64857 81 [54] E-GEOD-67851 31 [55] E-GEOD-68720 97 [56] The  [21], fSVA exact [14], mean centering, ratio A, reference centering, reference ratio A and standardization. The reference centering method subtracts the reference samples' mean expression of each gene from all samples in the training set and the test set respectively. Similarly, the reference ratio A method scales each gene and sample in the training set and the test set by their respective reference samples mean expression. To examine the predictive performance on normalized data, each of the 25 datasets was iteratively used as a temporary training set. First, 3 samples from the same biological group were randomly selected as reference samples in the training set. Next, using all samples in the training set, the 500 most significant differentially expressed genes, comparing males to females, were identified using limma [57,58]. Because the normalization methods had all been adjusted to be used in predictive modelling, as implemented in the bapred package [21] or through implementations in R, each method was first applied to the training set. Next, the transformed training set was used to define the prediction models. Three different prediction models were examined; k-nearest neighbors (kNN), random forest, and support vector machines (SVM). The prediction models were tuned using repeated cross-validation on the training set, with 3 repeats and 10 folds. The parameters resulting in the highest mean prediction performance, evaluated using Mathews Correlation Coefficient (MCC), were selected to establish the final prediction model. To allow for variation among the samples acting as reference sample, the test set normalization and prediction procedure was repeated 10 times for each test set, using a different selection of reference samples in each iteration. More specifically, when classifying the samples in each test set, 3 samples from the same group as the reference samples in the training set were randomly selected from the test set. The normalization methods that did not rely on reference samples used all samples in the test set, including the 3 reference samples, while the reference-based normalization methods only used the reference samples to normalize the data. Because information about the group of the reference samples could be considered being leaked during the normalization procedure, the reference samples were removed from the test sets before the predictions were made. The final prediction performance for each test set was calculated as the median MCC from the 10 iterations. To obtain an overall prediction performance for each training set, the MCCs of the 24 test sets were averaged. A summarized prediction score for each normalization and prediction model was calculated as the mean MCC from all the training sets.

Assessment of BARA's dependence on the number of reference samples
To assess the performance of the BARA algorithm as the number of reference samples was varied, the cross-study prediction approach described above was repeated. The performance estimation was repeated 6 times, where the number of utilized reference samples was varied from 1 sample to 6 samples. The predictive performances were summarized as described above.

The BARA algorithm
The BARA algorithm was created specifically for predictive modelling, where a fixed training set is used to classify test samples possibly affected by batch effects. The training set is used to identify a set of directions that captures the largest part of the variance in the data, using singular value decomposition (SVD). This step allows the data to be compressed into a lower dimensional space, which reduces the number of necessary batch estimates, and simultaneously decreases the complexity of the data. The training dataset, X, contains m samples in rows and n variables in columns. Each variable in X is centered by its mean value, and the matrix is decomposed using SVD to identify the directions where the batch adjustment is performed.
where � x represents a 1 � n dimensional vector containing the column means of X, U is the left singular vectors, S the singular values, and V the right singular vectors. The number of dimensions retained, k, is an adjustable parameter that can be set by using a predetermined value, estimated with for example cross-validation, or determined by setting an acceptable loss of variance in the training set, for example 10%. The centered training data is then multiplied by the first k columns of the right singular vector to obtain a transformed training set.
The test dataset, Z, contains p samples in rows and n variables. The variables are first adjusted by subtracting the mean values of the training data, and is then projected onto the identified directions.
A batch adjustment factor, a j , is estimated for all retained dimensions, using the reference samples present in both the training set and the test set. For example, the adjustment factor of dimension j is estimated by comparing the mean value of the reference samples in the training set for dimension j, to the mean value of the reference samples in the test set for dimension j.
Where � z ref and � x ref are 1 � k dimensional vectors containing the variable means for the reference samples in the transformed test set and the transformed training set respectively. The transformed test data is then adjusted by the adjustment factors and both datasets are reconstructed to the original data space.
To achieve a level of expression to what was originally observed, the mean values estimated from the training set are finally added to the reconstructed data.
The predictions are then performed as normal, where the model is built from the compressed training dataset, X k , and the predictions are made on the adjusted test set, Z k .

Normalization methods and parameter settings
The following methods were used through their implementations in the R-package bapred; ComBat, FAbatch, fSVA exact, mean centering, ratio A and standardization. For FAbatch, the default values of the parameters were used, i.e. the number of factors to estimate for each batch was left unspecified, the preliminary probabilities were estimated using leave-one-out crossvalidation, maximum number of iterations were 100, and the maximum number of factors were 12. For fSVA exact, the algorithm parameter was changed to correspond to the exact algorithm instead of the fast, while the default values were used for the remaining parameters. For the other methods implemented in the bapred package, no additional parameter values could be specified.
The two reference-based methods, reference mean centering and reference ratio A, where implemented in R. Reference mean centering subtracted each batch's genes by the mean expression of its reference samples, and reference ratio A scaled each batch's genes expression by the mean expression of the reference samples.
BARA was implemented by specifying the loss parameter as a criterion for selecting the number of dimensions to retain. The loss parameter was set to 10%. Thus, at most 10% of the variance in the training data was lost in the normalization.

Prediction models
Three types of prediction models were used to assess the performance of the normalization methods in the cross-study analysis. The prediction models were; random forest, kNN and SVM with linear kernel. The prediction models were implemented using the R-packages ran-domForest, class and e1071 [18,22,26]. The prediction models were selected to include both linear and non-linear classifiers. For every training set, the prediction models were tuned to maximize the MCC using repeated cross-validation with 3 repeats and 10 folds. For the respective prediction model, the following parameters and parameter values were tuned:

Results
To assess the performance of BARA and to compare it to existing normalization methods, cross-study predictions were examined. Because the acquired datasets originated from separate studies, the biological annotation used for classification was sex/gender. Fig 1 shows a PCA plot of the 25 datasets, which shows clear signs of batch effects.
To assess the different normalization methods, each dataset was iteratively used as training set to define a prediction model. Batch effects between the training sets and the test sets were adjusted with the examined normalization methods. Normalized test sets were classified with the trained prediction models and MCCs were calculated to estimate the prediction performances. The resulting MCCs for the normalized data and for the unnormalized data on the 3 different prediction models are shown in Figs 2-4. Further, the mean performance for each normalization method and prediction model can be seen in Table 2. Figs 2-4 show that data normalized with BARA seems to generate consistent performances independent of the examined prediction model. Further, the estimated MCCs are high with low variance. In fact, considering the calculated performance scores in Table 2, BARA achieves the highest mean MCC compared to the other examined normalization methods. Lastly, BARA also shows an improved prediction performance compared to the unnormalized data.
Because all information used to adjust for batch effects in the BARA method is estimated from the reference samples, the performance of the method as the number of reference samples was varied was examined. The number of reference samples was varied from 1 sample to 6 samples, and the cross-study prediction approach described above was repeated. Again, the performances were summarized by calculating the mean MCC for each run. The performance for each repeat can be seen in Fig 5. The plot suggests that the performance of BARA in the examined datasets is robust to the number of reference samples utilized. However, it is also evident that the run with a single reference sample achieve the lowest performance metric. Batch adjustment method improves prediction performance in biological test sets with batch effects

Discussion
Batch effects are a widespread problem that exist in most biological data acquisition platforms and hinder the development and implementation of promising biomarker signatures [1][2][3][4][5].
Batch effects are especially difficult to account for in predictive modelling where the biological factors in the test set are completely confounded with batch. In this paper, we have introduced BARA, a normalization method for the adjustment of batch effects in predictive modelling, and compared it to already existing methods. The aim of normalization methods applied in settings of predictive modelling is to make prediction models and inferred information transferable to test sets affected by batch effects. Common strategies suggest using reference samples to estimate the perturbation induced by batch effects [9]. However, including multiple additional samples in each batch of test samples can be both time consuming and add analytical costs. Therefore, normalization methods should ideally require none or few reference samples.
Only a few reference samples are required by BARA; three were used in the comparison described above and BARA was also found to achieve robust performance metrics when only a single reference sample was used. However, the optimal number of reference samples could generally be assumed to be based on a trade-off between costs and accuracy. Because the reference samples are used to estimate mean adjustments for every batch in a compressed data space, additional reference samples should provide more certain estimates of the mean values. This was also observed when the number of reference samples used by BARA was varied, see Fig 5. Because mean values are estimated, it is also expected that the ideal number of reference samples will not be the same for different data acquisition techniques and datasets but will depend on, for example, the variation between replicates. Further, even though random assignment of reference samples does not represent an ideal selection strategy, the performance of BARA could be considered stable as indicated by the low variance in the performance metrics, see Figs 2-4. In an ideal scenario, the reference samples should represent a standardized sample where the major difference compared to other reference samples are batch effects. This could lead to better estimates with a lower number of required reference samples.
BARA estimates the batch adjustments in a compressed data space spanned by the training set. The compression is calculated with SVD and is thus a linear combination of the original variables along the directions of maximum variance. SVD was chosen because we hypothesized that it would be suitable for many datasets associated with predictive modelling where biomarker signatures have been identified to optimize prediction performances, which suggests that low-dimensional representations of the data exist that captures large fractions of the important variance. Further, SVD is a well-known operation that offers a convenient way of compressing data, performing batch adjustment and reconstructing the original variables. Because SVD is used in BARA, multiple levels of compression can be selected for a dataset depending on the number of directions that are retained. In the cross-study analyses described  Batch adjustment method improves prediction performance in biological test sets with batch effects above, a maximum of 10% of the variance in the training set was considered an acceptable loss, and the smallest number of dimensions satisfying this condition was selected in the normalization steps. However, other approaches for selecting an optimal number of dimensions to retain in a specific dataset could be pursued. For example, cross-validation performances could be compared as the number of dimensions retained are varied, or an external validation set affected by batch effects could aid the selection by better mimicking an actual prediction scenario. The ability of BARA to restore the predictive performances in datasets suffering from batch effects was assessed by cross-study predictions, and the obtained performances were compared to those obtained using existing normalization methods or no normalization. Due to the difficulty in procuring public datasets containing training sets suitable for predictive modelling where external test sets affected by batch effects exist, a collection of datasets previously compiled [13] were used, where sex/gender was used as classification label. Successively, each of the 25 datasets in the collection were designated as a training set. From the training set, the 500 most significant genes after comparing the gene expression of females to males were identified, and prediction models were defined using kNN, random forest and SVM. Batch effects between the training sets and the test sets were removed with the examined normalization methods, and the samples in the test set were classified. Figs 2-4 shows the predictive performances after applying the examined normalization methods. The figures show that BARA produces high and consistent performances for the examined datasets independent of prediction model. The results further suggest, that BARA outperforms the other examined normalization methods on the examined datasets, by reaching the highest mean MCC values and consistently show low variation in performance. Further, the performance is also improved compared to using no normalization, which indicates that the BARA algorithm mitigates some of the negative effects caused by batch effects in the studied datasets. It is also worth noting that genes associated with sex/gender are strong predictors, and the performance of the unnormalized data was often higher than those obtained by some of the normalization methods. In fact, BARA was the only normalization method that consistently resulted in improved mean prediction performance compared to the unnormalized data in all three prediction models.
In conclusion, we have introduced a novel method to adjust for batch effects in predictive modelling and compared it to already existing methods. We show that BARA improves the prediction performances in the examined datasets compared to applying no normalization. Further, the BARA-normalized datasets achieved higher or comparable prediction performances compared to datasets normalized with the other examined methods. These results suggest that BARA can be considered a useful tool to reduce the negative impact of batch effect in predictive modelling.