Single-Patient Molecular Testing with NanoString nCounter Data Using a Reference-Based Strategy for Batch Effect Correction

A major weakness in many high-throughput genomic studies is the lack of consideration of a clinical environment where one patient at a time must be evaluated. We examined generalizable and platform-specific sources of variation from NanoString gene expression data on both ovarian cancer and Hodgkin lymphoma patients. A reference-based strategy, applicable to single-patient molecular testing is proposed for batch effect correction. The proposed protocol improved performance in an established Hodgkin lymphoma classifier, reducing batch-to-batch misclassification while retaining accuracy and precision. We suggest this strategy may facilitate development of NanoString and similar molecular assays by accelerating prospective validation and clinical uptake of relevant diagnostics.


Introduction
The use of molecular technologies in clinical assays for guiding patient diagnosis, prognosis and management is paving the way for precision medicine. Gene expression in particular has been widely used in biomedical research to identify biomarkers and genetic profiles that enable disease diagnosis, (sub) classification, and prediction of prognosis and response to therapy [1][2][3][4].
Principle 1 Grant, to DGH and MSA), and a Carraresi Foundation OVCARE Research Grant (to MSA), supported by the Carraresi Foundation, BC Cancer Foundation and the VGH and UBC Hospitals Foundation. The United Kingdom Ovarian Cancer Population Study (UKOPS; an OTTA member contributor) was funded by the Eve Appeal (The Oak Foundation), and supported by researchers at the National Institute for Health Research University College London Hospitals Biomedical Research Centre. The OVCARE gynecological tissue bank receives funding from the BC Cancer Foundation and the UBC+VGH Hospitals Foundation. Funding bodies had no role in the design of the study; the collection, analysis, or interpretation of the data; the writing of the manuscript; or the decision to submit the manuscript for publication.
Competing Interests: AT, MSA, and DGH are involved in the development of a diagnostic classifier for ovarian carcinomas. CS and DWS have contributed to a patented predictive classifier for use in Hodgkin Lymphoma (US Provisional Patent Application #6159116). DWS is also a named inventor on a patent for lymphoma subtyping that has been licensed by NanoString Technologies, Inc (International patent application No. PCT/US14/ 64161). Neither DWS nor CS receive royalties or licensing payments related to the above patents. These interests have not influenced the interpretation of the data herein, and do not alter the authors' adherence to PLOS ONE policies on sharing data and materials.

NanoString nCounter Data
The NanoString technology is based on single-molecule imaging of color-coded barcodes bound to target-specific probes [8,10]. A CodeSet is a unique, single production-run, multiassay mix of all probes of interest; it includes six positive controls, spiked-in at fixed, proportional concentrations (from 0.125-128 fM), and eight negative controls (probes without a corresponding target) used to assess background and non-specific binding. "Housekeeping genes" (HK) are an obligate component in the design of each CodeSet. Similar to quantitative RT-PCR, HK are expected to remain constant between biological conditions of interest and are used to control for the amount of RNA in a given reaction. A standard cartridge holds 12 lanes, allowing for 12 samples to be processed in a given run.

Cohort Description
Datasets obtained from different NanoString experiments included 161 clinical specimens from two cancer types: Hodgkin lymphoma (HL) and ovarian cancer (OC). For the HL cohort, data was derived from Scott et al. [26], wherein a subset of samples had replicate data generated on a second CodeSet (n = 32) and an additional subset had data generated on a third CodeSet (n = 10). OC specimens (n = 129) were obtained through the OVCARE tissue bank and the Ovarian Tumour Tissue Analysis (OTTA) consortium and were also run on two CodeSets. Ovarian cancer cell lines (OVCL) [28] were also run in duplicate across Code-Sets (n = 13). The last two cohorts were DNA oligonucleotides, complementary to target RNA in the HL (HLO) and the OC (OVO) CodeSets, run at different concentrations for a total of 203 runs. A detailed description of individual cohorts is given in Table 1. All specimens were collected through hospital based research studies with approval of local research ethics and/or institutional review boards. The BCCA/UBC research ethics board further approved generation of gene expression data for investigation of tumour biology, including development of RNA-based classifiers.

Nucleic acid preparation
RNA extraction was standardized within each research group (either Hodgkin or ovarian). All samples were reviewed for the presence of tumour. Ovarian frozen specimens were cryosectioned (5-40 section at 20 μm, depending on face size), tissue was dissociated by vortexing in Trizol reagent and then processed using the Qiagen miRNeasy protocol (Qiagen) as per manufacturer's recommendations. FFPE tissues were sectioned (3 sections at 10 μm) into microfuge tubes. OTTA-sourced specimens were collected by scrapping 2-4 5um section off glass slides. FFPE specimens were de-paraffinized and processed according to the Qiagen miRNeasy FFPE kit (Qiagen) with an elongated 55°C digest period (45 min). Lymphoma specimens, all FFPE derived, were processed similarly with the exception of deparaffinization using Qiagen deparaffinization solution as described previously [26].
For oligonucleotide experiments, standard desalted, single strand, 100-base DNA oligonucleotides were obtained (IDT) and a single pool of DNA oligonucleotides was generated by mixing all oligonucleotides in equimolar concentration corresponding to the sum of targets in both ovarian CodeSets. This resulted in each CodeSet targeting only a subset of the total pool. We refer to the "functional concentration" as the concentration of the subset oligonucleotide pool that is targeted in a given CodeSet hybridization. In the case of Hodgkin lymphoma CodeSets an oligonucleotide pool corresponding only to the prognostic genes [26] was used.

NanoString RNA expression data
The manufacturer's protocol was adhered to with the exceptions noted below. Briefly, capture, reporter, specimen total RNA (or DNA oligonucleotides) were mixed with hybridization buffer and hybridized at 65°C overnight. Sample, wash reagents and imaging cartridge were then processed on the nCounter Prep Station and finally imaged on the nCounter Digital Analyzer. FFPE derived specimens used 200-500 ng of input total RNA, whereas frozen specimens used only the recommended 100 ng. Hybridization times ranged from 12-23 hours and were strictly monitored in selected cohort experiments. See Table 1 for further details.

Statistical Methods
Raw data was assessed using several quality assurance (QA) metrics to measure imaging quality, oversaturation and overall signal to noise. All samples satisfying QA metric checks were log-transformed (base 2) to help with distributional assumptions, and were normalized by subtracting the average expression level of HK genes (equivalent to the geometric mean normalization on the raw scale). Principal Components Analysis (PCA) and Principal Variance Components Analysis (PVCA) were used to measure the impact and assess sources of variability in the normalized data. The minimal set of principal components for PVCA [29] were selected to ensure at least 60% explained variability was retained. We compared the intra-CodeSet to inter-CodeSet variability using the Dispersion Separability Criterion (DSC) [30], a ratio of between-to within-batch dispersion computed from batch scatter matrices. High DSC (over 0.5) indicates large inter-batch variability. Significance of DSC was assessed using a permutation test: starting with a null hypothesis of no batch effects present and that the data is homogeneous in terms of batches-five thousand permutations were generated from the full data (including all batches). DSC values were computed for each permutation; at the end of all the runs the proportion of values greater than the observed DSC is computed to yield the p-value (significance level at 0.05). A series of multi-sample methods were applied to both OC and HL data: batch mean centering (bmc) [11], mean and variance standardization (zscore) [11], a two-stage location/scale correction (ber) [11], and parametric empirical Bayes approach (combat) [14]. In addition, we considered different types of references and their performance in comparison to multisample methods. HL samples were calibrated using the HLO data (HLO_ref) and three clinical HL samples chosen at random (any3_ref). Similarly, OC samples were adjusted using the OVO data (OVO_ref), five samples from each carcinoma histotype [31] chosen at random (any5_ref), and the OVCL data (OVCL_ref; averaging over all OC cell lines). PVCA plots were used to compare the variability post-adjustment to the unadjusted data. Furthermore, gene-wise agreement between CodeSets was compared using: Pearson's correlation coefficient (R), for precision; the coefficient of accuracy (C a ), to assess the systematic bias; Lin's concordance correlation [32] was used to capture both precision and accuracy simultaneously (R c = R C a ). The degree of BE removal was evaluated using DSC.
Adjusting for batch effects using reference samples, involved running designated reference samples in every batch along actual samples and measuring the same set of genes. Assuming we have two batches (A and B) and were interested in calibrating samples X B that were run in batch B to samples in X A that were run in batch A, the reference approach would require that some number of reference samples (R) would be run in both batches A and B, resulting in R A and R B . These reference samples would be used to calibrate between the two batches. Mathematically, we assumed the following standard additive (on the log scale) batch effect model on the normalized and log (base 2) transformed NanoString data [11]: Where X A ij is the observed mRNA expression of gene i in sample j run in batch A, X 0 ij is the true gene expression, b A ij is the batch effect corresponding to batch A, and A ij is a random error assumed to be independent with expected value equal to 0. Let R A be a set of designated reference samples of size k, run alongside actual samples in batch A, we can then write and to remove BE, we subtract the observed gene expression of the reference from the observed gene expression of clinical samplesX We note in the above equation that if a certain gene i, measured in the original sample, is not expressed in the reference sample, then X k l¼1 R il ¼ 0, thus resulting in no adjustment for that gene and BE will not be eliminated. For this reason, it is important to ensure that the selected reference samples have a good expression level of the genes of interest otherwise the reference will not be effective at eliminating BE.
If the interest is to combine data obtained from different batches, this can be achieved by sub- k from X B prior to combining the two data sets. In certain cases a model may have been developed with data that were not batch corrected and hence future data would need to be calibrated to the training data. This is achieved by adding the difference between the reference to the new data:

Quality Assurance Metrics
All OC clinical samples passed all QA metrics. Only one HL sample failed based on fields of view (FOV; obtained 73% cutoff criteria >75%) metric and was removed. Samples were flagged as imaging failures if the percentage of lane images FOV obtained was less than 75% of the requested number of fields. This may be affected by either physical problems with the cartridge or oversaturation of probes (locally or across the whole lane). Oversaturation occurs when probes compete for physical space on a lane, displacing positive control probes and resulting in loss of linearity while simultaneously probes expressed at low levels become indistinguishable from background noise. Lane oversaturation was determined using the linearity of positive control probes (R 2 < 0.95) and the ability to detect the smallest positive control. Failures were more common in OVO cohort (12/135 failed) vs. HLO (no failure); however, this was expected as dilution experiments were set to stress usability limits. OVCL had a large number of genes that were not expressed (Fig 1), likely due to OC CodeSet design based on OC histotype and molecular subtype analysis [31,33,34], with a substantial proportion of stromally-derived genes [33,34]. Taking this into account, we were less stringent in excluding cell lines that had a lower percent of genes detected above background. Clinical samples with a signal-to-noise ratio (S/ N) smaller than 100 and <50% genes detected above limit of detection (LOD) were considered poor quality and removed from analyses. Oligonucleotide samples with less than 95% detection levels were considered failed as these synthetic samples were engineered for 100% perfectmatch. The S/N, computed as a ratio of the geometric mean of HK and the LOD, was used as an overall sample quality measure. LOD is a metric used to determine the level of background noise in the system; high LOD results in fewer genes being detected above background noise. The fraction of genes above LOD is a good measure of the quality of RNA preparation and the quantity of RNA loaded, however it should be noted this is not a generalizable cut-off and must be established (empirically) for a given codeset and experimental design. A subset of OVO cohort also failed the S/N metric; again, this was expected as noted above (Fig 2). Samples with high S/N and a large % of genes detected are generally considered better quality samples; however, the cutoff for the removal of samples from analyses would be application-dependent. The results are presented in Table 2 and more details are provided in S1 File.
Binding density (BD), a quality assurance measure suggested by the manufacturer to monitor oversaturation, was not used, as there were minor inconsistencies between hardware versions. BD also did not correlate well with S/N (See S1 File for detail), suggesting BDs well above the recommended limit were still valid for quantitation.

Data Normalization
NanoString generates non-amplified count data, which may be right-skewed depending on gene selection. All samples satisfying QA metric checks were log-transformed (base 2) to help with distribution assumptions, as needed. NanoString recommends a three-step normalization procedure which involves adjustments relative to positive controls and HK as well as background subtraction [35]. Different normalizations for nCounter data have also been considered previously [20]. We found that normalizing to positive controls made little difference and resulted in unnecessary processing of the data. Similarly, estimating and subtracting background resulted in a "false floor", a problem that becomes most evident when comparing data from different CodeSets. Our analysis demonstrated that a sufficient step is the subtraction of the arithmetic mean of HK (on the log scale, this is equivalent to the geometric mean on the raw scale; S2 File). Finally, for the methodology to be suitable in a single-patient clinical setting, we strongly discourage rescaling, which creates an unnecessary dependency on the level of expression of other samples processed concurrently. Using our approach, the final data can be interpreted as a relative fold change to [the mean of] HK. HK selection, in the HL CodeSets was kept consistent with published works, using ACTB, CLTC and RPLP0. For OC, HK were chosen from a panel of commonly accepted invariant genes represented on the NanoString Human Reference GX panel, based on their variability from a number of OC studies [33,34]. Five genes were selected from the upper (2), middle (2), and lower (1) expression level quantiles: RPL19, ACTB, PGK1, SDHA, and POLR1B from highest to lowest median expression respectively. The selection of HK is important; optimal genes should have lower variance across samples and an expression level that spans that of all other genes while remaining in the linear dynamic range of the assay. Moreover, although there is little formal analysis as to the optimal number of genes to use as housekeepers, more is better [36]; our current preference is to recommend a minimum of 5.

Monitoring Sources of Variability
CodeSet variability can challenge the reproducibility of results for any third-party without access to original samples and CodeSet. Technical and true biological (e.g. histology and Epstein Barr Virus (EBV) status) variables were considered in PVCA (Figs 3 and 4).
In both OC and HL, the variability associated with CodeSet was~10%. HL clinical samples showed pronounced CodeSet-dependent shifts in plots of the first three principal components. A significant DSC of 0.21 and 0.26 was found in the HL and OC datasets respectively (p-value <0.01) (Tables 3 and 4), indicating the presence of significant but weak BE. Finally, a genewise percent change in the average log-expression between CodeSets was computed. The median % change was around 10% in both the OC and HL gene sets, with certain genes being more stable across CodeSets than others (all compared genes had identical probe-sequence from lot-to-lot).

Batch Effect Correction
Common multi-sample BE adjustments in gene expression experiments [11,37,13,12,15] typically require a generous sample set (n > 30) to be processed within each batch. Empirical Bayes methods [12,14] may be useful when the number of samples in each batch is small (n < 30); however, all these methods assume that the underlying biological feature representation is equal in every batch [12], a condition seldom met in clinical practice and observational studies.

Reference-Based Strategy for Batch Effect Correction
Using reference samples for BE adjustment has been suggested as more effective method for calibration [9,11], where the expression level of the clinical sample is taken relative to the reference (see Materials and Methods).
PVCA plots were used to compare the variability post-adjustment to the unadjusted data (Figs 5 and 6). In the HL data, the batch effect due to CodeSet was removed in all cases except when using the HLO_ref. Similarly, in the OC data, % of variance due to CodeSet went down in all cases except in OVO_ref and OVCL_ref. The results (Tables 3 and 4) showed that for both cohorts, not correcting for BE, resulted in a smaller concordance correlation coefficient of 0.85 for the HL cohort and 0.88 for the OC cohort. Batch adjustment methods had no effect on precision; in contrast, accuracy was modified in all cases, with changes to the OC cohorts being marginal in comparison to HL. In multi-sample correction, all methods except combat in the OC cohort seemed to perform similarly in improving concordance between batches; DSC appeared to shrink in both cohorts after adjusting for BE. In reference-based methods, synthetic DNA oligonucleotides did not perform as expected, appearing to add more bias to the results. Downstream Analysis BE can have massive implications on downstream analysis. To assess this impact using the NanoString nCounter data, we evaluated a reference-based approach to correct for BE using a previously developed and published HL prognostic model [26] that uses 26 genes to predict recurrence risk score for HL. A predictive score > 0.6235 is used to indicate a higher risk of death. Using this model, we computed risk scores of the same 31 patients represented in two HL CodeSets and compared the concordance of the scores (Fig 7a). While there was an excellent correlation between scores, a lack of accuracy (C a = 0.81) appeared to result in a systematic shift from the identity line. If the threshold model is to be used 4/31 cases would be misclassified as low risk in the second CodeSet when in fact they were classified as high risk in the first CodeSet. This can only be attributed to batch effect associated with CodeSet. We used the reference-based strategy, randomly selecting 3 samples and setting them as a reference (any3_ref), to correct this bias, resulting in no misclassifications (Fig 7b). To ensure that this result was not due to chance, we repeated the process of selecting 3 samples at random and setting them as reference (in both CodeSets) 5000 times. Each time, we corrected for BE and counted the number of misclassification. Over 99% of the times, the resulting C a was near perfect (over 0.99). We observed no misclassification 39% of times, 1 misclassification 18% of the times, and a maximum of 2 misclassifications 44% of the times. It should be noted that in all cases misclassifications were attributed to the same two cases that sat very close to the threshold, and slight perturbations in the data shifted their classification.

Discussion
We have provided empirical evidence that the use of our reference-based BE correction strategy is equivalent to population-based correction methods with the advantage of being better suited to clinical applications. Though multi-sample BE correction and normalization methods may be appropriate for exploratory retrospective studies, as bulk data would likely be available, special care should be taken to ensure that samples are from homogenous populations. One problematic example within the spectrum of OC could be comparing a North American population OC cohort to a Japanese cohort as the latter are known to have a larger proportion of ovarian clear cell carcinoma histotype and lesser proportion of high-grade serous [38]. While in some cases a biological signal will over-power any of the above noted variability, this is near impossible to predict in advance. Careful experimental design is critical to maximize the utility of any study intending to generate a clinically relevant classifier, whether diagnostic, prognostic or predictive. In such studies, planning for a reference-based strategy is highly desirable. Optimally, a reference would be included on each cartridge-this being the smallest "batch". However, within-lot variability, examined in several recent studies, were found to be highly stable and to yield negligible effects [23,27]; our data are largely consistent with these findings. This suggests most applications will lend well to periodic reference runs, at least until a locked-down protocol is established for validation [17].
Variability in gene expression data can be partitioned into two sources [39]: biological, caused by differences between different biological conditions, and technical, which can be introduced by virtually every experimental detail. NanoString workflow automation and foregoing the need for enzymatic processing/amplification minimizes technical variability that would otherwise be influenced by the user. For nCounter assays, large lots of reagent are typically ordered upfront and used for all experiments in a given plan. We focused our efforts on evaluating CodeSet-to-CodeSet variability, a topic that has garnered relatively little attention, as well as deriving batch-independent quality assurance metrics and an algorithm for normalization such that assays could be run in small batches, or single patient environments, as one would encounter in clinical practice.
Overall we found lot-to-lot variance on a per-probe basis varied widely and was the single greatest source of identifiable variation. Not all probes varied, however, since we do not have knowledge on batch-manufacture of specific probe sets, it is unknown whether stability is inherent to a gene/probe design or if genes with little difference were from the same manufacturing pool.
In validating our reference-based strategy for batch-independent normalization we encountered some unexpected results. Specifically, the poor performance of the synthetic DNA oligonucleotide references was surprising. This may have been due to: i) A substantial fraction of n-1 and other incomplete oligonucleotides known to occur in longer synthesis reactions even with high coupling efficiency [40]. This may be correctable using HPLC purification, a cost restrictive addition during our experiments. ii) Secondary structure interaction of native RNA that is lacking in DNA. However, given the consistency between signal obtained between replicates of intact mRNA and fragmented mRNA from FFPE sources, we do not suspect significant secondary structure interaction. iii) Unexpected interaction of DNA-oligonucleotides with the colourbarcode molecules. Maintaining consistency in the probe-barcode combinations from batch-tobatch may alleviate this. iv) Other non-optimal hybridization parameters (temperature/salt/ detergent) for the probe-DNA duplex compared to probe-RNA duplex of a true sample.
Similarly, in the OC cohort, the cell line-based reference performed poorly. In retrospect, this may have been predictable since many of the genes were not expressed in the cell lines; expression in clinical samples originated from stroma and their absence in cell lines hindered correction.
Finally, we were unable to test specifically whether synthetic RNA-oligonucleotides had superior performance to DNA-oligonucleotides (or other tested references). Synthetic RNA may be an optimal solution for clinical/commercial assays as it could be recreated within very precise parameters. An in vitro transcribed RNA strategy appears to be the method employed for NanoString's FDA approved "ProSigna" assay [41]. However, synthetic RNA pools may be overly expensive to establish and maintain during research and development phases.

Conclusion
Advance planning is key to any study and identifying the goals in a NanoString-based experiment is no exception. Should a normalization reference sample be required, then careful selection of such a reference should be made: (1) must act identically to a test specimen in the assay, (2) must be plentiful, even in the initial research setting it should last through several CodeSets worth of reagents-stock should be established early, (3) must adequately represent all genes of interest. Cell lines may be suitable in some applications, though not in the above example. If they are used, it is important to avoid replenishing stock by growing more cells as transcript levels may be affected by culture conditions including confluence, nutrient availability, oxygenations, pH, and handling. A new batch of cells is a "new" reference, albeit with similar characteristics to the original. Any new reference must be migrated into the experimental protocol appropriately. Finally, (4) normal tissue, pooled samples or synthetic RNA pools may be considered-it appears that synthetic DNA oligonucleotide pools are not suitable for this purpose. The current de facto gold standard, in vitro transcribed RNA pools, has been defined by FDA approvals of NanoString's ProSigna assay. This synthetic RNA approach is unfortunately expensive to establish and maintain during the research and development phase. Large pools of high-quality RNA from real biological samples of interest may be a low cost stand-in, could be shared with others continuing/reproducing results, and can be migrated to replacement pools or a long-term, commercial-product solution at a later stage.
Overall, we found the NanoString gene expression platform to be an easy to use and highly robust technology. Experimental variability is relatively small and can be dealt with using good experimental design and proactive planning, keeping in mind the goals of a given project and the desired reproducibility.
Supporting Information S1 File. Metrics for Quality Assurance. This file contains additional extensive detail on all methods and parameters related to processing and establishment of metrics for quality assurance. (DOCX) S2 File. Normalization. This file contains additional extensive detail on the method used for normalization of the data using the reference-based strategy and example data comparing our method to the manufacturer recommendations. (DOCX)