OCR-Stats: Robust estimation and statistical testing of mitochondrial respiration activities using Seahorse XF Analyzer

The accurate quantification of cellular and mitochondrial bioenergetic activity is of great interest in medicine and biology. Mitochondrial stress tests performed with Seahorse Bioscience XF Analyzers allow the estimation of different bioenergetic measures by monitoring the oxygen consumption rates (OCR) of living cells in multi-well plates. However, studies of the statistical best practices for determining aggregated OCR measurements and comparisons have been lacking. Therefore, to understand how OCR behaves across different biological samples, wells, and plates, we performed mitochondrial stress tests in 126 96-well plates involving 203 fibroblast cell lines. We show that the noise of OCR is multiplicative, that outlier data points can concern individual measurements or all measurements of a well, and that the inter-plate variation is greater than the intra-plate variation. Based on these insights, we developed a novel statistical method, OCR-Stats, that: i) robustly estimates OCR levels modeling multiplicative noise and automatically identifying outlier data points and outlier wells; and ii) performs statistical testing between samples, taking into account the different magnitudes of the between- and within-plate variations. This led to a significant reduction of the coefficient of variation across plates of basal respiration by 45% and of maximal respiration by 29%. Moreover, using positive and negative controls, we show that our statistical test outperforms the existing methods, which suffer from an excess of either false positives (within-plate methods), or false negatives (between-plate methods). Altogether, this study provides statistical good practices to support experimentalists in designing, analyzing, testing, and reporting the results of mitochondrial stress tests using this high throughput platform.


Introduction
Mitochondria are double-membrane-enclosed, ubiquitous, maternally inherited organelles present in most eukaryotic cells [1]. They are known as the powerhouse of the cell [2,3] due to their pivotal function in the cellular energy supply where adenosine triphosphate (ATP) is generated by the mitochondrial respiratory chain in a process referred to as oxidative phosphorylation. Furthermore, mitochondria are involved in regulating reactive oxygen species [4], apoptosis [2], amino acid synthesis [5,6], cell proliferation [6], cell signaling [7], and in the regulation of innate and adaptive immunity [8]. A decline in mitochondrial function, reflected by a diminished electron transport chain activity, is related to many human diseases ranging from rare genetic disorders [9] to common ones such as cancer [7,10], diabetes [11], neurodegeneration [12], and aging [3]. One of the most informative tests of mitochondrial function is the quantification of cellular respiration, since it directly reflects electron transport chain impairment [9] and depends on many sequential reactions from glycolysis to oxidative phosphorylation [13]. One of the last steps of cellular respiration is the oxidation of cytochrome c in complex IV, which reduces oxygen to form water. Therefore, the estimations of oxygen consumption rates (OCR) expressed in pmol/min enable drawing conclusions about the ability to synthesize ATP and about mitochondrial function, even more than the measurements of intermediates (such as ATP or nicotinamide adenine dinucleotide NADH) and potentials [14,15].
OCR was classically measured using a Clark-type electrode, which is time-consuming, limited to whole cells in suspension and high yield, and does not allow the automated injection of compounds [15]. In addition, experimentation with isolated mitochondria is ineffective because the cellular regulation of mitochondrial function is removed during isolation [16]. In the last few years, a new technology that calculates O 2 concentrations from fluorescence [17] in a microplate assay format has been developed by the company Seahorse Bioscience (now part of Agilent Technologies) [18]. It allows simultaneous real-time measurements of both OCR and extracellular acidification rate (ECAR) in multiple cell lines and conditions, reducing the amount of required sample material and increasing the throughput [18,19].
Typically, OCR and ECAR are measured using the Seahorse XF Analyzer in 96-well (or 24-well) plates at multiple time steps under three consecutive treatments (Fig 1), in a procedure known as a mitochondrial stress test [20]. Under basal conditions, complexes I-IV exploit energy derived from electron transport to pump protons across the inner mitochondrial membrane. The proton gradient generated in this manner is subsequently harnessed by complex V to generate ATP. The blockage of the proton translocation through complex V by injecting oligomycin represses ATP production and prevents the electron transport throughout complexes I-IV due to the unexploited gradient, thus, generating ATP-ase independent OCR only (Fig 1A and 1B). The administration of carbonyl cyanide-4-(trifluoromethoxy)phenylhydrazone (FCCP), an ionophor, subsequently dissipates the gradient uncoupling electron transport from complex V activity and increasing oxygen consumption to a maximum level (Fig 1A and 1B). Finally, mitochondrial respiration is completely halted using rotenone, a complex I inhibitor. There is still some remaining oxygen consumption that is independent from electron transport chain activity (Fig 1A and 1B). This approach is label-free and nondestructive, so the cells can be retained and used for further assays [21].
OCR differences in the natural scale between the various stages of this procedure lead to the estimation of six different bioenergetic measures: basal respiration, proton leak, non-mitochondrial respiration, ATP-linked respiration, spare respiratory capacity, and maximal respiration [15,19] (Table 1). An increase in proton leak and a decrease in basal or maximal respiration are indicators of mitochondrial dysfunction [15]. In addition, ATP-linked respiration, basal respiration, and spare capacity change in response to ATP demand, which is not necessarily mitochondrion-related as it may be the consequence of the dysregulation of any cellular process altering general cellular energy demand. Then, these bioenergetic measures are typically used to test two samples or conditions against each other.
The existing literature describing the Seahorse technology addresses experimental aspects regarding sample preparation [22,23], the number of cells to seed [23,24], and compound concentration in different organisms [13,22,25]. However, studies regarding statistical best practices for determining OCR levels and testing them against others are lacking. The sole definition of bioenergetic measures varies between authors, as well as the number of time points in each interval (usually three time points, but in some cases one [26], two [27], or four or more [11]), and whether differences [6,13,28], ratios [12,29], or both [24,25]  computed. Consequently, the comparison of results across studies is difficult. Moreover, often, statistical power analyses for experimental design are not provided. The differences in OCR between biological samples (e.g. patient vs. control, or gene knockout vs. WT) can be as low as 12%-30% [30][31][32]. Therefore, to design experiments with appropriate power to significantly detect such differences, it is important to know the source and amplitude of the variation within each sample, and to reduce it as much as possible.
We performed and analyzed a large dataset of 126 experiments in 96-well plate format involving 203 different fibroblast cell lines, out of which 26 were seeded in more than one plate (S1 Table). The large number of between-plate and within-plate replicates allowed us to statistically characterize the nature and magnitude of systematic and random variations in these data. We developed a statistical procedure called OCR-Stats, to extract robust and accurate oxygen consumption rates for each well, which translates into robust summarized values of the multiple replicates within and between plates. The OCR-Stats algorithm includes automatic outlier identification and controls for well and plate-interval effects, which led to a significant increase in accuracy over state-of-the-art methods.
Systematic and random variations were found to be multiplicative. This motivated us to establish bioenergetic measures based on differences in the logarithmic scale that translate into ratios and proportions in the natural scale: ETC-dependent OC proportion, ATPase-dependent OC proportion, ETC-dependent proportion of ATPase-independent OC, and maximal over initial OC fold change (Table 1).
Using an automatic outlier detection approach, we provide estimators for each instance and show empirically that they are normally distributed. This permitted the use of linear regression models for assessing the statistical significance of bioenergetic measure comparisons between two biological samples. Using positive and negative controls from individuals known to have mitochondrial respiratory defects, we show that OCR-Stats outperforms the currently used statistical tests, which suffer from an excess of either false positives (withinplate methods) or false negatives (between-plate methods).
Furthermore, our study provides experimental design guidance by i) showing that between-plate variation largely dominates within-plate variation, implying that it is important to seed the same cell lines in multiple plates, and ii) providing estimates of variances within and between plates for each bioenergetic measure allowing for statistical power computations. A free and pose source implementation of OCR-Stats in the statistical language R is provided at github.com/gagneurlab/OCR-Stats.

Experimental design and raw data
We measured the OCR, the ECAR, and the cell number of 203 dermal fibroblast cultures derived from patients suspected to suffer from rare mitochondrial diseases and control cells from healthy donors (normal human dermal fibroblasts: NHDF, Materials and methods, S1 Table). These were assayed in 126 plates, all using the same protocol (Materials and methods). Twenty-six cell lines were grown independently and were measured in multiple plates. We will refer to these growth replicates as different biological samples. The NHDF cell line was seeded in all the plates for the assessment of potential systematic plate effects. The corners of each plate were left as blank, that is, filled with media but not with cells, to control for changes in temperature [22]. One common layout of a plate is depicted in Fig 1C, showing how each biological sample is present in many well replicates. We seeded between 3 and 7 biological samples per plate (median = 4). This variation reflects typical set-ups of experiments in a lab performed over multiple years. Then, we used the standard mitochondrial stress test assay [20] leading to four time intervals, with three time points each, denoted by Int 1 (before adding any treatment), Int 2 (after oligomycin), Int 3 (after FCCP), and Int 4 (after rotenone) ( Fig 1A). In addition, we flagged wells that did not react as expected to the treatments and discarded them from the statistical analysis (Materials and methods). First, outlier data points occurred frequently. We distinguished two different types of outliers: entire series for a well (e.g., well G5 in Fig 2A) and individual data points (e.g., well B6 at time point 6 in Fig 2A). In the latter case, eliminating the entire series for well B6 would be too restrictive and would result in the loss of valuable data from the other 11 valid time points. Therefore, methods for detecting outliers taking these two possibilities into consideration must be devised.

Variations between replicates within plates
Second, we noticed a proportional dependence of OCR value and standard deviation between replicates (Fig 2B), suggesting that the error is multiplicative. Unequal variance, or heteroscedasticity, can strongly affect the validity of statistical tests and the robustness of estimations. Therefore, we propose modeling OCR on a logarithmic scale, where the dependence between the variance and the mean disappears (Fig 2B and 2C). The activities of respiratory chain enzymes such as NADH-ubiquinone reductase also obey log-normal distributions [33].
Third, we observed systematic effects in OCR between wells (e.g., OCR values of well C6 are among the highest, while OCR values of well B5 are among the lowest at all the time points; Fig 2A). Variations in cell number, initial conditions, treatment concentrations, or fluorophore sleeve calibration can lead to systematic differences between wells, which we refer to as well effects. To investigate whether well effects could be corrected using cell number to a large extent as in [26], we counted the number of cells after the experiments using Cyquant (Materials and methods). As expected, the median OCR for each interval grows linearly with cell number measured at the end of the experiment (Spearman's rho between 0.32 and 0.47, P < 2.2×10 −16 , S1A Fig). However, the relationship is not perfect, reflecting important additional sources of variations and also possible noise in measuring the cell number. Strikingly, dividing OCR by cell count led to a higher coefficient of variation (standard deviation divided by the mean) between the replicate wells than without that correction (S1B Fig). This analysis showed that normalization by the division of raw cell counts is insufficient and motivated us to derive another method to capture well effects. Finally, we found that sex does not significantly associate with OCR levels (S2 Fig), in agreement with [34].

A statistical model for OCR within plates
Building on these insights, we introduced a statistical model for OCR within plates. For a given plate, we modeled the logarithm of OCR y w,t of well w at time point t = 1,. . .,12 as a sum of time interval effects, well effects, and noise, that is: where θ biosample(w),interv(t) is the time interval effect of the biological sample in well w for interval interv(t) = 1,. . .,4 of time point t (Fig 1A), β w is the relative effect of well w compared to the reference well, and ε w,t is the error. We now present the OCR-Stats algorithm. For a given plate: 1. Fit the log linear model from Eq (1) using the least squares method, which consists of mini-   5. Compute the ratio-based metrics (Table 1), or scale back to natural scale in order to compute the bioenergetic measures [e.g.: Basal respiration ¼ expðŷ 1 Þ À expðŷ 4 Þ].
Note that the well effect is modeled independently for each plate, that is, it corresponds to the effect of a well of a given plate and not to the effect of a well position shared across plates.
We investigated whether there were positional effects and found that OCR measurements are lower in the edges by a median of up to 13.1% (S4A and S4B Fig). However, these positional effects were consistent across intervals (S4A and S4B Fig). Consequently, these positional effects are to a large extent canceled (S4C and S4D Fig) when using the metrics that we suggest (Table 1) because they involve differences of log OCRs. One exception was row A, where median differences of up to 2.2% were observed for the ETC-dependent OCR proportion and for the maximal over initial OCR fold change (S4C and S4D Fig). Practitioners could avoid all four edges and not only the four corners as typically done. However, these systematic deviations are small compared to the amplitude of biological effects typically investigated (not less than 12%-30% [30][31][32]). Altogether, this approach led to coefficients of variation between wells of the same plate of 11%, 14%, 13%, and 17% for intervals 1, 2, 3 and 4 respectively (Materials and methods).

Variations between plates
After analyzing the OCR variation among the wells inside plates, we aimed to study the variation across multiple plates. Using data from the controls NHDF, we found that the variability between plates in all the intervals is much larger than that between wells (S2 Table and  Variations between plates can arise, for example, due to differences in temperature, seeding time, growth time, growth medium, or sensor cartridge [13]. Moreover, treatment efficiencies can also vary between plates, but independently from each other. For example, the concentration of rotenone may differ in one plate. That would affect the OCR measurements of all the wells in that plate, but only in time interval 4. Next, we investigated whether our assumption of systematic plate-interval effects held. We indeed observed a similar increase in OCR in interval 1 on both biological samples on plate #20140430 with respect to plate #20140428 (Fig 3A). To test whether this tendency held across the repeated biological samples, we compared all the replicate pairings with their respective NHDF controls and found a positive correlation in all the time intervals (Fig 3B), suggesting a plate-interval effect. These observations show the importance of basing conclusions from observations across multiple plates and for seeding a control cell line on every plate.

Statistical testing for the comparison of biological samples across plates
We then set up to devise a model to statistically assess difference in OCR ratios between two biological samples across multiple plates. Since there is a remaining systematic effect across intervals at the plate level ( Fig 3C) and because of the plate-interval effects, we recommend using ratios of OCR levels (i.e. differences in the logarithmic scale) ( Table 2).
Subsequently, for any given OCR ratio (e.g., M/Ei-fold change, Tables 1 and 2), we test the differences of the OCR log-ratios ΔΔθ of a biological sample b versus a control c using the following linear model: where ΔΔθ b,p is one OCR log-ratio difference of interest (Table 2) inside a plate p. We fit this model over our complete dataset using linear regression, thus obtaining one valuem b per OCR ratio and biological sample b. Then, we tested these against the null hypothesis μ b = 0 to compute p-values and confidence intervals (Materials and methods). Fitting the linear model of Equation (Eq 2) over the complete dataset gives a robust estimate of the standard deviation of the error term. Applying this approach, we found no evidence against the normality and homoscedasticity assumption of OCR-Stats as the quantile-quantile plots of the residuals aligned well along the diagonal (Fig 4A and S6 Fig).

Benchmark
We applied OCR-Stats statistical testing, Extreme Differences plus Wilcoxon test within each plate (within-plate ED), and Extreme Differences plus Wilcoxon test across plates (acrossplate ED) to obtain the M/Ei-fold change and maximal respiration (MR) of all of the 26 cell lines that were seeded in more than one plate (Materials and methods). For every approach, we computed p-values for significant fold changes against the controls. Six of these cell lines are derived from patients with rare variants in genes associated with an established cellular respiratory defect, allowing the assessment of the sensitivity (or statistical power) of each approach (S3 Table, [35][36][37][38][39]). Additionally, two cell lines (#73901 and #91410) repeatedly showed no significant respiratory defects in earlier studies and served as negative controls [40,41].  Table 2. OCR ratio-based differences for statistical testing.

OCR ratios Tested differences ΔΔθ
For each OCR ratio from The within-plate ED method reported significantly higher or lower MR for 56 out of 69 (81.2%) biological samples with respect to the control (Fig 4B and S3 Table). Moreover, the within-plate ED method reported one or more significant differences for all the 26 cell lines, and one or more non-significant differences for 11 cell lines (Fig 4B). For two cell lines, the within-plate ED method returned significant differences with opposite signs (cell lines #78661, #83109, Fig 4B). These ambiguous results show the importance of testing using multiple plates and suggest the need for a more robust approach than the within-plate ED. One approach to evaluate samples measured in multiple plates is to perform a Wilcoxon test on the ED values averaged per plate (across-plate ED, Materials and methods). However, this requires at least five plate replicates in order to obtain significant results. Here, one cell line only, #78661, was found to have significantly impaired OCR in this way. For these data, OCR-Stats was much more conservative than within-plate ED and found only 7 out of 26 (26.9%) cell lines to have aggregated significantly lower M/Ei-fold change than the control, including all six positive control cell lines (Fig 4B and 4C, and S3 Table). Moreover, OCR-Stats did not report significant M/Ei-fold changes for the two negative controls.
Furthermore, we computed the coefficient of variation (standard deviation divided by mean) of the six bioenergetic measures in the natural scale (Table 1) of all the repeated biological samples across plates for the following methods: i) the default Extreme Differences (ED) method (Materials and methods) provided by the vendor, ii) the log linear (LL) corresponding to steps 1 and 2 of the OCR-Stats algorithm, iii) complete OCR-Stats (LL + outlier removal), and iv) OCR-Stats after correcting for plate effect (OCR-PE) using Eq (4) (Materials and methods). Each step contributed to a decrease in the coefficient of variation, obtaining a final significant reductions of 45% and 29% in basal and maximal respiration, respectively, from plate-corrected OCR-Stats (OCR-PE) with respect to ED (P < 0.012, one-sided Wilcoxon test) (Fig 5). Taken together, these results show that OCR-Stats successfully identifies and decreases the variation within and between plates, providing more stable testing results, which translates into fewer false positives.

Power analysis
Finally, we investigated the statistical power of OCR-Stats in this dataset. Specifically, we are interested in determining the minimum relative differences our method is able to significantly detect, and the minimal number of well replicates needed. We subsetted the number of wells of the repeated biological samples to 4,6,8,10,12,14, and 16 wells on each plate, and used the OCR-Stats algorithm (including outlier removal) and statistical testing to obtain the residuals b,p and their standard deviation (Fig 6). Assuming three plates per comparison and 16 wells per plate, these standard deviations allow detecting relative differences of 10% to 15% depending on the considered log OCR ratios differences for significance level of 5% (Fig 6, right yaxis, Materials and methods). Relative differences of 10% to 15% are in line with reported detected variations in the literature which we found to be as low as 12%-30% [30][31][32]. This analysis also suggests to seed at least 12 wells per biological sample per plate, since we observed increased standard deviations of the residuals for numbers of wells smaller than 12. Note that this power calculation is based on measurements performed in our laboratory only. Other laboratories might have larger or smaller measurement variations. Nonetheless, our procedure could be used as a guideline for power calculation.

Discussion and conclusion
Mitochondrial studies using extracellular fluxes, specifically the XF Analyzer from Seahorse, are gaining popularity and are finding their way into diagnostics; therefore, it is of paramount importance to have an appropriate statistical method to estimate the OCR levels from the raw data. Here, we have developed such a model, the OCR-Stats algorithm, which includes approaches to control for well and plate-interval effects, and automatic outlier identification. We found that dividing cellular OCR by cell number involved the introduction of more noise than was seen for uncorrected data. Here, we always seeded the same number of cells. Hence, the variations across wells that we observed in the cell number at the end of the experiments are largely overestimated by noise in the measurements. In other experimental settings in which different numbers of cells are seeded, we suggest the inclusion of an offset term to the model in Eq (1) equal to the logarithm of the seeded cell number to control for this variation by design. In addition, the Seahorse XF Analyzer can be used on isolated mitochondria and on isolated enzymes, where a normalization approach is to divide OCR by mitochondrial proteins or enzyme concentration [42]. However, as described here for cellular assays, robust normalization procedures require careful analysis.
We demonstrated that OCR comparisons should be performed using ratios rather than using differences, and that the cell lines must be seeded on the same plate, as this eliminates sources of variation like cell number, and well positional and plate-interval effects. We introduced a linear model, the OCR-Stats statistical testing, and showed that the results agree with previous results of patients diagnosed with mitochondrial disorders. We showed that the variation in differences of OCR log-ratios ΔΔθ b,p for the same biological sample across plates is large, and that, consequently, samples should be seeded in multiple plates. Note that a contaminated sample can increase the variability, affecting the significance of all the other samples. Therefore, it is important to detect such samples and to exclude them from further analysis. By doing power analysis, we showed that our method is able to detect relative differences of 10% -15%, and that the minimum number of well replicates per biological sample in a 96-well plate should be 12.
We encourage users to consider all five metrics (Table 1). Severely affected cell lines with strongly reduced E/I-proportion might not necessarily show a clear effect on M/Ei-fold change. For example, cell line #73387 was found to have a lower, but not significantly (P < 0.10), M/Ei-fold change (the most common metric used throughout the literature, Fig 4C  and S3 Table), but when analyzing its E/I proportion, we found that it was significantly lower  Table 1). The right y-axis corresponds to the minimal detectable relative differences using three plates at 5% significance level (Materials and methods). For every number of wells, the 10 data points correspond to each of the 10 random samplings without replacement of the wells per biological sample and per plate.
In principle, OCR-Stats should be able to estimate ECAR levels. To guarantee that the method is indeed applicable, similar analyses as performed here should be done beforehand. Preliminary investigations suggest that the nature of noise (outliers, multiplicative) is similar to that for OCR.

Biological material
All the biological samples were derived from primary fibroblast cell lines of humans suffering from rare mitochondrial diseases, established in the framework of the German and European networks for mitochondrial disorders mitoNet and GENOMIT. All the individuals or their guardians provided written informed consent for their cell lines to be used for evaluation and testing, in agreement with the Declaration of Helsinki and approved by the ethical committees of the centers participating in this study. All the assays were performed in accordance with the local approval of the ethical committee of the Technical University of Munich. The controls are primary patient fibroblast cell lines, normal human dermal fibroblasts (NHDF) from neonatal tissue, commercially available from Lonza, Basel, Switzerland.

Measure of extracellular fluxes using Seahorse XF96
We seeded 20,000 fibroblast cells in each well of a XF 96-well cell culture microplate in 80 ml of culture medium, and incubated them overnight at 37˚C in 5% CO 2 . The four corners were left only with medium for background correction. The culture medium was replaced with 180 ml of bicarbonate-free DMEM and cells were incubated at 37˚C for 30 min before measurement. Oxygen consumption rates (OCR) were measured using an XF96 Extracellular Flux Analyzer [20]. OCR was determined at four levels: with no additions, and after adding oligomycin (1 μM), carbonyl cyanide 4-(trifluoromethoxy) phenylhydrazone (FCCP, 0.4 μM), and rotenone (2 μM) (additives purchased from Sigma at highest available quality). After each assay, manual inspection was performed on all wells using a conventional light microscope. The wells for which the median OCR level did not follow the expected order, namely, median [OCR(Int 3 )] > median[OCR(Int 1 )] > median[OCR(Int 2 )] > median[OCR(Int 4 )] (Fig 1A), were discarded (977 wells, 10.47%). It is important to note that other cell lines, or cell lines under certain conditions, may not react as expected to the standard treatments; therefore, they should not be discarded. In addition, we excluded contaminated wells and wells in which the cells got detached (461 wells, 4.94%) from the analysis. All the raw OCR data are available in S4 Table.

Cell number quantification
The cell number was quantified using the CyQuant Cell Proliferation Kit (Thermo Fisher Scientific, Waltham, MA, USA), according to the manufacturer's protocol. In brief, the cells were washed with 200 μL PBS per well and frozen in the microplate at -80˚C to ensure subsequent cell lysis. The cells were thawed and resuspended vigorously in 200 μL of 1x cell-lysis buffer supplemented with 1x CyQUANT GR dye per well. The resuspended cells were incubated in the dark for 5 min at RT, whereupon fluorescence was measured (excitation: 480 nm, emission: 520 nm).

Extreme differences (default) method to compute bioenergetic measures
On every plate independently, for each well, in interval 1 take the OCR corresponding to the last measurement, in intervals 2 and 4 take the minimum, and in interval 3 the maximum OCR value [19]. Then, use the corresponding differences to estimate the bioenergetic measures. Report the results per patient as the mean across wells plus standard deviation or standard error, separately for each plate.

Outlier removal
For each sample s and well w, compute the mean across time points of its squared deviations: s w ≔mean t ðd 2 w;t Þ, thus, obtaining a vector s. Identify as outliers the wells whose s w > median(s) + 5 mad(s), where mad, median absolute deviation, is a robust estimation of the standard deviation (S3A Fig). We found that deviations by 5 mad from the median were sufficiently selective in practice. Compute the vector of estimatesθusing the remaining wells and iterate this procedure until no more wells are identified as outliers. It required eight iterations until convergence and around 16.5% of all the wells were found to be outliers (S3B Fig).
Single point outliers are identified in a similar way. After discarding the wells that were found to be outliers in the previous step, categorize as outliers single data points whose d 2 w;t > median t ðd 2 w;t Þ þ 7 mad t ðd 2 w;t Þ (S3C Fig). Iterate until no more outliers are found. It required 19 iterations until convergence and approximately 6.1% of single points were found to be outliers (S3D Fig).

Coefficient of variation between wells of the same plate
Using only the controls NHDF, we computed the standard deviation σ p,i of the logarithm of OCR across all the wells for each plate p and interval i. Then, we computed the median across plates, thus, obtaining one value " s i per interval (" s 1 ¼ 0:10; " s 2 ¼ 0:13; " s 3 ¼ 0:12; " s 4 ¼ 0:16). Coefficients of variation in the natural scale were approximated by taking the exponential of these standard deviations.

OCR-Stats statistical testing
We fitted Eq (2) using linear regression as implemented in the base R function lm(). P-values for each ratio against the null hypothesis μ b = 0 are obtained with the default test (Student's ttest) returned by the summary function on the lm fit object.

Power calculation of multi-plate experiments
Minimal detectable effects for OCR-ratio based metrics (Table 2) at 95% confidence level were estimated using the following equation: where b,p are the residuals from Eq 3 and 1.96 corresponds to the approximate value of the 97.5 percentile of the standard normal distribution. We obtain the metrics on Table 2 by setting n = 3.

Plate-interval effect benchmark
For benchmarking, we correct for the plate-interval effect using only the data from the controls NHDF c of each plate using the following model: We solved it using the least squares method and used the effectsb i;p as offsets in Eq (1). We recomputedŷ b;i values accordingly and scaled back to the natural scale to calculate the bioenergetic measures and the coefficient of variation of all repeated the biological samples, except the control (Fig 5).

Multi-plate averaging method
In the case of inter-plate comparisons, the multi-plate averaging method takes the mean and standard error of the bioenergetic measures obtained using the Extreme Differences (ED) method of all the repeated biological samples across plates [44].
Supporting information S1   (Table 1) across rows (x-axis, C) and columns (x-axis, D). All the location effects get canceled, except for row A where it remains relatively low. (PNG)

S5 Fig. Variation between plates is larger than variation within plates.
Boxplot showing OCR in time interval 3 (x-axis) of NHDF seeded in 10 randomly selected plates (y-axis) reflecting that the variation between is larger than the variation within plates. Red line: mean of OCR across all plates. This trend was observed across all the plates and for all the intervals (S2 Table).