Effect of Population Heterogenization on the Reproducibility of Mouse Behavior: A Multi-Laboratory Study

In animal experiments, animals, husbandry and test procedures are traditionally standardized to maximize test sensitivity and minimize animal use, assuming that this will also guarantee reproducibility. However, by reducing within-experiment variation, standardization may limit inference to the specific experimental conditions. Indeed, we have recently shown in mice that standardization may generate spurious results in behavioral tests, accounting for poor reproducibility, and that this can be avoided by population heterogenization through systematic variation of experimental conditions. Here, we examined whether a simple form of heterogenization effectively improves reproducibility of test results in a multi-laboratory situation. Each of six laboratories independently ordered 64 female mice of two inbred strains (C57BL/6NCrl, DBA/2NCrl) and examined them for strain differences in five commonly used behavioral tests under two different experimental designs. In the standardized design, experimental conditions were standardized as much as possible in each laboratory, while they were systematically varied with respect to the animals' test age and cage enrichment in the heterogenized design. Although heterogenization tended to improve reproducibility by increasing within-experiment variation relative to between-experiment variation, the effect was too weak to account for the large variation between laboratories. However, our findings confirm the potential of systematic heterogenization for improving reproducibility of animal experiments and highlight the need for effective and practicable heterogenization strategies.


Introduction
Experimental results that cannot be reproduced are scientifically worthless and a nuisance if published in the literature where they may create uncertainty and hinder scientific progress. Poor reproducibility and lack of external validity are an issue throughout laboratory research from mass spectrometry proteomic profiling [1] and microarray analysis [2][3][4][5] to the social and behavioral sciences [6,7]. In animal experiments, however, where the lives of animals are highly valuable, poor reproducibility is also an ethical issue. Thus, animal care and use regulations require scientists not to unnecessarily duplicate previous experiments [8][9][10]. This explicitly assumes that animal results are reproducible by different laboratories, and that duplication therefore represents unnecessary animal use. However, a review of the scientific literature casts serious doubt on this assumption, indicating that poor reproducibility may be rather widespread [11][12][13][14][15][16][17][18][19][20][21][22].
In animal experiments, animals, housing and experimental conditions are traditionally standardized to render the animals' responses to experimental treatments more homogeneous, thereby reducing within-experiment variation and increasing test sensitivity [23,24]. Because higher test sensitivity allows a reduction of sample size, standardization is also promoted for ethical reasons as a mean to reduce animal use [25,26]. Moreover, standardization across experiments is assumed to reduce between-experiment variation, thereby improving reproducibility among laboratories [24,27]. However, by reducing within-experiment variation, standardization may limit inference to the specific experimental conditions [28,29]. Given that most biological traits exhibit environmental plasticity [30], different experimental conditions may produce different experimental outcomes. Because laboratories inherently vary in many experimental features (e.g. experimenter, room architecture), conditions are generally more homogenous within than between laboratories. Therefore, standardization inevitably induces disparity between results from different laboratories. In contrast, controlled variation of experimental conditions may render the animals within experiments more heterogeneous, thereby improving the external validity and hence the reproducibility of experimental results [28,29,31]. Indeed, we have recently shown in mice that standardization may increase the incidence of spurious results in behavioral tests, accounting for poor reproducibility between replicate experiments, while systematic variation of experimental conditions (heterogenization) attenuated spurious results, thereby improving reproducibility [32]. However, our findings were challenged because they were based on retrospective analysis, and because heterogenization may be logistically unfeasible [33]. We therefore tested standardization against a simple form of heterogenization for reproducibility across four independent replicate experiments. Systematic variation of only two factors was sufficient to mimic the range of differences between the replicate experiments, resulting in almost perfect reproducibility [34].
In a real multi-laboratory situation, however, between-experiment variation might be considerably greater. Recent multilaboratory studies revealed large effects of the laboratory as well as strong interactions between genotype and the laboratory environment [13,17,35,22]. To investigate whether simple forms of heterogenization within laboratories render populations of mice sufficiently heterogeneous to guarantee robust results across laboratories, we designed a multi-laboratory study involving six laboratories, and compared the effect of standardization against heterogenization on the reproducibility of behavioral differences between two common inbred strains of mice. Although heterogenization significantly increased within-experiment variation relative to between-experiment variation, the effect was too weak to account for the large variation between laboratories and improve reproducibility substantially. Thus, further research is needed to establish effective and practicable heterogenization strategies.

Experimental design
Each of six laboratories used 64 female mice of two inbred strains (C57BL6NCrl, DBA/2NCrl, n = 32 each) and examined them for strain differences in five commonly used behavioral tests (barrier test, vertical pole test, elevated zero maze, open field test, novel object test). To test heterogenization against standardization, each laboratory successively conducted the same experiment twice, using two different experimental designs with half of the mice allocated to each design. In the standardized design, experimental conditions were standardized, while they were systematically varied in the heterogenized design. For heterogenization, we selected two experimental factors (test age, cage enrichment) that typically vary between experiments in different laboratories, and chose three factor levels A, B and C for each factor (age: A = 12 weeks old, B = 8 weeks old, C = 16 weeks old; cage enrichment: A = nesting material, B = shelter, C = climbing structures). Within each laboratory, the two factors were standardized to factor level A in the standardized design and systematically varied across B and C using a 262 factorial design in the heterogenized design ( Fig. 1). Because both age and enrichment have been demonstrated to affect and interact with a wide variety of potential outcome measures [36][37][38][39][40][41][42], heterogenization across these two factors was expected to create a range of different phenotypes within experiments, thereby increasing the external validity and thus the reproducibility of the results across the six laboratories.
Besides the two experimental factors that were standardized or heterogenized depending on the experimental design, the following factors were controlled and standardized in both experimental designs and all six laboratories: order of tests, test protocols, animal supplier, and housing protocols (number of animals/cage, housing period prior to testing, position of cages Figure 1. Experimental design. Each of six laboratories used 64 female mice of two inbred strains (C57BL6NCrl, DBA/2NCrl) ordered in two consecutive batches (n = 16 per batch and strain), with each batch being allocated to one experimental design. Upon arrival of a batch, the 16 mice per strain were randomly assigned to four cages in groups of four. To test heterogenization against standardization, we selected two experimental factors (test age, cage enrichment) and chose three factor levels A, B and C for each factor. Within each laboratory, the two factors were either standardized to factor level A (standardized design, uniform grey) or systematically varied across B and C using a 262 factorial design (heterogenized design, varying grey). According to the 262 factorial design of the heterogenized condition, study populations were divided into four blocks that were also characterized by similar microenvironmental differences due to cage position within the rack. doi:10.1371/journal.pone.0016461.g001 within the rack, interval of cage changes). All other variables varied between laboratories depending on laboratory standards. These included: details of the housing conditions (e.g. local tap water, food type, local bedding material, cage size), physical arrangement of housing and testing rooms (e.g. local room architecture, humidity, lighting, temperature), test apparatuses, tracking software (e.g. ANYmaze or EthoVision), experimenter, handling method (e.g. with/without gloves), identification method (e.g. ear punctures, fur markings) arrival and test dates, and test time (see Tables 1 and 2).

Laboratories
The study was conducted in the following six laboratories: (1) Animal Welfare and Ethology, University of Giessen (H. Würbel), . Each lab provided space in a conventional colony room for animal housing and a test room for behavioral testing. Animal care was provided by each lab's animal care staff together with the designated experimenter of each laboratory who also implemented cage enrichments and conducted behavioral testing throughout the two test weeks. Experimenters were a PhD student in Giessen, a PhD student in Muenster, a postdoctoral research fellow and a student assistant in the Munich lab, a technician and a postdoctoral research fellow in Mannheim, a biotechnician in the Utrecht lab, and a postdoctoral research fellow and a biotechnician in Zürich. All experimenters were adept in working with mice and conducting behavioral tests. Experimenters were not blinded to strain, age, housing conditions and experimental design. However, because our outcome measure was reproducibility across laboratories and each laboratory had its own experimenter, the experimenters' knowledge about the animals and their expectations about the outcome of the study could not bias our outcome measure.

Between-laboratory standardized conditions and procedures
Animals and housing conditions. The 384 female mice (C57BL/6NCrl, DBA/2NCrl, n = 192 each) were obtained from Charles River Laboratories (Sulzfeld, Germany) aged nine weeks for the standardized condition, and aged five and thirteen weeks for the heterogenized condition (Fig. 2). Each lab independently ordered 32 females per strain that were supplied consecutively in two batches (n = 16/strain), one for the standardized design and one for the heterogenized design. The order of supply was balanced across laboratories with three laboratories (Giessen, Mannheim, Munich) starting with the standardized design and three laboratories (Muenster, Zürich, Utrecht) starting with the heterogenized design. Upon arrival, the mice were randomly assigned to samestrain groups of four and housed in conventional polycarbonate cages with sawdust, standard mouse diet and tap water ad libitum. Depending on the experimental design, cages contained additional equipment: Cages of the standardized design (A) additionally contained two soft tissue papers (Tork, SCA Hygiene Products GmbH, Wiesbaden, Germany), while half of the cages of the heterogenized design (B) contained a mouse house (MouseHouse, Tecniplast, Italy) and one tissue paper and the other half (C) a climbing structure (18 cm long, 10 cm high) [43], a wooden ladder (3 rungs, each 5 cm long, 14 cm high; Trixie Heimtierbedarf, Tarp, Germany) and one tissue paper. Cages were cleaned once per week, except for the test week to minimize disruption due to cage cleaning before testing. Mice were housed under these conditions for three weeks before the onset of the test phase (Fig. 2). Temperature and relative humidity were stable within laboratories, but differed between them (see Table 1). Similarly, all mice were held under a constant 12 h light-dark cycle, but the time schedules differed between laboratories (see Table 1). Depending on the position in the rack, cages may differ in local environmental conditions (e.g. temperature, humidity, lighting, and disturbance) due to variation in proximity to ventilation, lights and human traffic. To avoid position bias, we controlled for cage position in the experimental design [44]. Thus, the eight cages of one design were stacked in two horizontal lines of four cages in one rack, with cages of DBA/2NCrl and C57BL/6NCrl mice balanced for horizontal and vertical position in the rack, and each vertical pair of cages of C57BL/6NCrl and DBA/2NCrl mice was treated as a block, assuming greater microenvironmental similarity within blocks than between blocks.
All procedures complied with the regulations covering animal experimentation within the EU (European Communities Council Directive 86/609/EEC) and in the countries in which the experiments were conducted (Germany: Deutsches Tierschutzgesetz; The Netherlands: Dutch Animal Welfare Act; Switzerland: Schweizerisches Tierschutzgesetz). They were conducted in accordance with the institutions' animal care and use guidelines and, where necessary, approved by the national and local authorities. The German labs (Giessen, Munich, Münster and Mannheim) did not need formal approval of the study by governmental authorities, because the study did not involve any harmful procedures. In the Utrecht lab, the study was approved by the Dutch Ethical Commission (Lely-DEC) under license number DPh-09-04, and the lab's permission to conduct animal experiments was granted by their general license number 24900 provided by the Dutch Government. In the Zürich lab, the study was approved by the Swiss Federal Veterinary Office under license number 204/2008. Moreover, all efforts were made to minimize the number of animals used and the severity of procedures applied in this study. Behavioral testing. Mice were subjected to five behavioral tests that are commonly performed in drug-screening or behavioral phenotyping studies. They were conducted in the same order in all six laboratories: day 1: barrier test (BT), vertical pole test (VPT), day 2: elevated zero maze (EZM), day 3: open field test (OFT) and day 4: novel object test (NOT). To monitor health status, mice were weighed prior to and after testing (Fig. 2).
Testing order of cages was balanced across strain and rack position, and the mice of one cage were tested either simultaneously (OFT, NOT) or successively (BT, VPT, EZM). Apparatuses were cleaned with water or alcohol solution between trials.
Barrier test. To test the exploratory drive, mice were individually placed into an unfamiliar, empty type III Macrolon cage, divided in two halves by a plastic hurdle (see Table 2). At the beginning of each trial, a mouse was placed into one of the compartments according to a pseudo-random schedule. The test was finished when the mouse either crossed the barrier (all four paws on the other side of the barrier) or a maximum time of 300 s elapsed without the mouse climbing over the barrier. The latency to cross the barrier was used as measure of exploratory behavior.
Vertical pole test. The vertical pole test is a measure of motor coordination and balance that requires minimal equipment [45]. A wooden pole, approximately 2 cm in diameter and 40 cm long, was wrapped with cloth tape for improved traction. The mouse was placed on the centre of the pole that was held in a horizontal position by hand. The pole was then gradually lifted to a vertical position. The test was finished when the mouse either fell off the pole or held fast to it for 180 s. The latency to fall off the pole was used as dependent variable.
Elevated zero maze. On an EZM, the exploratory drive of mice is competing with their natural avoidance of heights and open spaces [45]. The EZM is a modification of the elevated plus maze that was first introduced and pharmacologically validated in rats [46,47]. The advantage of the EZM is that it lacks the ambiguous central square of the traditional plus maze. The apparatus consisted of a circular platform, elevated 40-50 cm above the floor, divided into two open and two closed sectors enclosed by walls of about 20 cm height. At the beginning of each trial, the mouse was placed in one of the two closed sectors and behavior was recorded for 300 s. By using specialized tracking software, the total path moved on the maze as well as the path moved within, the time spent in, and the number of entries into the open and closed sectors, were automatically recorded. Moreover, head dips, stretched postures and rearing behavior were manually calculated according to the following definitions: Open field test. The open field test is the most widely used behavioral test since it was developed by Hall [48,49]. It has been validated pharmacologically as a test of anxiety [50], but is also used to measure exploratory and locomotor drive in laboratory rodents [45]. The apparatus consisted of an open box, 40 cm640 cm minimum size, virtually divided into various zones (corners, 5 cm wall zone, centre). Mice were placed into the centre of the empty open field arena and videotracked for 10 min. The time spent in, the distance travelled within, and the number of entries into each zone, were calculated. In addition, the total distance moved during the 10 min session was analyzed and the number of fecal boli dropped was counted at the end of each trial.
Novel object test. In combination with an open field test, the novel object test serves to discriminate between approach and avoidance tendencies towards novel stimuli [51]. Twenty-four hours after the open field test, the animals were re-exposed for 10 min to the same arena with a novel object (black pine cone, autoclaved, about 7 cm high and 6 cm in diameter mounted on a metal plate; Miroflor, Greiz, Germany) placed upright in the centre of the arena. In addition to the zones defined for the open field test, two zones surrounding the object (15 cm, 25 cm diameter) were defined as exploration zones. The zone defined by the object itself was excluded from the exploration zones to avoid confounding ''sitting on the object'' with ''object exploration''. Again, time spent in, distance travelled within, and the number of entries into the various zones were calculated. Object exploration time and frequency were assessed using the Figure 2. Experimental procedure followed by each laboratory. The 64 mice per laboratory, aged nine weeks for the standardized design, and five and thirteen weeks for the heterogenized design, were supplied in two independent batches (n = 16/strain). Upon arrival, the mice were group-housed in conventional polycarbonate cages for three weeks. Cages of the standardized design (red) contained two pieces of tissue paper (nesting material), while half of the cages of the heterogenized design (blue) contained a mouse house and the other half a climbing structure and a wooden ladder. Subsequent to the three-week housing phase, mice were subjected to a battery of five behavioral tests. The whole experimental procedure lasted five weeks, including a three-week housing phase, a one-week test phase and one week shift between the behavioral tests of the standardized and the heterogenized design. The order was balanced across the six laboratories with three laboratories starting with the standardized and three laboratories with the heterogenized design. doi:10.1371/journal.pone.0016461.g002 time spent in, and the frequency of entering the exploration zones. Moreover, the total distance moved during the 10 min session was analyzed and the number of fecal boli dropped was counted at the end of each trial.

Statistical analysis
The aim of the present study was to compare a standardized design with a heterogenized design to examine whether they differ with respect to the reproducibility of strain differences across laboratories, and sample size was determined by the minimum number of animals need for this purpose. For each factor combination of the heterogenized design we used only one cage per strain (the absolute minimum), with 4 mice in each cage (in total n = 16 mice per strain, experimental design, and laboratory). We considered 4 mice per cage the absolute minimum to allow us to compare within-cage variance with between-cage variance as an important control measure to assess whether heterogenization had worked. Moreover, we considered 6 labs sufficient to obtain a reasonable estimate of the effect of heterogenization on reproducibility of behavioral strain differences.
Except for Utrecht, where one animal had died during the test phase, data recordings were complete. However, for 21 out of the 384 animals we had to exclude single values from the final analysis. Reasons for exclusion were (i) mice jumping out of the apparatus (especially in the BT), (ii) mice performing stereotypic circling in the open-field arena, and (iii) problems with video tracking. These missing values were replaced by series means.
All data were analyzed using General Linear Models (GLM). To meet the assumptions of parametric analysis, residuals were graphically examined for homoscedasticity and outliers, and, when necessary, the raw data were transformed using square-root, logarithmic or angular transformations (for a list of transformations see Table 3). For the analysis we selected 29 behavioral measures from the five behavioral tests, including common measures of activity, anxiety and exploratory drive, 20 of which were automatically recorded using specialized software (Table 3).
In a first step, we determined mean strain differences ( = mean C57BL/6NCrl mice -mean DBA/2NCrl mice) for all 29 behavioral measures to compare variation among the six laboratories for the standardized and the heterogenized design.
Next, we analyzed the results of each laboratory separately (laboratory-specific analysis) as if each experiment had been conducted independently and assessed the main effect of 'strain' on each of the 29 behavioral measures using a GLM split by experimental design. Based on the 262 factorial design of the heterogenized condition, and to account for microenvironmental differences due to cage position in the rack, each experiment was divided into the four blocks of cage pairs, and 'block' included as a blocking factor in the GLM: y = strain + block. Including 'block' as a blocking factor in the GLM allowed us to control for betweenblock variation, thereby reducing variance in the data and increasing test sensitivity [52,53].
To explore the difference between the two experimental designs in the variation among laboratories (lab) further, we analyzed the two experimental designs separately using the GLM: y = strain + lab + strain6 lab. We then compared the resulting F-ratios of the 'strain-by-lab' interaction term between the two experimental designs using a second GLM blocked by behavioral measure: y = experimental design+behavioral measure.
The rationale for using F-ratios for this comparison was twofold, namely (i) that F-ratios are scale invariant, so the different scales of the different test measures became unimportant, and (ii) that F-ratios in a GLM have a discrete null hypothesis (F = 1) which we could test against. The latter is because F-ratios reflect 'variance components', so we can think of the true variance of any factor as being = variance due to that factor+residual variance. Thus, in a GLM, if the variance due to the factor under test is 0, then the F-ratio will ideally be 1 (because F-ratio = (variance due to the factor + residual variance)/residual variance). Therefore, if the average F-ratio of the 'strain-by-laboratory' interaction term were equal to 1, this would mean that strain differences did not vary between laboratories, which would essentially be the same as perfect reproducibility. To test this statistically, we used a post-hoc t-test of the null hypothesis that F equals 1.
In the GLM used to determine variation among the six laboratories (y = strain + lab + strain6 lab), the residual variance accounts for all the within-laboratory variance (except variance due to 'strain'). However, the residual variance of this model reflects various aspects of within-laboratory variance, including variation due to the heterogenization factors, cage position in the rack, and individual differences. To determine how exactly Table 3. Complete list of behavioral measures used for the analysis and the transformations applied to meet the assumptions of parametric analysis (NT = no transformation, log = log 10 (y+1)-transformed, sqrt = square-root-transformed, angular = arcsin(square-root(y))-transformed). heterogenization influenced between-experiment variation, we therefore calculated an additional GLM that included 'block' and the interaction between 'strain' and 'block' as factors: y = strain + lab + block(lab) + strain6 lab + strain6 block(lab). Including 'block' and the 'strain-by-block' interaction (with 'block' nested within 'lab') in the GLM, allowed us to calculate an additional Fratio by dividing the mean squares (MS) of the 'strain-by-lab' interaction by the MS of the 'strain-by-block' interaction (F = MS(strain6lab)/MS(strain6block(lab))). This F-ratio reflects the partitioning of the strain-by-block variance among all 24 blocks in the six laboratories into variance among blocks of different laboratories (i.e. between-laboratory variation), and variance among blocks within the same laboratory (i.e. withinlaboratory variation). It therefore represents an ideal measure to determine how heterogenization affected within-experiment variation relative to between-experiment variation [34]. Our prediction was that this ratio will be smaller for the heterogenized design, and ideally = 1. If it were equal to 1 or lower, this would mean that heterogenization generated as much or even more variance between the four blocks within a laboratory as exists between laboratories. All statistical tests were conducted using the software package SPSS/PASW (version 17.0 for Windows).

Effects of strain and laboratory
Regardless of the experimental design, significant and, in some cases, large main effects of 'laboratory' and 'strain' were found for nearly all variables ( Table 4). As expected, the absolute values measured were quite variable among laboratories (see Fig. 3 and Fig. 4). Such additive effects of the laboratory, however, occurred in all five behavioral tests, including measures of activity (e.g. total path moved in the open field), exploration (e.g. novel object exploration time and frequency), and anxiety (e.g. time spent in and frequency of entering the centre in the open field; Table 4). For example, mice tested in Muenster were, on average, less active than those tested in other labs (measured by 'total path moved in the open field test' or 'total path moved in the novel object test'). In particular, the number of stretched postures on the elevated zero maze varied most remarkably among the six laboratories (see Fig. 3).

Standardization versus heterogenization
Between-experiment variation. To explore the effect of the experimental design on the reproducibility of behavioral strain differences, the effect of 'strain' on each of the 29 behavioral measures was assessed separately for each laboratory. Although the average effect of 'strain' varied considerably among the six laboratories in the heterogenized design, the standardized design produced even more variable outcomes (Fig. 5). Moreover, the average F-ratios of the 'strain' effect were considerably larger in the standardized design (Fig. 5).
To confirm these findings statistically, we used the GLM y = strain + lab + strain 6lab (see Statistical Analysis) to determine the F-ratios of the 'strain-by-lab' interaction term for each of the 29 behavioral measures that were then compared between the two experimental designs. Indeed, these F-ratios were significantly smaller in the heterogenized design (F 1,28 = 4.222, p = 0.049), indicating improved reproducibility of strain differences among laboratories in the heterogenized design (Fig. 6). However, including 'block' in the GLM (y = strain + lab + block(lab) + strain6 lab + strain 6 block(lab)) weakened this effect to a nonsignificant trend (F 1,28 = 3,405, p = 0.076), indicating that part of the effect was due to cage position, independent of the heterogenization factors. Moreover, in both designs the average F-ratio was significantly different from 1 (t-test of the null hypothesis that F = 1: standardized design: T 28 = 7.660, p,0.001; heterogenized design: T 28 = 8.214, p,0.001), demonstrating that strain effects varied substantially among laboratories in both designs (Fig. 6). Further graphical examination of the mean strain differences across the six laboratories confirmed this, although strain differences were somewhat more consistent in the heterogenized design (Fig. 7).
Within-experiment variation. To assess whether improved reproducibility in the heterogenized design was caused by heterogenization shifting variation from between-experiment variation to within-experiment variation, within-experiment variances were averaged across the six laboratories and compared between the two designs for each of the 29 behavioral measures. The average within-experiment variance was larger in 23 out of 29 measures in DBA/2NCrl mice and in 18 out of 29 measures in C57BL/6NCrl mice, suggesting that heterogenization systematically shifted variance from between-experiment to within-experiment variation.
To confirm this statistically, we used the GLM y = strain + lab + block(lab) + strain6 lab + strain 6 block(lab) and calculated the Fratio of the 'strain-by-lab' interaction term divided by the 'strainby-block' interaction term (see Statistical Analysis). These F-ratios were significantly smaller in the heterogenized design (F 1,28 = 4.678, p = 0.039, Fig. 8), demonstrating that heterogenization did indeed increase within-experiment variation (variance among blocks of the same laboratory) relative to between-experiment variation (variance among blocks of different laboratories).

Strain effects
C57BL/6 and DBA/2 mice, two of the most widely used inbred strains of laboratory mice, are known to differ markedly in many behavioral tasks [54][55][56][57][58][59]. Therefore, it was not surprising to find significant and often large strain differences in almost all behavioral measures assessed in the present study. In line with previous studies, C57BL/6NCrl mice generally showed less anxiety-related behavior than DBA/2NCrl mice [55,56]. General levels of locomotor activity as measured by the total path moved in the tests, however, did not differ much between the two strains, although C57BL/6 mice are considered to be more active than DBA/2 mice [55,56].

The impact of the laboratory environment
We also found considerable differences in the absolute values measured in different laboratories, confirming previous findings [13,22,35]. In particular, the frequency of stretched postures on the elevated zero maze differed markedly among laboratories. Such large differences in the absolute values may be typical for manually recorded measures and reflect experimenter-dependent variability, highlighting the importance of inter-observer reliability training [60,61] and the value of automated data recording [62,63]. However, such additive differences among laboratories do not normally threaten the validity of strain differences. For example, Munich and Muenster used ANYmaze (Stoelting Co.) for video-tracking of open field and elevated zero maze performance, while the other four laboratories used two different versions of EthoVision (Noldus Information Tecnology). Differences in software functioning may indeed explain some variation in the absolute values measured, but should not affect the size and direction of strain differences.
Despite the marked phenotypic differences between these two strains, however, we also found variation in the direction of strain differences among laboratories in some measures, indicating that the same test conducted in different laboratories may lead to fundamentally different conclusions. Such dramatic strain-bylaboratory interactions may arise when different strains respond differently to the specific environmental or testing conditions of the different laboratories. When using strains that are phenotypically less distinct as is often the case when transgenic strains are compared with wild-type strains [64], this may actually be the norm rather than an exception given that many phenotypic states are highly dependent on environmental conditions [11,19,30,57,65]. In the present study, some aspects of the housing and testing conditions were equated between laboratories (e.g. supplier, testing order, position of cages within the rack), while others remained laboratory-specific (e.g. local room architecture, tracking software, experimenter, time of testing, handling and identification method). However, because both additive and non-additive Table 4. F-ratios of all behavioral measures for the main effects of 'strain' and 'laboratory' based on the GLM (split by experimental design): y = strain+laboratory+strain6laboratory; p#0.001***, p#0.01**, p#0.05*, p.0.05 NS. laboratory effects may arise from any or all of these laboratoryspecific aspects, any further explanation of these effects in terms of single factors is impossible.

Reproducibility of the results
Poor reproducibility is typically caused by interactions of genotype with the specific laboratory conditions. To avoid this, scientists are generally advised to strengthen efforts of standardization both within and between laboratories [23,27,66,67]. However, attempts to avoid poor reproducibility by more rigorous standardization are misleading. If fully effective, standardization within laboratories would decrease variation within study populations to zero [28], and therefore, each experiment would turn into a single-case study with zero information gain, producing statistically significant, but irrelevant results that lack generality under even slightly different conditions [28,29]. Indeed, the average F-ratios of the 'strain' effect were considerably larger in the standardized design, indicating that standardization may systematically overestimate strain main effects. The obvious reason for this is that interactions between strain and the laboratory-specific conditions are mistaken for strain main effects [32,34]. Instead of rigorous standardization, we proposed systematic variation of experimental conditions to render populations of experimental animals more heterogeneous, thereby improving the external validity of results across the unavoidable variation among laboratories [32,34]. The findings reported here are somewhat ambiguous with respect to the efficacy of heterogenization in improving reproducibility. Thus, although heterogenization did have an effect in the predicted direction, this effect was rather weak, and both heterogenization and standardization resulted in relatively poor reproducibility. The reason for this might be that either heterogenization did not work with our selection of behavioral measures or that the type of heterogenization employed here was not effective enough.
Both the reproducibility of behavioral measures and the effect of heterogenization on their reproducibility may vary depending on the exact selection of measures. However, heterogenization should have the weakest effect on those measures that are least sensitive to environmental conditions. Such measures should also be highly reproducible under both standardized and heterogenized conditions. The present analysis was based on a selection of 29 behavioral measures from five tests that are widely used in behavioral phenotyping or drug screening studies. The fact that nearly all of these measures varied considerably among laboratories suggests that they were highly sensitive to environmental conditions. Therefore, our selection of measures is unlikely to account for the relatively weak improvement of reproducibility by heterogenization. Instead, our findings suggest that the study populations generated within laboratories by the form of heterogenization employed here did not adequately represent the range of variation between the six laboratories. The reason for this may be that age and cage enrichment were poor heterogenization factors, or that the specific levels of these factors were not different enough to induce sufficient variation in behavioral phenotypes.

Efficacy of heterogenization
Our choice of heterogenization factors was based on practical considerations and on studies demonstrating that both age and enrichment affect, and interact with, a variety of potential outcome measures [36][37][38][39][40][41][42]68,69]. It is possible that more distinct levels of these factors would have produced stronger effects. Moreover, the pretest housing period was limited to three weeks for logistic reasons. Perhaps a longer exposure of the mice to the laboratory-specific conditions would have strengthened the effects of the heterogenization factors. On the other hand, more extreme variation of age and enrichment, or a longer housing period would have rendered heterogenization less practicable. This raises the question whether other factors might be more effective in heterogenization.
Many of the measures obtained from behavioral tests are highly sensitive to test conditions. Paylor [33] suggested running experiments in several batches tested on different days. While this may be an effective strategy since test conditions are likely to vary from day to day, it is not a well controlled strategy, and efficacy For each laboratory and experimental design, the main effect of 'strain' was separately calculated and displayed in terms of the mean F-ratio (+ s.e.m., square-root-transformed) across all 29 behavioral measures. Although the strain effect varied considerably among laboratories in the heterogenized design, the standardized design produced even more variable outcomes. Moreover, average F-ratios for 'strain' were considerably higher in the standardized design, indicating that treatment effects may be systematically overestimated by standardization. doi:10.1371/journal.pone.0016461.g005 Figure 6. Variation between laboratories in the standardized and in the heterogenized design. The variation in strain differences is displayed as mean F-ratios (+ s.e.m.) of the 'strain-by-laboratory' interaction term calculated for 29 behavioral measures. F-ratios were determined separately for the two experimental designs, square-roottransformed to meet the assumptions of parametric analysis, and then compared using a GLM blocked by 'behavioral measure'. F-ratios of the 'strain-by-laboratory' interaction terms were significantly lower in the heterogenized design (F 1,28 = 4.222, p = 0.049), indicating lower between-experiment variation. doi:10.1371/journal.pone.0016461.g006 may vary greatly both within and between laboratories. Alternatively, specific factors of the test conditions may be used for systematic heterogenization similar to age and enrichment in the present study. For example, test time, background noise, and illumination level have all been shown to affect test responses [70][71][72][73]. It is possible that heterogenization through factors of the test conditions would be more effective because their effects on the animals' test responses are more immediate. Taken together, the fact that the results varied greatly between laboratories in both designs confirms the need for effective heterogenization strategies to guarantee reproducible test results. Therefore, further research is needed to identify and validate factors that exert sufficiently strong effects on behavioral phenotypes. Because poor reproducibility occurs throughout animal experimentation, this research should aim at heterogenization strategies that are either applicable to a wide range of different studies or are specifically tailored to specific types of studies.

Conclusions
Despite strong effects of the laboratory on nearly all behavioral measures in both designs, the findings of this study confirm our earlier findings [32,34], and indicate that systematic heterogenization may also improve reproducibility in a real multi-laboratory situation. By systematically increasing within-experiment variation relative to between-experiment variation, heterogenization tended to improve reproducibility compared to standardization. However, the ratio of between-experiment to within-experiment variation was far greater than 1 in both designs, indicating that betweenlaboratory variation was substantially greater than withinlaboratory variation. This underscores the need for more powerful heterogenization strategies to guarantee reproducibility of results across the large variation among different laboratories.
Author Contributions Figure 8. Between-experiment variation versus within-experiment variation. To assess the relative weight of between-laboratory variation versus within-laboratory variation, an F-ratio was calculated that reflects the partitioning of the 'strain-by-block' variance between all 24 blocks of one experimental design into variance due to variation between laboratories and variance due to variation within laboratories. For this, the mean squares of the 'strain-by-laboratory' interaction term were divided by the mean squares of the 'strain-by-block' interaction term. Data are displayed as mean F-ratios (+ s.e.m.; square-roottransformed) across all 29 behavioral measures for both conditions. Fratios were significantly smaller in the heterogenized design (F 1,28 = 4.678, p = 0.039), demonstrating that heterogenization increased within-experiment variation relative to between-experiment variation. doi:10.1371/journal.pone.0016461.g008