Figures
Abstract
The reproducibility of studies involving insect species is an underexplored area in the broader discussion about poor reproducibility in science. Our study addresses this gap by conducting a systematic multi-laboratory investigation into the reproducibility of ecological studies on insect behavior. We implemented a 3 × 3 experimental design, incorporating three study sites, and three independent experiments on three insect species from different orders: the turnip sawfly (Athalia rosae, Hymenoptera), the meadow grasshopper (Pseudochorthippus parallelus, Orthoptera), and the red flour beetle (Tribolium castaneum, Coleoptera). Using random-effect meta-analysis, we compared the consistency and accuracy of treatment effects on insect behavioral traits across replicate experiments. We successfully reproduced the overall statistical treatment effect in 83% of the replicate experiments, but overall effect size replication was achieved in only 66% of the replicates. Thus, though demonstrating sufficient reproducibility in some measures, this study also provides the first experimental evidence for cases of poor reproducibility in insect experiments. Our findings further show that reasons causing poor reproducibility established in rodent research also hold for other study organisms and research questions. We believe that a rethinking of current best practices is required to face reproducibility issues in insect studies but also across disciplines. Specifically, we advocate for adopting open research practices and the implementation of methodological strategies that reduce bias and problems arising from over-standardization. With respect to the latter, the introduction of systematic variation through multi-laboratory or heterogenized designs may contribute to improved reproducibility in studies involving any living organisms.
Citation: Mundinger C, Schulz NKE, Singh P, Janz S, Schurig M, Seidemann J, et al. (2025) Testing the reproducibility of ecological studies on insect behavior in a multi-laboratory setting identifies opportunities for improving experimental rigor. PLoS Biol 23(4): e3003019. https://doi.org/10.1371/journal.pbio.3003019
Academic Editor: Ingrid A. Fetter-Pruneda, Universidad Nacional Autonoma de Mexico Instituto de Investigaciones Biomedicas, MEXICO
Received: December 9, 2024; Accepted: March 25, 2025; Published: April 22, 2025
Copyright: © 2025 Mundinger et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The code and data for all modeling are available at Zenodo https://zenodo.org/records/14002690.
Funding: This research was funded by the German Research Foundation (DFG) as part of the CRC TRR 212 (NC³) – Project numbers 316099922 (CRC overall), 396776123 (SHR), 396777467 (CM), 396780003 (JK), and 396776775 (HS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Abbreviations: CI, confidence interval; DMD, 4,8-dimethyl decanal; EDA, The Experimental Design Assistant; GLMM, generalized linear mixed models; OLRE, observation-level random effect; PCI, post-contact immobility
1. Introduction
Reproducibility, i.e., the ability of a result to be replicated by an independent experiment in the same or different laboratory [1]; also referred to as replicability [2], is a cornerstone of any scientific method. Results that cannot be independently reproduced cause scientific uncertainty, hinder scientific progress, and incur costs to science and society. Attempts to replicate scientific findings have produced mixed to rather discouraging results [3–5], raising concerns whether reproducibility is as high as is desirable and coining the term “reproducibility crisis” (e.g., [3,5–7]). Furthermore, more than 70% of researchers who responded to a survey reported having tried and failed to reproduce another scientist’s experiments, and more than half admitted having failed to reproduce their own experiments [5]. Interestingly, this crisis does not seem to be restricted to a specific research discipline but may affect disciplines as diverse as human medicine [8], psychology [6,9], and economics [10,11].
Historically, the discussion about poor reproducibility was sparked by a multi-laboratory study on mouse phenotyping [12]. In this landmark study, eight different mouse strains were simultaneously investigated in a battery of six behavioral paradigms at three different sites. Although the test set-up and the environmental conditions were rigorously standardized across the three laboratories, Crabbe and colleagues (1999) detected strikingly different results in the three labs, with some behavioral tests even yielding contradictory findings. The authors therefore concluded that “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory” [12].
In subsequent years, a series of single- and multi-laboratory as well as systematic replication studies supported these findings, uncovering further problems with reproducing previous findings (e.g., [13–16]). Consequently, researchers started to search for causes of poor reproducibility and ways out of the crisis. Along these lines, diverse threats to reproducibility have been proposed, with a lack of scientific rigor, low statistical power, publication bias, and analytical flexibility being among the most discussed ones (e.g., [17–21]). To solve these issues, a number of strategies have been developed that aim to overcome biases in the experimental design (e.g., The Experimental Design Assistant—EDA [22]) and/or to improve reporting of the study setup (e.g., ARRIVE guidelines [23,24]). Likewise, the pre-registration of studies as well as the publication of registered reports have been emphasized to address biases at the publication level (e.g., [25,26]).
When considering animal research in particular, yet another problem has been identified, namely the systematic neglect of biological variation (e.g., [1,27–30]). The underlying idea here is that the response of an animal to an experimental treatment depends not only on the properties of the treatment but is a product of the animal’s genotype, parental effects, and its past and present environmental conditions (see also “reaction norm perspective” [31–33]). As laboratory experiments are usually conducted under highly standardized conditions, only a very narrow range of environmental conditions is represented, thereby limiting the inference space of the whole study. Efforts taken to increase reproducibility (i.e., through rigorous standardization) therefore compromise external validity, simply because they restrict the range of environmental conditions to a specific “local set”. This apparent increase in reproducibility at the expense of external validity has repeatedly been highlighted as “standardization fallacy” [34,35] and explains why results can differ when experiments are replicated. Interestingly, however, this reasoning has been almost exclusively used to explain poor reproducibility of preclinical animal research with laboratory rodents. This is particularly surprising as the underlying logic should apply to all living organisms and might explain why discrepant findings arise, whenever animals are studied under standardized laboratory conditions, independent of the study species, the experimental purpose, or the research area.
We here explore if similar problems arise in experimental studies on insects. The use of insects for laboratory experiments is common in many disciplines, but—despite the multitude and diversity of research approaches—has attracted far less attention in terms of reproducibility. To our knowledge, there have been no systematic (meta-) research projects on reproducibility of insect studies to date. Moreover, we know only little about the extent to which the reproducibility crisis plagues the community of studies in behavior, ecology, and evolution, although it appears that the field is not impervious to reproducibility issues (e.g., [36–38]).
The present study aimed to bridge this gap, by systematically exploring the reproducibility of simple ecological studies on insect behavior. In a multi-laboratory approach, we conducted three experiments at three locations with three different study species, each following the same standardized protocol. The study species included the turnip sawfly Athalia rosae (Hymenoptera), the meadow grasshopper Pseudochorthippus parallelus (Orthoptera), and the red flour beetle Tribolium castaneum (Coleoptera), thus representing three frequently studied insect species across three different orders. Each experiment had already been conducted in a previous study in a single laboratory, ensuring that the treatments had produced significant effects in the past (see for A. rosae: [39]; for P. parallelus: [40]; for T. castaneum: [41]). This implies that one laboratory had extensive experience with one of the study species and the experiment, while the other two laboratories were inexperienced and relied on protocols. Consequently, we expected a higher reproducibility of treatment effects in laboratories that had already conducted the respective pilot studies compared to inexperienced laboratories new to both the respective study organism and the experimental set-up. Furthermore, we predicted that manual handling during testing would introduce more between-laboratory variation than assays relying on observation alone.
2. Animals, materials, and methods
Using a multi-laboratory approach, we investigated the reproducibility of three experiments involving three insect species: the turnip sawfly (A. rosae), the meadow grasshopper (P. parallelus), and the red flour beetle (T. castaneum). These organisms exemplify three types of model systems. Individuals of P. parallelus were collected from the wild for this study. In contrast, T. castaneum serves as a classic example of a laboratory-adapted insect model, having been bred exclusively in laboratory conditions for over a decade [42]. The species A. rosae represents an intermediate state, as its populations originated from a laboratory culture that was annually supplemented with wild-caught individuals. Each experiment was designed by one of the participating laboratories to address specific evolutionary and ecological questions relevant to the study species (Fig 1).
PCI: post-contact immobility. Photo credit: Athalia: Pragya Singh; Pseudochorthippus: Holger Schielzeth; Tribolium: Tobias Prueser.
In the first experiment, we examined the effects of starvation on larval behavior in A. rosae. Starvation, a common ecological stressor, can profoundly affect behavioral responses. In this experiment, we specifically measured post-contact immobility (PCI) and activity following a simulated attack. Based on previous findings [39], we hypothesized that, compared to non-starved larvae, starved larvae would exhibit shorter PCI durations and increased activity levels, an adaptive strategy to enhance foraging success and food location under nutritional stress. The PCI quantifications required manual handling of the larva, whereas evaluating activity required little human intervention and handling. A comparison of these two tests presents an opportunity to investigate which behavioral test is more prone to experimental variability introduced by handling. We predicted higher between-laboratory variation in results of the PCI trait compared to results on activity.
In the second experiment, we investigated the relevance of color polymorphism for substrate choice in P. parallelus. The experiment used two color morphs (green and a brown) as the statistical treatment effect of interest to test for morph-dependent microhabitat choice and crypsis [40]. We assessed the preference of grasshoppers for green or brown patches to evaluate whether color morphs selectively choose backgrounds (= substrates) that match their own color and thus enhance their camouflage. We predicted that each morph would preferentially select a substrate that matches its body color.
The third experiment focused on T. castaneum, where we assessed niche preference by offering beetles a choice between flour types conditioned by beetles with or without functional stink glands [41,43]. In this species, adult beetles secrete quinone-based compounds that not only alter the microflora [44] and provide external immunity [45] but also serve as signals of population density [46]. These secretions can create microhabitats of varying quality, potentially guiding beetles in selecting optimal habitats. We predicted that larvae and adult beetles differed in their niche choice, with larvae showing a preference for conditioned flour containing antimicrobial secretions, while adults avoid this conditioned flour.
To ensure consistency across the participating laboratories, each experiment followed a standardized protocol. Behavioral assays were conducted using predefined methodologies to minimize between-laboratory variation. Data collected from the three laboratories were then analyzed to assess the reproducibility of the experimental results. Environmental conditions, such as temperature, humidity, and light cycles were controlled and kept as consistent as possible across all laboratories (see S1–S3 Tables). Diets were standardized to a large degree, but were not completely identical, as each laboratory procured the food for the insects itself by either buying, growing, or collecting the necessary dietary component. In detail, cabbage for A. rosae and grass blades for P. parallelus were freshly sourced locally. For T. castaneum, organic wheat flour type 550 with 5% brewer’s yeast was used. While the yeast came from the same batch, flour was bought from different distributors. All details can be found in the supplement table listing laboratory-specific conditions (see S1–S3 Tables).
2.1. Effect of starvation on behavior of Athalia rosae
2.1.1. Study organism.
We investigated the effects of starvation on PCI and activity levels in larvae of A. rosae. The larvae of the species feed on various Brassicaceae plants and can be an agricultural pest [47]. In the larval stage, individuals can swiftly defoliate their host plants, consequently facing periods of starvation [48]. Besides finding food, larvae also need to manage predation hazards. PCI is a behavioral response to physical interaction with a predator. During PCI individuals remain motionless for a certain duration [39], a phenomenon that is also referred to as post-predation immobility, tonic immobility, thanatosis, or “death-feigning” behavior [49]. PCI is hypothesized to function as an antipredator tactic [49] and is also used as a proxy for boldness or risk-taking behavior in animals [50,51]. We chose the behavioral traits of PCI and activity because we previously documented a pronounced starvation effect on these traits, with starved larvae exhibiting shorter PCI duration and higher activity levels [39].
The sawflies used in this study originated from the Bielefeld laboratory stock population, initially established from adults collected in the vicinity of Bielefeld, Germany. This stock population was annually supplemented with wild-caught A. rosae adults. Stock population sawflies were housed in mesh cages (60 × 60 × 60 cm) within a laboratory environment at a 16:8-h light-to-dark cycle, approximately 60% relative humidity, and room temperature (15–25°C). For this study, multiple male and female adults were introduced into a cage with mustard (Sinapis alba, Brassicaceae) plants for oviposition, giving rise to male and female offspring. After 1 week, the newly hatched larvae were reared on non-flowering plants of cabbage (Brassica rapa var. pekinensis, Brassicaceae). Mustard and cabbage had been cultivated from seeds in a climate chamber and a greenhouse, respectively. Third- and fourth-instar larvae were collected from the cage, put individually in Petri dishes (5.5 cm diameter) lined with slightly moistened filter paper and provided with cabbage leaf discs, and were sent via mail to all three laboratories (70 larvae to each laboratory). Upon arrival, the larvae were transferred to fresh Petri dishes and were given ad libitum access to cabbage leaf discs obtained locally. Some larvae molted during the experimental assay and were excluded from the experiment. The experiment was performed within 4 days after the arrival of the larvae.
2.1.2. Larval behavioral traits: PCI assay and activity.
For the experiment, the larvae were moved to clean Petri dishes with moist filter paper (1 larva per dish) and randomly allocated to a control or starvation treatment (N = 30 per treatment). In the starvation treatment larvae had no access to cabbage leaves, while in the control treatment, larvae were provided cabbage leaves ad libitum (see S1 Fig). The position of the Petri dishes of the two treatments was alternated to avoid any spatial effects. After 3 h of treatment, the PCI duration of all larvae was measured. For measuring PCI, a clean Petri dish (5.5 cm diameter) was prepared and a graph-paper was placed beneath the Petri dish. To induce PCI, a larva was grasped in the middle of the body using soft-tip spring-steel forceps and dropped from a height of 5 cm onto the clean Petri dish. The larva was considered to exhibit PCI if it curled up its body and stayed motionless in this curled-up posture for at least one second. If the larva did not show PCI, the forceps stimulation was repeated up to two more times. The PCI duration was measured starting from the time the larva showed PCI until the time it uncurled its body and moved at least 1 cm. Each larva was monitored for a maximum of 10 min. After the PCI assay, all larvae were returned to their Petri dishes. One hour after measuring PCI (= after 4 h of starvation treatment), the activity level of each larva was measured. Larvae were moved individually to clean, empty Petri dishes (5.5 cm diameter), and their behavior was recorded and tracked for 10 min using a video camera. Six Petri dishes were recorded in parallel. Using a tracking software (different software used depending on laboratory, see S1 Table), the data on the distance moved for each individual was extracted from the videos.
2.2. Substrate choice in Pseudochorthippus parallelus
2.2.1. Study organism.
Pseudochorthippus parallelus is a species of grasshopper that is widespread across most of Europe and occurs in multiple discrete color variants that co-occur in local populations [52]. This raises the evolutionary question of why and how this phenotypic polymorphism is maintained within populations. It has been hypothesized that a trade-off between crypsis and thermoregulation results in balancing selection and thus a maintenance of color-morph diversity [53]. Experimental evidence suggests that in the laboratory, greener individuals have a stronger preference to perch on green backgrounds than brown individuals [40], supporting the hypothesis of morph-differential microhabitat choice. The original study had been conducted with three color morphs and we simplified the experiment to the two most extreme morphs (uniform green and uniform brown) for the purpose of our multi-laboratory replication study.
2.2.2. Substrate choice in green and brown morphs.
Experimental individuals were wild-caught from a grassland near Jena, Germany. Two days after the mature grasshoppers were caught, they were transported in small vials by train to the other laboratories, with experimental animal for Jena also being transported by train for about the same duration of time. After arrival, experimental animals were stored in refrigerators overnight. The next day, we transferred individuals to experimental plastic cages (37 × 22 × 24.5 cm) with one individual per cage. The floor of cages was covered by a laminated sheet which contained a 2 × 2 checkerboard pattern of brown and green patches (see S2 Fig). As a modification to the original experiment [40], brown and green patches were matched for brightness. Experimental individuals were supplied with freshly cut grass leaves potted in small water-filled vials placed at the center of the cage and female were also provided with a small sand pot for egg laying. Cages, vials, and lining for the floor were provided by the Jena laboratory, while food was collected locally and consisted of blades of wild-growing Poaceae, mainly Dactylis glomerata. After the last measurement of each day, food was replaced if needed.
We arranged cages in a way that individuals of the same sex, but different morphs were next to each other and distributed these paired cages randomly across the room. After one day of acclimatization, the assay started. To record substrate choice, we manually documented the color of the patch that each individual was sitting on (using the position of the head if the individual was sitting on the border of two patches, as in [40]). Observations were repeated on 10 consecutive days, with five measurements taken per day. All measurements were performed between 9 am and 4 pm, to ensure sufficient light and relatively warm temperatures, and at least 1 h elapsed between recordings. Data from individuals that were not sitting on the floor of the cage were excluded from the analysis.
2.3. Niche preference in Tribolium castaneum
2.3.1. Study organism.
Red flour beetles are group-living insects and a global pest to stored grain. The beetles live in overlapping generations within their food source. Adults provide external immunity via quinone-rich stink gland secretions to their offspring and conspecifics [54–56]. However, quinones in high concentrations can have toxic effects [57]. Red flour beetles can exhibit cannibalistic behavior, wherein larvae and mainly female adults feed on eggs and pupae [58]. Therefore, individuals face a trade-off in their niche choice. Due to their high feeding rate, which enables rapid growth, larvae are especially prone to oral infections by entomopathogens. However, only mature adults can produce protective stink gland secretions, which implies that larvae given a choice should prefer flour-containing stink gland secretions. A previous choice assay supported this prediction by demonstrating larval preference for conditioned flour containing antimicrobial secretions [41]. In contrast, virgin females showed a preference for flour conditioned by beetles without functional stink glands and thus drastically reduced secretions [41]. Females might choose this niche because it signals a lower population density, and thus a lower risk of cannibalism. Furthermore, the absence of quinones allows higher concentrations of the aggregation pheromone 4,8-dimethyl decanal (DMD) to persist [59], which attracts adult beetles.
The animals used in this study were from the CRO1 strain [42], which is derived from originally 165 mating pairs caught in Croatia in 2010. The population has been kept in the laboratory in non-overlapping generations on organic wheat flour (DM, type 550) with 5% brewer’s yeast. The substrate was heat sterilized at 75°C and they were kept at 30°C and 70% humidity at a 12 h/12 h light/dark cycle.
The experiment was carried out using virgin adult females and larvae of unknown sex. The Münster laboratory provided a parental population of beetles to each laboratory, along with RNAi-treated knockdown and control beetles, which were used to condition the flour [41]. Beetles that receive a knockdown for the Drak gene during the pupal stage do not develop functioning stink glands, therefore flour conditioned by them hardly contains any of the characteristic quinone-containing secretions. The flour conditioned this way is referred to as “Drak flour” below. The second flour type was conditioned by beetles from the control group, which received a Gfp RNAi control injection and produced normal levels of stink gland secretions. The successful knockdown was confirmed via photometric measurements of quinones at specific wavelengths. Beetles were sent via mail to all three laboratories.
2.3.2. Niche choice assay in larvae and adults.
In the behavioral assay, we gave the beetles the choice between two different flour types, either containing quinone-rich stink gland secretions or having these drastically reduced. We conducted the assay in Münster and Bielefeld in late spring of 2023, and in Jena in May 2024. At the time of the assay, larvae were 17 days old and virgin females were 40–45 days. The sex was determined during the pupal stage, and females were separated from males to ensure their virgin status.
For the behavioral assay, we prepared Petri dishes (9 cm diameter) by marking a center line and spreading 0.3 g of each flour type on either side, leaving a 1 cm section in the middle blank (see S3 Fig). The order of the Petri dishes regarding side orientation was randomized and the observer was blind to the identity of the different flour types. We conducted the experiment in two blocks separated by two weeks. Each block had a sample size of N = 40 or 60 Petri dishes per life stage. Three virgin female adults or three larvae were used per Petri dish. Conducting the assay with three individuals reduced the amount of missing data because individual arenas could still be scored, even if one or two individuals had fallen on the back and could not make an active choice. Also, multiple adults in the same Petri dish can flip each other back on their feet through random interactions. At the start of the experiment, animals were placed with soft-tip spring-steel forceps in parallel orientation along the central line of the dish at the distal end of the flour. We took the first measurement 30 min after all animals were put into the arenas. For 6 h, we then recorded every 30 min the position of each individual in each Petri dish, and for the adults whether it was on its feet or laying on the back. One additional measurement was taken 24 h after the start. Only data from individuals on their feet were used in the analysis. Thus, some of the individuals did not produce data for all 14 time points.
2.4. Statistical analysis
All analyses were done in R version 4.3.2 [60]. The code and data for all modeling are available at Zenodo (https://zenodo.org/records/14002690). Data were analyzed using univariate generalized linear mixed models (GLMM; package “lme4” version 1.1-35.1 [61] and package “lmerTest” version 3.1-3 [62]). Model residuals were inspected using the DHARMa package version 0.4.6 [63].
For A. rosae, we modeled the duration of PCI and the distance moved. Variables were appropriately transformed to improve the distribution and variance homogeneity of model residuals (y’ = log (y + 1) for PCI duration and y’ = sqrt(y) for distance moved). Both transformed variables were modeled using Gaussian error distributions. For P. parallelus, the choice was modeled as the probability of an individual sitting on a green patch in the form of a binary response using binomial models with logit link function. As we did not find any difference in patch preference between the two sexes in P. parallelus (GLMM with the structure green preference ~ morph + sex + (1|lab) + (1|id); β = 0.05 ± 0.11, z = 0.40, p = 0.69; estimate refers to difference in patch choice of males relative to females; see also S4 Fig), we excluded sex as a factor from the analysis to keep the models aligned with those of A. rosae and T. castaneum. For T. castaneum, beetles were always recorded as three indistinguishable individuals grouped on a single Petri dish. Therefore, we calculated the preference as the number of individuals found in quinone-reduced flour relative to the number of individuals in control flour. The response was thus modeled as proportion data using binomial models with logit link function. For an overview of all biological variables modeled, see Table 1.
P. parallelus models fitted “individual identity” and T. castaneum models fitted “Petri dish identity” as random effects, to account for the non-independence of repeated measures. To cope with overdispersion in count data, we further added an observation-level random effect (OLRE; [64]) to the T. castaneum models. We tested for overdispersion using the “performance” package version 0.12.0 [65].
We evaluated the reproducibility of the treatment effects across all three laboratories, by following the approach of von Kortzfleisch and colleagues (2020) [66]: Firstly, we estimated the consistency of treatment effects across the laboratories. Secondly, we evaluated how accurately each laboratory was able to predict the overall effect and effect size.
To evaluate the consistency of treatment effects, we fitted GLMMs with only the treatment as fixed effect separately for each of the experiments (plus random effects for P. parallelus and T. castaneum as described above). In a second analysis, the data were pooled across experiments and “laboratory” as well as the interaction between “treatment” and “laboratory” were fitted as fixed effects. We then used likelihood ratio tests (car-package version 3.1-2 [67]) to assess the statistical significances of the “treatment-by-laboratory” interaction terms.
For the evaluation of how accurately the laboratories were able to predict the overall treatment effect, three measurements were calculated: the coverage probability (Pcov), the proportion of consistently significant results (Psig), and proportion of accurate results (Pacc). Pcov represents the proportion of experiments in which the 95% confidence interval (CI95) of the estimate covered the meta-analytic average effect. The meta-analytic overall effect size was estimated by a random-effect meta-analysis using the metafor-package (version 4.4-0 [68]) based on the individual treatment effect sizes and standard errors of all laboratories.
The proportion of consistent significant results (Psig) quantifies the proportion of laboratory-specific experiments that are congruent with the overall significance (P < 0.05 versus P > 0.05) and, if significant, also in sign. We did not find cases of significance in opposite directions, so do not discuss this situation further. To determine overall statistical significance, we examined whether the CI95 of the overall effect as estimated by meta-analyses (as described above) contained 0 (i.e., indicating an overall not significant effect) or not (i.e., indicating an overall significant effect).
For the most conservative measure, the proportion of accurate results (Pacc), we then combined both measures of Pcov and Psig. To evaluate whether a laboratory accurately predicted the overall effect and effect size, two conditions needed to be met: the 95% confidence interval (CI95) of the estimates covered the meta-analytic average effect (see Pcov), and the statistical significance had to be in the same direction as to the overall effect (see Psig; compare [69] and [66] for graphical illustrations of this notion).
3. Results
3.1. Experiment-specific results
3.1.1. Activity of Athalia rosae.
We examined the effects of starvation on larval behavior in A. rosae. Overall, we recorded 360 measurements across the two behavioral activity measures of 180 individual sawfly larvae. We found a significant correlation between the two activity measurements: the longer the distance moved, the shorter the PCI duration (ρ = −0.57, p < 0.001).
Across laboratories, we found treatment-specific differences for both behaviors (PCI and distance moved). Starved A. rosae larvae remained significantly shorter in PCI (two-sided Wilcoxon signed-rank test, W = 6,144, p < 0.001; see Fig 2A) and were more active in the form of a greater distance moved (W = 1,475, p < 0.001; see Fig 2C).
When comparing the results for each of the laboratories, we found that the overall effect of starvation on PCI duration was reproduced in only two out of the three replicates (Lab A: W = 871.5, p < 0.001; Lab C: W = 621, p = 0.01; see Fig 2B). While the direction of the effect was similar for those experiments, the length of the PCI duration of the control group was much more prolonged and less variable in Lab A (on average 425 ± 182 s) compared to Lab C (200 ± 213 s). Only Lab C covered the overall effect size, while Lab A overestimated it (see Fig 3A). In contrast, the third laboratory did not detect a statistically significant difference between the starved and control larvae (Lab B: W = 549.5, p = 0.14; see Fig 3A; but please see critical discussion on the use of p-values in paragraph 4.1.3). This impaired reproducibility between laboratories was further reflected in a significant “treatment-by-laboratory” interaction term (LRT: χ2(2) = 30.12, p = 0.006).
For the distance moved, we found significant treatment differences in each of the three laboratories (Wilcoxon signed-rank test, Lab A: W = 67, p < 0.001; Lab B: W = 239, p < 0.01; Lab C: W = 147, p < 0.001; see Fig 2D). Nevertheless, we found a significant “treatment-by-laboratory” interaction term in the pooled analysis (LRT: χ2(2) = 51.25, p = 0.024). Only two laboratories (Lab A and Lab C) recovered the overall effect size, while the third laboratory (Lab B) underestimated the magnitude (see Fig 3B).
3.1.2. Substrate choice in Pseudochorthippus parallelus.
We tested whether grasshoppers' color morphs differed in their preference for either green or brown substrate patches. Overall, we recorded 8,784 positions of 185 individual grasshoppers. In 15.1% of those observations (1,329 instances), grasshoppers were sitting on the floor and could thus be assigned to one of the two patch colors (see S5 Fig). In the remaining 84.9% of observations (7,455 instances), grasshoppers were sitting on the cage walls, under the cage lids, or on the bundles of grass and were consequently not assigned any patch preference.
Among the records of grasshoppers sitting on one of the colored patches, there were no significant morph-specific differences in substrate choice, neither across all laboratories (two-sided Wilcoxon signed-rank test, W = 3626.5, p = 0.33; see Fig 4A) nor within each laboratory (Lab A: W = 358, p = 0.34, Lab B: W = 490, p = 0.28; Lab C: W = 361.5, p = 0.09; see Fig 4B). All replicate CIs covered the overall effect size but also included zero effect (see Fig 3C). The consistency of these results across the replicates with the overall effect was reflected in the non-significant “treatment-by-laboratory” interaction term (LRT: χ2(2) = 1.55, p = 0.46).
3.1.3. Niche preference in Tribolium castaneum.
We tested whether adult and larval flour beetles differ in their niche preference for flour with protective quinone-rich secretions or flour with drastically reduced quinone content. Overall, we recorded 24,661 positions of Tribolium larvae (12,328) and adults (all virgin females, 12,333) across all Petri dishes. In 33.4% of the observations (4,116 instances), adult beetles were on their feet, so that a preference could be recorded.
We found a significantly higher preference for the secretion-less flour in adults (56.8%) compared to larvae (53.3%; two-sided Wilcoxon signed-rank test, W = 56,790, p = 0.02; see Fig 5A). When studying the preference within each laboratory, however, only two out of three laboratories replicated the differences in flour preference between the larvae and adult beetles (Lab A: W = 3914.5, p = 0.03; Lab C: W = 8,494, p = 0.02; see Fig 5B). Out of these two cases, the confidence intervals of only one replicate (Lab A) covered the overall effect size, while the other lab (Lab C) observed a larger treatment effect than the overall effect (see Fig 3D). The third laboratory did not replicate significant group differences (Lab B: W = 7699.5, p = 0.30). Consistent with these findings, the significant ‘treatment-by-laboratory’ interaction (LRT: χ2(2) = 10.51, p = 0.005) echoed the significant differences among laboratories.
3.2. Consistency and accuracy of the treatment effects across replicate experiments
When replicating a total of four outcome measures within three unique experimental set-ups, we found overall statistically significant treatment effects in three out of the four outcomes (see Table 3). For the fourth outcome measure, the substrate choice in the grasshopper P. parallelus, the treatment effect was neither significant across nor within laboratories (see Table 3). Across all findings, we found the direction of the overall treatment effect (Psig) reproduced in 83% of cases (10 out of 12 replicates), while in 58% of cases, both the effect size and significance were accurately predicted (Pacc; 7 out of 12; see Table 3).
Since the overall treatment effect for the grasshopper experiment did not reproduce the treatment effect of the original study [40], we focused the further reproducibility assessment on the two experiments that successfully replicated their original study effects. For these three outcomes, we were able to reproduce the overall statistical effect (Psig) in 77% of the replicates (seven out of nine). But in none of the outcomes did we have perfect reproducibility across all three laboratories. Whenever the overall significant treatment was not reproduced, it was always in the same location (Lab B; see Table 3). The replicate experiments covered the overall effect size of the three significant outcome measures only in 55% of the time (Pcov; five out of nine cases). In three cases, the effect was overestimated, in one replicate it was underestimated. When evaluating how well the replicate experiments were able to accurately predict both the overall significant treatment effect and also estimate the effect size (Pacc) for outcome measures, we found this true in 44% (four out of nine outcome measures).
When comparing the reproducibility between the individual outcomes, the distance moved in A. rosae, an experiment with automated data collection, had the highest reproducibility. Here all three laboratories reproduced the overall significant treatment effect (Psig), and two out of three laboratories also covered the overall effect size as well as predicted the treatment effect (Pacc; see Table 3). The niche choice of T. castaneum (manual data collection) showed the lowest reproducibility, with none of the measures (Psig, Pcov, or Pacc) being unanimous across all three laboratories.
In addition, the laboratory that originally designed and conducted the experiment did not necessarily cover the overall effect best. More specifically, we found that in one case the inexperienced laboratory replicated the results equally well as the experienced laboratory (A. rosae, distance moved). In two cases (T. castaneum and the PCI duration of A. rosae), did the inexperienced locations even achieve better reproducibility than the laboratory that had developed the experiment.
4. Discussion
The reproducibility of studies involving insect species is an underexplored area. Ecological and evolutionary processes often operate over large spatial and temporal scales, making a re-collection of appropriate data difficult and in some cases impossible. Consequently, systematic investigations of reproducibility are scarce in eco-evolutionary studies, rendering it even more important to approach the issue experimentally. Our study aimed to fill this gap by conducting a multi-laboratory study on the reproducibility of three behavioral ecological experiments. Within this framework, we compared the consistency and accuracy with which the replicate experiments predicted the overall treatment effects, both in terms of statistical significance and effect size.
Across our three experiments, we documented an overall prevalence rate of irreproducibility of 17–42%. The 42% thereby reflect the proportion of irreproducibility according to the strictest criterion (Pacc), namely the correct estimation of the effect size in combination with the replication of the significance of the overall treatment effect, while only 17% of the results did not reproduce the same direction of significance (Psig). With respect to our third experiment (P. parallelus), none of the laboratories were able to reproduce the previously described significant treatment effect.
With these findings, this study documents challenges in achieving reproducibility of ecological studies on insect behavior. In comparison to other systematic replication studies [3,6,7], however, we observed higher reproducibility rates, suggesting that while reproducibility issues do exist in insect studies, they might be less pronounced than in other areas of science.
4.1. Causes of poor reproducibility
4.1.1. Standardization and reproducibility.
According to the previously postulated “standardization fallacy” [34,35], different laboratories might produce increasingly idiosyncratic results as standardization within laboratories becomes more rigorous (see also [1]). In the present study, potential effects of “over-standardization” became evident on three levels: First, we found significant “treatment-by-laboratory” interactions for A. rosae and T. castaneum experiments, indicating that the detected treatment might have been idiosyncratic to the specific laboratory, which would match the findings of Crabbe and colleagues (1999). This is even more surprising, as protocols were harmonized across laboratories, thereby reaching a higher standard than is usually reached in replicate studies. In fact, a recent study on pharmacological effects in mice clearly demonstrated that harmonization of protocols across laboratories reduced between-laboratory variation substantially compared to a situation, where each laboratory uses its own local protocol [70]. In contrast, a recent analysis of coordinated distributed experiments in ecology suggests that reducing methodological heterogeneity across sites does not consistently reduce variation in observed effect sizes, possibly due to higher intrinsic biological variability among locations that predominates over methodological variance [71]. Thus, the effectiveness of standardization in minimizing between-laboratory variability may depend upon the inherent biological heterogeneity characteristic of the study system.
Second, we covered different levels of standardization by including three species that inherently varied in their degrees of genetic diversity and habituation to the laboratory: P. parallelus was the least standardized population, with unknown life histories and no habituation to laboratory conditions. In contrast, T. castaneum represented the most standardized model, with all individuals stemming from a homogenous, laboratory-adapted population that had been controlled for age and life history. A. rosae fell in between, descending from a laboratory stock population that was annually augmented with wild-caught individuals, but individuals used in the experiment were already from the 12th generation of purely laboratory-reared cohorts. Interestingly, the results of the A. rosae and T. castaneum experiments were not as consistent across our three laboratories as might be expected by such a comparatively high degree of standardization. By contrast, when focusing on the results of our study and neglecting the comparison with the original study, the results of the P. parallelus experiment were the most consistent across replicates. This might indicate that greater heterogeneity of the individuals within a study species could potentially benefit the representativeness of the study population and hence lead to better reproducibility across studies [29]. Similarly, an ecological study on grass grown in microcosms showed that a controlled systematic increase in genetic variation reduced variation among laboratories [72]. Apart from that, differences in reproducibility might also stem from variability in parental environmental conditions. While all A. rosae individuals were taken from one generation and population, T. castaneum individuals came from different generations, and wild-caught P. parallelus very likely also belonged to different cohorts. Finally, the lack of a significant treatment effect in the P. parallelus experiments could also have made it easier to achieve reproducibility.
Third, in regards to reproducing results from the original studies, A. rosae and T. castaneum experiments performed markedly better than the P. parallelus experiment, as none of our laboratories found any substrate preference as a function of color morph. Thus, although the reproducibility seen within our multi-laboratory approach may be considered high for this experiment, we failed to reproduce the effect seen in the original study [40]. Notably, out of the three experiments, the experimental set-up of the P. parallelus experiment deviated the most from the design of the original study. For instance, our experiment differed in terms of cage size, substrate design, and age structure, potentially explaining the observed discrepancies. At the same time, however, such conflicting findings highlight problems arising from over-standardization within experiments, as this makes it even harder to reproduce findings across the variation that inevitably exists between experiments and laboratories [29,30].
4.1.2. Specific experimental variables and reproducibility.
In rodent studies, significant research has been conducted to identify and rank sources of experimental variation. Along these lines, it has been repeatedly highlighted that one of the most confounding factors is the experimenter [73–75]. Precisely what differentiates experimenters remains unknown, but there is some evidence that aspects such as the sex [76,77] or the familiarity and training of the experimenter play a decisive role [78]. Until now, however, it is unclear whether similar patterns also hold for insect studies. Theoretically, the effect of the experimenter depends on an organism’s sensitivity to environmental cues. Zebrafish models, for instance, appear to be more resilient to experimenter effects compared to rodents, likely because they are less exposed to experimenter-specific pheromones and odor [79]. In the present study, we standardized the experience levels of the lead experimenters, ensuring that none had prior experience with the specific study organisms or behavioral assays. In two of the laboratories, however, they were assisted by researchers experienced with insect species and bioassays related to their laboratory. Adopting this approach ensured similar conditions across laboratories, thereby reducing but not excluding the risk of introducing any experimenter effects.
Reproducibility also varied with the levels of manual handling and scoring required: The variables measured by automated tracking software, i.e., the distance moved in the sawfly larvae, yielded the most consistent results across laboratories, while the behavior requiring manual handling in the same experiment, i.e., triggering PCI, showed substantially greater variation. In line with these findings, automated systems have been shown to yield better reproducible results across laboratories in comparison to manually conducted tests in rodents [80–82]. Likewise, it has been argued that computer algorithms, once programmed and trained, lead to more consistent and unbiased measurements [83,84], suggesting that the absence of human interference is a prominent advantage (cf. [85]).
Furthermore, reproducibility varied by trait: effects of starvation on the PCI duration in A. rosae larvae and substrate preferences of T. castaneum showed high variation between laboratories. Similarly, research on rodents found highly reproducible strain differences for ethanol preference and locomotor activity, while strain differences for anxiety-like behavior strongly depended on the local conditions [86]. Most likely, such findings indicate that processes underlying the different constructs might be more or less sensitive to environmental fluctuations, hence yielding more or less idiosyncratic findings.
Lastly, species-specific characteristics might also contribute to differences in reproducibility. In this respect, the T. castaneum experiment highlights how age can impact reproducibility. Whereas adult behavior was consistent with the original study, larvae here showed a preference for secretion-reduced flour, unlike the original study [41]. This difference might be explained by our use of older larvae (17 vs. 12 days), with the older aged larvae likely reaching a different instar, closer to pupation. While younger larvae feed constantly [87] and likely depend on external immunity provided by the antimicrobial secretions in the flour, old larvae nearing their pre-pupal phase might prefer a niche with lower densities to avoid dangers of cannibalism. Here, the secretion-reduced flour could act as a signal of lower density of conspecifics [88]. Similarly, some A. rosae larvae may have been near molting in certain laboratories. During the molting period, larvae do not feed, and thus starvation may not have had a strong effect on them. Standardizing age in larvae might, however, be particularly challenging when larval stages are hard to discriminate.
4.1.3. Sample size, statistical significance, and reproducibility.
Larger sample sizes are known to increase the statistical power of a study, i.e., the probability of rejecting a false null hypothesis [89,90]. It has thus been argued that poor reproducibility of rodent studies is at least partly due to a lack of sufficiently powered experiments [91]. In invertebrate studies, sample sizes are typically considerably higher and might include hundreds or even thousands of animals [92,93]. The sample sizes included in our experiments (see Table 1 and S4–S7 Tables) were well in the range to be considered sufficiently powered [91]. Despite this, however, we still observed limited reproducibility of some findings, indicating that increasing the power is one important step toward improvement, but certainly not the only one.
Apart from that, the reproducibility crisis has also led to growing concerns about the sole use of p-values and statistical significance for reporting findings. In particular, it has been argued that degrading p-values into significant and non-significant findings contributes to making studies irreproducible, or to making them seem irreproducible [94]. In the present study, we used three different measures to compare the reproducibility of results across replicate experiments. These measures increased in their rigor from evaluating reproducibility based solely on the significance of p-values, to the replication of the overall effect sizes and, finally, to the combination of both conditions. In contrast to p-values, effect sizes are independent of sample size. Adding effect sizes and their confidence intervals to publications has therefore been proposed to contribute to a more nuanced interpretation of experimental data [95,96]. We were able to reproduce the overall statistical significance of the treatment effect in 83% of the replicate experiments; however, reproducing the overall effect size was successful in only 66%, while both criteria were achieved in 58% of cases. Such a discrepancy between replicating statistical significance and effect size has been observed in other fields, too. For example, in cancer biology, 79% of replication studies had the same statistical significance, while only 18% reproduced the original effect size [7]. Whether effect sizes or statistical significance are easier to reproduce appears to be field-specific (in psychology: 47% versus 36% [6]; in economics: 61% versus 67% [10]).
4.2. Toward better reproducibility in insect studies
Identifying sources of poor reproducibility represents only the first step toward improvement. It is just as important to foster ways out of the crisis. In this respect, a number of strategies have already been developed that address methodological shortcomings or aim to improve the overall publication culture [23,24,26]. Although most of these strategies have been tailored towards rodent studies, they can be easily broadened to also encompass insect experiments. For instance, identifying additional reliable outcome measures in insect behavior could enhance the robustness of insect studies. This approach mirrors successful techniques used in rodent research, where the use of computer algorithms has expanded the repertoire of commonly used behavioral paradigms to include newly established measures, such as specific movement patterns [97–99]. Likewise, to minimize potential experimenter effects, several methodological strategies, such as blinding or the use of automated test systems, could be implemented more systematically [78,85,100,101]. Also, replicating studies independently and increasing sample sizes per single experiment would be more easily possible in insect studies, as invertebrates are comparatively easy to keep and breed and experiments are less constrained by ethical considerations and regulatory restrictions. Furthermore, to address the problem of over-standardization, systematic heterogenization of experimental conditions within laboratories has been proposed as a tool to deliberately incorporate known sources of biological variation into the experimental design (e.g., [1,27–29,102]). According to this idea, the introduction of variation on a systematic and controlled basis predicts increased external validity and hence improved reproducibility [30]. For example, one could imagine introducing biological variation by using insect populations from multiple geographic locations or from different cohorts or generations. For some insects, in particular Drosophila, also different levels of genetic homogenization can be included (DGRP lines; [103]). Regarding experimental conditions, studies could be conducted across different rooms, spread across weeks or seasons [66], or, depending on the biology of the model species, could also be realized at different times of the day [104]. Lastly, an implementation of open research practices throughout the research cycle is strongly advocated, independent of the research area [105,106].
5. Conclusions
With the present study, we aimed to systematically explore the reproducibility of ecological studies on insect behavior across independent replicate experiments. By means of a multi-laboratory approach, we uncovered difficulties in reproducing the overall estimate of the effect size as well as the direction of significance, documenting that reproducibility problems might also exist in insect studies. However, parts of the results are also encouraging, since reproducibility in our experiments was larger than in replication studies from other fields. This way, we wish to raise awareness for the topic and encourage the implementation of potential improvement strategies. We believe that addressing the reproducibility crisis requires a comprehensive solution involving methodological improvements, better experimental design, and a collaborative effort across research communities. Specifically, adopting open research practices and introducing systematic variation through multi-laboratory or heterogenized designs as well as implementing preregistrations may further enhance reproducibility in insect studies.
Supporting information
S1 Fig. Arrangement of Petri dishes for the A. rosae experiment with starvation or control treatment.
Photo credit: Maximilian Schurig.
https://doi.org/10.1371/journal.pbio.3003019.s001
(TIF)
S2 Fig. Top-down view of a single cage for the P. parallelus experiment.
Photo credit: Maximilian Schurig.
https://doi.org/10.1371/journal.pbio.3003019.s002
(TIF)
S3 Fig. Setup of a single Petri dish containing (A) three female virgin adults and (B) three larvae for the T. castaneum experiment.
Photo was taken 24 h after the start of the experiment. Photo credit: Maximilian Schurig.
https://doi.org/10.1371/journal.pbio.3003019.s003
(TIF)
S4 Fig. Substrate preference in P. parallelus morphs across all labs, split by sex.
Data are presented as boxplots showing medians, 25% and 75% percentiles, and 5% and 95% percentiles. Statistics: Wilcoxon signed-rank test, two-sided, on the untransformed data *p < 0.05, **p < 0.01, ***p ≤ 0.001. The data and code needed to reproduce this Figure can be found in https://zenodo.org/records/14002690. The data summarized in the Figures can be found in S7 Table.
https://doi.org/10.1371/journal.pbio.3003019.s004
(TIF)
S5 Fig. Choice of P. parallelus across all laboratories for sitting on one of the substrate patches (brown or green), or on neither patch (instead sitting, e.g., on the walls or lid of the cage).
In 15.1% of those observations (1,329 instances), grasshoppers were sitting on the floor and could thus be assigned to one of the two patch colors. In the remaining 84.9% of observations (7,455 instances), grasshoppers were sitting on the cage walls, under the cage lids, or on the bundles of grass. Data are presented as boxplots showing medians, 25% and 75% percentiles, and 5% and 95% percentiles. The data and code needed to reproduce this Figure can be found in https://zenodo.org/records/14002690. The data summarized in the Figures can be found in S8 Table.
https://doi.org/10.1371/journal.pbio.3003019.s005
(TIF)
S1 Table. Details on housing conditions, animals, preparation of materials and setup, experimental phase, and experimenter-specific characteristics for the Athalia experiment for each laboratory.
https://doi.org/10.1371/journal.pbio.3003019.s006
(DOCX)
S2 Table. Details on housing conditions, animals, preparation of materials and setup, experimental phase, and experimenter-specific characteristics for the Pseudochorthippus experiment for each laboratory.
https://doi.org/10.1371/journal.pbio.3003019.s007
(DOCX)
S3 Table. Details on housing conditions, animals, preparation of materials and setup, experimental phase, and experimenter-specific characteristics for the Tribolium experiment for each laboratory.
https://doi.org/10.1371/journal.pbio.3003019.s008
(DOCX)
S4 Table. (A) Descriptive Statistics of the outcome measure “PCI duration” [sec] in the Athalia experiment for each group across all labs. (B) Descriptive Statistics of the outcome measure “PCI duration” [sec] in the Athalia experiment within each lab and group. (C) Descriptive Statistics of the outcome measure “distance moved” [cm] in the Athalia experiment for each group across all labs. (D) Descriptive Statistics of the outcome measure “distance moved” [cm] in the Athalia experiment within each lab and group.
https://doi.org/10.1371/journal.pbio.3003019.s009
(DOCX)
S5 Table. (A) Descriptive statistics of the outcome measure “substrate choice” as percent of individuals on green patch [%] in the Pseudochorthippus experiment for each morph type across all labs. (B) Descriptive statistics of the outcome measure “substrate choice” as percent of individuals on green patch [%] in the Pseudochorthippus experiment within each lab and morph type.
https://doi.org/10.1371/journal.pbio.3003019.s010
(DOCX)
S6 Table. (A) Descriptive statistics of the outcome measure “niche choice” as percent of individuals in drak-flour [%] in the Tribolium experiment for each group across all labs. (B) Descriptive statistics of the outcome measure “niche choice” as percent of individuals in drak-flour [%] in the Tribolium experiment within each lab and group.
https://doi.org/10.1371/journal.pbio.3003019.s011
(DOCX)
S7 Table. Descriptive statistics of the outcome measure “substrate choice” as percent of individuals on green patch [%] in the Pseudochorthippus experiment within each lab, sex, and morph type.
https://doi.org/10.1371/journal.pbio.3003019.s012
(DOCX)
S8 Table. Descriptive statistics of the recorded location [%] across all individuals in the Pseudochorthippus experiment across all labs.
https://doi.org/10.1371/journal.pbio.3003019.s013
(DOCX)
References
- 1. von Kortzfleisch VT, Richter SH. Systematic heterogenization revisited: Increasing variation in animal experiments to improve reproducibility? J Neurosci Methods. 2024;401:109992. pmid:37884081
- 2. Plesser HE. Reproducibility vs. replicability: a brief history of a confused terminology. Front Neuroinform. 2018;11:76. pmid:29403370
- 3. Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124. pmid:16060722
- 4. Freedman LP, Cockburn IM, Simcoe TS. The economics of reproducibility in preclinical research. PLoS Biol. 2015;13(6):e1002165. pmid:26057340
- 5. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4. pmid:27225100
- 6. Open Science Collaboration. Psychology. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. pmid:26315443
- 7. Errington TM, Mathur M, Soderberg CK, Denis A, Perfito N, Iorns E, et al. Investigating the replicability of preclinical cancer biology. eLife. 2021;10:e71601. pmid:34874005
- 8. Niven DJ, McCormick TJ, Straus SE, Hemmelgarn BR, Jeffs L, Barnes TRM, et al. Reproducibility of clinical research in critical care: a scoping review. BMC Med. 2018;16(1):26. pmid:29463308
- 9. Nosek BA, Hardwicke TE, Moshontz H, Allard A, Corker KS, Dreber A, et al. Replicability, robustness, and reproducibility in psychological science. Annu Rev Psychol. 2022;73:719–48. pmid:34665669
- 10. Camerer CF, Dreber A, Forsell E, Ho T-H, Huber J, Johannesson M, et al. Evaluating replicability of laboratory experiments in economics. Science. 2016;351(6280):1433–6. pmid:26940865
- 11. Christensen G, Miguel E. Transparency, reproducibility, and the credibility of economics research. J Econ Literature. 2018;56(3):920–80.
- 12. Crabbe JC, Wahlsten D, Dudek BC. Genetics of mouse behavior: interactions with laboratory environment. Science. 1999;284(5420):1670–2. pmid:10356397
- 13. Mandillo S, Tucci V, Hölter SM, Meziane H, Banchaabouchi MA, Kallnik M, et al. Reliability, robustness, and reproducibility in mouse behavioral phenotyping: a cross-laboratory study. Physiol Genomics. 2008;34(3):243–55. pmid:18505770
- 14. Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets?. Nat Rev Drug Discov. 2011;10(9):712. pmid:21892149
- 15. Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer research. Nature. 2012;483(7391):531–3. pmid:22460880
- 16. Perrin S. Preclinical research: make mouse studies work. Nature. 2014;507(7493):423–5. pmid:24678540
- 17. Sena ES, van der Worp HB, Bath PMW, Howells DW, Macleod MR. Publication bias in reports of animal stroke studies leads to major overstatement of efficacy. PLoS Biol. 2010;8(3):e1000344. pmid:20361022
- 18. Ioannidis JPA, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383(9912):166–75. pmid:24411645
- 19. Macleod MR, Lawson McLean A, Kyriakopoulou A, Serghiou S, de Wilde A, Sherratt N, et al. Risk of bias in reports of in vivo research: a focus for improvement. PLoS Biol. 2015;13(10):e1002273. pmid:26460723
- 20. Forstmeier W, Wagenmakers E, Parker TH. Detecting and avoiding likely false‐positive findings – a practical guide. Biol Rev. 2017;92: 1941–1968.
- 21. Karp NA, Fry D. What is the optimum design for my animal experiment?. BMJ Open Sci. 2021;5(1):e100126. pmid:35047700
- 22. Percie du Sert N, Bamsey I, Bate ST, Berdoy M, Clark RA, Cuthill I, et al. The experimental design assistant. PLoS Biol. 2017;15(9):e2003779. pmid:28957312
- 23. Kilkenny C, Browne WJ, Cuthill IC, Emerson M, Altman DG. Improving bioscience research reporting: the ARRIVE guidelines for reporting animal research. PLoS Biol. 2010;8(6):e1000412. pmid:20613859
- 24. Percie du Sert N, Hurst V, Ahluwalia A, Alam S, Avey MT, Baker M, et al. The ARRIVE guidelines 2.0: updated guidelines for reporting animal research. PLoS Biol. 2020;18(7):e3000410. pmid:32663219
- 25. Parker TH, Forstmeier W, Koricheva J, Fidler F, Hadfield JD, Chee YE, et al. Transparency in ecology and evolution: real problems, real solutions. Trends Ecol Evol. 2016;31(9):711–9. pmid:27461041
- 26. Bert B, Heinl C, Chmielewska J, Schwarz F, Grune B, Hensel A, et al. Refining animal research: the animal study registry. PLoS Biol. 2019;17(10):e3000463. pmid:31613875
- 27. Richter SH, Garner JP, Würbel H. Environmental standardization: cure or cause of poor reproducibility in animal experiments? Nat Methods. 2009;6(4):257–61. pmid:19333241
- 28. Richter SH, Garner JP, Auer C, Kunert J, Würbel H. Systematic variation improves reproducibility of animal experiments. Nat Methods. 2010;7(3):167–8. pmid:20195246
- 29. Richter SH. Systematic heterogenization for better reproducibility in animal experimentation. Lab Anim (NY). 2017;46(9):343–9. pmid:29296016
- 30. Voelkl B, Altman NS, Forsman A, Forstmeier W, Gurevitch J, Jaric I, et al. Reproducibility of animal research in light of biological variation. Nat Rev Neurosci. 2020;21(7):384–93. pmid:32488205
- 31. Dingemanse NJ, Kazem AJN, Réale D, Wright J. Behavioural reaction norms: animal personality meets individual plasticity. Trends Ecol Evol. 2010;25(2):81–9. pmid:19748700
- 32. Voelkl B, Würbel H. Reproducibility crisis: are we ignoring reaction norms? Trends Pharmacol Sci. 2016;37(7):509–10. pmid:27211784
- 33. Voelkl B, Würbel H. A reaction norm perspective on reproducibility. Theory Biosci. 2021;140(2):169–76. pmid:33768464
- 34. Voelkl B, Würbel H, Krzywinski M, Altman N. The standardization fallacy. Nat Methods. 2021;18:3.
- 35. Würbel H. Behaviour and the standardization fallacy. Nat Genet. 2000;26(3):263–263.
- 36. Fidler F, Chee YE, Wintle BC, Burgman MA, McCarthy MA, Gordon A. Metaresearch for evaluating reproducibility in ecology and evolution. Bioscience. 2017;67(3):282–9. pmid:28596617
- 37. Fraser H, Parker T, Nakagawa S, Barnett A, Fidler F. Questionable research practices in ecology and evolution. PLoS One. 2018;13(7):e0200303. pmid:30011289
- 38. Kelly CD. Rate and success of study replication in ecology and evolution. PeerJ. 2019;7:e7654. pmid:31565572
- 39. Singh P, Wolthaus J, Schielzeth H, Müller C. State dependency of behavioural traits is a function of the life stage in a holometabolous insect. Anim Behav. 2023;203:29–39.
- 40. Heinze P, Dieker P, Rowland HM, Schielzeth H. Evidence for morph-specific substrate choice in a green-brown polymorphic grasshopper. Behav Ecol. 2021;33(1):17–26. pmid:35197804
- 41.
Lo LK. The role of niche construction for evolvability in the red flour beetle, Tribolium castaneum. PhD Thesis, Universität Münster; 2024. Available from: https://katalogplus.uni-muenster.de/discovery/fulldisplay?docid=alma991045061752806449&context=L&vid=49HBZ_ULM:VU2&lang=de&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,The%20Role%20of%20Niche%20Construction%20for%20Evolvability%20in%20the%20Red%20Flour%20Beetle,%20Tribolium%20castaneum&offset=0
- 42. Milutinović B, Stolpe C, Peuβ R, Armitage SAO, Kurtz J. The red flour beetle as a model for bacterial oral infections. PLoS One. 2013;8(5):e64638. pmid:23737991
- 43. Lehmann S, Atika B, Grossmann D, Schmitt-Engel C, Strohlein N, Majumdar U, et al. Phenotypic screen and transcriptomics approach complement each other in functional genomics of defensive stink gland physiology. BMC Genomics. 2022;23(1):608. pmid:35987630
- 44. Unruh LM, Xu R, Kramer KJ. Benzoquinone levels as a function of age and gender of the red flour beetle, Tribolium castaneum. Insect Biochem Mol Biol. 1998;28(12):969–77.
- 45. Joop G, Roth O, Schmid-Hempel P, Kurtz J. Experimental evolution of external immune defences in the red flour beetle. J Evol Biol. 2014;27(8):1562–71. pmid:24835532
- 46. Khan I, Prakash A, Issar S, Umarani M, Sasidharan R, Masagalli JN, et al. Female density-dependent chemical warfare underlies fitness effects of group sex ratio in flour beetles. Am Nat. 2018;191(3):306–17.
- 47. Benson RB. An introduction to the natural history of British sawflies (Hymenoptera Symphyta). Trans Soc Brit Ent. 1950;10:45–142.
- 48. Riggert E. Untersuchungen über die Rübenblattwespe Athalia colibri Christ (A. spinarum F.). Zeitschrift für Angewandte Entomologie. 2009;26(3):462–516.
- 49. Humphreys RK, Ruxton GD. A review of thanatosis (death feigning) as an anti-predator behaviour. Behav Ecol Sociobiol. 2018;72(2):22. pmid:29386702
- 50. Edelaar P, Serrano D, Carrete M, Blas J, Potti J, Tella JL. Tonic immobility is a measure of boldness toward predators: an application of Bayesian structural equation modeling. Behav Ecol. 2012;23(3):619–26.
- 51. Tremmel M, Müller C. Insect personality depends on environmental conditions. Behav Ecol. 2012;24(2):386–92.
- 52. Köhler G, Samietz J, Schielzeth H. Morphological and colour morph clines along an altitudinal gradient in the meadow grasshopper Pseudochorthippus parallelus. PLoS One. 2017;12(12):e0189815. pmid:29284051
- 53. Köhler G, Schielzeth H. Green-brown polymorphism in alpine grasshoppers affects body temperature. Ecol Evol. 2019;10(1):441–50. pmid:31988736
- 54. Prendeville HR, Stevens L. Microbe inhibition by Tribolium flour beetles varies with beetle species, strain, sex, and microbe group. J Chem Ecol. 2002;28(6):1183–90. pmid:12184396
- 55. Yezerski A, Ciccone C, Rozitski J, Volingavage B. The effects of a naturally produced benzoquinone on microbes common to flour. J Chem Ecol. 2007;33(6):1217–25. pmid:17473960
- 56. Otti O, Tragust S, Feldhaar H. Unifying external and internal immune defences. Trends Ecol Evol. 2014;29(11):625–34.
- 57. Verheggen F, Ryne C, Olsson POC, Arnaud L, Lognay G, Högberg HE, et al. Electrophysiological and behavioral activity of secondary metabolites in the confused flour beetle, Tribolium confusum. J Chem Ecol. 2007;33(3):525–39. pmid:17265176
- 58. Flinn PW, Campbell JF. Effects of flour conditioning on cannibalism of T. castaneum eggs and pupae. Environ Entomol. 2012;41(6):1501–4. pmid:23321098
- 59. Faustini DL, Burkholder WE. Quinone-aggregation pheromone interaction in the red flour beetle. Anim Behav. 1987;35(2):601–3.
- 60.
R Core Team. R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing. Vienna, Austria; 2023. Available from: https://www.R-project.org/
- 61. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models Usinglme4. J Stat Soft. 2015;67(1).
- 62. Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest package: tests in linear mixed effects models. J Stat Soft. 2017;82(13).
- 63. Hartig F. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level/ Mixed) Regression Models; 2022. Available from: http://florianhartig.github.io/DHARMa/
- 64. Harrison XA. Using observation-level random effects to model overdispersion in count data in ecology and evolution. PeerJ. 2014;2:e616. pmid:25320683
- 65. Lüdecke D, Ben-Shachar M, Patil I, Waggoner P, Makowski D. Performance: an R Package for assessment, comparison and testing of statistical models. JOSS. 2021;6(60):3139.
- 66. von Kortzfleisch VT, Karp NA, Palme R, Kaiser S, Sachser N, Richter SH. Improving reproducibility in animal research by splitting the study population into several “mini-experiments”. Sci Rep. 2020;10(1):16579. pmid:33024165
- 67.
Fox J, Weisberg S. An R Companion to Applied Regression. Thousand Oaks, CA: Sage; 2019. Available from: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
- 68. Viechtbauer W. Conducting meta-analyses in R with the meta for Package. J Stat Soft. 2010;36(3).
- 69. Voelkl B, Vogt L, Sena ES, Würbel H. Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol. 2018;16(2):e2003693. pmid:29470495
- 70. Arroyo-Araujo M, Voelkl B, Laloux C, Novak J, Koopmans B, Waldron A-M, et al. Systematic assessment of the replicability and generalizability of preclinical findings: impact of protocol harmonization across laboratory sites. PLoS Biol. 2022;20(11):e3001886. pmid:36417471
- 71. Bebout J, Fox JW. Coordinated distributed experiments in ecology do not consistently reduce heterogeneity in effect size. Oikos. 2024;2024(6).
- 72. Milcu A, Puga-Freitas R, Ellison AM, Blouin M, Scheu S, Freschet GT, et al. Genotypic variability enhances the reproducibility of an ecological study. Nat Ecol Evol. 2018;2(2):279–87. pmid:29335575
- 73. Bohlen M, Hayes ER, Bohlen B, Bailoo JD, Crabbe JC, Wahlsten D. Experimenter effects on behavioral test scores of eight inbred mouse strains under the influence of ethanol. Behav Brain Res. 2014;272:46–54. pmid:24933191
- 74. Chesler EJ, Wilson SG, Lariviere WR, Rodriguez-Zas SL, Mogil JS. Influences of laboratory environment on behavior. Nat Neurosci. 2002;5(11):1101–2. pmid:12403996
- 75. Chesler EJ, Wilson SG, Lariviere WR, Rodriguez-Zas SL, Mogil JS. Identification and ranking of genetic and laboratory environment factors influencing a behavioral trait, thermal nociception, via computational analysis of a large data archive. Neurosci Biobehav Rev. 2002;26(8):907–23. pmid:12667496
- 76. Georgiou P, Zanos P, Mou T-CM, An X, Gerhard DM, Dryanovski DI, et al. Experimenters’ sex modulates mouse behaviors and neural responses to ketamine via corticotropin releasing factor. Nat Neurosci. 2022;25(9):1191–200. pmid:36042309
- 77. Sorge RE, Martin LJ, Isbester KA, Sotocinal SG, Rosen S, Tuttle AH, et al. Olfactory exposure to males, including men, causes stress and related analgesia in rodents. Nat Methods. 2014;11(6):629–32. pmid:24776635
- 78. Gulinello M, Mitchell HA, Chang Q, Timothy O’Brien W, Zhou Z, Abel T, et al. Rigor and reproducibility in rodent behavioral research. Neurobiol Learn Mem. 2019;165:106780. pmid:29307548
- 79. de Abreu MS, Kalueff AV. Of mice and zebrafish: the impact of the experimenter identity on animal behavior. Lab Anim (NY). 2021;50(1):7. pmid:33299171
- 80. Krackow S, Vannoni E, Codita A, Mohammed AH, Cirulli F, Branchi I, et al. Consistent behavioral phenotype differences between inbred mouse strains in the IntelliCage. Genes Brain Behav. 2010;9(7):722–31. pmid:20528956
- 81.
Lipp H, Litvin O, Galsworthy M, Vyssotski D, Vyssotski AL, Zinn P, et al. Automated behavioral analysis of mice using INTELLICAGE: inter-laboratory comparisons and validation with exploratory behavior and spatial learning. In: Noldus LPJJ, editor. Proceedings of measuring behavior 2005: 5th International Conference on Methods and Techniques in Behavioral Research. 2005. p. 66–69.
- 82. Robinson L, Spruijt B, Riedel G. Between and within laboratory reliability of mouse behaviour recorded in home-cage and open-field. J Neurosci Methods. 2018;300:10–9. pmid:29233658
- 83. Spruijt BM, Peters SM, de Heer RC, Pothuizen HHJ, van der Harst JE. Reproducibility and relevance of future behavioral sciences should benefit from a cross fertilization of past recommendations and today’s technology: “Back to the future”. J Neurosci Methods. 2014;234:2–12. pmid:24632384
- 84. Spruijt BM, DeVisser L. Advanced behavioural screening: automated home cage ethology. Drug Discov Today Technol. 2006;3(2):231–7. pmid:24980412
- 85. Richter SH. Automated home-cage testing as a tool to improve reproducibility of behavioral research? Front Neurosci. 2020;14:383. pmid:32390795
- 86. Wahlsten D, Metten P, Phillips TJ, Boehm SL 2nd, Burkhart-Kasch S, Dorow J, et al. Different data from different labs: lessons from studies of gene-environment interaction. J Neurobiol. 2003;54(1):283–311. pmid:12486710
- 87.
Sokoloff A. The Biology of Tribolium with Special Emphasis on Genetic Aspects. Volume I. Oxford University Press; 1972.
- 88. Park T, Nathanson M, Ziegler JR, Mertz DB. Cannibalism of pupae by mixed-species populations of adult Tribolium. Physiol Zool. 1970;43(3):166–84.
- 89. Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation. Emerg Med J. 2003;20(5):453–8. pmid:12954688
- 90. Serdar CC, Cihan M, Yücel D, Serdar MA. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med (Zagreb). 2021;31(1):010502. pmid:33380887
- 91. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365–76. pmid:23571845
- 92. Bassetto M, Reichl T, Kobylkov D, Kattnig DR, Winklhofer M, Hore PJ, et al. No evidence for magnetic field effects on the behaviour of Drosophila. Nature. 2023;620(7974):595–9. pmid:37558871
- 93. Pointer MD, Spurgin LG, Vasudeva R, McMullan M, Butler S, Richardson DS. Traits underlying experimentally evolved dispersal behavior in Tribolium castaneum. J Insect Behav. 2024;37(3–4):220–32. pmid:39553468
- 94. Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ. 2017;5: e3544.
- 95. Nakagawa S, Cuthill IC. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc. 2007;82(4):591–605. pmid:17944619
- 96. Sullivan GM, Feinn R. Using effect size-or why the P value is not enough. J Grad Med Educ. 2012;4(3):279–82. pmid:23997866
- 97. Benjamini Y, Lipkind D, Horev G, Fonio E, Kafkafi N, Golani I. Ten ways to improve the quality of descriptions of whole-animal movement. Neurosci Biobehav Rev. 2010;34(8):1351–65. pmid:20399806
- 98. Kafkafi N, Mayo CL, Elmer GI. Mining mouse behavior for patterns predicting psychiatric drug classification. Psychopharmacology (Berl). 2014;231(1):231–42. pmid:23958942
- 99. Lipkind D, Sakov A, Kafkafi N, Elmer GI, Benjamini Y, Golani I. New replicable anxiety-related measures of wall vs center behavior of mice in the open field. J Appl Physiol (1985). 2004;97(1):347–59. pmid:14990560
- 100. Karp NA, Pearl EJ, Stringer EJ, Barkus C, Ulrichsen JC, Percie du Sert N. A qualitative study of the barriers to using blinding in in vivo experiments and suggestions for improvement. PLoS Biol. 2022;20(11):e3001873. pmid:36395326
- 101. Freeberg TM, Benson SA, Burghardt GM. Minimizing observer bias in animal behavior studies revisited: improvement, but a long way to go. Ethology. 2024;130(6).
- 102. Richter SH, Garner JP, Zipser B, Lewejohann L, Sachser N, Touma C, et al. Effect of population heterogenization on the reproducibility of mouse behavior: a multi-laboratory study. PLoS One. 2011;6(1):e16461. pmid:21305027
- 103. Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu D, et al. The Drosophila melanogaster genetic reference panel. Nature. 2012;482(7384):173–8. pmid:22318601
- 104. Bodden C, von Kortzfleisch VT, Karwinkel F, Kaiser S, Sachser N, Richter SH. Heterogenising study samples across testing time improves reproducibility of behavioural data. Sci Rep. 2019;9(1):8247. pmid:31160667
- 105. Cuff JP, Barrett M, Gray H, Fox C, Watt A, Aimé E. The case for open research in entomology: reducing harm, refining reproducibility and advancing insect science. Agri Forest Entomol. 2024;26(3):285–95.
- 106. Wittman JT, Aukema BH. A guide and toolbox to replicability and open science in entomology. J Insect Sci. 2020;20(3):6. pmid:32441307