Table 1.
Typology of quality criteria for in vivo studies.
Table 2.
Scale used for expert elicitation.
Figure 1.
Quality assessment of Tyl et al. (2002), using Qualichem with eight respondents.
For the study of Tyl et al. (2002), of the 45 criteria, the figure represents only the 35 controversial criteria out of the total set of 45 criteria. The remaining 10 criteria were not controversial according to our definition; they received scores of 5 or 6 and were considered to be of high aggregated quality. The figure is divided in three colored areas: red (including scores and medians <3), orange (for scores and medians between 3 and 4) and green (for scores and medians >4). A line covers the full range, from the lowest score to the highest score in the group of responding experts. The median of the scores is represented by an “x” and the interquartile range is represented by a rectangle. If the median (x) is in the red area, the aggregated quality of the criterion is low. If the median is in the orange area, the aggregated quality is average. If the median is in the green area, the aggregated quality is high. The interquartile range is an indicator of inter-expert heterogeneity. Thirty of the 35 controversial criteria were of high aggregated quality (the median is in the green area). Of the five remaining criteria, three were of average aggregated quality (the median in the orange area) and two were of low aggregated quality (the median is in the red area).
Figure 2.
Quality assessment of Stump (2009), using Qualichem with four respondents.
For the report of Stump (2009), of the possible 45, the figure represents only the 16 controversial criteria. All the other criteria—those that are not controversial according to our definition—received scores of 5 or 6 and were considered as having a high aggregated quality. The figure is divided in three colored areas: red (including scores and medians <3), orange (for scores and medians between 3 and 4) and green (for scores or medians >4). A line covers the full range from the lowest score to the highest score in the group of responding experts. The median of the scores is represented by an “x” and the interquartile range is represented with a rectangle. If the median (x) is in the red area, the aggregated quality of the criterion is low. If the median is in the orange area, the aggregated quality is average. If the median is in the green area, the aggregated quality is high. The interquartile range is an indicator of inter-expert heterogeneity. Nine of the 16 controversial criteria were of high aggregated quality (median fell in the green area). The remaining 7 criteria were all of average aggregated quality (median fell in the orange area).
Figure 3.
Relative importance of the Qualichem quality criteria to the global quality of the study.
Eight of the twelve respondents agreed to select a subset of up to 15 criteria that they considered to be the most important for the quality of in vivo study results. The figure shows the combined 39 criteria chosen by these eight experts. The vertical axis represents the 39 criteria, and the horizontal axis represents the percentage of respondents that selected each criterion.
Table 3.
Comparison between Qualichem and other reporting and/or quality assessment frameworks.
Figure 4.
Disciplines of Qualichem respondents.
The vertical axis represents the disciplinary areas chosen by the experts. The horizontal axis represents the number of experts who chose each disciplinary area to describe his/her work.
Figure 5.
Quality assessment by two endocrinologists using Qualichem to evaluate Tyl et al. (2002).
For the study of Tyl et al. (2002), of the 45 criteria, the figure represents only the 30 controversial criteria for the two respondents who included “endocrinology” or “endocrine toxicology” among their fields of competence. The figure is divided into three colored areas: red (including scores and medians <3), orange (for scores and medians between 3 and 4) and green (for scores and medians >4). A line covers the full range from the lowest score to the highest score in the group of responding experts. The median of the scores is represented by an “x” and the interquartile range is represented by a rectangle. If the median (x) is in the red area, the aggregated quality of the criterion is low. If the median is in the orange area, the aggregated quality is average. If the median is in the green area, the aggregated quality is high. The interquartile range is an indicator of inter-expert heterogeneity. The number of criteria that fell in the orange or red areas is much higher for these two respondents than for all respondents together: 16 vs 3 in the orange area, 9 vs 2 in the red area, and only 5 vs 30 in the green area. This indicates lower levels of aggregated quality for these criteria, compared to the 8 respondents together.