Reader Comments

Post a new comment on this article

Do these results generalise to what researchers really do?

Posted by BBTK on 14 Mar 2014 at 12:41 GMT

In 2001 we carried out a similar study using the common experiment generators of the day. We constructed a set of agreed basic benchmarks and even let the authors of the packages comment on them beforehand. We used clean equipment to test the presentation, response and synchronization timing accuracy.

Within a given range all performed fairly well. However when we checked real researchers experiments on their own hardware they didn’t match the benchmarks.

The reason for this is because flashing a bitmap over and over on idealised equipment is not representative of what real researchers do in the field! Their equipment is never ideal, their coding never as good as yours, their experiment is more complicated, they link to different equipment to yours, e.g. EEG, fMRI etc. Or they are using a different version of the software to you.

We abandoned this approach and think that researchers should be required by publishers and funders to self-validate their own accuracy on their own equipment when running in-situ. In any other field if you couldn’t prove the accuracy of what you did you wouldn’t be published period! Ditto for methods sections that lack detail.

In this specific paper they used one of our Black Box ToolKits to check timing. What this tells you is how those experiments performed on that equipment at that time, with those scripts/code/software and nothing more. Hence why all researchers should check their own studies. Generalising more widely could to lead to false assumptions and perpetuate bad science in my view…

No competing interests declared.

RE: Do these results generalise to what researchers really do?

garaizar replied to BBTK on 14 Mar 2014 at 20:28 GMT

Dear Robert,

we do agree with you when you doubt about the generalizability (or ecological validity in terms of what researchers really do) of these results. We also agree with you when you stated that researchers should be required by publishers and funders to self-validate their own accuracy on their own equipment when running in-situ. Unfortunately, that is not the case right now. For this reason, we believe that a series of standardized tests conducted by third-party researchers (ie, not involved in the development of the analysed software) could provide some insight about the accuracy and precision of the software. As you mentioned, this does not necessarily mean that our tests certify or validate the precision and accuracy of software used. However, it works like a "health quality control" in a restaurant or an annual air-conditioning system maintenance visit: It does not ensure that you will not suffer food poisoning or become infected with legionella, but is a good thing to do from time to time.

In this precise case, it helped to start a fruitful collaboration with the PsychoPy develoment team to better assess its accuracy and precision. The preliminary results are very promising (PsychoPy is able to present 1-frame stimuli accurately in our new tests), but it also helped to fix some minor bugs (not-related to PsychoPy's accuracy) and improve both the tested software and the benchmarks. In our view, this is very positive both for researchers and experiment software developers.

So, yes, generalising these results could lead to false assumptions and bad science, but we did not intended to be used in that way. We provided the dataset of our tests with the hope that it will be included in a larger and more comprehensive dataset that could lead us to have a general idea of the accuracy and precision of the experiment software.

Thank you very much for your insights.

Best,
Pablo Garaizar

Competing interests declared: I am a co-author of this paper