Reader Comments

Post a new comment on this article

Importance of Verification

Posted by gfariello on 20 Jun 2012 at 17:10 GMT

I believe this to be an important paper in that it appears to have re-awoken recognition of a phenomenon that is well understood in computer science and system engineering, but which has apparently come as a surprise to many in the neuroimaging community. Reproducing computational results across different systems is difficult, but usually quite doable. FreeSurfer may be more affected than many neuroimaging tools, but is far from the exception. Any software that relies on iterative algorithms will have compounded differences across environments.

It concerns me, however, that a few in the community seem to be singling out the FreeSurfer developers for things which are largely outside of their control. It would be good to know how some of the larger differences came about, but differences are to be expected.

Below I share the essence of an email I posted to the FreeSurfer email list recently in the hops that others find it useful.

Sincerely,

-Gabriele

-- Email follows --

Hi Bruce et al,

I may be late to the discussion, but wanted to share some insights given that we've had some headaches trying to get identical results on presumptively identical systems for FreeSurfer and other tools. Ultimately, of course, the systems were not 100% identical, and the differences resulted in some down-stream libraries using 32-bit code in stead of 64-bit. Some differences had been observed in other software packages to a lesser extent, but I'm confident that had we compared any other software which relied on complex iterative algorithms using these libraries, the differences would have been more pronounced. Re-imaging the systems to ensure that all libraries were identical resulted in identical output. Then when migrating from RHEL 4.x to 5.x to 6.x, we did some similar library checking and again were able to produce identical results.

It is important for the neuroimaging community to understand that reproducibility is in large part their responsibility and not that of the software package or operating system developers. It is effectively impossible to guarantee identical results on non-identical systems. Unfortunately, "identical" can for some analysis leave little room for differences (perhaps even the random seed differences as just suggested by Michael Harms), but notwithstanding hardware related interference, bugs, or errors, identical systems should generally produce identical results.

I forwarded the paper that sparked this thread in jest to friends saying "why can't things be simple?" but in reality, as Bruce mentioned, this is not in the least a surprise. I've been involved in testing output and fixing it between software and system updates in clinical and research settings since 1999, and I'm pretty sure I was not the first. There is a reason the FDA does not want you upgrading even Mine Sweep on a validated Windows system without re-validating. Researchers need to think similarly (if not quite as extremely).

Some additional notes for those of you who may not be aware:

1) The USER environment can affect results. On GNU/Linux systems, for example, modifying the $PATH or $LD_LIBRARY_PATH variables may result in different output from the same executable on the same system by different users (or the same user under different shells). Mac and Windows can have similar issues, particularly when concerning "power users".

2) Statically compiling software does not eliminate the use of dynamically loaded libraries (see a good explanation at "Linking libstdc++ statically"). So even statically compiled software can be affected by other libraries on the system.

3) All other things being identical, using Intel vs. AMD x86 compatible chips should not affect the output; however, going to ARM, RISC, or GPU chips where floating point representations are IEEE compliant but different would virtually guarantee different results even if all libraries are the same for all but the simplest calculations. This means that unfortunately you'll likely never be able to reproduce your 1,000 node cluster results on your iPad -- no matter how cool or powerful it gets.

All that being said, if you do run your entire experiment twice, using two different systems that differ only in their IEEE-compliant double precision floating point implementation and the results are significantly different (e.g., running on a XEON cluster the hippocampal volumes of group A and group B are different and running on an NVIDIA GPU cluster they are not), that would bring into question the validity/reliability of the analysis. I have not seen any evidence of that.

That may have been more than two cents.

-Gabriele

P.S. I'm not going to chime in on differences between versions, since I can't imagine how a segmentation algorithm (for example) would have gotten more accurate and yet have produced the same results.

No competing interests declared.