Correction: Independent re-analysis of alleged mind-matter interaction in double-slit experimental data

[This corrects the article DOI: 10.1371/journal.pone.0211511.].


1
The hypothesis of a mind-matter interaction, that is, the possibility that human 2 intention may have an impact on matter at a distance, is usually regarded by most 3 physicists as a highly controversial concept. It is nonetheless related to von Neumann's 4 interpretation [2] of the quantum measurement problem, namely that consciousness 5 causes the collapse of the wave function when a quantum system in a superposition of 6 states is observed. Even if this interpretation has been and still is considered by many 7 minds of quantum mechanics [2][3][4], it is today blatantly disregarded by a majority of 8 physicists [5] partly because it flirts with the overwhelmingly complex mind/body 9 problem. This mysterious link between consciousness and matter appears indeed to 10 have an infinite number of uncontrollable parameters, and therefore does not seem to 11 lend itself to rigorous scientific inquiry. Moreover, von Neumann's interpretation being 12 by all means only one out of many possible interpretations of quantum mechanics [6] 13 -most of which keep consciousness aside-, physicists generally prefer mathematically 14 controlled objective concepts such as quantum decoherence [7] or Everett's many-worlds 15 interpretation [8]. It is nevertheless well worth reminding that, however strong and quantum system may be modeled as an extremely weak measurement of the system, 22 that should in turn imply a proportionally weak but still measurable collapse of its wave 23 function. The authors propose to test this hypothesis using one of the simplest quantum 24 apparatus: the double-slit optical interferometer. In this context, it is well-known [10] 25 that if the path taken by photons through the interferometer (called "which-way 26 information") is recorded, then photons behave like particles (they don't interfere), 27 otherwise they behave like waves (they interfere). It has also been verified that the 28 strength of the observed interference pattern is inversely proportional to the amount of 29 which-way information one gathers [11,12]. Keeping that in mind, and according to the 30 working hypothesis previously stated, a human subject's attention towards a double-slit 31 system, if it really acts as a weak measurement of the which-way information, should 32 very slightly attenuate the interference pattern. Other working hypotheses can be 33 thought of that do not require a gain in which-way information while still accounting for 34 a decrease in fringe visibility. For instance, Pradhan [14] proposes another theoretical 35 background based on a small modification of the Born rule. We will not delve here into 36 the technicalities of these theoretical approaches and refer to the debates and ideas 37 in [14][15][16][17] for the interested reader. In this paper, we will essentially concentrate on 38 data and analyze it as carefully as possible to identify anomalies if they exist, regardless 39 of the precise potential mechanism underlying them. 40 Ibison and Jeffers reported contradictory and inconclusive results from their 41 pioneering experiments [9]. In the last few years, Radin and collaborators [1,18,19] 42 reproduced their experiment at a large scale. In their work, the fringe visibility of the 43 interference pattern is monitored while human subjects are asked to periodically shift 44 their attention towards or away from the optical system. In [1], the authors analyze a 45 two-year long experiment with thousands of subjects, and claim to find small but 46 statistically significant shifts of the fringe visibility, and interpret it as evidence of mind 47 matter interaction. Note that Baer [20] proposed a partial re-analysis of the data and 48 concluded that the data "lead to a possibility, but certainly not a proof, that a 49 psychophysical effect exists" and pointed out that physical noise was too high in the 50 system to draw further conclusions.

51
In this paper, we independently re-analyze the dataset presented in [1]. We i/ show 52 that the trimming-based 1 statistical procedure used in [1] is flawed and leads to 53 false-positives, as was pointed out to us by Von Stillfried and Walleczek, the authors of 54 a recent article [24] reporting a commissioned replication study of Radin's double-slit 55 experiment; ii/ provide a bigger picture of the statistical analysis and explore its 56 robustness with respect to several preprocessing choices. As in [1], we observe fringe 57 visibility shifts towards the direction predicted by the mind-matter hypothesis. However, 58 our analysis shows that these shifts are not statistically significant, with no p-value 59 under 0.05.

60
In an effort for reproducible research, the ∼ 80 Gb of raw data are publicly available 61 on the Open Science Framework platform at the address https://osf.io/ywktp/. offers concluding remarks. The apparatus consists of a laser, a double-slit, and a camera recording the interference 75 pattern; and is located in IONS' laboratory, in Petaluma, California. Details are in [1]. 76 The apparatus is always running, even though the data is only recorded when somebody 77 connects to the system via Internet. A participant to the experiment connects online to 78 the server (accessible through IONS' research website) and receives alternating 79 instructions every 30 seconds, to either "now concentrate" or "now relax". During 80 concentration epochs, the participant's task is to mentally influence the optical system 81 in order to increase a real-time feedback signal, displayed as a dynamic line on the 82 screen. For people who prefer to close their eyes during the experiment, the feedback is 83 also transmitted as a whistling wind tone.

84
In 2013, the feedback was inversely proportional to a sliding 3-second span average 85 of the fringe visibility: the higher the line, or the higher the pitch of the tone, the lower 86 was the fringe visibility, the closer was the system to "particle-like" behaviour.

87
In 2014, due to a coding error, the feedback was inversed: the feedback now 88 increased when the fringe visibility increased. The participant's task was still to increase 89 the feedback, but this time the higher the line, or the higher the pitch of the tone, the 90 lower was the fringe visibility, the closer was the system to "wave-like" behaviour.

91
As controls, a Linux machine connects to the server via Internet at regular intervals. 92 The server does not know who it is dealing with: it computes and sends feedback, and 93 records interference data just as it would do for a human participant.

94
Each session always starts and finishes with a relaxation epoch. A total of 10 95 concentration and 11 relaxation epochs are recorded per session, which makes the whole 96 session last about 10 minutes and 30 seconds. Some sessions end before all epochs are 97 completed, due to Internet connection issues, or to participants' impatience. One 98 possible bias could come from participants' self-selection: it could be argued that 99 participants with poor results quit the experiment earlier than participants performing 100 well. To avoid this bias, we need to take as many sessions as possible into account. On 101 the other hand, very short sessions do not enable a precise estimation of any measurable 102 difference between the two types of epochs. We decide to keep only sessions containing 103 more than τ = 1000 camera frames, which correspond to sessions approximately 104 completed half-way and containing 8 alternating epochs. We will see in Section 2.7 how 105 the value of τ changes the results.  The camera records at 4Hz a line of 3000 pixels, an example of which is shown in Fig 1, 111 where are also displayed the maximum (noted env M ) and minimum (note env m ) 112 envelopes of the interference pattern computed with cubic spline interpolation between 113 local extrema. Local extrema are automatically detected after a Savitzky-Golay filter of 114 order 2 on a 29-pixel moving window that smooths the interference pattern in order to 115 remove the pixel jitter that appears on some camera frames. We have also tried other For a better signal to noise ratio, we consider the 19 middle fringes of the pattern. For each camera frame, we extract one scalar. The choice of this scalar is not 122 straightforward and we will explore different choices throughout the paper. Following  For each session, we extract a single scalar value: the difference between the median 131 of the fringe visibility during concentration epochs, and the median of the fringe 132 visibility during relaxation epochs. The medians are considered as they are more robust 133 to outliers than the average. Formally, given the fringe visibility time series f v, define 134 f v c (resp. f v r ) as the reduction of f v to the concentration (resp. relaxation) epochs, 135 and ∆ν as the difference in median fringe visibility: ∆ν is the statistics we will use in the following analyses.  2. Trim the bootstrap sample: denoting by r q the integer closest to qn/2, remove the 148 r q lowest and r q highest values from X * 1 , obtaining X * 1,q of size n − 2r q . 149 3. Compute the sample meanx * 1,q of X * 1,q .
with significance level α. The probability that a bootstrap trimmed sample mean 157 verifiesx * q < 0 is readily estimated by A/B, where A is the number of bootstrap 158 samples whose trimmed sample mean is inferior to 0. The associated p-value is 159 thus estimated by p = 2 min A B , 1 − A B .
Note that this normalized shift is only computed for illustration purposes (in order 164 to observe in which direction potential shifts of the mean appear): it is not used for the 165 statistical test. Also, note that in the study by Radin et al. [1], the trimming is 166 performed before generating the bootstrap samples (steps 1 and 2 are inverted), which 167 creates false positives as soon as q > 0, as illustrated in the Supporting information. In 168 this first analysis, q is set to 20%. We will see later in Section 2.6 how this choice affects 169 the results.

170
A time lag l is expected between the fringe visibility and the alternating instructions 171 of concentration and relaxation. Indeed, a lag could occur for three main reasons: first 172 due to the time needed to switch one's attention from a concentration state to another, 173 second due to the finite (and possibly slow) speed of the Internet connection, and third 174 due to the 3 seconds span of the sliding window on which the feedback is computed. In 175 the following, we will consider lags between 0 and 25 seconds.

176
The null hypothesis we are testing is therefore: H 0 : considering any time lag, E(∆ν) 177 is null. Indeed, common sense suggests that whatever the concentration state of a 178 participant, there is no reason that the fringe visibility of the optical system should be 179 affected. This hypothesis involves multiple testing (m = 26 tests precisely): one for each 180 time lag l. For each time lag l we test the null hypothesis: H l 0 : considering time lag l, 181 E(∆ν) is null, that will output a p-value p l . We then apply the Holm-Bonferonni 182 method [22] to adjust for multiple comparison, and obtain an overall p-value p H0 for H 0 . 183 To this end, write p (1) ≤ p (2) ≤ . . . ≤ p (m) the values of {p l } sorted in ascending order. 184 The overall p-value p H0 is then formally defined as: This method is regarded as pessimistic in our context of correlated tests [23]. But in 186 this controversial field of research, it is safer to use pessimistic estimations.  We now propose to make a very different choice in the analysis of this data than the 197 one originally proposed. The authors in [1] propose to aggregate the data from both accidental sign inversion. We argue in this paper that aggregating the data is confusing 200 and makes results' interpretation more difficult. In this preliminary analysis, 2014's 201 data slightly shift towards positive values, but within chance expectations. Given that 202 there was no reason to believe before the experiment that such a positive shift would be 203 observed, one could argue that aggregating the data after a sign inversion is using a 204 possibly random fluctuation to one's advantage. Another possibility is to aggregate the 205 data without the sign inversion. This is not reasonable given the fact that experimental 206 conditions (specifically the feedback, which seems to be very important) were different 207 for both years. The most reasonable decision regarding both years' analyses is to keep 208 them separate -at the cost of lower statistical power.

209
Another fundamental difference between our analysis and the one proposed in [1] is 210 prior knowledge regarding the time lag to consider. Authors in [1] build upon their 211 previous (and independent) experiment [19] that indicated a time lag of 9 seconds as a 212 good parameter to discriminate humans from controls (as long as the experiment used 213 to learn this parameter and the experiment used to test this parameter are independent, 214 this is perfectly possible). In our independent re-analysis, we prefer the safer choice of 215 no prior knowledge, thereby necessarily testing several time lags followed by constraining 216 adjustments due to multiple testing -at the cost, once again, of lower statistical power. 217 Note that, for the sake of completeness, we will later show (in Figure 12 with the 218 discussion in Section 3) the results obtained by aggregating both years' data after sign 219 inversion and/or supposing prior knowledge of the time lag. For now, however, we keep 220 both years' data separate, and test against several time lags.

221
In the next four sections (Section 2.5 to Section 2.8), we look at the robustness of  Fringe number 9 is an arbitrary choice and it is necessary to look at other fringes. Fig 5 230 shows results obtained for fringe number 7: the shifts observed for the human sessions 231 are in the same direction than for fringe number 9, with a less (resp. more) significant 232 result for 2013 (resp. 2014) with a corrected p-value of p H0 = 3 × 10 −1 (resp. 6 × 10 −1 ). 233 The big surprise comes from the 2013 control sessions that show a significant  To look at all fringes at once, Fig 6 shows the corrected p-values p H0 as a function of 240 the fringe number for all four different session types. We see how a particular choice of 241 fringe for the analysis is problematic: depending on this choice one may serve different 242 outcomes of the statistical test! For instance, one could p-hack and choose a posteriori 243 fringe number 14 as a good candidate to discriminate humans from controls; or choose 244 fringe number 19 to conclude that one cannot discriminate one from the other.

245
To go further, and in order to prevent us from choosing the fringe number(s) that 246 serve one hypothesis or the other, we propose two strategies that both take into account 247 information from all fringes.  to a signal-to-noise ratio (SNR) that is too small for our task. In order to increase the 263 SNR, we define f v µ the average of f v over all fringes between 10 − µ and 10 + µ (with 264 µ an integer between 0 and 9). We choose to concentrate on intervals centered around 265 fringe 10 as it is the one with the best SNR. We could of course choose other intervals 266 to average over but we would encounter the very same problem we are trying to avoid: 267 different intervals will serve different hypotheses and a particular choice of interval 268 would be difficult to justify. Here, we rely on the (strong) SNR argument to choose to 269 look at all intervals centered around fringe 10.  We now investigate if these results are robust to i/ the trimming intensity q in 288 Section 2.6, ii/ the length threshold τ in Section 2.7, iii/ the fringe visibility estimation 289 method in Section 2.8.  maximum M n and its preceding local minimum m n : Results obtained with this definition on fringe 9, and with q = 20% and τ = 1000, are 308 shown in Fig 11 (top). We observe significant anomalies (even though much less 309 significant than in [1]) in the human data of both years especially around l = 9 seconds, 310 and insignificant results for the controls.

316
For a fringe number n and its associated local maximum M n , there is no reason to 317 define its visibility by comparing M n to its previous local minimum m n rather than its 318 succeeding local minimum m n+1 . If one defines then one obtains similar results (not shown).

320
One concludes that the results as summarized at the end of Section 2.5 are robust 321 with respect to the fringe visibility estimation method. seconds is chosen from the start based on a previous (and independent) experiment [19] 353 that indicated that such a time lag was a good parameter to discriminate humans from 354 controls.

355
In this paper, we corrected point i/ and we argued that points ii/ and iii/ were not 356 solid choices from our statistical re-analysis point-of-view, and preferred a more lags before correcting for multiple comparisons; both these choices necessarily inducing 359 a lower statistical power. For completeness, we show in Figure 12 the results one would 360 have obtained instead of Fig. 9 in three different scenarios, in which we set the time lag 361 at 9 seconds from the start and/or combine both years after sign inversion for 2014. We 362 observe that the results look more convincing in these scenarios, with large p-values 363 (> 0.7) for the controls, and slightly significant deviations for the humans. However, all 364 p-values in these three scenarios are larger than 2 × 10 −3 : they cannot be interpreted as 365 strong evidence of mind-matter interaction, but may motivate further replication 366 attempts. These additional results seem to point out that the erroneous statistical test 367 used in [1] lead to an underestimation of the p-value by 5 orders of magnitude (they 368 reported a p-value of ∼ 10 −8 instead of the ∼ 10 −3 that we find here) -which further 369 lead the authors to erroneous conclusions.

370
Before we conclude, let us make an important statement. We have made many 371 statistical tests, and to prevent p-hacking, one needs to look at all these tests as a whole. 372 Extracting one test or the other from the whole is not recommended. Note that, on top 373 of the tests discussed in the paper we have also performed tests with two other fringe 374 visibility definitions: the average of Eqs (4) and (5), and the fringe visibility extracted 375 by spline interpolation as in Eq (1) but sampled only at the extrema instead of 376 considering the average over each fringe as presented here. None of these tests showed a 377 significant difference than the ones shown in the paper. 378

379
The thorough analysis pursued in this paper contradicts the results previously published 380 in [1]. On the one hand, we observe shifts of the fringe visibility in the direction 381 predicted by the mind-matter interaction hypothesis, as in [1]. On the other hand, these 382 shifts are not deemed significant by our analysis.

383
Supporting information 384 Let 0 < α < 1 be a significance level. We illustrate here that false-positives are 385 uncontrolled in the test used in [1] and under control of α in the correct test described 386 in Section 2.3. To do so, consider the following framework: 387 i/ Create a synthetic set X by drawing its n iid elements from N (0, 1), the Gaussian 388 distribution with zero mean and variance 1. n is set to 10 3 .

389
ii/ Generate N = 5 × 10 4 independent realisations of such set X . All N sets thus 390 have a true zero mean by construction.

391
iii/ Test each set: obtain a p-value per set and, given the significance level α, a 392 rejection decision per set. 393 We then consider both the probability of type I error estimated byp I = R/N , where R 394 counts the number of rejected sets X (for the given α), as well as the α-quantile p * α of 395 the N p-values obtained: the value under which there are αN p-values. If the test used 396 in iii/ is correct, bothp I and p * α should be very close to α.