Causal inference regulates audiovisual spatial recalibration via its influence on audiovisual perception

To obtain a coherent perception of the world, our senses need to be in alignment. When we encounter misaligned cues from two sensory modalities, the brain must infer which cue is faulty and recalibrate the corresponding sense. We examined whether and how the brain uses cue reliability to identify the miscalibrated sense by measuring the audiovisual ventriloquism aftereffect for stimuli of varying visual reliability. To adjust for modality-specific biases, visual stimulus locations were chosen based on perceived alignment with auditory stimulus locations for each participant. During an audiovisual recalibration phase, participants were presented with bimodal stimuli with a fixed perceptual spatial discrepancy; they localized one modality, cued after stimulus presentation. Unimodal auditory and visual localization was measured before and after the audiovisual recalibration phase. We compared participants’ behavior to the predictions of three models of recalibration: (a) Reliability-based: each modality is recalibrated based on its relative reliability—less reliable cues are recalibrated more; (b) Fixed-ratio: the degree of recalibration for each modality is fixed; (c) Causal-inference: recalibration is directly determined by the discrepancy between a cue and its estimate, which in turn depends on the reliability of both cues, and inference about how likely the two cues derive from a common source. Vision was hardly recalibrated by audition. Auditory recalibration by vision changed idiosyncratically as visual reliability decreased: the extent of auditory recalibration either decreased monotonically, peaked at medium visual reliability, or increased monotonically. The latter two patterns cannot be explained by either the reliability-based or fixed-ratio models. Only the causal-inference model of recalibration captures the idiosyncratic influences of cue reliability on recalibration. We conclude that cue reliability, causal inference, and modality-specific biases guide cross-modal recalibration indirectly by determining the perception of audiovisual stimuli.


Cross-modal integration
In our daily lives, we continuously estimate properties of the environment such as the location of a barking dog. Usually multiple sensory cues for each property arrive in the brain, a glimpse of the dog's wagging tail and the barking both give away its location. However, due to external noise in the environment and internal noise in our sensory systems, two cues hardly ever agree perfectly. To still form a coherent percept, the brain relies on a weighted mixture of the cues. This strategy becomes evident when the cues are in conflict. For example, when a ventriloquist speaks without moving her lips, the auditory signal indicates that the speech originates from the ventriloquist, while the visual signal indicates the "dummy". Typically, vision dominates the combined spatial estimate of the sound source. In this example, we perceive the "dummy" to be speaking. This phenomenon is called the ventriloquism effect [1][2][3]. However, when visual reliability, the inverse of the average variability of a cue, is degraded enough to be lower than auditory reliability, the combined spatial estimate is no longer dominated by the visual but instead by the auditory cue, indicating that cue integration depends on their relative reliability [4][5][6]. This integration strategy maximizes the precision of the estimate, i.e., reduces its variability. In addition to audiovisual spatial perception, reliability-based cue integration has also been found for visual and auditory cues to temporal rate [7][8][9], visual and haptic size and shape [10][11][12] as well as numerosity [13], and visual and vestibular cues to heading direction [14][15][16][17][18][19].
If during a ventriloquist's show, the speech sounds originate from someone standing behind the stage or if the dummy's mouth moves out of synch with the speech sounds, the audience is unlikely to experience a ventriloquism effect. In other words, integration breaks down when the two cues are too different to be perceived as coming from a common source. Such a breakdown of integration with spatial and temporal cue conflicts has been found not only in auditory-visual spatial integration [20][21][22][23][24][25], but also in integration of visual and discrepant cues can lead to very different percepts, depending on each cue's reliability and inference about whether they have common or separate origins. Cross-modal recalibration might take this perceptual inference into account. Indeed, causal-inference models of crossmodal recalibration have successfully predicted visual-auditory [63] and visual-tactile [32] ventriloquism aftereffects. In these models, recalibration is not based on a mere comparison of two sensory cues, but rather relates the cues to the perceptual estimates and by doing so incorporates causal inference and cue reliability. According to this model, conflicting findings regarding the influence of cue reliability on recalibration might reflect differences in the perceptual estimates rather than diverging underlying mechanisms.

Preview
In this study, we contrasted all three accounts of cross-modal recalibration by fitting the models to data from an audiovisual ventriloquism-aftereffect study. Across sessions, we manipulated cue reliability, a determinant of reliability-based and causal-inference-driven crossmodal recalibration. Additionally, we controlled for the effect of modality-specific biases on the spatial perception of the two sensory cues by choosing visual stimulus locations that matched perceptually with the locations of the auditory stimuli for each participant.
With decreased visual reliability, many participants showed either increasing or nonlinearly changing auditory recalibration. No clear pattern of visual recalibration was found. These results cannot be explained by either the reliability-based or the fixed-ratio model. The causalinference model, on the other hand, is able to capture these idiosyncratic effects based on individual differences in cue reliability, modality-specific biases, and an a priori belief about how often visual and auditory cues come from a common source. Thus, the model comparison suggests that cross-modal recalibration is driven by a comparison between sensory cues and perceptual estimates, which in turn are determined using causal inference.

Cue reliability
In the first part of the study, spatial reliability for one auditory stimulus and three visual stimuli was estimated for each participant using a unimodal spatial-discrimination task (Fig 1B; see Materials and methods). Visual reliability had been manipulated by varying the horizontal spread of a random collection of ten Gaussian blobs ( Fig 1A). For each stimulus condition, we fitted a cumulative Gaussian distribution to the responses as a function of test stimulus location (Fig 2A, mean adjusted R 2 = 0.953, range = 0.751-0.997). To compare spatial-discrimination performance across stimulus conditions, we computed the just-noticeable difference (JND; Fig 2B and 2C). Statistical analysis of the JNDs (Section S1.1 in S1 Appendix) confirmed that visual stimulus reliabilities were (i) smaller than, (ii) comparable to, and (iii) larger than the auditory reliability.

Modality-specific spatial biases
In the second part of the study, we measured participants' modality-specific biases in the spatial perception of auditory stimuli relative to a visual stimulus ( Fig 1A) using a bimodal spatialdiscrimination task (Fig 3A). We did so at four auditory stimulus locations and for visual stimuli with high spatial reliability. Four separate psychometric functions were fitted, one for each auditory stimulus location (Fig 3B, mean adjusted R 2 = 0.909, range = 0.751-0.981). From each psychometric function, we calculated the point of subjective equality (PSE) and then described the four PSEs as a linear function of the underlying auditory location (Fig 4A).
The estimated slopes for five out of six participants were significantly larger than 1 (Fig 4B) indicating that the auditory stimuli were perceived as shifted towards the periphery relative to the visual stimuli. Four participants showed significant negative y-intercepts. That is, they perceived the auditory stimuli as shifted to the left relative to visual stimuli. From the linear regression line through the PSEs, we extracted the four visual stimulus locations that were perceived to be co-located with the four auditory stimulus locations and used these locations in all subsequent tasks. (B) Task timing (blue: auditory stimuli; pink: visual stimuli). Participants were successively presented with a standard stimulus (located straight ahead) and a test stimulus (located to the left or right) in random order. After stimulus presentation, they used a keypad to report which interval contained the stimulus located farther to the right. Feedback was provided. https://doi.org/10.1371/journal.pcbi.1008877.g001

Localization response precision
In the next part of the study, we measured participants' localization noise unrelated to spatial perception (e.g., noise due to holding a location in memory and errors indicating the intended location). To this aim, participants performed a direct localization task with maximally reliable visual stimuli. At the same time, they were familiarized with our custom-made device used to move a visual cursor to the stimulus location. We assumed that localization errors were unbiased and independent of stimulus location and fitted them with a Gaussian distribution centered at zero to estimate the extent of noise corrupting localization responses (see Section S2 in S1 Appendix for a more complex model and results of a model comparison). Participants' spatial perception-unrelated localization noise was 1.85˚on average (range 1.55-2.29˚).

Recalibration effects
In the final part of the study, participants completed six sessions of the audiovisual recalibration experiment (2 recalibration directions × 3 reliability levels). Each session consisted of three phases: (1) pre-recalibration: participants localized unimodally presented visual and auditory stimuli ( Fig 5A); (2) recalibration: participants were presented with audiovisual stimulus pairs with a constant spatial discrepancy in perceptual space; they localized one modality cued after stimulus presentation (Fig 5B and 5C); (3) post-recalibration: the unimodal localization task was repeated.
To statistically examine recalibration of each modality, we fitted linear regressions separately to pre-and post-recalibration localization responses as a function of stimulus location ( Fig 6A). Recalibration effects were computed as the difference between pre-and post-recalibration regression intercepts; shifts compensating for the audiovisual discrepancy present during the recalibration phase were coded as positive. As no significant effects of recalibration direction on the recalibration effect were found in our statistical analysis (Section S1.2 in S1 Appendix), we averaged the recalibration effect across the two recalibration directions (visual to right/left of auditory) for display ( Fig 6B). For the majority of participants (three out of six), auditory recalibration was a non-monotonic function of visual stimulus reliability. Two participants showed a monotonic increase in auditory recalibration with decreasing visual stimulus reliability. One participant showed a monotonic decrease in auditory recalibration as visual reliability decreased. Unsurprisingly, our statistical analysis of the recalibration effect revealed no significant main effects or interactions involving visual stimulus reliability (Section S1.2 in S1 Appendix).

Recalibration models
To understand the mechanisms of cross-modal recalibration, we compared participants' behavior to three existing models of cross-modal recalibration: (1) a reliability-based, (2) a fixed-ratio, and (3) a causal-inference model (for details see Models of audiovisual recalibration). All three models conceptualize recalibration as constant updating of a modality-specific shift applied to the measurements before the estimate is derived. However, these three models differ in their assumptions about the way in which the amount of recalibration for each modality is determined. As a consequence, these three models make different predictions for the influence of visual reliability on audiovisual recalibration.
The reliability-based model. According to this model, the brain determines the amount of recalibration (i.e., the measurement shift) for each modality based on the reliabilities of both measurements. The shift for one modality is updated by a fraction of the difference between the two measurements that is proportional to the other modality's relative reliability. Across trials, the amount of recalibration depends not only on stimulus reliability, but also on a common learning rate α for both modalities. This model predicts increasing visual and decreasing auditory recalibration effects with decreasing visual stimulus reliability (Fig 7, dot-dashed line).
The fixed-ratio model. According to this model, the brain determines the amount of recalibration based on the modalities of both sensory measurements. The measurement shift for each sense is updated by a fraction of the difference between the two measurements. This fraction is determined only by modality-specific learning rates. Therefore, this model predicts modality-specific recalibration effects that are not influenced by stimulus reliability (Fig 7, dashed line).
The causal-inference model. According to this model, the brain determines the amount of recalibration for each modality based on the difference between a measurement and the corresponding location estimate. The measurement shifts are updated by a fraction of this difference, either implemented as modality-specific learning rates or as a supra-modal learning rate. In this model, cross-modal recalibration depends on stimulus reliability, modality-specific localization biases, and inference about a common cause through their influences on the location estimates. The model can predict various effects of visual reliability on auditory recalibration (Fig 7, solid line).

Modeling results
Model predictions. The causal-inference model is the only model that can capture the idiosyncratic influence of visual stimulus reliability on auditory recalibration across participants ( Fig 8A). Data from 5 out of 6 participants were best fitted by the causal-inference model, either with a supra-modal learning rate (see Section S3 in S1 Appendix for model predictions) or modality-specific ones. For most participants, neither the reliability-based nor the fixedratio model can reproduce the diverse influence of visual stimulus reliability on auditory https://doi.org/10.1371/journal.pcbi.1008877.g007 recalibration ( Fig 8B; Sections S4-S5 in S1 Appendix). The three models do not differ in terms of predicting visual recalibration and modality-specific biases (Section S6 in S1 Appendix).
Model comparison. To compare model performance quantitatively, we computed the Akaike information criterion (AIC) for all three models [64] and then calculated relative model-comparison scores, Δ AIC , which relate the AIC value of the best-fitting model (the model with the lowest AIC) to that of each of the other models (a high value of Δ AIC indicates stronger evidence for the best-fitting model; Fig 8C). Δ AIC values comparing both the reliability-based and the fixed-ratio to the causal-inference models were large for the majority of participants, revealing substantial evidence for the causal-inference model of cross-modal recalibration.

Discussion
In this study, we investigated the mechanism underlying cross-modal recalibration. To this aim, we measured the effects of visual stimulus reliability, a potential determinant of crossmodal recalibration, on audiovisual spatial recalibration. To induce recalibration, we repeatedly exposed participants to audiovisual pairs with a perceptually constant spatial discrepancy. To measure recalibration, we compared unimodal auditory and visual localization responses before and after the exposure. Auditory localization was recalibrated by vision, yet, the influence of visual stimulus reliability on auditory recalibration differed qualitatively across participants. To scrutinize the mechanisms of cross-modal recalibration, we compared participants' behavior to three models of recalibration: (1) a reliability-based model, which assumes that the amount of recalibration depends on the relative reliability of the cues that are in conflict, (2) a fixed-ratio model, which assumes that the amount of recalibration is fixed, dependent only on the modalities in conflict and independent of cue reliability, and (3) a causalinference model, which ties recalibration to the percept of a cue. This percept depends on the other cue, causal inference of the two cues coming from a common source as well as cue reliabilities, and modality-specific spatial biases. Only the causal-inference model captured the idiosyncratic influences of cue reliability on cross-modal recalibration.

Only the causal-inference model captures the diverse influences of visual reliability on auditory recalibration by vision
Our results demonstrated diverse influences of visual stimulus reliability on auditory recalibration. For half of the participants, auditory recalibration was maximal at medium visual stimulus reliability. For some other participants, auditory recalibration increased with decreasing visual stimulus reliability. Neither of these patterns can be replicated by models of recalibration that assume the amount of recalibration relies directly on cue reliability [53,61]. These models can only predict decreases in recalibration as the stimulus reliability of the other modality decreases, as has been found previously [59,61]. Thus, the best prediction these models could produce for either monotonically or non-monotonically increasing auditory recalibration effects with decreasing visual reliability was no influence of stimulus reliability (Fig 8B, right  panel). The observed influences of cue reliability on recalibration are also at odds with models of recalibration that assume the amount of recalibration relies only on the identity of the two modalities in conflict [62,65]. These models predict no influence of stimulus reliability on recalibration. Crucially, the causal-inference model of cross-modal recalibration [32,63] captures all the observed influences of stimulus reliability on cross-modal recalibration and is able to replicate all previous patterns of results [61,62] based on individual differences in cue reliability, the common-cause prior, and modality-specific spatial biases.
It is remarkable that the causal-inference model of cross-modal recalibration is capable of producing qualitatively different patterns of results for the amount of cross-modal recalibration as stimulus reliability changes [32]. Based on our experience with the model, we next provide an intuition of how individual differences in the sensory reliabilities, biases in spatial perception of both modalities, and the common-cause prior influence cross-modal recalibration according to the causal-inference model.
The role of cue reliability for cross-modal recalibration. Cue reliability influences the degree of cross-modal recalibration by influencing the posterior probability that both cues arose from a common cause as well as the integrated location estimate for the common-cause scenario. Both of them determine the final location estimate and in turn the amount of recalibration. When the visual cue is extremely reliable, the posterior probability that two discrepant cues originated from the same source is low (Fig 9B). If the common-cause scenario is unlikely, the auditory location estimate is mostly based on the estimate given separate sources and therefore located close to the auditory cue ( Fig 9A, top panel), which leads to small recalibration effects.
However, the amount of auditory recalibration does not increase monotonically with decreasing visual reliability. With decreasing visual reliability, the integrated location estimate for the common-cause scenario is increasingly dominated by the auditory measurement ( Fig  9C). The closer the integrated estimate is to the auditory measurement, the closer the final location estimate is to the auditory measurement, and the smaller the auditory recalibration effect will be. Importantly, the effect of variation in reliability depends on the tested reliability range ( Fig  9D). Thus, studies that use different types of stimuli will likely obtain different results regarding the influence of stimulus reliability on cross-modal recalibration. The contradictions that emerged between previous studies [61,62] can be explained by the causal-inference model of recalibration.
The role of common cause prior assumptions for cross-modal recalibration. The common-cause prior impacts the posterior probability of a common cause and hence the amount of recalibration independent of stimulus reliability and discrepancy ( Fig 10). As the commoncause prior increases, the posterior probability of a common cause increases, leading the final auditory location estimate to be farther away from the auditory measurement and hence yielding a larger measurement-shift update. We used a post-response cue in the bimodal localization task during the recalibration phase to foster audiovisual recalibration as the commoncause prior is strengthened when participants attend to both modalities [32].
Modality-specific spatial biases. Our model differs from previous causal-inference models of cross-modal perception in that we assumed the existence of modality-specific biases but not modality-specific spatial priors over stimulus location [32,42,66]. Both can account for the typically observed biases in localization, but modality-specific priors exert a stronger influence when visual reliability is reduced. In contrast, the influence of modality-specific biases does not vary as stimulus reliability changes. Thus, we conducted a control experiment examining whether the tendency to perceive visual stimuli as shifted towards the central fixation increased with decreasing visual reliability (Section S7 in S1 Appendix). The results showed no systematic influence of visual reliability. Thus, to avoid overfitting the model, we omitted modality-specific priors in our models and fitted only modality-specific biases.
The existence of fixed biases in the spatial perception of visual and auditory stimuli seems to be at odds with the concept of cross-modal recalibration. If discrepancies between the senses consistently lead to recalibration, why do these biases persist? Perceptual biases could reflect adjustments for sensory discrepancies during an early sensitive period of development as has recently been shown for cross-modal biases in temporal perception [31]. After the sensitive period, it might become impossible to fully compensate for newly developing differences between the senses. In terms of our model, this would mean that a limitation to the shiftupdates is set during early infancy.
Our study differs from previous spatial-recalibration studies in that we adjusted for perceptual biases in auditory relative to visual spatial perception by selecting visual locations perceptually aligned with pre-selected auditory locations. During piloting we presented stimuli with a constant physical spatial audiovisual discrepancy during the recalibration phase, and did not find significant recalibration effects. We attributed this to the modality-specific biases we had observed combined with the small spatial discrepancy used in our study. Indeed, our simulations with the causal-inference model (Fig 11, top panel) show that there are many combinations of proportional and constant biases that lead to minuscule or even negative recalibration effects when these biases are not adjusted for. Additionally, our simulations reveal a complex interaction between these biases and the spatial discrepancy. Recalibration through exposure to a constant and relatively large physical discrepancy, as done in most previous studies [43-45, 47-50, 67], would produce positive recalibration effects on average but not necessarily in all participants given that humans differ in their spatial biases. In contrast, keeping the discrepancy constant in perceptual space makes the recalibration effects less prone to individual differences in modality-specific biases and works better with smaller spatial discrepancies (Fig 11, bottom panel).

Fitting the causal-inference model
Here, we fitted, for the first time, the causal-inference model of recalibration to observed data. To achieve this, we fitted the outcome of the recalibration process rather than its build-up. The audiovisual recalibration phase itself is characterized by sequential dependence of the measurement shifts across trials and the lack of a closed-form solution for the model. Fitting such models is computationally expensive as the required number of simulations increases exponentially with each recalibration trial due to the sequential nature of the model. There might be a way to address these sequential dependencies using particle filters [68]. However, we concentrated on fitting the outcome of the recalibration process by repeatedly simulating the audiovisual recalibration phase. In this way, we obtained an approximation of the distribution of measurement-shifts given a set of parameters. The good match between the observed and predicted unimodal localization data as well as our checks of the fitting procedure (Section S8 in S1 Appendix) confirm the validity of our approach. One negative consequence of not fitting the recalibration process itself is that parameters reflecting stimulus reliability under bimodal conditions are not directly constrained by the data. We incorporated different reliabilities for unimodal and bimodal presentation conditions because previous studies indicated differences in one [32,37,69] or the other [70,71] direction. Yet, we remain cautious in interpreting the estimated bimodal reliabilities in our study.
We additionally remain cautious with respect to the interpretation of other parameter estimates. We assumed a flat supra-modal prior over stimulus location as we found indications for trade-offs between modality-specific biases and a supra-modal prior over stimulus locations (Section S9.1 in S1 Appendix). As a consequence, the estimated modality-specific biases might have been underestimated and sensory reliabilities might have been overestimated. Additionally, the prior probability of a common cause and the modality-specific learning rate trade off and thus might be misestimated, because an increase in either factor can lead to a greater amount of recalibration (Section S9.2 in S1 Appendix).
Importantly, even though the possibility of biases in our parameter estimates exists, the mechanisms outlined at the beginning of the discussion explain the idiosyncratic influence of visual reliability on auditory recalibration, whereas the other models cannot qualitatively The influence of spatial discrepancy and modality-specific biases on the amount of auditory recalibration. Top row: The auditory recalibration effects (color key) as a function of proportional and constant shift of visual relative to auditory location when spatial discrepancy (panels) is constant in physical space (i.e., s V = s A + spatial discrepancy). Bottom row: The auditory recalibration effects when spatial discrepancy is constant in perceptual space (i.e., visual stimulus locations are selected to adjust the perceptual biases in auditory relative to visual spatial perception, s V = proportional shift × (s A + spatialdiscrepancy) + constant shift). https://doi.org/10.1371/journal.pcbi.1008877.g011 reproduce the observed results. Thus, our conclusion that causal-inference-based percepts regulate cross-modal recalibration stands independent of the parameter estimates.
Future work might involve an experimental design that allows for better estimation of model parameters to enable a determination of the underlying cause of these individual differences and to relate them to behavior in other tasks. For example, more constraints on the experimental parameters could be obtained in a design in which unimodal trials are interspersed with recalibration trials, allowing the time course of recalibration to be measured.

Conclusion
This study examined the mechanism underlying cross-modal recalibration. To this aim, we measured audiovisual spatial recalibration while varying visual stimulus reliability. Stimulus reliability has been described as one plausible determinant the brain uses to decide which sensory modality should be recalibrated when there is a cue conflict. We found that visual stimulus reliability influenced auditory recalibration in qualitatively different ways across participants. Neither the reliability-based model nor an alternative model that assumes a fixed degree of recalibration for each modality and completely ignores stimulus reliability, could replicate the data. Yet, a causal-inference model was able to capture all the observed diverse influences of reliability on recalibration, including two patterns found in previous studies. In this model, recalibration is not based on a mere comparison of two sensory cues, but rather relates each cue to its corresponding perceptual estimate, and by doing so incorporates causal inference of a common source for a cross-modal stimulus pair as well as cue reliability and modality-specific perceptual biases into cross-modal recalibration.

Ethics statement
Experimental protocols were approved by the Institutional Review Board at New York University (protocol number FY2016-595). All participants gave informed written consent prior to the beginning of the experiment and five of them were compensated $10 per hour for participation.

Participants
Six participants (three females, aged 22-29 years, mean: 25 years, six right-handed), recruited from New York University and naive to the purpose of the study, participated in the experiment. All stated that they were free of visual, auditory, or motor impairments. The data of one additional participant (female, 21 years, ambidextrous) were excluded from data analysis and model fitting due to conflicting modality-specific spatial biases found in the bimodal spatialdiscrimination and unimodal localization tasks (Section S10 in S1 Appendix).

Apparatus and stimuli
The experiment was conducted in a dark and semi sound-attenuated room. Participants were seated 1 m from an acoustically transparent, white screen (1.36 × 1.02 m, 68 × 52˚visual angle). An LCD projector (Hitachi CP-X3010N, 1024 × 768 pixels, 60 Hz) was mounted above and behind participants to project visual stimuli on the screen. The visual stimuli were clusters of 10 randomly placed low-contrast (36.55 cd/m 2 ) Gaussian blobs (SD: 3.6˚) added to a grey background (29.33 cd/m 2 ). Blob locations were drawn from a two-dimensional Gaussian distribution (vertical SD: σ y = 5.4˚, horizontal SD: (1) σ x = 1.1˚for the high-reliability condition, (2) σ x = 5.4˚for the medium-reliability condition, and (3) σ x = 8.7˚for the low-reliability condition). We recorded the centroid of the cluster rather than the center parameter of the two-dimensional Gaussian used for cluster generation as the visual stimulus location. Each visual stimulus was presented for 100 ms (6 frames), and followed by a 900 ms long backward masker (54 frames of randomly black or white checks of 4 × 4 pixels filling the screen) to erase any visual memory of the stimulus.
Behind the screen, a loudspeaker (20 W, 4O Full-Range Speaker, Adafruit, New York) was mounted on a sledge attached to a linear rail (1.5 m long, 23 cm above the table, 5 cm behind the screen). The rail was hung from the ceiling using elastic ropes, perpendicular to the line of sight. The position of the sledge on the rail was controlled by a microcomputer (Arduino Mega 2560; Arduino, Somerville, MA, USA). The microcomputer controlled a stepper motor that rotated a threaded rod (OpenBuilds, www.openbuildspartstore.com). This way the speaker was moved to the auditory stimulus location. The auditory stimulus was a 100 ms broadband noise burst (0-20.05 kHz, 60 dB), windowed using the first 100 ms of a sine wave with a period of 200 ms. To control audiovisual synchrony in bimodal trials, we adjusted audiovisual latencies in the presentation software and confirmed their synchrony by recording their relative latencies using a microphone and photo-diode.
We were concerned that participants might infer the position of the speaker from the sounds produced by sledge movements. We tried to foil this strategy by playing a masking sound from an additional speaker behind the center of the screen during each movement of the speaker. The masking sound (55 dB) was a recording of a randomly chosen speaker movement plus white noise. Additionally, the speaker moved from its last position to the target location through a stopover location. The stopover was randomly chosen under the constraint that the total distances the speaker moved were approximately equal across trials. We carried out a control experiment, which indicated that participants could not infer the speaker position based on sounds arising from the speaker movements (Section S11 in S1 Appendix).
Responses were given using a numeric keypad in spatial-discrimination tasks, and a pointing device in localization tasks. The pointing device was custom built using a potentiometer (Uxcell 10KO Linear Taper Rotary Potentiometer) with a plastic ruler (5 × 17 cm) securely fixed perpendicular to the shaft of the potentiometer. Participants placed their hands on either side of the ruler and rotated it so that it pointed at the perceived location of the stimulus. A visual cursor (an 8 × 8 pixel white square) was displayed to indicate the selected location. The pointing device was covered by a black box (42 × 30 × 15 cm) so that participants could not see their hands while using the device. A foot pedal was placed on the floor and used to confirm the current position of the pointing device as the response.

Procedure
In the beginning of the study, participants completed a unimodal spatial-discrimination task, once for the auditory stimulus, and once for each of the three visual stimuli (low, medium, and high spatial uncertainty). Participants then completed a bimodal spatial-discrimination task followed by a pointing practice task. The last part of the study consisted of six recalibration sessions, one for each condition (2 recalibration directions × 3 reliability levels). Each session started with a unimodal localization task, followed by an audiovisual recalibration phase in which participants localized one modality of a spatially discrepant audiovisual stimulus, and ended with a repetition of the unimodal localization task. Unimodal spatial-discrimination task. Participants' spatial-discrimination thresholds were measured using a 2-interval, forced-choice (2IFC) procedure. In each trial, a standard and a test stimulus were presented in random order. Each stimulus was preceded by a fixation cross, presented at the center of the screen for 1,000 ms, followed by a 2,000 ms-long period of blank screen in which the loudspeaker moved to its position. The actual stimulus lasted 100 ms, followed by either a 900 ms-long backward masker (visual stimuli) or a blank screen for 900 ms (auditory stimuli). After the second stimulus period was over, a response probe was displayed and participants indicated by button press which interval contained the stimulus that was farther to the right (Fig 1B). Visual feedback was provided for 500 ms immediately after the response was given. The inter-trial interval was 500 ms.
The standard stimulus was located straight-ahead at the center of the screen; the location of the test stimulus was controlled by four interleaved staircases, two of which had the test stimulus start at 12.5˚(to the right of straight-ahead), and the other two at −12.5˚(12.5˚to the left of straight-ahead). For the two staircases starting at one side, one followed the two-down-one-up rule (with down being defined as moving the test stimulus leftwards) converging to a probability of 71% [75] of perceiving the test stimulus as farther to the right than the standard stimulus; the other staircase followed the one-down-two-up rule converging to a probability of 29%. The initial step size was 1.9˚, decreased to 1.0˚after the first staircase reversal and to 0.5˚after the third reversal. Each staircase consisted of 40 trials, resulting in a total of 160 trials. To improve the estimation of the lapse rate, a test stimulus was presented distant from the center (±12.5˚) once every 10 trials. A total of 176 trials was evenly split into 4 blocks.
Participants completed the spatial-discrimination task for each of the three levels of visual stimulus reliability in random order, the auditory stimulus was always tested last. Participants took about an hour to complete one stimulus condition; typically, they spread all four stimulus conditions across two days. Bimodal spatial-discrimination task. Participants' biases in auditory relative to visual spatial perception were measured using a spatial 2IFC procedure. An auditory and a high-reliability visual stimulus were presented in random order and participants indicated by button press whether the visual stimulus was located to the left or right of the auditory stimulus. The procedure was otherwise identical to that of the unimodal spatial-discrimination task (Fig 3A).
No feedback was provided.
The auditory stimulus was presented at one of four locations (±2.5 or ±7.5˚). We used the adaptive staircase procedure to effectively sample the visual stimulus space for each participant as the range of meaningful stimulus locations varied considerably across participants due to individual perceptual biases. Specifically, the location of the visual stimulus was controlled by eight interleaved staircases, two for each of the four auditory locations. Of the two staircases per auditory location, one started the test stimulus at 15˚relative to the auditory location, and the other one at −15˚. Staircases with the visual stimulus starting from the right of the auditory stimulus followed the one-down-two-up rule, converging to a probability of 29% of choosing the visual as to the right of the auditory stimulus. Staircases with the visual stimulus starting from the left of the auditory stimulus followed the two-down-one-up rule, converging to a probability of 71%. Staircase step size was updated as described above. Each staircase consisted of 36 trials. Trials with the visual stimulus being located at ±15˚relative to the auditory stimulus were inserted once every nine trials, resulting in a total of 320 trials. The session was divided into six blocks. Usually, participants took about two hours to complete all trials.
Pointing practice task. Participants' localization precision independent of spatial perception was measured using a localization task with visual stimuli of maximal spatial reliability. In each trial, a white square (8 × 8 pixels � 0.6˚× 0.6˚) was displayed on the screen for 100 ms. The stimulus was followed by a 900 ms-long backward masker and 500 ms of blank screen.
Subsequently, the response cursor, a green square of the same size as the white square, appeared on the screen. Participants used the pointing device to move the cursor to the stimulus position, and confirmed their response with the footpedal. Response times were unrestricted. The cursor location was shown during adjustment, but error feedback was not provided. There were eight possible horizontal positions for the stimulus, evenly spaced from -17.5 to 17.5˚in steps of 5˚. Each stimulus location was visited 30 times in random order, resulting in a total of 240 trials. The inter-trial interval was 500 ms. This experiment took 30 minutes to complete and was typically administered after the bimodal spatial-discrimination task on the same day.
Unimodal localization task (pre-and post-recalibration phase). Participants' baseline and post-recalibration spatial perception were measured using a unimodal spatial-localization task. In each trial, either an auditory or a visual stimulus was presented, again preceded by a fixation cross and a blank screen. The inter-trial interval was 100 ms, otherwise timing was identical to that of the discrimination tasks. Auditory stimulus locations were the same as in the bimodal spatial-discrimination task; visual stimulus locations were the four locations that were identified as perceptually co-located with those four auditory locations using the bimodal spatial-discrimination task. Participants again responded by moving a visual cursor to the stimulus location (Fig 5A). There was no time limit for the response. The location of the visual cursor was shown during the adjustment, but feedback about the localization error was not provided. Each of the four target locations per modality was tested 12 times, resulting in a total of 96 trials administered in pseudorandom order. These trials were split into four blocks. Usually participants took about 25 min to complete all 96 trials; they did so once at the beginning of the session (pre-recalibration phase) and once again after the recalibration phase (postrecalibration phase).
Bimodal localization task (recalibration phase). During the audiovisual recalibration phase, participants were presented with temporally synchronous but spatially discrepant audiovisual stimuli. We asked them to localize either the auditory or the visual component, with the localization modality cued after stimulus presentation (Fig 5B), a procedure that has been associated with larger recalibration effects than non-spatial tasks [32]. All other trial parameters were identical to the unimodal localization task. Three audiovisual stimulus pairs were chosen such that an auditory stimulus location was paired with the visual location perceived as aligned with the auditory stimulus location to its left in the visual-left-of-auditory condition and the auditory stimulus location to its right in the visual-right-of-auditory condition ( Fig 5C). Each of the three audiovisual pairs was repeated 40 times in random order, resulting in a total of 120 trials. These trials were split into four blocks. Usually, participants took about 30 min to complete all 120 trials. A full session (pre-recalibration, recalibration, and post-recalibration phase) took 80 min. Participants completed the six sessions on separate days.

Data preparation and statistical analysis
Unimodal spatial-discrimination task. For the unimodal spatial-discrimination task, the data were coded as the probability of identifying the test stimulus as located farther to the right than the standard stimulus as a function of test stimulus location separately for each stimulus type (Fig 2A). These data were fitted with a cumulative Gaussian distribution centered at 0 and with a lapse rate constrained to be less than or equal to 6% [76]. The JND was calculated as half the distance between the stimulus locations corresponding to probabilities of 0.25 and 0.75 according to the fitted cumulative Gaussian distribution (unscaled by the lapse rate). To measure goodness of fit, we computed adjusted R 2 values based on the binned data (bin size = 1.8˚). To derive error bars, we randomly resampled the raw data with replacement 1,000 times, fitted psychometric functions to each resampled dataset, calculated the JND, and took the 2.5 and 97.5 percentiles of the 1,000 JNDs as the bootstrapped confidence interval. Bimodal spatial-discrimination task. Data from the bimodal spatial-discrimination task were coded as the probability of identifying the visual stimulus as located to the right of the auditory stimulus as a function of visual stimulus location. We fitted four cumulative Gaussian distributions to these data, one for each auditory standard stimulus. Again, we included a lapse rate, constrained to be less than or equal to 6% [76]. Adjusted R 2 values were calculated based on binned data (bin size = 3˚). The point of subjective equality (PSE) was defined as the visual stimulus location corresponding to a probability of 0.5 according to the unscaled psychometric function. The four PSEs were modeled as a linear function of auditory stimulus location. From this linear regression of the PSEs, we computed the locations of the visual stimulus perceived as co-located with the four auditory locations. In the subsequent unimodal and bimodal spatial-localization tasks, we presented visual stimuli at these locations rather than those directly indicated by the PSEs to reduce effects of random noise during the bimodal spatial-discrimination task. 95% confidence intervals for each parameter were obtained as before.
Pointing practice task. Data from the pointing practice task, used to measure localization response precision, were not statistically analyzed but were filtered before the model fitting. For each stimulus location, we z-transformed the data by subtracting the mean localization response per stimulus location and then dividing by the standard deviation of all demeaned responses. Localization responses with a z-score outside of [−3, 3] were identified as outliers (0-1.67% of trials) and excluded from the model fitting.
Unimodal localization task (pre-and post-recalibration phase). Data from the unimodal localization task were filtered separately for each modality, stimulus location and phase. To compute the means for the z-transformation, auditory and visual localization responses in the pre-recalibration phase were pooled across all six sessions, as the day of testing should not influence localization performance. In contrast, the means of auditory and visual localization responses in the post-recalibration phase were calculated separately for each of the six conditions as each recalibration condition should influence localization differently. Then, we computed the standard deviation of the demeaned auditory localization responses pooled across all six session and both phases. The standard deviation of demeaned visual localization responses was calculated separately for each visual-reliability condition given that stimulus reliability should influence localization precision. Localization responses identified as outliers (z-scores outside of [−3, 3]; 0.78-1.82% of trials) were excluded from all further analyses.
For statistical analysis and display, we summarized localization responses as a linear function of stimulus location, separately for each modality and phase. For each modality, localization responses in the pre-recalibration phase were pooled across all six sessions (visual reliability should influence the precision but not the accuracy of the localization responses; Section S7 in S1 Appendix). Responses from the post-recalibration phase were regressed separately for each condition, because each condition should influence localization differently. The seven regression lines per modality were fit with the constraint that all have the same slope. The amount of recalibration of one modality by the other was defined as the distance between the intercepts of the regression lines for pre-and post-recalibration localization responses. It was coded as positive if localization responses in the post-recalibration phase were shifted to compensate for the audiovisual discrepancy in the preceding recalibration phase. For the statistical analysis, we calculated the amount of recalibration for each modality and recalibration condition (2 recalibration directions × 3 reliability levels). To derive confidence intervals, localization responses were resampled, separately for each location, task, session, and modality.
Data from the bimodal spatial-localization task conducted during the audiovisual recalibration phase were not analyzed. Data analysis was done using Python 3.7, R 4.0.2, and MATLAB 2019a.

Models of audiovisual recalibration
In this section we lay out the definition of recalibration underlying all models reported here and then describe the recalibration process during the audiovisual recalibration phase according to each of the models. Finally, we provide a formalization of each of the tasks used to constrain the model parameters followed by the details of how the models were fit to the data.

Definition of recalibration
Each stimulus at location s in the world leads to a sensory measurement m 0 in an observer's brain. This measurement is corrupted by Gaussian-distributed sensory noise. Thus, with repeated presentations of stimuli at location s, the sensory measurements correspond to scattered spatial locations m 0 � N ðs 0 ; s 02 Þ. The variability of the measurements is determined by the stimulus reliability 1/σ 02 . To allow integration of information from different modalities, measurements are remapped into a common internal reference frame. Hence, the measurement distribution is centered on s 0 , the remapped location of s. As part of the remapping process, spatial discrepancies between the senses are accounted for by shifting the measurements by a modality-specific amount Δ. We model recalibration as the process of updating these shifts following each encounter with a cross-modal stimulus pair [32,63] and probed this updating process by misaligning the physical visual and auditory stimuli to create an artificial sensory discrepancy.
We assumed that observers were calibrated as far as possible at the beginning of each experimental session. Thus, the remapped stimulus locations in internal space were understood as linear functions of the physical stimulus locations s A and s V , that is, However, given that we can only measure relative biases there is no way to empirically isolate the remapping of one modality. As a consequence and without loss of generality, we set s 0 V to be equal to s V (i.e., a V = 1 and b V = 0). We use the variables D A i and D V i to exclusively capture the update in measurement shifts after encounters with the spatially discrepant audiovisual stimulus pairs with visual reliability i (i 2 {1, 2, 3}) during the audiovisual recalibration phase. In addition, we assumed that D A i and D V i are updated in every trial of the phase, and thus the location-independent shifts at the end of the task can be written as the sum of the initial shifts and the shift updates over 120 trials, that is s 0 The final shifts accumulated after 120 recalibration trials were assumed be maintained throughout the subsequent post-recalibration task as observers were not exposed to spatially aligned audiovisual pairs after the recalibration phase (Section S12 in S1 Appendix).
We further assumed that stimulus reliability differed between unimodal (1=s 0 A 2 ) and bimodal (1=s 0 AV;A 2 ) stimulus presentations. Note that we denote the visual-reliability condition for variables associated with the auditory modality (i.e., with a subscript A i ) when the value of that variable can be impacted by visual measurements (e.g., shifts D A i or, below, sensory estimatesŝ 0 A i ), but not otherwise (e.g., measurements m 0 A or measurement variances s 0

Models of the recalibration process
The reliability-based model of cross-modal recalibration. According to this model, each modality should be recalibrated in the direction of the other modality by an amount that is proportional to the other modality's relative reliability [53]. In other words, after every trial, the measurement shifts, D A i and D V i , are updated in the direction of the discrepancy between the visual and auditory measurements by an amount proportional to the two modalities' relative reliabilities as follows: where and analogously where l A and l V index the auditory and visual locations (l A , l V 2 {1, 2, 3, 4}), α denotes a supramodal learning rate, and t denotes trial number.
The fixed-ratio model of cross-modal recalibration. According to this model, after every trial, D A i and D V i are updated in the direction of the discrepancy between the visual and auditory measurements by a fixed ratio of this discrepancy. The ratio of the update depends solely on the identity of the modality and thus is independent of stimulus reliability [62]. D A i and D V i are updated according to the following equations: and where α A and α V are modality-specific learning rates.
The causal-inference model of recalibration. In this model, the shift updates are determined by the discrepancy between a measurement and the corresponding perceptual estimate, [63] for each modality. The spatial discrepancy between auditory and visual measurements and the relative reliabilities of both stimuli have indirect influence on the shift updates by means of their influence on the location estimates,ŝ 0 A i ;l A ðtÞ and s 0 V i ;l V ðtÞ . Additionally, the location estimates and thus the shift updates are contingent on the degree to which the brain infers a common cause or separate causes for the two measurements [30,63].
The location estimates are a mixture of two conditional location estimates, one for each causal scenario (common audiovisual source, C = 1, or different auditory and visual sources, C = 2). In the case of a common source, the location estimate of the audiovisual stimulus pair, In the case of two separate sources, the location estimates of the auditory and the visual stimulus,ŝ 0 A i ;l A ðtÞ;C¼2 andŝ 0 V i ;l V ðtÞ;C¼2 , are equal to the reliability-weighted averages of m 0 A;l A ðtÞ and m 0 P for the auditory estimate, and m 0 V i ;l V ðtÞ and m 0 P for the visual estimate, respectively: The final location estimates are derived by model averaging (see alternative decision strategy in Section S3 in S1 Appendix). Specifically, the final location estimateŝ 0 A i ;l A ðtÞ is the average of the conditional location estimates,ŝ 0 A i ;l A ðtÞ;C¼1 andŝ 0 A i ;l A ðtÞ;C¼2 , with each estimate weighted by the posterior probability of its causal structure: and analogously for the visual location estimate: The posterior probability of a common source for the auditory and visual measurements in trial t, PðC ¼ 1jm 0 A;l A ðtÞ ; m 0 V i ;l V ðtÞ Þ, is proportional to the product of the likelihood of a common source for these measurements in trial t and the prior probability of a common source for visual and auditory measurements in general, P(C = 1): The posterior probability of two separate sources, one for the auditory and one for the visual measurement, is The likelihood of different sources for the visual and auditory measurements in trial t, Pðm 0 A;l A ðtÞ ; m 0 V i ;l V ðtÞ jC ¼ 2Þ is the product of the likelihood of internally represented auditory and visual stimulus locations s 0 A and s 0 V given the auditory measurement, m 0 A;l A ðtÞ , and the visual measurement m 0 V i ;l V ðtÞ , and the supra-modal prior. Given that the measurements in this causal scenario stem from different sources, the product is integrated over all possible, remapped visual and auditory stimulus locations, s 0 V and s 0 A : ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðs 0 The updates of the shifts are scaled by two modality-specific learning rates, α A and α V [63]: and We also tested a version of the model with one supra-modal learning rate (α = α A = α V ).

Formalization of the tasks
Unimodal spatial-discrimination task. The unimodal spatial-discrimination task was conducted to estimate the one auditory 1=s 0 A 2 and three visual 1=s 0 V i 2 stimulus reliabilities under unimodal presentation conditions as well as to constrain the estimates of the variable bias a A introduced by the remapping process. We begin by describing the auditory version. The standard stimulus was presented straight ahead, at location s A,0 , and the test stimulus was presented at one of N A locations, s A,n , determined by an adaptive procedure. For each pair, the probability, p A,n , of estimating the test stimulus to be located to the right of the standard stimulus is a function of the physical distance between the two stimuli.
We assume that the observer makes the decision by comparing the internal location estimatesŝ 0 A;n andŝ 0 To further specify p A,n , we have to derive the probability distributions of the internal location estimates. Each physical stimulus at location s A,n results in an internal measurement, m 0 A;n . The measurement distribution is Gaussian (m 0 A;n � N ðs 0 A;n ; s 0 A 2Þ) and for a given measurement the estimate of the remapped location of the stimulus is the average of the measurement and the mean of the spatial prior, m 0 P , each weighted by their relative reliabilities . Thus, the probability distribution of the location estimates of a test stimulus isŝ The probability distribution of the difference between the two location estimatesŝ 0 A;n and s 0 where m^s0 Taken together, the probability of perceiving an auditory test stimulus at location s A,n to the right of an auditory standard stimulus at location s A,0 is where F(x;μ, σ 2 ) is the cumulative Gaussian distribution. However, as experimenters we only have access to response probabilities as a function of the stimulus locations in physical space. Given that the remapped location of s 0 A is a function of the physical stimulus location s A , we can rewrite Eq 20 as and analogously, Finally, the model includes occasional response lapses (i.e., random button presses) at rate λ, so that the probability of reporting the test stimulus as located farther to the right than the standard (r A,n = 1) is p r A;n ¼1 ¼ 0:5l A þ ð1 À l A Þp A;n and p r V i ;n ¼1 ¼ 0: Bimodal spatial-discrimination task. The bimodal spatial-discrimination task was conducted to estimate the relative bias of auditory compared to visual spatial perception, i.e., to estimate a A and b A . Auditory stimuli were presented at four different locations in physical space s A,l , where l indexes the auditory location. Guided by a staircase procedure, on each trial t, an auditory stimulus at location s A,l was paired with a visual stimulus of high spatial reliability (i = 1) at one of N test locations s V,n , where n indexes the finer grid of locations of visual stimuli that were presented during the task. For each pair, the model predicts p l,n , the probability of judging the visual stimulus at location s V,n as to the right of the auditory stimulus at location s A,l .
Analogous to the unimodal spatial-discrimination task, we specify the probability distributions of the internal auditory and visual location estimates aŝ The probability of the observer perceiving a visual stimulus at physical location s V,n as located to the right of an auditory stimulus at physical location s A,l is thus: The distribution of this difference is: The probability of perceiving the visual stimulus to the right of the auditory can then be expressed as As in the unimodal spatial-discrimination task, the model includes occasional lapses at rate λ AV . Therefore, the probability of reporting a visual stimulus at s V,n as located to the right of an auditory stimulus at location s A,l (r l,n = 1) is equal to p r l;n ¼1 ¼ 0:5l AV þ ð1 À l AV Þp l;n : ð31Þ Pointing practice task. The pointing practice task was used to estimate localization response variability, s 2 r , due to sources unrelated to the spatial perception of the stimuli. Visual stimuli were presented at eight different locations s V,o , where o indexes the stimulus location (o 2 {1, 2, . . ., 8}). Localization responses (i.e., confirmed cursor positions) in each trial were modeled as perturbed by Gaussian-distributed noise and centered on the physical stimulus location: By doing so, we assume that 1) the location of the visual cursor in physical space maps directly to its location in perceptual space and 2) the stimulus location estimate is unbiased. This is based on our general assumption of identity remapping for visual stimuli as well as on the high spatial reliability of the visual cursor and the visual stimulus, which should safeguard their estimates against the influence of spatial priors. See Section S2 in S1 Appendix for a model that does not have these assumptions.
Unimodal localization task-Pre-recalibration phase. The unimodal localization task was conducted before and after the recalibration phase to measure shifts in auditory and visual localization responses as a consequence of exposure to spatially discrepant audiovisual stimuli during the recalibration phase. Stimuli were presented at four locations for each modality, s A,l and s V,l .
As for the pointing practice task, we assumed that the probability distributions of the localization responses in physical space, r A,l and r V i ;l , are centered on the location estimatesŝ 0 A;l and s 0 V i ;l in perceptual space, and corrupted by additional unbiased noise. As in the spatial-discrimination tasks (and unlike in the pointing practice task that used a different, maximally reliable stimulus), the stimulus location estimates are assumed to be biased due to the remapping process and the incorporation of the supra-modal spatial prior. It follows that the probability distributions of the localization responses are where the terms are defined in Eqs 24-27. Unimodal localization task-Post-recalibration phase. In the post-recalibration phase, the remapping from physical to perceptual space had been updated so that additional shifts D A i and D V i , accumulated after 120 exposures to discrepant audiovisual stimuli, were incorporated: This change in the measurement distributions affects the centers of the location estimates' probability distributions as follows: We assumed that the probability distributions of the localization responses are centered on these updated values, that is, we did not implement the updated remapping for the location estimates of the cursor. In sum, localization responses to unimodally presented visual and auditory stimuli have a Gaussian probability distribution that, after the audiovisual recalibration task, additionally depends on the final shift updates D A i ð121Þ and D V i ð121Þ.

Model fitting
All models were fit using a maximum-likelihood procedure. That is, a set of free parameters Θ was chosen to maximize the log likelihood of the data given a model M. Our model fitting strategy aimed to reduce the number of free parameters estimated at once. We split the set of free parameters into three subsets Θ i , i = 1, 2, 3, each fit to a subset of the data X i , i = 1, 2, 3, and maximized the log-likelihoods of each subset X i separately, where X 1 , X 2 , X 3 � X and Θ 1 , Θ 2 , Θ 3 � Θ. The first dataset, X 1 , refers to the unimodal spatialdiscrimination task, which was used to constrain parameter subset, Θ 1 , that comprised unimodal stimulus reliabilities as well as lapse rates in the different sessions of this task. X 2 refers to the pointing practice task used to estimate parameter subset, Θ 2 , which comprised only the variability in localization responses due to other factors than spatial perception. The third subset comprised data from three tasks, X 1 3 ; X 2 3 ; X 3 3 � X 3 , the bimodal spatial-discrimination task, and the unimodal spatial-localization task for the pre-and post-recalibration phase, respectively. These three datasets constrained overlapping sets of parameters Y 1 3 ; Y 2 3 ; Y 3 3 � Y 3 . Thus, they were fit jointly. Y 1 3 and Y 2 3 comprised only localization bias parameters as well as task-specific lapse rates; stimulus reliabilities and response noise parameter estimates were taken from Θ 1 and Θ 2 . Only Y 3 3 , the parameter set constrained by participants' post-recalibration localization responses, included parameters specific to each of the three models of crossmodal recalibration such as the learning rate and common-cause prior.
As outlined before, we did not fit the localization responses from the audiovisual recalibration task, i.e., the build-up of the recalibration effect (Section S13 in S1 Appendix), because the shifts (D A i ðtÞ and D V i ðtÞ) are serially dependent, that is, the size of the shift in trial t depends on the size of the shift in trial t − 1. Given that there is no closed-form solution for the causalinference model, we would have needed to use Monte Carlo simulations to approximate the probability distribution of the location estimates. Yet, the location estimates depend on the serially dependent shifts and consequently the number of necessary samples would have grown exponentially from trial to trial. Thus, it was computationally challenging to estimate the likelihood of the parameters and the model given the data from the audiovisual recalibration task. Instead, we used Monte Carlo simulations to approximate the probability distribution of the shift updates D A i ð121Þ and D V i ð121Þ accumulated at the end of the audiovisual recalibration task, i.e., we fitted the final recalibration effect rather than its build-up.
Model log-likelihood-Unimodal spatial-discrimination task. In the unimodal spatialdiscrimination task (auditory session), participants indicated whether the test stimulus was located to the left, r A,n(t) = 0, or to the right of the standard stimulus, r A,n(t) = 1. For each such trial, the likelihood of model parameters given the response r A,n(t) is The log likelihood across all four sessions is The set of free parameters that were constrained by the binary responses in this task is where T 2 and the sum do not include outlier trials.
Model log-likelihood-Bimodal spatial-discrimination task. In the bimodal spatial-discrimination task, for each trial t, participants indicated whether the visual test stimulus at location s V 1 ;nðtÞ was located to the left, r l(t),n(t) = 0, or to the right, r l(t),n(t) = 1, of the auditory standard stimulus presented at s A,l(t) . For each such trial, the likelihood of model parameters given the response r l(t),n(t) is Pðr lðtÞ;nðtÞ jM; Y 1 3 Þ ¼ p r lðtÞ;nðtÞ ¼1 r lðtÞ;nðtÞ ð1 À p r lðtÞ;nðtÞ ¼1 Þ 1À r lðtÞ;nðtÞ ; ð41Þ where p r l;n ¼1 is defined in Eq 31. Thus, the log likelihood given the responses across all T 1 3 trials is ½r lðtÞ;nðtÞ logp r lðtÞ;nðtÞ ¼1 þ ð1 À r lðtÞ;nðtÞ Þlogð1 À p r lðtÞ;nðtÞ ¼1 Þ�: ð42Þ p r l;n ¼1 is a function of p l(t),n(t) , which in turn depends on the bias parameters a A and b A , the parameters of the supra-modal prior over locations m 0 P and s 0 P , as well as the measurement variances s 0 A and s 0 V i (see . Fitting both the bias parameters and the supra-modal prior at once was impossible as they effectively traded off. Thus, we implemented a non-informative supra-modal prior over stimulus locations by setting s 0 P to 100 and m 0 P to 0. as 0 A , s 0 V 1 , s 0 V 2 , and s 0 V 3 were estimated based on the forced-choice responses from the unimodal spatial-discrimination task. The final set of free parameters that were constrained by the binary responses in this task was Y 1 3 ¼ fa A ; b A ; l VA g. The bias parameters, a A and b A , were jointly estimated using the data from this task as well as pre-and post-recalibration responses from the unimodal localization task. Model log-likelihood-Unimodal localization task-Pre-recalibration phase. In this task, each localization results in cursor location settings r A,i,j,l(t) and r V i ;j;lðtÞ on trial t of session (i, j) where i indicates the visual-reliability condition and j the recalibration direction in the subsequent recalibration phase. The localization responses from this task were modeled as Gaussian-distributed. From these distributions, we can compute the likelihood of a model M and the parameter set Y 2 3 as the Gaussian probability density function in Eq 33 evaluated at the observed localization responses r A,i,j,l(t) and r V i ;j;lðtÞ : The log likelihood is the sum of the log likelihoods across the trials of all six sessions: The log-likelihood depends on m^s A;i;j;lðt A Þ and m^s V i ;j;lðt V i Þ , which in turn depend on the bias parameters a A and b A , the parameters of the supra-modal prior m 0 P and s 0 P , as well as the measurement variances s 0 A and s 0 V i (see Eq 34), and the response noise σ r . We chose a flat prior over stimulus locations, the (scaled) measurement variances ( ffi ffi ffi 2 p s 0 A =a A and s 0 V i ) were estimated based on the unimodal spatial-discrimination task, and σ r was estimated based on the pointing practice task. Consequently, the actual set of parameters constrained by localization responses from the pre-recalibration task was Y 2 3 ¼ fa A ; b A g. Here, the values of the T variables and the sums do not include outlier trials.
Model log-likelihood-Unimodal localization task-Post-recalibration phase. Localization responses in the post-recalibration phase additionally depend on the updates for the visual and auditory shifts accumulated after 120 trials during the recalibration phase, D A i ð121Þ and D V i ð121Þ (Eq 34). Since these accumulated shift updates are not accessible to the experimenter, we marginalized over these shift updates to calculate the log-likelihood. For each of the six experimental sessions (i, j), the log likelihood of a model M and its parameter set Y 3 3 is the integral over D A i and D V i of the likelihood of the final shift updates given the observed data We will describe in the following sections how the joint probability PðD A i ;j ; D V i ;j jM; Y 3 3 Þ and the log-likelihood log PðX 3 3 jM; Y 3 3 Þ were derived for each of the three models of cross-modal recalibration.
Reliability-based model of cross-modal recalibration. In this model, auditory and visual shift updates have a constant ratio of D A i ðtÞ=D V i ðtÞ ¼ À s 0 AV;A 2=s 0 AV;V i 2, the ratio of the measurement noise variances. Therefore, D A i ðtÞ can be rewritten as ðÀ s 0 AV;A 2=s 0 AV;V i 2ÞD V i ðtÞ, and we can express the likelihood given a single auditory localization response r A i ;j;lðtÞ as The visual response likelihoods and means are defined analogously (Eq 34). Thus, the joint likelihood PðX 3 3;i;j jD A i ;j ; D V i ;j ; M RB ; Y 3 3;RB Þ can be written as PðX 3 3;i;j jD V i ;j ; M RB ; Y 3 3;RB Þ. Given that the likelihood depends only on D V i ;j , we only need to integrate over D V i ;j and the log likelihood simplifies to The shift updates D V i ;j are stochastic because the visual and auditory measurements in each trial of the audiovisual recalibration task are stochastic. We cannot derive their probability distribution PðD V i ;j jM RB ; Y 3 3;RB Þ in closed form. Instead, we used Monte Carlo simulation to approximate this probability distribution. Given the reliability-based model, for each candidate set of parameters Y 3 3;RB , visual-reliability condition i, and recalibration direction j, we simulated 120 recalibration trials analogous to the audiovisual recalibration task. We repeated this simulation 1,000 times, resulting in a sample of 1,000 shift updates (D V i ) and checked whether the distribution of the 1,000 samples was well fit by a Gaussian with mean and standard deviation equal to the corresponding empirical parameters of the sampled distribution. To do so, we binned the simulated shift updates into 100 bins of equal size and computed the correlation between the observed and predicted number of samples per bin. The resulting value of R 2 was greater than 0.925 in all cases (Section S8 in S1 Appendix). The approximated probability distribution of the shift updates is denoted asPðD V i ;j jM RB ; Y 3 3;RB Þ. We approximated the integral in Eq 48 by numerical integration over a region discretized into 100 bins. To ensure that we include enough of the tails of the probability distribution of the shift updates, we set the integration region to be three times larger than the range of the D V i samples, and centered the integration region on that range. Thus, the lower bound, lb, is defined as lb = Δ min − (Δ max − Δ min ) and the upper bound is ub = Δ max + (Δ max − Δ min ). The numerical integration region was derived separately for each session. The log likelihood is: depend on the bias parameters a A and b A , as well as on D V i ;j , which depends on the measurement variances s 0 AV;A and s 0 AV;V i given bimodal presentation and the common learning rate α. Note that s 0 AV;A and s 0 AV;V i are not directly constrained by data from bimodal trials (because these trials were not included in the model fitting), but estimated based on their influence on the shift updates. Specifically, s 0 AV;A and s 0 AV;V i affect the spread of the measurements, and as a consequence they influence the width of the predicted probability distribution of measurement-shift updates, which in turn affect the log likelihood of the model. The set of free parameters for this model is Y 3 3;RB ¼ fa A ; b A ; s 0 AV;A ; s 0 AV;V 1 ; s 0 AV;V 2 ; s 0 AV;V 3 ; ag. s 0 AV;V i was constrained to be a non-decreasing function of visual-reliability condition i, and s 0 AV;A and s 0 AV;V i were constrained to be no greater than five times the average values of s 0 A and s 0 V i across participants (Section S14 in S1 Appendix). The values of the T variables and the sums do not include outlier trials.
Log-likelihood-fixed-ratio model of cross-modal recalibration. In this model, auditory and visual shift updates have a fixed ratio of D A i ðtÞ=D V i ðtÞ ¼ À a A =a V (Section S15 in S1 Appendix). Thus, we can express the likelihood for the fixed-ratio model and parameter set Y 3 3;FR given an auditory localization response r A i ;j;lðtÞ in a similar form to Eq 47: The approximationPðD V i ;j jM FR ; Y 3 3;FR Þ was generated in the same way as for the reliabilitybased model. The set of free parameters for this model is Y 3 3;FR ¼ fa A ; b A ; s 0 AV;A ; s 0 AV;V 1 ; s 0 AV;V 2 ; s 0 AV;V 3 ; a A ; a V g. Note that even though the shift updates in the fixed-ratio model do not depend on the stimulus reliabilities, the log-likelihood does due to the influence of stimulus reliability on the estimates in the localization task (Eq 51) and due to the influence of the spread of the simulated measurements on the spread of the estimated distribution ofPðD V i ;j jM FR ; Y 3 3;FR Þ. As in the reliability-based model, s 0 AV;V i was constrained to be a non-decreasing function of visual-reliability condition i, and s 0 AV;A and s 0 AV;V i were constrained to be no greater than five times the average values of s 0 A and s 0 V i across participants. Log-likelihood-causal-inference model of cross-modal recalibration. For this model, the joint likelihood PðX 3 3;i;j jD A i ;j ; D V i ;j ; M CI ; Y 3 3;CI Þ was truly two-dimensional. Thus, we approximated the joint probability of the auditory and visual shift updates, PðD A i ;j ; D V i ;j jM; Y 3 3;CI Þ by drawing 1000 samples of shift-update pairs and compared the set of sample pairs to a 2-d Gaussian with the sample mean and covariance as parameters. We again tested whether the two-dimensional Gaussian distribution provided a good fit to the simulated density (defined as R 2 > 0.925). If the Gaussian fit was poor, we used a kernel density estimate (Gaussian kernel smoother with σ chosen automatically) of the distribution based on the 2-d density of the samples [77,78]. Overall, the simulated auditory and visual shift updates were very well fit by a bivariate Gaussian, and we rarely used a kernel density estimate (Section S8 in S1 Appendix). We additionally used simulations to verify that our estimates of the partial model log-likelihood (log PðX 3 3 jM CI ; Y 3 3;CI Þ) had reasonably small bias (Section S8 in S1 Appendix). For the causal-inference model, we approximate the log likelihood by numerical integration over a 2-dimensional region of Δ A , Δ V space discretized into 100x100 bins. The upper and lower bounds were determined for both dimensions in the same way as before. The log likelihood is: s 0 AV;A ; s 0 AV;V 1 ; s 0 AV;V 2 ; s 0 AV;V 3 ; a; p C¼1 g. Parameter estimation. For each model, we approximated the set of parameters Θ 1 and Θ 2 that maximized the likelihood using the MATLAB function fmincon and Python SciPy. optimize [79], and approximated Θ 3 using the BADS toolbox [80]. To deal with the possibility that the returned parameter values might correspond to a local minimum, we ran BADS multiple times with different starting points, randomly chosen from a D-dimensional grid, where D is the number of free parameters in Θ 3 (see Table 1 for a summary of the free parameters for each model) and with three evenly spaced values chosen for each dimension. The final parameter estimates were those with the maximum likelihood across all runs of the fitting procedure.
Model comparison. To compare model performance quantitatively, we computed the Akaike information criterion (AIC) for all four models [64] and calculated relative modelcomparison scores, Δ AIC , which relate the AIC value of the best-fit model to that of each of the other models (a higher Δ AIC value indicates stronger evidence for the best-fit model). Models with 0 < Δ AIC < 2 are weakly supported; models with 4 < Δ AIC < 7 have considerably less support; models with Δ AIC > 10 have essentially no support [81].
Supporting information S1 Appendix. Appendix. Supplemental control experiments, analyses and figures. (PDF)