Entrained audiovisual speech integration implemented by two independent computational mechanisms: Redundancy in left posterior superior temporal gyrus and Synergy in left motor cortex

Information integration is fundamental to many aspects of human behavior, and yet its neural mechanism remains to be understood. For example, during face-to-face communication we know that the brain integrates the auditory and visual inputs but we do not yet understand where and how such integration mechanisms support speech comprehension. Here we show that two independent mechanisms forge audiovisual representations for speech comprehension in different brain regions. With a novel information theoretic measure, we found that theta (3-7 Hz) oscillations in the posterior superior temporal gyrus/sulcus (pSTG/S) code speech information that is common (i.e. redundant) to the auditory and visual inputs whereas the same oscillations in left motor and inferior temporal cortex code synergistic information between the same inputs. Importantly, redundant coding in the left pSTG/S and synergistic coding in the left motor cortex predict behavior - i.e. speech comprehension performance. Our findings therefore demonstrate that processes classically described as integration effectively reflect independent mechanisms that occur in different brain regions to support audiovisual speech comprehension.


Introduction
While engaged in a conversation, we effortlessly integrate auditory and visual speech information into a unified perception. Such integration of multisensory information is a key aspect of audiovisual speech processing that has been extensively studied [1][2][3][4]. Studies of multisensory integration have demonstrated that, in face-to-face conversation, especially in adverse conditions, observing lip movements of the speaker can improve speech comprehension [4][5][6]. In fact, lip movements during speech typically contain sufficient information to understand speech without corresponding auditory information [1,7].
Turning to the brain, we know that specific brain regions are involved in audiovisual integration.
Especially the superior temporal gyrus/sulcus (STG/S) responds to conjunction of auditory and visual stimuli and its disruption leads to reduced McGurk fusion [8][9][10][11][12]. However, these classic studies present two shortcomings. First, their experimental designs typically contrasted two conditions: unisensory (i.e. audio or visual cues) and multisensory (congruent or incongruent audio and visual cues). However, such contrast does not dissociate effects of integration per se from those arising from differences in stimulation complexity (i.e. one or two sources)e.g. possible modulations of attention, cognitive load and even arousal. A second shortcoming is that previous studies typically investigated (changes of) regional activation and not information integration between audiovisual stimuli and brain signals. Here, we address these two shortcomings and study the specific mechanisms of audiovisual integration from brain oscillations. We used a novel methodology (speech-brain entrainment) and novel information theoretic measures (the Partial Information Decomposition) to quantify the interactions between audiovisual stimuli and dynamic brain signals.
Our methodology of speech-brain entrainment builds on recent studies suggesting that rhythmic components in brain activity that are temporally aligned to salient features in speechmost notably the syllable rate [5,6,[13][14][15] facilitate processing of both the auditory and visual speech inputs. The main advantage of speech-brain entrainment is that it replaces unspecific measures of activation with measures that directly quantify the coupling between the components of continuous speech (e.g. syllable rate) and frequency-specific brain activity, thereby tapping more directly into the brain mechanisms of speech segmentation and coding [14].
Our information theoretic measure of Partial Information Decomposition (PID; see Fig 1A and method for details) [16][17][18] addresses a perennial question in multisensory processing, the extent to which each sensory modality contributes uniquely to sensory representation in the brain vs.
how different modalities (e.g. audio and visual) jointly contribute via a form of interaction. There are two main forms of interaction to consider. In redundant coding, the brain signal reflects common coding of auditory and visual speech. In synergistic coding the brain signal reflects a new representation of auditory and visual speech that is over and above representations based on each modality alone (i.e. a cross-modal enhancement of speech representation). Interestingly, both redundant and synergistic coding can be interpreted as reflecting different mechanisms of audiovisual information integration. The PID framework therefore provides a principled approach to investigate different cross-modal integration mechanisms (redundant and synergistic) in the human brain during naturalistic audiovisual speech processing. That is, to understand how neural representations of dynamic auditory and visual speech signals interact in the brain to form a unified perception. E. Unique information of visual speech and auditory speech were compared. Stronger unique information for auditory speech was found in bilateral auditory, temporal, inferior frontal areas and stronger unique information for visual speech was found in bilateral visual cortex (P < 0.05, false discovery rate (FDR) corrected).
Specifically, we recorded brain activity using MEG while participants attended to continuous audiovisual speech to entrain brain activity. We applied the PID to reveal where and how speechentrained brain activity in different regions reflects different types of auditory and visual speech integration. In the first experimental condition, we used naturalistic audiovisual speech where attention to visual speech was not critical ('All congruent' condition). In the second condition, we added a second interfering auditory stimulus that was incongruent to the congruent audiovisual stimuli ('AV congruent' condition), requiring attention to visual speech to suppress the competing additional incongruent auditory input. In the third condition, both auditory stimuli were not congruent to visual stimulus ('All incongruent'). This allows us to see how the congruence of audiovisual stimuli changes integration. We contrasted redundancy and synergy coding between the conditions to reveal differential effects of attention and congruence on multisensory integration mechanisms and behavioral performance.

Results
We first studied partial information decomposition (PID) in an 'All congruent' condition (binaural presentation of speech with matching video) to understand multisensory representational interactions in the brain during processing of natural audiovisual speech. We used mutual information (MI) to quantify the overall dependence between the full multisensory dynamic stimulus time-course (broadband speech amplitude envelope and lip area for auditory and visual modalities respectively) and the recorded brain activity. This multisensory audiovisual MI (total mutual information (MEG; A, V), Fig 1B) includes unique unimodal as well as redundant and synergistic multisensory effects, which we can separate with the PID. The unique components are shown in Figs 1C-E. The total mutual information map ( Fig 1B) shows multimodal stimulus entrainment in bilateral auditory/temporal areas and to lesser extent in visual cortex. Auditory unique information (MEG; A) is present in bilateral auditory areas (Fig 1C), where it accounts for the a large proportion of the total mutual information (MEG; A,V). Visual unique information (MEG; V) is present in both visual and auditory areas (Fig 1D), but overall visual entrainment is weaker than auditory entrainment (see S1 Fig for more details). Comparing the auditory unique information to visual unique information across subjects revealed stronger visual entrainment in bilateral visual cortex and stronger auditory entrainment in bilateral auditory, temporal, and inferior frontal areas (paired two-sided t-test, df: 43, P < 0.05, false discovery rate (FDR) corrected; Fig 1E).
To identify a frequency band where auditory and visual speech signals show significant dependencies, we computed MI between both signals and compared it to MI between nonmatching auditory and visual speech signals for frequencies from 0 to 20 Hz. As expected, only matching audiovisual speech signals show significant MI peaking at 5 Hz (Fig 2A) consistent with previous results based on coherence (see Fig 2C in [5]). We therefore focus our further analysis on the 3-7 Hz frequency band that is known to correspond to the syllable rate in continuous speech [13] and within which amplitude envelope of speech is known to reliably entrain auditory brain activity [15,20].

Fig 2. Mutual information between auditory and visual speech signals.
A. To investigate Partial Information Decomposition (PID) in 'AV congruent' condition, first mutual information between auditory speech and visual speech signals was computed separately for matching and non-matching signals. Mutual information for matching auditory-visual speech signals shows nicely with a peak around 5 Hz (red line), whereas mutual information for nonmatching signals is shown flat (blue line).
B. Analysis of PID is shown for 'AV congruent' condition in which both matching and nonmatching auditory-visual speech signals are present on the same brain response (MEG data).
Two external speech signals (auditory speech envelope and lip movement signal) and brain signals were used in the PID computation. Each signal was bandpass filtered followed by Hilbert transform.

Redundancy in left pSTG/S and synergy in left motor cortex
Next, we investigated how multimodal representational interactions are modulated by attention and congruence in continuous audiovisual speech. Here we focus on an 'AV congruent' condition where a congruent audiovisual stimulus pair is presented mon-aurally, together with an interfering non-matching auditory speech stimulus to the other ear ( Fig 2B). This condition is of particular interest because visual speech (lip movement) is used to disambiguate the two competing auditory speech signals. Furthermore, it is ideally suited for our analysis because we can directly contrast representational interactions quantified with the PID in matching and non-matching audiovisual speech signals in the same data set (see Fig 2B).  Redundant and synergistic information of matching audiovisual speech signals in the brain compared to non-matching signals are shown. Each map (matching or non-matching in each information map) was firstly yielded to regression analysis using speech comprehension then transformed to standard Z maps and subtracted.  Comparison between conditions of matching versus non-matching audiovisual speech signals in 'AV congruent' condition entails both attention and congruence effects. To separate this effect, we additionally analyzed contrast for congruence ('AV congruent' > 'All incongruent') first.
A. Redundancy for congruence effect is observed in left inferior frontal region and posterior superior temporal gyrus/sulcus (pSTG/S) and right posterior middle temporal cortex (Z-difference map at P < 0.005).
B. Synergistic information for congruence effect is found in superior part of motor/sensory cortices in left hemisphere (Z-difference map at P < 0.005).
The attention effect ('AV congruent' > 'All congruent') shows higher redundant information in left auditory and temporal (superior and middle temporal cortices and posterior superior temporal gyrus/sulcus) areas and right inferior frontal and superior temporal cortex (Fig 5A; Z-difference map at P < 0.005). Higher synergistic information was localized in left motor cortex, inferior temporal cortex, and parieto-occipital areas (Fig 5B; Z-difference map at P < 0.005). In summary, theta-band activity in left posterior superior temporal gyrus/sulcus (pSTG/S) significantly represents redundant information about audiovisual speech more strongly in experimental conditions with higher attention and congruence. In contrast, synergistic information in the left motor cortex is more prominent in conditions requiring increased attention. Therefore, the increased relevance of visual speech in the 'AV congruent' condition leads to differential increase of different integration mechanisms in different brain areasnamely, increased redundancy in left pSTG/S and increased synergy in left motor cortex. For detailed local maps of interaction between predictors (auditory and visual speech signals) and target (MEG response)

Performance scales with redundancy in left pSTG/S and synergy in left motor cortex
Next we investigated if the differential pattern of redundancy and synergy is of behavioral relevance in our most important condition -'AV congruent'where visual speech is particularly informative. To this end, we extracted raw values of redundancy for the location showing strongest redundancy in the left pSTG/S ( Fig 5A)

Discussion
In this study, we investigated how multisensory audiovisual speech rhythms are represented in the brain and how they are integrated for speech comprehension. We propose to study multisensory integration using information theory for the following reasons: First, it is a principled way to quantify the interactions between different stimuli. Second, interactions can be measured directly without resorting to statistical contrasts between conditions (e.g. AV > A+V). Third, information theory affords measures of interaction that cannot be computed in other ways (particularly specifying unique information and synergy) ( Fig 1A) [16,19]. The partial information decomposition (PID) allowed us to break down the relationships between the representations of auditory and visual speech inputs in the brain. We found that left posterior superior temporal region conveys speech information common to both auditory and visual modalities (redundancy) while left motor cortex conveys information that is greater than the linear summation of individual information (synergy). These results are obtained from low-frequency theta rhythm (3)(4)(5)(6)(7) signals, corresponding to syllable rate in speech components. Importantly, redundancy in pSTG/S and synergy in left motor cortex predict behavioral performancespeech comprehension accuracyacross participants.
A critical hallmark of multisensory integration in general, and audiovisual integration in particular, is the behavioral advantage conveyed by both stimulus modalities as compared to each single modality. Here we have shown that this process relies on at least two different mechanisms in two different brain areas, reflected in different representational interaction profiles revealed with synergy and redundancy.

What do redundancy and synergy mean? Linking to audiovisual integration in fMRI studies
In fMRI studies, audiovisual speech integration has been studied by manipulating experimental conditions (e.g. [11,22]). Changes in BOLD responses elicited by congruent audiovisual stimuli (AV) were compared to auditory-only (AO) or visual-only (VO), their sum (AO + VO) or their conjunction (AO ∩ VO). Greater activation for audiovisual condition (AV) compared to others were interpreted as audiovisual speech integration. Comparison to auditory-only (AO) or visualonly (VO) condition was regarded as less conservative criterion for integration and comparison to their summation (AO + VO) was considered as integration effect with supra-additivity. However, these contrasts are potentially confounded by condition-specific differences in attention, stimulus complexity, individual preference to individual stimulus modality (e.g. auditory over visual) and others. Similarly, this approach cannot identify how the brain makes use of the similar or complementary aspects of information in auditory and visual inputs.
Importantly, the PID can quantify the representational interactions between multiple sensory signals and the associated brain response in a single experimental condition where both sensory modalities are simultaneously present. In the PID framework, the unique contributions of a single (e.g. auditory) sensory modality to brain activity is directly quantified instead of relying on the statistical contrast between different conditions. Furthermore, the PID method allows the quantification of both redundant and synergistic interactions. In the context of audiovisual integration both types of interaction can be seen as integration effects. Redundant information refers to quantification of overlapping information content of the predictor variables (auditory and visual speech signals) and synergistic information refers to additional information gained from simultaneous observation of two predictor variables compared to observation of one. If we force a comparison between activation and information we could argue that redundancy is more related to the conjunction of each modality (AV ≈ AO ∩ VO) and synergy more related to supra-additivity (AV > (AO + VO)). However, it is important to keep in mind that activation and information are conceptually and computationally very different.

Left posterior superior temporal region extracts common features from auditory and visual speech rhythms
Posterior superior temporal region (pSTG/S) has been implicated in audiovisual speech integration area by functional [23][24][25] and anatomical [26] neuroimaging. A typical finding in fMRI studies is that pSTG/S shows stronger activation for audiovisual (AV) compared to auditory-only (AO) and/or visual-only (VO) conditions. This was confirmed by a combined fMRI-TMS study in which the likelihood of McGurk fusion was reduced when TMS was applied individually to fMRIlocalized pSTS, suggesting a critical role of pSTS in auditory-visual integration [12].
The redundant information in the same left superior temporal region in this study matches this notion that this region processes shared information from both modalities. We found this region not only in the congruence effect ('AV congruent' > 'All incongruent'; Fig 4A) but also in the attention effect ('AV congruent' > 'All congruent'; Fig 5A).

Left motor cortex activity reflects synergistic information in audiovisual speech
processing Interestingly, we found the left motor cortex shows increased synergy for the matching vs nonmatching audio stimuli of 'AV congruent' condition ( Fig 3B). However, further analysis optimized for effects of attention and congruence revealed two slightly different precentral areaswith the area that shows strongest synergy change with attention ( Fig 5B) located more lateral and anterior compared to the area identified in the congruence (Fig 4B). Interestingly, the motor region in the attention contrast is consistent with the area in our previous study that showed entrainment to lip movements during continuous speech that correlated with speech comprehension [5]. In another study we identified this area as the source of top-down modulation of activity in the left auditory cortex [20]. The definition of synergistic information in our context refers to more information gained from the simultaneous observation of auditory and visual speech compared to the observation of each alone. When it comes to the attention effect ('AV congruent' > 'All congruent'), 'AV congruent' condition requires paying more attention to auditory and visual speech than the 'All congruent' condition does, even though the speech signals to be attended match the visual stimulus in both conditions. Thus, this synergy effect in the left motor cortex can be explained by a net attention effect at the same level of stimulus congruence. This effect is likely driven by stronger attention to visual speech which is informative for the disambiguation of the two competing auditory speech streams [5]. This notion is plausible because it is supported by directional information analysis which shows that the left motor cortex better predicts upcoming visual speech in the 'AV congruent' condition where attention to visual speech is crucial (S2 Fig B,D).
In summary, we demonstrate how information theoretic tools can provide a new perspective on audiovisual integration, by explicitly quantifying both redundant and synergistic cross-modal representational interactions. This reveals two distinct profiles of audiovisual integration, that are supported by different brain areas (left motor cortex and left pSTG/S) and are differentially recruited under different listening conditions.

Participants
Data from 44 subjects were analyzed (26 females; age range: 18-30 years; mean age: 20.54 ± 2.58 years). Another analysis of these data was presented in a previous report [5]. All subjects were healthy, right-handed and had normal or corrected-to-normal vision and normal hearing.
None of the participants had a history of developmental, psychological, or neurological disorders.
They all provided informed written consent before the experiment and received monetary compensation for their participation. The study was approved by the local ethics committee (CSE01321; College of Science and Engineering, University of Glasgow) and conducted in accordance with the ethical guidelines in the Declaration of Helsinki.

Stimuli and Experiment
We used audiovisual video clips of a professional male speaker talking continuously (7-9 minutes) which were used in our previous study [5]. The talks were originally taken from TED talks (www.ted.com/talks/) and edited to be appropriate to the stimuli we used (e.g. editing words referring to visual materials, the gender of the speaker etc.).
High-quality audiovisual video clips were filmed by a professional filming company with sampling rate of 48 kHz for audio and 25 fps (frame per second) for video in 1920 x 1080 pixels.
Questionnaires for each talk were validated in a separate behavioral study (16 subjects; 13 females; aged 18-23 years; mean age: 19.88 ± 1.71 years). These questionnaires are designed to assess the level of speech comprehension. Each questionnaire consists of 10 questions about a given talk to test general comprehension (e.g., "What is the speaker's job?") and were validated in terms of accuracy (the same level of difficulty), response time, and the length (word count).
Experimental conditions used in this study were 'All congruent', 'All incongruent', 'AV congruent'.
In each condition (7-9 min), one video recording was presented and two (matching or nonmatching) auditory recordings were presented to the left and the right ear, respectively. The 'All congruent' condition is a natural audiovisual speech condition where auditory stimuli to both ears and visual stimuli are congruent. The 'All incongruent' condition is where all three stimuli are from different videos and participants are instructed to attend to auditory information presented to one ear. The 'AV congruent' condition is where only one of auditory stimuli matches the visual information, and the speech presented to the other ear serves as a distraction. Participants attend to the talk that matches visual information. Participants were instructed to fixate on the speaker's lip all the time in all experimental conditions. In 'All congruent' condition (natural audiovisual speech), they were instructed to ignore the color of the fixation cross and just to attend to both sides naturally.
A fixation cross (either yellow or blue color) was overlaid on the speaker's lip to indicate the auditory stimulus to pay attention to (left or right ear, e.g. "If the color of fixation cross is yellow, please attend to left ear speech"). The color was counterbalanced across subjects. For the recombination and editing of audiovisual talks we used Final Cut Pro X (Apple Inc., Cupertino, CA).
Half of the 44 participants attended to speech in the left ear and the other half attended to speech in the right ear. There was no significant difference in comprehension accuracy between groups (two sample t-test, df: 42, P > 0.05). In this study, we pooled across both groups for data analysis.
The stimuli were presented with Psychtoolbox [28] in MATLAB (MathWorks, Natick, MA). Visual stimuli were delivered with a resolution of 1280 x 720 pixels at 25 fps (mp4 format). Auditory stimuli were delivered at 48 kHz sampling rate via a sound pressure transducer through two 5 meter-long plastic tubes terminating in plastic insert earpieces.
A comprehension questionnaire was administered about the attended speech separately for each condition.

Data acquisition
Cortical neuromagnetic signals were recorded using a 248-magnetometers whole-head MEG (Magnetoencephalography) system (MAGNES 3600 WH, 4-D Neuroimaging) in a magnetically shielded room. The MEG signals were sampled at 1,017 Hz and were denoised with information from the reference sensors using the denoise_pca function in FieldTrip toolbox [29].
For statistics and visualization, we used the FieldTrip Toolbox [29] and in-house MATLAB codes.
We followed the suggested guidelines [31] for MEG studies.
MEG-MRI co-registration. Structural MR images of each participant were co-registered to the MEG coordinate system using a semi-automatic procedure. Anatomical landmarks (nasion, bilateral pre-auricular points) were identified before the MEG recording and also manually identified in the individual's MR images. Based on these landmarks, both MEG and MRI coordinate systems were initially aligned. Subsequently, numerical optimization was achieved by using the ICP algorithm [32].
Source localization. A head model was created for each individual from their structural MRI using normalization and segmentation routines in FieldTrip and SPM8. Leadfield computation was performed based on a single shell volume conductor model [33] using a 8-mm grid defined on the template provided by MNI (Montreal Neurological Institute). The template grid was linearly transformed into individual head space for spatial normalization. Cross-spectral density matrices were computed using Fast Fourier Transform on 1-s segments of data after applying multitaper.
Source localization was performed using DICS beamforming algorithm [34] and beamformer coefficients were computed.
Auditory speech signal processing. The amplitude envelope of auditory speech signals was computed following the approach reported in [35]. We constructed eight frequency bands in the range 100-10,000 Hz to be equidistant on the cochlear map [36]. The auditory sound speech signals were band-pass filtered in these bands using a fourth-order forward and reverse Butterworth filter. Then Hilbert transform was applied to obtain amplitude envelopes for each band of signal. These signals were then averaged across bands and resulted in a wideband amplitude envelope. For further analysis, signals were downsampled to 250 Hz.

Visual speech signal processing. A lip movement signal was computed using an in-house
Matlab script. We first extracted the outline lip contour of the speaker for each frame of the movie stimuli. From the lip contour outline we computed the frame-by-frame lip area (area within lip contour). This signal was resampled at 250 Hz to match the sampling rate of the preprocessed MEG signal and auditory sound envelope signal. We reported the first demonstration of visual speech entrainment using this lip movement signal [5].

Estimating mutual information (MI) and other information theoretic quantities.
Shannon's Information Theory [37]. Information theory was originally developed to study manmade communication systems, however it also provides a theoretical framework for practical statistical analysis. It has become popular for the analysis of complex systems in a range of fields, and has been successfully applied in neuroscience to spike trains [38,39], LFPs [40,41], EEG [42,43], MEG time-series data [15,20]. Mutual information is a measure of statistical dependence between two variables, with a meaningful effect size measured in bits (see [30] for a review). MI of 1 bit corresponds to a reduction of uncertainty about one variable of a factor 2 after observation of another variable. Here we estimate MI and other quantities using Gaussian-Copula Mutual Information (GCMI) [30]. This provides a robust semi-parametric lower bound estimator of mutual information, by combining the statistical theory of copulas with the closed form solution for the entropy of Gaussian variables. Crucially, this method performs well for higher dimensional responses as required for measuring three-way statistical interactions (see below) and allows estimation over circular variables like phase.

Mutual information (MI) between auditory and visual speech signals.
Following the GCMI method [30], we normalized the complex spectrum by its amplitude to obtain a 2d representation of the phase as points lying on the unit circle. We then rank-normalized the real and imaginary parts of this normalized spectrum separately, and used the multivariate GCMI estimator to quantify the dependence between these two 2d signals. This gives a lower bound estimate of the MI between the phases of the two signals.
To determine the frequency of interest for the main analysis (partial information decomposition; PID), we computed MI between auditory (A) and visual (V) speech signals for the matching AV and non-matching AV signals from all the stimuli we used. As shown in Fig 2A, there was no relationship between non-matching auditory and visual stimuli, but there was a frequency dependent relationship for matching stimuli peaking in the band 3-7 Hz. This is consistent with previous results using coherence measures [5,35]. This frequency band corresponds to the syllable rate and is known to show robust phase coupling between speech and brain signals.

Partial Information Decomposition (PID) theory.
We seek to study the relationships between the neural representations of auditory (here amplitude envelope) and visual (here dynamic lip area) stimuli during natural speech. Mutual information can quantify entrainment of the MEG signal by either or both of these stimuli, but cannot address the relationship between the two entrained representationstheir representational interactions. The existence of significant auditory entrainment revealed with MI demonstrates that an observer who saw a section of auditory stimulus would be able to, on average, make some prediction about the MEG activity recorded after presentation of that stimulus (this is precisely what is quantified by MI, without the need for an explicit model). Visual MI reveals the same for the lip area. However, a natural question is then whether these two stimulus modalities provide the same information about the MEG, or provide different information.
If an observer saw the auditory stimulus, and made a corresponding prediction for the MEG activity, would that prediction be improved by observation of the concurrent visual stimulus, or would all the information about the likely MEG response available in the visual stimulus already be available from the related auditory stimulus? Alternatively, would an observer who saw both modalities together perhaps be able to make a better prediction of the MEG, on average, then would be possible if the modalities were not observed simultaneously? This is conceptually the same question that is addressed with techniques such as Representational Similarity Analysis (RSA [44] or cross-decoding [45]. RSA determines similar representations by comparing the pairwise similarity structure in responses evoked by a stimulus set usually consisting of many examplars with hierarchical categorical structure. If the pattern of pairwise relationships between stimulus evoked responses is similar between two brain areas, it indicates there is a similarity in how the stimulus ensemble is represented. Cross-decoding works by training a classification or regression algorithm in one experimental condition or time region, and then testing its performance in another experimental region or time region. If it performs above chance on the test set, this demonstrates some aspect of the representation in the data that the algorithm learned in the training phase, is preserved in the second situation. Both these techniques address the same conceptual issue of representational similarity, which is measured with redundancy in the information theoretic framework, but have specific experimental design constraints, and are usually used to compare different neural responses (recorded from different regions, time periods or with different experimental modalities). The information theoretic approach is more flexible, and can be applied both to simple binary experimental conditions, as well as continuous valued dynamic features extracted from complex naturalistic stimuli such as those we consider here. Further, it allows us to study representational interactions between stimulus features (not only neural responses), and provides the ability to quantify synergistic as well as redundant interactions.
We can address this question with information theory through a quantity called 'Interaction Information' [16,46], which is defined as follows: Interaction information is the difference between synergy and redundancy [17] and therefore measures a net effect. It is possible to have zero interaction information even in the presence of strong redundant and synergistic interactions (for example over different ranges of the stimulus space) that cancel out in the net value. The methodological problem of fully separating redundancy and synergy has recently been addressed with the development of a framework called the Partial Information Decomposition (PID) [17][18][19]47]. This provides a mathematical framework to obtain decomposition of mutual information into unique, redundant and synergistic components. The PID requires a measure of information redundancy. Here we use recently proposed measure of redundancy based on pointwise common change in surprisal; Iccs [16].
This approach starts from interaction information, but breaks down the conflated redundant and synergistic effects by only counting pointwise terms in the interaction information calculation that unambiguously correspond to a redundant interaction. This is the only redundancy measure which corresponds to an intuitive notion of overlapping information content and is defined for more than 2 variables and for continuous systems. We use it here in a continuous Gaussian formulation together with the rank-normalization approach of GCMI.
The PID allows us to separate the redundant and synergistic contributions to the interaction information, as well as the unique information in each modality (Fig 1). For clarity, we restate the interpretation of these terms in this experimental context.

Unique Information (MEG; A)
-This quantifies that part of the MEG activity than can be explained or predicted only from the auditory speech envelope. This necessarily means it represents entrainment to speech envelope features that are not common to the lip movement (i.e., the relationships quantified in Fig 1C).

Unique Information (MEG; V) -This quantifies that part of the MEG activity that can be
explained or predicted only from the visual lip area. This necessarily means it represents entrainment to speech envelope features that are not common to the speech envelope (i.e., the relationships quantified in Fig 1D).

Redundancy (MEG; A V) -This quantifies the information about the MEG signal that is common
to or shared between the two modalities. Alternatively, this quantifies the representation in the MEG of the variations that are common to both signals.

Synergy (MEG; A V)
-This quantifies the extra information that arises when both modalities are considered together. It indicates that prediction of the MEG response is improved by considering the dynamic relationship between the two stimuli, over and above what could be obtained from considering them individually.

Partial Information Decomposition (PID) analysis.
For brain signals, frequency-specific brain activation time-series were computed by applying the beamformer coefficients to the MEG data filtered in the same frequency band (fourth order Butterworth filter, forward and reverse, center frequency ± 2 Hz). The auditory and visual speech signals were filtered in the same frequency band. MEG signals were shifted by 100 ms as in previous studies [5,15] to compensate for delays between stimulus presentation and cortical responses. Then, each map of PID was computed using these auditory, visual speech signals and source-localized brain signal for each voxel and each frequency band across 1-s-long data segments overlapping by 0.5 s.
As described above (MI between auditory and visual speech signals) the complex spectra obtained from the Hilbert transform were amplitude normalized, and the real and imaginary parts were each rank-normalized. The covariance matrix of the full 6-dimensional signal space was then computed which completely describes the Gaussian-Copula dependence between the variables. The PID was applied with redundancy measured by pointwise common change in surprisal (Iccs) [16] for Gaussian variables.
This calculation was performed independently for each voxel, resulting in volumetric maps for the four PID terms (redundant information, unique information of auditory speech, unique information of visual speech, synergistic information) for each frequency band in each individual. This computation was performed for all experimental conditions: 'All congruent', 'All incongruent', 'AV congruent'.
In addition, surrogate maps were created by computing the same decomposed information maps between brain signals and time-shifted speech signals for each of the four experimental conditions in each individual. Visual speech signals were shifted for 30 s and auditory speech signals were shifted for 60 s. This surrogate data provides an estimate of each information map that can be expected by chance for each condition. This surrogate data is not used to create a null distribution but to estimate analysis bias. The surrogate data is used in analysis for Figs 1B-D, 5C-D, and S1, S2, S4 Figs.

Delayed Mutual Information analysis.
We used Delayed Mutual Information to investigate to what extent brain areas predict upcoming auditory or visual speech. Delayed mutual information refers to mutual information between two signals offset with different delays. If there is significant MI between brain activity at one time, and the speech signal at a later time, this shows that brain activity contains information about the future of the speech signal. We calculated delayed MI between each voxel and the two speech stimuli, from 0 ms to 500 ms with a 20 ms step (S2 Fig). Directed Information or Transfer Entropy [48,49], is based on the same principle but additionally conditions out the past of the speech signal, to ensure the delayed interaction is providing new information over and above that available in the past of the stimulus. Here, since the delayed MI peaks are clear and well isolated from the 0 lag we present the simpler measure, but transfer entropy calculations revealed similar effects (results not shown).

Selection of brain regions for Partial Information Decomposition (PID) analyses predictive of speech.
We selected eight brain regions to test a predictive mechanism, i.e., does MEG activity predict

Partial Information Decomposition (PID) analysis predictive of speech.
The PID analysis described above was computed to investigate cross-modal AV representational interactions in an individual brain region. But both PID and interaction information can be applied also to consider representational interactions between two brain regions to a single stimulus feature (as RSA is normally applied). To understand representational interactions between brain regions predictive of speech signals, we computed PID values with activity from two brain regions as the predictor variables and a unimodal speech signal (auditory or visual) as a target variable.
For this, we used the eight brain regions (see above) and computed the PID for each pair of eight regions predictive of visual speech or auditory speech. This resulted in 28 pairwise computations (n(n-1)/2).
Here, redundant information between two brain regions means they both provide the same prediction of the upcoming speech signal. A synergistic interaction demonstrates that the particular dynamic relationship between the neural activity in the two regions is itself predictive of speech, in a way that the direct recorded MEG in each region alone is not (S4 Fig). Statistics. Group statistics was performed on the data of all 44 participants in the FieldTrip. First, individual volumetric maps for each calculation (Mutual Information, Unique Information, Redundancy, Synergy) were smoothed with a 10-mm Gaussian kernel. Then they were subjected to dependent t-statistics using nonparametric randomization (Monte Carlo randomization) for comparisons between experimental conditions or to surrogate data. Results are reported after multiple comparison correction was performed using FDR (False Discovery Rate) [51].
For the maps relating synergistic and redundant PID relevant for behavior, each information map (unique, redundant, and synergistic map) was subjected to regression analysis. In the regression analysis, we detected brain regions that were positively correlated to comprehension accuracy using nonparametric randomization (Monte Carlo randomization). Then, regression t-maps were converted to standard Z-map (Z-transformation) and subtracted between conditions (P < 0.005).

S1 Fig. Neural decomposition of natural audiovisual speech ('All congruent' condition).
In order to define characteristics of decomposed information in naturalistic audiovisual speech condition (All congruent), we used predefined ROI maps from SPM Anatomy Toolbox (version 2.1) [52] and Automated Anatomical Labeling (AAL) [53]. SPM Anatomy Toolbox provides probabilistic cytoarchitectonic maps which provides stereotaxic information on the location and variability of cortical areas in the MNI (Montreal Neurological Institute) space. AAL maps provide anatomical parcellation of the spatially normalized single-subject high-resolution T1 of MNI space. We first transformed the dimension of each ROI map to the dimension of our source space data, then we extracted each information (unique unimodal information for auditory and visual speech, redundancy and synergy) of bandpass-filtered (low frequencies 1-7 Hz) phase data from each ROI and then each information value was averaged within the ROI. This was performed for All congruent condition and time-shifted surrogate data. Each information data was averaged across all subjects after subtracted by surrogate data within individual (mean ± s.e.m). Data shown per each hemisphere (LH, RH). Statistics compared to surrogate data was also performed and shown with asterisk in each bar when it is significant (paired two-sided t-test; P < 0.05).

First for unimodal unique information (UI-A, UI-V), as expected, auditory unique information (A, B)
showed stronger unique information in primary auditory cortices while visual unique information (C, D) showed stronger unique information in visual cortices as well as auditory areas suggesting auditory cortex might be involved in visual speech processing.
Next, for Redundancy (E,F) and Synergy (G,H), overall redundant information in both hemispheres shows stronger than synergistic information in all main areas. This was expected considering the same audiovisual inputs in this condition (All congruent). In auditory/temporal areas, both redundant and synergistic information were significantly different from time-shifted surrogate data. This pattern is the same in both hemispheres. However, in visual areas, only redundant information is significant whereas nearly none of synergistic information in visual cortices remain significant. In motor/sensory and language-related areas, redundant information is strongly significant in both hemispheres. Synergistic information in inferior frontal regions was also significant with more left-lateralized pattern. It should be noted that this pattern arises when perceived speech is natural unlike when task is challenging as in AV congruent condition (the main analysis), so it is highly likely that redundant information is stronger than synergistic information.

S2 Fig. S2 Fig. Left pSTG and left motor cortex differentially predict visual speech.
A potential benefit of speech-entrained brain activity is the facilitation of temporal prediction of upcoming speech. We therefore investigated to what extent the different integration mechanisms in pSTG and motor cortex (reflected by differences in redundancy versus synergy) lead to differences in prediction. Since informativeness changed most strongly for the visual (speech) input signal (informative for 'AV congruent', less informative for 'All congruent'), we expected strongest prediction effects for visual speech. We investigated prediction by means of delayed MI (see Methods for details) between theta phase in each brain area and later theta phase in visual speech. We tested delays from 0 ms to 500 ms in steps of 20 ms and then averaged the values across these delays. Interestingly, prediction of visual speech varied in both brain areas between conditions, but in different ways. Left pSTG predicts visual speech stronger in 'All congruent' and 'AV congruent' conditions than in incongruent condition (A; t = 2.99, P = 0.004 in 'All congruent' > 'All incongruent'; t = 2.32, P = 0.02 in 'AV congruent' > 'All incongruent'). Left motor cortex predicts visual speech stronger for 'AV congruent' than 'All congruent' (B; attention effect; t = 2.24, P = 0.03).
When we unfolded these patterns in the temporal domain more interesting pattern emerged. The prediction mechanism in left pSTG operates in shorter temporal delays of 150-300 ms (C; P < 0.05), but left motor cortex is involved in longer temporal delays of 350 ms and above (D; P < 0.05).
These findings suggest that left pSTG is mostly sensitive to congruent audiovisual speech (as demonstrated by redundancy in Fig 4) and best predicts visual speech when congruent audiovisual speech is available in the absence of distracting input. This happens fast at shorter delays with visual speech. However, this pattern is different for left motor cortex. Here, we see better prediction in the 'AV congruent' condition, when visual speech information is informative, attended and useful to resolve a challenging listening task. Thus it has rather slow temporal dynamics at delays greater than 350 ms.
The prediction of auditory speech was not different between conditions. This is expected because the level of auditory attention is similar across conditions.
Overall, this suggests that integration mechanisms in left pSTG are optimized for congruent audiovisual speech. This is consistent with results in Figs  In the 'L pSTG' plot, when MEG response < 0 (C), the redundancy comes from above median values of both auditory and visual speech. This shows that when both auditory and visual speech signals are high (above median), they redundantly suggest that MEG response is below the median value in the band. However, when both auditory and visual speech signals are below their median values, they redundantly suggest an above median MEG response (D).
The 'L Motor Cortex' plot shows that when auditory and visual signals have opposite signs (i.e. above median value in one signal occurring with a below median value of the other signal, red diagonal nodes) (G,H) they synergistically inform about a co-occurring MEG response value. That is, knowing that A is high and V is low together (or know that V is high and A is low together), provides a better prediction of a specific MEG response value than would be expected if the evidence was combined independently. Here the negative node (blue) indicates a negative synergistic contribution to mutual information (sometimes called misinformation).
Above/below median values can be interpreted as loud/quiet auditory speech and large/small lip movement, so low sound amplitude combined with small lip movement can produce larger response in the left pSTG whereas high sound amplitude combined with large lip movement can produce smaller response in the left pSTG. This seems highly plausible mechanism given the left pSTG's role in AV integration that it has greater involvement when both speech are not physically strong enough. However, synergy shows a different pattern that regardless of high or low MEG To further understand integration mechanism of audiovisual speech processing observed in redundant and synergistic interaction between multisensory speech signals predictive of brain activity, we computed PID differently in which redundant and synergistic interaction between brain regions predictive of speech signals (auditory or visual). We selected eight brain regions from the statistical contrasts for attention and congruence effects shown in Figs 4 and 5 (see Methods for details). The signals at the maximum coordinates were extracted from each region, PID was computed for each pair of brain regions predictive of auditory or visual speech signals. Each condition was compared to time-shifted surrogate data and between conditions. We found interesting results for synergistic interaction (but not for redundant information) between brain regions on visual speech for attention effect ('AV congruent' > 'All congruent'). Precuneus, SMA-Precuneus were observed to be predictive of visual speech (paired two-sided ttest; P < 0.05). However, we could not find any significant interaction (either redundancy or synergy) between these regions predictive of auditory speech.
These results suggest that synergistic information interaction between the regions centering around left inferior frontal gyrus (BA44/BA45) and motor areas, which matches dorsal stream in speech processing [21], plays important role in attention to speech particularly visual speech when the task is challenging.