Sensitivity of occipito-temporal cortex, premotor and Broca’s areas to visible speech gestures in a familiar language

When looking at a speaking person, the analysis of facial kinematics contributes to language discrimination and to the decoding of the time flow of visual speech. To disentangle these two factors, we investigated behavioural and fMRI responses to familiar and unfamiliar languages when observing speech gestures with natural or reversed kinematics. Twenty Italian volunteers viewed silent video-clips of speech shown as recorded (Forward, biological motion) or reversed in time (Backward, non-biological motion), in Italian (familiar language) or Arabic (non-familiar language). fMRI revealed that language (Italian/Arabic) and time-rendering (Forward/Backward) modulated distinct areas in the ventral occipito-temporal cortex, suggesting that visual speech analysis begins in this region, earlier than previously thought. Left premotor ventral (superior subdivision) and dorsal areas were preferentially activated with the familiar language independently of time-rendering, challenging the view that the role of these regions in speech processing is purely articulatory. The left premotor ventral region in the frontal operculum, thought to include part of the Broca’s area, responded to the natural familiar language, consistent with the hypothesis of motor simulation of speech gestures.

Observers can discriminate fairly reliably between silent video-clips of a speaker played as recorded (Forward mode) or time-reversed (Backward mode) [1]. It was argued that natural kinematics (recognition of biological motion) rather than linguistic competences had a role in this task [1]. Visual speech stimuli have been shown to engage cortical motor areas involved in speech production, such as the left inferior frontal gyrus (IFG) that includes Brodmann's Area BA 44 and BA 45 (pars opercularis and triangularis of IFG, respectively) thought to overlap with Broca's region [43] and the ventral premotor cortex (PMv), but also more dorsal regions of the premotor cortex (PMd) [27,40,[44][45][46][47][48][49]. Importantly, part of the frontal areas implicated in the control of movements and speech are connected with the visual cortex [50,51]. The motor theory of speech perception [52][53][54] proposed that the activation of motor speech areas during the observation of speech might represent an implicit motor simulation of the observed gestures conducive to speech understanding [44,[55][56][57].
However, several authors questioned the idea that an automatic engagement of motor areas, such as IFG, during perceptual or cognitive task is evidence of a specific involvement of the motor system in perceptual or cognitive processes [28, [58][59][60][61][62][63][64]. The dorsal premotor cortex, rather than the Broca's area (BA 44/45), seems to be engaged both in the execution and the observation of speech gestures [62]. Conversely, it was found that the activity in the IFG correlates with hit-rate and response bias during speech perception tasks [27,65]. Since response bias and hit rate are characteristic indexes of the decisional process, these findings might suggest that high level processes related to the generation of the response decision (e.g. whether to respond Yes or No), rather than motor simulations occurred in the IFG during visual speech [66,67]. In summary, the specific role of IFG and Broca's area in the functional architecture of speech perception remains open to debate [43,68].
In the present study, we investigated the neural circuits engaged by language familiarity (Italian vs Arabic) and natural kinematics of biological motion (Forward vs Backward) of visible speech. Italian observers viewed silent video-clips of the mouth movements of Italian and Arabic actors speaking in their native language. Stimuli were rendered either in normal (Forward) mode or after a time reversal (Backward). During an fMRI session, participants were asked to identify the rendering mode (Forward or Backward). The brain regions sensitive to language familiarity and those sensitive to natural mouth movements of speech were identified through the fMRI contrasts Italian vs Arabic (main effect of language) and Forward vs Backward (main effect of rendering mode), respectively. We also computed the interaction between the two main factors by means of the contrast "language x rendering mode" [(Italian Forward vs Italian Backward) vs (Arabic Forward vs Arabic Backward)]. The latter contrast should identify the areas where the effects of Familiarity (Italian vs Arabic) was larger for natural (Forward) than non-natural (Backward) motion (i.e biological motion).

Participants
Forty healthy right-handed Italian volunteers took part in this study. Twenty participants (14 females, 6 males; mean age: 25 years; age range: 20-42 years) were tested in a preliminary experiment and twenty different participants (13 females, 7 males; mean age: 23 years; age range: [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] in the main study (i.e. fMRI and in a follow-up experiment, see later). All participants had normal or corrected to normal vision. None of them had any familiarity with the Arabic language or experience with lip-reading. Written informed consent to procedures approved by the Institutional Review Board of Fondazione Santa Lucia was obtained from each participant. Experimental protocols complied with the Declaration of Helsinki on the use of human subjects in research.

Stimuli
Ten adults (5 females, 5 males) native speakers of Arabic and 10 adults (5 females, 5 males) native speakers of Italian volunteered as actors for generating the stimuli. We chose the Arabic rather than other European languages in order to ensure that the participants would have not been exposed more than occasionally to this language before, and we verified that this was the case. The choice of the Arabic was also motivated because it differed from the Italian more than did most other European languages. Indeed, articulatory movements for the production of words in Arabic and Italian languages are different [18,20].
Two of the authors (P.V. and V.M.) selected Italian speakers so that, after careful visual inspection, the general features of the lower part of the face were roughly similar to those of the Arabic speakers previously selected. None of the actors participated in the main or in the preliminary study. Each actor read four texts excerpted from newspapers in his/her native language (Arabic, A or Italian, I). The preparation of the stimuli involved three steps. First, we recorded the lower part of the face (including the upper/lower lips and the chin) of each actor with a digital camera (25 frames/s, Sony HDR-SR-8E), and stored the results as a sequence of single frames (1024 x 768 pixel, RGB TIFF format). Second, static frames were processed with Photoshop CS6 to equalize for luminance and chromatic spectrum, and cropped to the size of 972 x 694 pixels in order to display only the mouth movements. Finally, by using Virtual Dub, we transformed the sequences of frames into silent AVI video-clips lasting 14 s. The experimental stimuli consisted in the video-clips rendered either as recorded (Forward mode, F) or after reversing the frames order (Backward mode, B Additionally, we checked whether Arabic and Italian video-clips differed in motion energy. For each two consecutive frames of each video-clip, we calculated the mean of the squared differences in the red, green, and blue channels in every pixel [69][70][71][72]. The motion energy was estimated as the average of these values across all pixels and frames (350 frames) of each video. We found that the motion energy was not significantly different (unpaired t-test; p = 0.7; tvalue (78) = 0.38) between Italian (mean ± SD: 34.0 ± 11.8 pixel 2 /frame 2 ) and Arabic videos (34.9 ± 7.8 pixel 2 /frame 2 ).

General outline of the study
The study involved three successive experiments: a preliminary, purely behavioural session with the first group of 20 participants in which we estimated the ability to discriminate presentation modes (Backward/Forward); a main session with the second group of 20 participants in which this ability was estimated while measuring brain activity with fMRI; a follow-up session with the same participants of the fMRI session in which we estimated the ability to discriminate language familiarity (Italian/Arabic).

Main task: Identification of rendering mode
In both fMRI and preliminary experiments, participants were informed that the video-clips being shown could be either a faithful or a time-reversed rendering of actual speech movements. They were not informed that in half of the videos the actor's language was Italian (I) and in remaining half was Arabic (A). The task (2-Alternative Forced Choice: 2-AFC) was to indicate whether the video was displayed as recorded (Forward) or reversed in time (Backward). Participants had to wait until the end of the stimulus before responding. Responses were entered by pressing with the right index finger one of two buttons marked "F" (Forward) and "B" (Backward), respectively. Between trials, the display was uniformly grey and participants fixated a central point (0.5˚visual angle). No constraints were imposed on oculomotor behaviour during the presentation of the stimuli. Before each experimental session, participants were administered eight warm-up trials, which included at least one example for each combination of actor gender, native language, and rendering mode. The results of these trials were not analysed.

Follow-up task: Identification of language
The participants in the fMRI experiment were retested 10 (± 2) days later in a follow-up experiment outside the scanner. They viewed the same 160 silent video-clips described above. Participants were informed that the actor's language could be either Italian or another, unspecified language, but no further information was provided. Participants were asked to wait until the end of the stimulus before responding (2-AFC). Reponses were entered by pressing with the right index finger a button marked either I (Italian) or NI (not Italian). The aim of this experiment was to gauge the accuracy with which viewers could discriminate a familiar language (Italian) from an unfamiliar one (Arabic) using only visual cues, and to test whether the time-arrow (forward vs. backward rendering mode) affects the judgment of language familiarity. We hypothesized that the visual cues used to discriminate languages (e.g., temporal variability of vowels) are different from those used for discrimination of rendering mode (e.g., the acceleration profiles of opening/closing movements of the mouth). If so, the sensitivity index (see below) should be uncorrelated between the two tasks.

General procedure
In each experiment, the total number of stimuli (160) was divided in 5 runs (32 stimuli/run) with the constraint that successive stimuli never involved the same actor. Stimuli were pseudorandomized and presented using the Presentation software (Neurobehavioural system 1 ). Within runs, interstimulus intervals (ISI) followed a uniform distribution (range: 2 s-4 s; mean: 3 s). The five runs were administered in a single session and were separated by brief pauses. Additionally, in the fMRI experiment, to estimate more accurately the shape of BOLD impulse response [73][74][75], we pseudo-randomly inter-mixed null events (N = 35, duration 8 s). Thus, the duration of each run was 10' 50" during fMRI and 9' 54" in the follow-up and preliminary experiments.

fMRI experiment: Set-up
Participants lay supine in the MR scanner with the head immobilized with foam cushioning and wore earplugs and headphones to suppress ambient noise. A digital projector (NEC LT158, refresh rate: 60-Hz) projected the stimuli through an inverted telephoto lens onto a semi-opaque Plexiglas screen mounted vertically inside the scanner bore, behind the participant's head. The back-projected image was then viewed via a mirror mounted on the head coil positioned at about 4.5 cm from the eyes. The eye-to-screen equivalent distance was 66 cm, and the angular size of the projected image was 9˚(width) × 6.4˚(height). Responses were acquired with an MR-compatible response box (fORP, Current Designs).

Follow-up experiment and preliminary experiment: Set-up
The follow-up and preliminary experiments were performed in a quiet, dimly illuminated room. Participants sat in front a 19" LCD monitor and viewed the silent video-clips (9˚x 6.4v isual angle) in a pseudo-random order at a distance of about 80 cm. Responses were entered via a high-speed button box (Empirisoft 1 ).

Behavioural data analysis
In the fMRI experiment and in the preliminary experiments, where the task was to identify the rendering mode (see above), responses were collated by language, rendering mode and actor.
The number of responses "Forward" to Forward and Backward stimuli are indicated as N F|F and N F|B , respectively, while the number of responses "Backward" to Forward and Backward stimuli for each language are indicated as N B|F and N B|B , respectively. For each participant, the sample size was N T = 80 for each language, thus a total of 1600 trials for each language was collected (N T x 20 participants). For each participant, we computed a sensitivity index d' = Z {Hit} − Z{False Alarm} and a response bias c = −0.5 � (Z{Hits} + Z{False Alarm}), where Z{Hit} and Z{False Alarm} are the z-scored transformed values of P{Hit} = P{F|F} = N F|F /N T , and P {False Alarm} = P{F|B} = N F|B /N T , respectively [76]. Moreover, we calculated the probability of correct responses as P{C} = (N F|F + N B|B ) / N T .
Similarly, in the follow-up experiment where the task was to identify the actor's language (see above), responses were collated by language, rendering mode and actor. The number of responses "Italian" to Italian and Arabic stimuli are denoted as N I|I and N I|A , respectively, and N A|I and N A|A are the number of responses "Not Italian" to Italian and Arabic stimuli, respectively. For each participant, the sample size was N T = 80 for each rendering mode, thus and a total of 1600 trials for each rendering mode was collected (N T x 20 participants). We estimated sensitivity and response bias through d' and c indexes, respectively, based on the convention that, in this case, Z{Hit} and Z{False Alarm} are the z-scored transformed values of P{Hit} = P {I|I} = N I|I / N T and P{False Alarm} = P{I|A} = N I|A / N T , respectively. Moreover, we calculated the probability of correct responses: P{C} = (N I|I + N A|A ) / N T .
We considered d' and response bias in addition to the probability of correct responses, since the latter might be inflated by response bias and lead to misleading interpretations [76].
We expected that responses to the stimuli depended on whether participants had to discriminate between rendering mode (in the first two experiments) or languages (in the followup experiment). Thus, within-subject responses to the rendering-mode and language discrimination task should show different patterns, and the sensitivity index (d') should be uncorrelated between tasks. To verify these points, we calculated the correlation coefficient of participants' sensitivity index between the main task and the follow-up experiment task.

fMRI data acquisitions
MR images were acquired with a Siemens Magnetom Allegra 3T head-only scanning system (Siemens Medical Systems, Erlangen, Germany), equipped with a quadrature volume RF head coil. Whole brain BOLD echoplanar imaging (EPI) functional data were acquired with a 3Toptimized gradient echo pulse-sequence (TR = 2.47 s, TE = 30 ms; flip angle = 70˚; FOV = 192mm, fat suppression). 38 slices of BOLD images (volumes) were acquired in ascending order (64 x 64 voxels, 3 x 3 x 2.5 mm 3 , distance factor: 50%; inter-slice gap = 1.25 mm; slice thickness = 2.5 mm), covering the whole brain. For each participant, a total of 1315 volumes of functional data were acquired in five consecutive runs. At the end of each run, the acquisition was paused briefly. Structural MRI data were acquired using a standard T1-weighted scanning sequence of 1 mm 3 resolution (MPRAGE; TE = 2.74 ms, TR = 2500 ms, inversion time = 900 ms; flip angle = 8˚; FOV = 256 × 208 × 176 mm 3 ).

fMRI data preprocessing
Data and statistical analyses were performed using the SPM12 software (Wellcome Trust Centre for Neuroimaging, London, UK) implemented in MATLAB R2013 (The MathWorks Inc., Natick, MA) using standard procedures [77,78]. After discarding the first four volumes of each run, images were corrected for head movements, realigned to the mean image, coregistered to the structural image, and normalized to Montreal Neurological Institute (MNI) space using unified segmentation [79], including resampling to 2 × 2 × 2 mm voxels, and spatially smoothed with a 8 mm full-width at half maximum (FWHM) isotropic Gaussian kernel. Voxel time series were processed to remove autocorrelation using a first-order autoregressive model and high-pass filtered (128-s cut-off).

fMRI analysis
Patterns of brain activations were computed using the general linear model and a Finite Impulse Response (FIR) set of base functions. Here, the FIR approach is ideal to fit brain activity, because it can identify changes of activity over time without making any assumptions about the profile of these changes [80]. Accordingly, for each participant, the FIR estimated the level of activation in 12 successive time-bins. Each time-bin consisted of 1 TR (2.47 s), thus fitting 30 s of the fMRI data for each stimulus. We modelled 5 different event-types: Italian Forward rendering (I F ), Italian Backward rendering (I B ), Arabic Forward rendering (A F ), Arabic Backward rendering (A B ), time-locked to stimulus onset, thus obtaining 12 images (one for each time bin) for each correct trial of the 4 conditions, plus an additional event corresponding to errors trials irrespective of condition. Motion correction parameters were also included as effects of no interest. We analysed the activity related only to stimuli correctly identified (correct trials), since error trials (stimuli not correctly identified) may introduce confounding activation (i.e. contamination of the activation related to poorer performance by increased errors [81,82]. However, to evaluate the effect of error trials on the fMRI activity, we did a supplementary analysis (not reported here) with all trials (correct and error trials). We found that the brain sites activated in the main fMRI analysis (see Results and Table 1) were also activated in the supplementary fMRI analysis, although at an uncorrected level (puncorr < 0.05), thus indicating that error trials decrease the signal-to-noise ratio.
At single-subject level, we estimated four effects of interest. . The resulting parameters for each contrast (corresponding to 12 images, one for each time bin) in each participant were then entered into second-level group analyses [83].
Four separate one-way ANOVAs with 12 levels (each corresponding to one time-bin) were performed at the second (group) level. We used F-contrasts to highlight brain areas showing differential activity over the 12 time-bins, separately for each of the four ANOVAs. In particular F-contrasts subtracted the activity of the first bin (i.e. one TR at stimulus onset) from each of the other bins, thus capturing the changes of activity over-time. All analyses included appropriate corrections for non-sphericity. Statistical thresholds were set at p-FWE < 0.05, familywise error corrected for multiple comparisons at cluster level (hereafter, p-corr < 0.05), using a voxel-wise threshold set at p< 0.001 [84,85]. Furthermore, post-hoc t-tests on each time bin were false-discovery-rate (FDR) corrected for n multiple comparisons at p < 0.05 across the number of bins (n = 12).

Regions of interest
In addition to the previous whole-brain analysis, we also performed an analysis based on regions of interest (ROIs). In particular, we defined regions as spheres of 8 mm radius centred on premotor areas that respond to visual speech (Premotor Ventral inferior PMvi/Broca's xyz = −48 12 [90]. We applied family-wise-error small-volume-correction (FWE-SVC) to each ROI [91,92]. We retained results as significant at p < 0.05 FWE-SVC, further Bonferroni-corrected for the number of regions (n = 12).

Behavioural results
Main task: Identification of the rendering mode (Forward or Backward). During the preliminary and fMRI experiments, observers had to indicate whether the video-clip was played in Forward or Backward mode.
Observers detected the rendering mode (Forward or Backward) of video-clips with an overall probability of correct responses P{C} = 0.590 and P{C} = 0.556 for the preliminary and fMRI experiments respectively (pooled across participants and stimuli), significantly higher .05 one-sample ttest, for preliminary and fMRI experiments respectively) in favour of the response "Forward" for the Italian video-clips (Fig 1c), underlying the higher proportion of correct response in this task. By contrast, there was no response bias for the Arabic video-clips (c = 0.024 ± 0.06, p = 0.66, t(19) = 0.43 and c: -0.01 ± 0.08, p = 0.9, t(19) = 0.12, one-sample t-test, for preliminary and fMRI experiments respectively). The comparison of sensitivity (d') and response bias (c) indexes between the fMRI and the preliminary experiments, computed for both Italian and Arabic video clips, did not show significant differences (t-test, all t(19) < 1.28, p>0.21).
Follow-up experiment: Language identification. All the participants in the fMRI experiment were retested in a follow-up experiment to ascertain their ability to recognize the language by means of visual-only cues. In this experiment, volunteers had to indicate if the actor's language in the silent video clips was Italian or not. Fig 1a (white bars) reports for each condition the probability of correct responses (P{C} = 0.679) pooled across stimuli and participants. In all conditions, P{C} was significantly higher than chance level (two-tailed binomial test, p < 0.001). The average d' was not significantly different between video-clips played in forward (d': 1.05 ± 0.12) and backward mode (d': 0.96 ± 0.15) (paired t-test; p = 0.5; t(19) = 0.65) and in both cases d' > 0 (one sample t-test; all p <0.001; t(19) >7.02) (Fig 1d). There was a significant response bias in favour of the response "Italian" in the case of Forward video-clip (c: -0.21 ± 0.05, p < 0.001, t(19) = 4.43, one-sample t-test). Conversely, in the case of Backward video-clips, the response bias was in favour of the response "not Italian" (c: 0.12 ± 0.034, p < 0.01, t(19) = 2.95, one-sample t-test).
Comparison between tasks. We expected that stimuli were classified differently depending on the task, and that the two tasks had different response patterns across subjects. To verify this hypothesis, we compared the participants' sensitivity index (d') in the fMRI main task (rendering mode discrimination) and in the follow-up experiment (language discrimination). The analysis of d' showed a greater sensitivity to stimuli in the follow-up experiment task compared to stimuli sensitivity in the main task (paired t-test, t(19) = 6.15, p<0.001). An alternative possibility is that the higher d' in the second experiment could be due to a learning process occurring after the first experiment. In this case, we should expect a correlation across participants between tasks. However, the sensitivity indexes of the two tasks were not correlated (Pearson's r = 0.27, p = 0.23), suggesting that the two tasks rely on different processing.

fMRI results
Brain areas engaged by visual speech. We mapped the cortical regions activated by all visual speech stimuli, irrespective of the parameters manipulated experimentally, (i.e., all stimuli vs. rest) by a differential F-test across the 12 time-bins (see Methods). This test highlights regions having different amplitude and/or time-course of the BOLD response between conditions examined in the contrast image (in this case, all stimuli vs. rest condition). As shown in Fig 2, significant effects were observed in occipital and temporal cortices, i.e. regions that are typically involved in audio-visual processing, as well as in parieto-frontal cortices, which are generally engaged by the vision of speech movements (Bernstein et al 2014), in the insula, cingulate cortex, motor and premotor areas. Activity in left motor/premotor areas was presumably related, at least in part, to the right-hand motor responses.
Main effect of actor's language. Whole brain analysis. The differential F-test comparing Italian (familiar) vs. Arabic (unfamiliar) stimuli (irrespective of rendering mode) across the 12 time-bins revealed significant activations (p-corr < 0.05, whole brain) bilaterally in the posterior fusiform gyrus (FG a , at coordinates almost coinciding with those of FFA), extending to the inferior occipital gyrus (IOG) and right occipito-temporal sulcus (OTS, at coordinates near to those of pITS, a biological motion sensitive area), and in the precuneus (Fig 3, green area, Table 1). The time profiles of the estimated BOLD activity in these regions are plotted in Fig 4a, 4b and 4c. It is important to note that direct inspection of these activity patterns is necessary before any conclusion can be drawn, since significant differences obtained through our statistical analysis (i.e., F-test) could be due to a modulation in amplitude and/or to a timeshift. Time bins presenting a different activity level (post-hoc t-test, p-FDR corrected for multiple comparison < 0.05 across bins) between Italian and Arabic are filled in green. FG a showed enhanced earlier activity (see bins 1 th -3 th ) for Italian stimuli independently of rendering mode, and later activity for Arabic stimuli (6 th -8 th bins, compare grey and black lines in Fig 4). IOG and OTS, belonging to the same cluster of FG a , had a similar temporal profile (e.g., OTS in Fig 4). The precuneus showed the opposite trend, with a decreased activity in the earlier bins for Italian stimuli (see black lines in Fig 4c, 3 th -4 th bins) and in the later bins for Arabic stimuli (7 th -8 th bins).
ROI analysis. A significant effect of the familiar language independently of rendering mode (p-corr < 0.05, FWE-SVC Bonferroni) was also found in the left PMvs/PMd (Table 1, Fig 3, green area). These regions showed increased activity for the Italian stimuli independently of rendering mode only in late bins (6 th -9 th bins) (Fig 5a).
Also left FFA and right pITS showed a main effect of language, already reported in the whole brain analysis. By contrast PMvi, MT+/V5 and TVSA did not show a significant main effect of language.
Main effect of rendering mode. Whole brain analysis. Regions with differential responses to normal kinematics (i.e., video-clips played forward) and to implausible kinematics (i.e., video-clips played backward) were identified by the contrast Forward vs. Backward mode (main effect of rendering mode), irrespective of language (Italian or Arabic). The regions significantly sensitive to this contrast (orange regions in Fig 3) were found in the intraparietal sulcus (IPS) bilaterally, and in the left posterior-middle fusiform gyrus (FG b ) (p-corr < 0.05, whole brain). Fig 4d, 4e and 4f show the time-course of activity in these regions. Bins in which the activity differed significantly between normal and reversed video-clips are filled in orange (post-hoc t-test, p<0.05 FDR corrected for multiple comparisons across bins). In particular, the right IPS (1 th -2 th bins) and FG b (1 th -2 th bins) responded more to Forward than Backward rendering mode (compare continuous and dotted lines in Fig 4e and 4f, respectively). Left IPS showed a similar trend in the early bins (2 th -3 th bins) at a lower statistical threshold (puncorr < 0.05).
Finally, the image resulting from the intersections (logical AND) between the cluster image of the left FG a (reported above) and the cluster image of left FG b showed that these two clusters were sharply separated (no voxel in common).
ROI analysis. A significant effect of the rendering mode independently of language (pcorr < 0.05, FWE-SVC Bonferroni) was also found in the PMvi, in the pars opercularis of IFG (p-corr < 0.05, FWE-SVC Bonferroni) (see Table 1, Figs 3 and 5b, blue). PMvi responded more during late bins to the rendering modality (8 th -9 th bins), but selectively to Italian language, so that also the interaction calculated on the peak was significant (see also below).
By contrast none of the posterior ROIs (MT+/V5, FFA, pITS, TVSA) nor PMvs/PMd showed a main effect of rendering.   Table 1). Abscissa: time in TR's unit (TR = 2.47s), T = 0: trial onset. Continuous and dotted lines indicate forward and backward rendering mode, respectively. Black and grey lines indicate Arabic and Italian video-clips, respectively. Green, orange and blue filled rectangles indicate time bins showing a significant difference in BOLD activity (post-hoc t-test, p-value FDR-corrected for multiple comparison < 0.05) due to Actor's language, rendering mode or interaction, respectively. I F : Italian forward, A F : Arabic forward, I B : Italian Backward, A B : Arabic Backward.
https://doi.org/10.1371/journal.pone.0234695.g004 triang) and in lingual gyrus (LG, see blue regions in Fig 3). The temporal profile of BOLD responses (Fig 4g) showed that IFG-triang differentiated the Arabic video-clips played in Backward and Forward mode in the earlier bins, in particular Arabic backward stimuli strongly deactivated IFG-triang (3 th -4 th bins) (post-hoc t-test, p-FDR corrected for multiple comparison < 0.05 across bins). In a similar way, IFG-triang differentiated Forward from Backward Italian video-clips, but at later times compared to Arabic stimuli discrimination, namely between the 8 th and 10 th bins (Fig 4g). Moreover, Italian stimuli played backward also showed a marked negative pattern in these bins. Overall, IFG-triang responded similarly to Italian and Arabic stimuli, although the temporal patterns were shifted. Indeed, neither the difference between the maximum peaks for Italian Forward stimuli (bin 8) and Arabic Forward stimuli (bin 3) (t(19) = 0.93; p = 0.36), nor the difference between Forward and Backward condition of Italian stimuli at bin 8 and that of Arabic stimuli at bin 3 (t(19) 0.12; p > 0.9) were significantly different (compare the differences between continuous and dotted grey lines in bin 8 and between continuous and dotted black lines in bin 3, respectively). In sum, the BOLD patterns showed that IFG-triang does not have a clear preferential response to the speech gestures most frequently performed by participants (i.e., Italian Forward stimuli).
LG showed a general de-activation in all four conditions versus rest. In particular, Italian Backward and Arabic Forward stimuli involved a very similar time-course, as did Arabic video-clips played backwards and Italian video-clips played forwards. These two latter conditions were also the two most deactivating (i.e., negative B [93]OLD patterns) conditions in this site (see 3 th -4 th bins filled in blue in Fig 4h) (post-hoc t-test, p-FDR corrected for multiple comparison < 0.05 across bins).
ROI analysis. This analysis revealed a significant interaction between language and rendering mode in PMvi (IFG pars opercularis), a region that was selective for the Italian language in late bins (8 th -9 th bins) ( Table 1, Figs 3 and 5b, blue). In particular, there was a single peak of higher response to the familiar language with respect to the other three conditions, indicating a clear preferential response to the speech gestures most frequently performed by participants.
By contrast, none of the posterior ROIs (MT+/V5, FFA, pITS, TVSA) nor PMvs/PMd showed an interaction between language and rendering mode. Mean time course (± s.e.m) of estimated BOLD signal at the peak voxel (see Table 1

Discussion
We reported differential brain responses to visual speech kinematics, language familiarity and their interaction. Neuroimaging data showed that language familiarity and temporal rendering of silent speech video-clips modulated two distinct areas in the ventral occipito-temporal cortex. Furthermore, language familiarity modulated the left dorsal premotor cortex, while natural familiar language activated the left ventral premotor cortex in the frontal operculum. These results may indicate that phono-articulatory regions resonate in response to the visemes (visual equivalents of phonemes) of a familiar language. Since in our experiments participants generally did not decode the semantic and syntactic content of visual speech, we propose that these results are confined to the visual equivalent of the phonemic axis. Indeed, our results are in agreement with the definition of a phonological pathway more dorsal with respect to the lexical and semantic pathways, which includes IPS, the dorsal premotor region and the pars opercularis of IFG [96,97].

Sensitivity to the time-arrow of visual speech
Participants were able to discriminate above chance level visual speech gestures rendered forwards from those rendered backwards. Behavioural results were consistent across experiments. Noteworthy, the sensitivity index d' estimated in the preliminary and in the main fMRI experiments were not statistically different. These results indicate a sensitivity of the central nervous system for temporal features (i.e., time arrow) of the visible speech, in agreement with the results obtained with a familiar language in a previous study with a similar task [1].
Although lip-reading accuracy of hearing people is generally low and idiosyncratic [98], one cannot rule out a priori that observers were occasionally able to lip-read excerpts of Italian texts in the forward mode, and to use these instances as a cue for discriminating the rendering mode. However, the fact that sensitivity was not significantly different for Italian and Arabic stimuli suggests that lexical competence and speech intelligibility did not play a significant role in the task. The evidence suggests instead that better than chance performance was achieved mainly by a kinematic analysis of movements. If so, the performance reflected the ability to discriminate the motor sequences that are visually perceived as plausible from those that are perceived as implausible from the motoric point of view. This assumption can be sharpened by taking into account the response bias, which describes the position, along the decision axis, of the internal threshold for discriminating the stimuli [76]. In our experiment, there was a response bias in favour of the response "Forward" for the Italian but not for Arabic stimuli, indicating a corresponding shift of the threshold to higher values. This invites the inference that in order to classify a movie reversed in time as 'Backward', it is necessary to detect more motoric incongruences in Italian than in Arabic stimuli. This inference is in keeping with the suggestion [99,100] that a high threshold for detecting speech kinematic anomalies favours the stability of speech perception in environments where such anomalies in a familiar language occur due to inter-individual differences or phonetic peculiarities typical of particular social environments (regional inflexions, slang, etc.). Indeed, the participants to both the preliminary and fMRI experiments had no reason to suspect that in half of the video-clips the language being spoken was not Italian.
In a follow-up experiment, participants were asked to identify the language (Italian or not Italian) spoken by the actors in the silent movies. If participants benefitted from speech intelligibility, then the sensitivity to discriminate languages should be higher for Forward than that for Backward rendered stimuli [101]. The results of the follow-up experiment do not conform with this scenario, because the sensitivity index d' was comparable for Forward and Backward rendered video-clips (see Fig 1), making unlikely that speech intelligibility or lexical processes occurred in our tasks.
The lack of significant correlation across participants between the sensitivity in the fMRI and follow-up experiments suggests that different processes, likely taking into account nonoverlapping sets of cues, underlie language and rendering mode discrimination. Conversely, the finding that response bias during rendering mode discrimination depended on the language (and vice-versa) might indicate that, during a late stage of analysis, these signals are merged. This merging might take place in the premotor cortex, where the PMvs/PMd selected the familiar language independently of rendering, while the PMvi responded to the Italian language selectively in the natural kinematics condition.

Ventral occipito-temporal cortex (vOTC)
The main effects of language and rendering mode activated distinct regions in ventral occipito-temporal cortex. In particular, the comparison of Forward with Backward conditions showed a differential pattern of BOLD responses in the left posterior-middle fusiform gyrus (FG b ), while the posterior fusiform bilaterally (FG a ), the inferior occipital gyri and the occipito-temporal sulcus (OTS) were differentially involved with Italian versus Arabic videoclips. The fusiform sites are located posteriorly to the visual areas engaged by semantic and/ or lexical processes in the vOTC, which are typically reported at y-coordinates < -50 mm, in the anterior part of the fusiform gyrus [see 102]. Conversely, the fusiform face area (FFA), a region responding selectively to static faces, is located in the posterior region of the fusiform gyrus [35]. The stereotaxic coordinates of FFA centre of mass ([34 - 62 -15], [89]) roughly correspond to those of the peak of FG a reported here, but are posterior to those we found for the main effect of rendering mode (FG b ). Previous studies reported that multiple sites in FG respond to faces [32, 89,103]. It is likely that different foci in FG encompass distinct functional modules, as suggested by early PET studies [104,105] showing that gender and face identification activated distinct regions in posterior and middle fusiform gyrus, respectively.
It has been shown that the kinematics of biological movements [30, [106][107][108], as well as the temporal unfolding of faces that express an emotional state [31] engage ventral OTC. In particular, observing facial speech gestures activates FG, although it is unclear whether these activations are specific for speech because some control stimuli, such as gurning faces, activated this region more than talking faces [41]. Indeed, the difference between speech and control stimuli may have been due to differences in low-level features, such as visual motion speed [38,41]. In our study, all four experimental conditions (showing exclusively the lower portion of faces) were comparable in terms of low-level features, such as mean luminance and motion speed. By contrast, time reversal of visual speech stimuli violates motor constraints, and hence produces movements with an implausible kinematics, never occurring during real speech. Previous studies showed that coherent sequences of facial expressions engage the posterior fusiform gyrus more than a scrambled sequence [109]. A possibility is that ventral OTC processed specific kinematic cues embedded in visible speech. Therefore, we speculate that the posteriormiddle fusiform site (i.e., FG b ) was sensitive to the kinematic plausibility of speech gestures. Conversely, the more posterior site FG a was involved mainly in processing kinematic features related to the familiarity of speech, such as the rhythm of speech that is invariant under time reversal but differs across languages (see Introduction).
The previous observations challenge one of the most prominent models about face processing, namely the model proposing that static and dynamic face features are processed separately in ventral OTC and STS, respectively [110]. In fact, our data suggest that ventral OTC has foci sensitive to spatio-temporal (i.e. changeable) characteristics of speech lip movements.
Italian-speech stimuli evoked a higher activity peak than Arabic-speech stimuli in the more posterior site of fusiform gyrus (FG a ), whereas the sustained post-peak activity was greater for Arabic than Italian stimuli. The Arabic-speech stimuli were unfamiliar, and thus they were presumably unexpected. It is thought that a specific class of neurons (the so-called error neurons) responds selectively to unexpected or unusual stimuli [111][112][113]. These neurons compare the sensory input with an internally generated (prediction) signal coding what is expected in a given context [114]. In case of a mismatch between the predicted and the incoming sensory signal, error units enhance their activity. We surmise that the greater activity for Arabic than Italian stimuli in the post-peak period might be related to the activity of error units. Therefore, depending on the language (familiar or unfamiliar), this class of neurons contributed differently to the overall neural activity in FG a , so the pattern of neural activity changes according with language. However, this mechanism is not specific to ventral OTC, but is a widespread mechanism governing several brain processes (see Friston 2010). For instance, we recently surmised that this kind of neural processing occurs also in lateral OTC when observing unfamiliar walking movements [107,108], and it might even bias balance control [115].

Lingual gyrus
The interaction between language and rendering mode showed significant activations in the lingual gyrus. Previous reports have already suggested that visually presented speech gestures engage this site [36,116]. The novel finding reported here is that LG responds differently depending on the language. In the follow-up experiment (language discrimination), the d' was similar between Forward and Backward stimuli, while there was a significant bias toward Arabic-speech or Italian-speech response in the Backward or Forward condition, respectively. The temporal profile of LG activity distinguishes the experimental conditions. In particular, the conditions Arabic Backward and Italian Forward were the two most deactivating conditions. Therefore, the response bias in language discrimination task could be related to a decrease of LG activity. Interestingly, in the auditory domain, LG activity has been found to depend on the familiarity of the spoken language [117]. In the latter study, hearing a speech segment in a second language modulated LG activity differently depending on the participant's proficiency in that language. Because LG is involved, together with fronto-parietal regions, in speech control [118], these findings suggest a supra-modal effect of speech language in LG, probably due to feedback from high-order centres.

Premotor prefrontal activity reflects motor simulation of speech gestures
The interaction of rendering mode and language showed a significant effect in the left IFG, comprising the pars opercularis (BA 44), which we have labelled as PMvi, and the pars triangularis (BA45). Most authors agree that the Broca's area includes both BA 44 and BA 45 of the left hemisphere [43,[119][120][121]. Broca's area was initially thought to be involved only in speech production, but current research shows that it has a more complex role possibly involving also speech comprehension [43,64,122,123].
In particular, the activation of IFG during speech perception has been interpreted by some authors as the occurrence of a motor simulation of the observed movements. This idea is in accordance with the motor theory of language holding that a simulation of speech gestures in the motor regions is instrumental for speech perception and understanding [53,[124][125][126].
However, it is still an open issue whether the IFG activation during speech perception is related to a language-specific process, as the putative motor simulation of speech gestures, or represents a general-domain cognitive mechanism [127]. It has also been suggested that distinct IFG foci have different roles during speech perception. Specifically, articulatory rehearsal of speech gestures would occur in pars opercularis, while pars triangularis and orbitalis could be related to cognitive-control mechanisms, such as decisional or working-memory processes [68,128,129]. The rehearsal function of pars opercularis generalizes across different types of movement, as this region was found to respond also to observation of hand movements [130].
The time-course of the BOLD signal that we observed in the pars opercularis of IFG fit with the predictions of the motor simulation hypothesis (Figs 3 and 5b). According to this hypothesis, the familiar stimuli (i.e., Italian Forward) should elicit a higher level of activity than unfamiliar, implausible gestures difficult to reproduce (e.g., Italian Backward or Arabic Forward and Backward stimuli). Conversely, in the pars triangularis of IFG, Italian-speech stimuli and Arabic-speech stimuli, for which latter participants had no motoric expertise, evoked comparable responses although shifted in time (Figs 3 and 4g). Thus, the results do not suggest a specific sensitivity for Italian-speech stimuli in the pars triangularis of IFG, but rather sensitivity for the kinematics of natural mouth movements, a kind of biological motion. Our data, limited to the case of a familiar language (Italian), are also in agreement with those reported by Paulesu et al. [27] in which IFG activity was greater for forward than backward silent movies of a speech in a familiar language.
The issue of the role of intelligibility of silent visual speech should be further investigated, as one could argue that motor simulations occur only or mainly when linguistic competences are required, as with a lexical discrimination task [27, 86,131]. However, we believe that the ability of participants to speech-read might be a confound when trying to disentangle motor from higher cognitive functions of Broca's area. In our case, it appears that motor simulation occurs in absence of comprehension of the content of the speech.

Summary and conclusions
Previous studies focused mainly on the role of temporal auditory regions [37,48,132] and frontal regions [86,131,133] in processing visual speech. More recently, it has been shown that a region in the left posterior temporal cortex, the so-called temporal visual speech area (TVSA), is activated in visual phonetic discrimination [38], possibly integrating information coming from high-level visual areas in OTC [3,41,134]. We did not find significant effects in TSVA, as verified through a specific ROI drawn in this region. Our data suggest that the ventral occipito-temporal cortex has a sensitivity to visual speech gestures, contrary to the view that the peculiar analysis of visual speech starts at higher cortical levels [27,135]. Our results support the hypothesis that kinematic cues embedded in visible speech can be extracted through the visual pathways [136], outside the classical areas related to auditory speech and audio-visual integration [36,37,116,137,138]. Finally, the selective responses of PMvs / PMd to the familiar language and of PMvi to the natural familiar language support the hypothesis that motor simulation drives premotor activity during visible speech perception.