Modulation transfer functions for audiovisual speech

Nicolai F. Pedersen; Torsten Dau; Lars Kai Hansen; Jens Hjortkjær

doi:10.1371/journal.pcbi.1010273

Peer Review History

Original SubmissionJanuary 25, 2022
28 Mar 2022 Decision Letter - Frédéric E. Theunissen, Editor, Samuel J. Gershman, Editor Dear Dr. Hjortkjær, Thank you very much for submitting your manuscript "Modulation transfer functions for audiovisual speech" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Frédéric E. Theunissen Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology ********************* A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Dear authors, Please respond carefully to the comments of reviewer1. Best, Frederic Theunissen Reviewer's Responses to Questions Comments to the Authors: Reviewer #1: This impressive manuscript presents results from the use of regularized canonical correlation analysis on speech acoustics and video to learn temporal envelope filters that correlate with motion on different parts of the face. In both machine and human recognition of speech integration of audio and visual information from talkers enhances performance. This research provides an innovative and rigorous methodology to address part of this problem: What is the linear correspondence between the two signals? The strengths of the work include that it analyzes a large AV data set as well as a more controlled smaller data set, a broad literature review considering the many issues (though a few key references noted below should be considered), and rigorous methods that consider the joint reduction in dimensionality for auditory and video signals to compute the correlated regions and timescales. References to consider: On the relationship between face movements, tongue movements, and speech acoustics J Jiang, A Alwan, PA Keating, ET Auer, LE Bernstein - EURASIP Journal on Advances in Signal Processing, 2002 Quantifying time-varying coordination of multimodal speech signals using correlation map analysis A Vilela Barbosa, RM Déchaine, E Vatikiotis-Bateson… - The Journal of the Acoustical Society of America, 2012 Grant, K. W., and Seitz, P. F. (2000). “The use of visible speech cues for improving auditory detection of spoken sentences,” J. Acoust. Soc. Am. 108(3), 1197–1208. Yehia, H., Rubin, P., and Vatikiotis-Bateson, E. (1998). ‘‘Quantitative association of vocal-tract and facial behavior,’’ Speech Commun. 26, 23–43. The weaknesses of the manuscript include: A) In spite of the large data set and improved methods, the results are not really surprising. As the omitted references suggest, there are visual correlations between the acoustic envelope and parts of the moving head and face. The partition shown in this paper is nice but is partially anticipated by other studies that show additional correlation features. However, demonstrating something that is not surprising is still worth doing. B) The authors discuss a number of limitations but one that is inherent to all approaches like this is that components are dominated by larger variances sources. However, information that is perceptually salient can be quite small and statistically infrequent, but that information can make all the difference. For example, good lipreaders take advantage of high spatial frequency visual information in some contexts even though low spatial frequency aspects of the signal are generally fine. C) It seems that the correlations are computed on segments, thus with stationary AV temporal alignment. As the authors suggest temporal synchrony is a complex problem but perhaps more complex than they suggest. There are differences in physical transmission of speech sight and sound as well as differences in neural transmission and processing in the two modalities. But, also the two speech signals differ greatly in character. Acoustic speech can be segmented with more or less accuracy because it is punctuated by syllabic nuclei and silences. Visual speech is far more continuous and segmenting it is more similar to the problem of visual event perception (e.g., Zacks, 2020). In the past, this problem has been hacked by omitting the signals associated with silences (e.g., Yehia). This just avoids the problem. D) The temporal signal compression to match video and acoustic signals predetermines that only low frequency envelope filters will be observed. Granted that there can’t be much spectral power in the movements that is perceptually important (see de Paula et al.). H. de Paula, H.C. Yehia, D. Shiller, G. Jozan, K.G. Munhall, E.Vatikiotis-Bateson Linking production and perception through spatial and temporal filtering of visible speech information Proceedings of the 6th International Seminar on Speech Production, Sydney, Australia, 7–10 December 2003 (2003), pp. 37-42. But there is information in the acoustics above this low frequency filter level that can be predictive of movement and vice versa (e.g., in a Bayesian manner). In other words, correlation of temporally matched signals is not the only kind of binding. Some additional comments: Line 250… This is a possible null distribution but of all of the control distributions that could be computed this is certainly the weakest. Given the differences in speaking style and rate between talkers (as the authors show), this will produce a very uncorrelated distribution. But, if you wanted to control for differences in speaking rate and emphatic variance in speech, you could use utterances from the same talker or the same utterance played in reverse. I am not suggesting that this be done but be aware that this is a very null null. Perhaps the null distribution and the data correlations could be shown graphically in an appendix. This would make me understand the sentence on line 254-255. By the way, how many matching components were discarded as a result of that criterion. Line 277 – correlations above 1%... What does that mean? And what percentage of the variance does correlation account for overall? S3 Fig. Thus, the shape and peak of the group filters are determined by averaging smear. Note that the individual filters shown here are also averages for individuals’ speech sample and people change rate dynamically even within a phrase. Line 359-361. I am happy to see this demonstration about the GRID type tasks being different from more natural speech. I presume that the lack of envelope rates in the 1-2 Hz range for the GRID tasks has something to do with them having no variation in prosody. The TED talks are more natural but are heavily scripted with edited retakes. Lines 392-394. I am surprised that the envelope modulations in the two ranges were mutually uncorrelated. They should be related as stress and emphasis signalled by head motions should correlated with mouth aperture size. The different event rate and information span must wash that out. Lines 376-379. This supports my view that analyses such as this are biased towards uncovering large movement variance as the comment on heart rate and breathing indicates. In summary, this is a well carried out and thoughtful piece of work. It has sophisticated methodology and a number of good controls have been carried out to test alternative explanations. I feel that this manuscript will of interest to researchers in speech recognition and those who study the behavioral and neural concomitants of multisensory speech. Reviewer #2: Review of Modulation transfer functions for audiovisual speech In this excellent study, Pedersen et al. revisit the question of the link between auditory and visual cues in Speech. This is a well-studied and important field that has had extensive amount of work. The known result is that visual and auditory cues are roughly correlated in 3-8 Hz frequency bands and that head movement is often in the 1-2 Hz region and is important for speech perception. However, what is missing is a precise quantification of the precise rates at which envelope information is synchronized with motion in different parts of the face. To address this question, the authors combined a novel approach that first used a deep learning approach to identify face landmarks, and a gammatone filter bank to filter the auditory signals. They then used regularized canonical correlation analysis (rCCA) to learn speech envelope filters whose outputs correlate with motion in different parts of the speakers face. In general, I found the paper to be well written and has some interesting conclusions, albeit consistent with prior work. I had a few comments which will hopefully improve the manuscript. Statistical significance of components: Could the authors add what correlation is explained by each of the CCs? The authors say they report CCs where correlation was above 1%. That seems quite low. Second, I feel the authors are not doing justice to their incredible 4000 speaker dataset by only showing the average correlation. I feel S1 Fig should perhaps be a side panel to the figure 1 and figure 1 should have errorbars around it. Reporting the correlation coefficient next to the canonical components would help the reader. The normalization also makes it hard to assess if some components are larger or smaller. I would say the authors should also include statistics and errorbars for the GRID corpus dataset. ****** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: Yes ****** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. https://doi.org/10.1371/journal.pcbi.1010273.r001
Revision 1
27 May 2022 Author Response Attachments Attachment Submitted filename: Response.docx https://doi.org/10.1371/journal.pcbi.1010273.r002
1 Jun 2022 Decision Letter - Frédéric E. Theunissen, Editor, Samuel J. Gershman, Editor Dear Dr. Hjortkjær, We are pleased to inform you that your manuscript 'Modulation transfer functions for audiovisual speech' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Frédéric E. Theunissen Associate Editor PLOS Computational Biology Samuel Gershman Deputy Editor PLOS Computational Biology *********************************************************** Thank you for your thorough response to the reviewers. Nice contribution. F https://doi.org/10.1371/journal.pcbi.1010273.r003
Formally Accepted
4 Jul 2022 Acceptance Letter - Frédéric E. Theunissen, Editor, Samuel J. Gershman, Editor PCOMPBIOL-D-22-00122R1 Modulation transfer functions for audiovisual speech Dear Dr Hjortkjær, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Marianna Bach PLOS Computational Biology \| Carlyle House, Carlyle Road, Cambridge CB4 3DN \| United Kingdom ploscompbiol@plos.org \| Phone +44 (0) 1223-442824 \| ploscompbiol.org \| @PLOSCompBiol https://doi.org/10.1371/journal.pcbi.1010273.r004

Open letter on the publication of peer review reports

PLOS recognizes the benefits of transparency in the peer review process. Therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. Reviewers remain anonymous, unless they choose to reveal their names.

We encourage other journals to join us in this initiative. We hope that our action inspires the community, including researchers, research funders, and research institutions, to recognize the benefits of published peer review reports for all parts of the research system.

Learn more at ASAPbio .