Modulation transfer functions for audiovisual speech

doi:10.1371/journal.pcbi.1010273

Fig 1.

Analysis procedure.

Regularized CCA (rCCA) combines speech envelope filter outputs and 3D landmarks of the speaker’s face. Resulting pairs of canonical components (CCs) are linear combinations of envelope filter outputs for audio (CC_A) and facial landmarks for video (CC_V). Image source: commons.wikimedia.org/wiki/File:Dr_H._L._Saxi_18_April_2013.jpg.

More »

Expand

Fig 2.

CCA results for the LRS3 dataset.

Left: CCA-derived temporal modulation filters for the first 5 significant canonical components (CCs). Right: corresponding facial landmark loadings. Darker red indicates higher weights. The 3D landmarks are shown in 2D projection, and the colorbar indicates the relative contribution of the x (blue), y (orange), and z (green) directions.

More »

Expand

Fig 3.

CC1 and CC3 for an example speaker.

The CC time series for the speech envelope are shown in blue, and the CCs for the facial landmarks are shown in orange. Vertical lines indicate word onsets. CC1 represents speech envelope fluctuations corresponding to the onset of individual syllables, while CC3 tracks slower variations corresponding to words or phrases.

More »

Expand

Fig 4.

CCA results for the GRID dataset.

CCA-derived envelope filters (left) and corresponding face loadings (right) for the GRID dataset. Unlike in the wild recordings of natural speech such as the LRS3, the GRID corpus is composed of simple, syntactically identical six-word sentences.

More »

Expand