Skip to main content
Advertisement

< Back to Article

Fig 1.

Articulatory-based speech synthesizer.

Using a DNN, articulatory features of the reference speaker are mapped to acoustic features, which are then converted into an audible signal using the MLSA filter and an excitation signal.

More »

Fig 1 Expand

Fig 2.

Articulatory and acoustic data.

A–Positioning of the sensors on the lip corners (1 & 3), upper lip (2), lower lip (4), tongue tip (5), tongue dorsum (6), tongue back (7) and velum (8). The jaw sensor was glued at the base of the incisive (not visible in this image). B–Articulatory signals and corresponding audio signal for the sentence “Annie s’ennuie loin de mes parents” (“Annie gets bored away from my parents”). For each sensor, the horizontal caudo-rostral X and below the vertical ventro-dorsal Y coordinates projected in the midsagittal plane are plotted. Dashed lines show the phone segmentation obtained by forced-alignment. C–Acoustic features (25 mel-cepstrum coefficients—MEL) and corresponding segmented audio signal for the same sentence as in B.

More »

Fig 2 Expand

Fig 3.

Articulatory-acoustic database description.

A–Occurrence histogram of all phones of the articulatory-acoustic database. Each bar shows the number of occurrence of a specific phone in the whole corpus. B–Spatial distribution of all articulatory data points of the database (silences excluded) in the midsagittal plane. The positions of the 7 different sensors are plotted with different colors. The labeled positions correspond to the mean position for the 7 main French vowels.

More »

Fig 3 Expand

Fig 4.

Real-time closed loop synthesis.

A) Real-time closed-loop experiment. Articulatory data from a silent speaker are recorded and converted into articulatory input parameters for the articulatory-based speech synthesizer. The speaker receives the auditory feedback of the produced speech through earphones. B) Processing chain for real-time closed-loop articulatory synthesis, where the articulatory-to-articulatory (left part) and articulatory-to-acoustic mappings (right part) are cascaded. Items that depend on the reference speaker are in orange, while those that depend on the new speaker are in blue. The articulatory features of the new speaker are linearly mapped to articulatory features of the reference speaker, which are then mapped to acoustic features using a DNN, which in turn are eventually converted into an audible signal using the MLSA filter and the template-based excitation signal. C) Experimental protocol. First, sensors are glued on the speaker’s articulators, then articulatory data for the calibration is recorded in order to compute the articulatory-to-articulatory mapping, and finally the speaker articulates a set of test items during the closed-loop real-time control of the synthesizer.

More »

Fig 4 Expand

Fig 5.

Offline reference synthesis example.

Comparison of the spectrograms of the original audio, and the corresponding audio signal produced by the 5 different offline articulatory synthesis for the sentence “Le fermier est parti pour la foire” (“The farmer went to the fair”). Dashed lines show the phonetic segmentation obtained by forced-alignment.

More »

Fig 5 Expand

Fig 6.

Subjective evaluation of the intelligibility of the speech synthesizer (offline reference synthesis).

A–Recognition accuracy for vowels and consonants for each of the 5 synthesis conditions. The dashed lines show the chance level for vowels (blue) and VCVs (orange). B–Recognition accuracy of the VCVs regarding the vocalic context, for the 5 synthesis conditions. The dashed line shows the chance level. C–Recognition accuracy of the consonant of the VCVs, for the 5 synthesis conditions. Dashed line shows the chance level. See text for statistical comparison results.

More »

Fig 6 Expand

Fig 7.

Confusion matrices of the subjective evaluation of the intelligibility of the speech synthesizer (offline reference synthesis).

Confusion matrices for vowels (left) and consonants (right), for each of the three conditions FixedPitch_27, Pitch_27 and Pitch_14. In the matrices, rows correspond to ground truth while columns correspond to user answer. The last column indicates the amount of errors made on each phone. Cells are colored by their values, while text color is for readability only.

More »

Fig 7 Expand

Fig 8.

Subjective evaluation of the intelligibility of the speech synthesizer on sentences (offline reference synthesis).

Word recognition accuracy for the sentences, for both conditions Pitch_27 and Pitch_14.

More »

Fig 8 Expand

Fig 9.

Articulatory-to-articulatory mapping.

A–Articulatory data recorded on a new speaker (Speaker 2) and corresponding reference audio signal for the sentence “Deux jolis boubous” (“Two beautiful booboos”). For each sensor, the X (rostro-caudal), Y (ventro-dorsal) and Z (left-right) coordinates are plotted. Dashed lines show the phonetic segmentation of the reference audio, which the new speaker was ask to silently repeat in synchrony. B–Reference articulatory data (dashed line), and articulatory data of Speaker 2 after articulatory-to-articulatory linear mapping (predicted, plain line) for the same sentence as in A. Note that X,Y,Z data were mapped onto X,Y positions on the midsagittal plane. C–Mean Euclidean distance between reference and predicted sensor position in the reference midsagittal plane for each speaker and each sensor, averaged over the duration of all speech sounds of the calibration corpus. Error bars show the standard deviations, and “All” refer to mean distance error when pooling all the sensors together.

More »

Fig 9 Expand

Fig 10.

Real-time closed loop synthesis examples.

Examples of audio spectrograms for anasynth, reference offline synthesis and real-time closed-loop (Speaker 2), A) for the vowels /a/, /e/, /i/, /o/, /u/, /œ/ and /y/, and B) for the consonants /b/, /d/, /g/, /l/, /v/, /z/ and /ʒ/ in /a/ context. The thick black line under the spectrograms corresponds to 100 ms.

More »

Fig 10 Expand

Fig 11.

Results of the subjective listening test for real-time articulatory synthesis.

A–Recognition accuracy for vowels and consonants, for each subject. The grey dashed line shows the chance level, while the blue and orange dashed lines show the corresponding recognition accuracy for the offline articulatory synthesis, for vowels and consonants respectively (on the same subsets of phones). B–Recognition accuracy for the VCVs regarding the vowel context, for each subject. C–Recognition accuracy for the VCVs, by consonant and for each subject. D–Confusion matrices for vowels (left) and consonants from VCVs in /a/ context (right). Rows correspond to ground truth while columns correspond to user answer. The last column indicates the amount of errors made on each phone. Cells are colored by their values, while text color is for readability only.

More »

Fig 11 Expand