Fig 1.
Simplified schematic view of the DIVA model, showing the combination of feedforward and feedback control loops.
Fig 2.
DiffWave supervised training process.
Top: Process for training in the original DiffWave model. Bottom: Modified DiffWave training, using a deep CNN upsampler to match the conditioner in DiffWave’s reference upsampler.
Fig 3.
Normalized RMSE of motor command during training.
Normalized root mean-square error (RMSE) in motor command output of TorchDIVA vs DIVA over 20 repetitions during the training process with the speech target ‘u’.
Fig 4.
Normalized RMSE of motor command after training.
Normalized root mean-square error (RMSE) in motor command output of TorchDIVA vs DIVA over 20 repetitions with a trained speech target ‘u’.
Table 1.
Normalized Root-Mean-Square Error (RMSE) of TorchDIVA motor signal.
Fig 5.
DIVA and TorchDIVA spectrogram comparison.
Speech production ‘happy’ output audio comparison. The first subplot is DIVA, the second is TorchDIVA, and the bottom is the difference calculated from the two output signals.
Fig 6.
DIVA and TorchDIVA auditory excitation pattern comparison.
Speech production ‘happy’ auditory excitation pattern (AEP) comparison for DIVA and TorchDIVA. The first subplot is the AEP, the second subplot is the difference between the two AEPs obtained.
Fig 7.
TorchDIVA and DiffWave speech quality metrics.
Speech quality metric comparison between TorchDIVA and DiffWave-enhanced TorchDIVA samples. Original human speech sample is reference signal for all metric calculation. H_DIVA: human reference vs. TorchDIVA output. H_DW: human reference vs. DiffWave-enhanced output. a) Perceptual evaluation of speech quality (PESQ). b) Predicted rating of speech distortion (CSIG). c) Predicted rating of background distortion (CBAK). d) Predicted rating of overall quality (COVL). e) Segmental signal-to-noise ratio (segSNR).
Table 2.
Paired sample t-test and Cohen’s D for TorchDIVA and DiffWave speech quality metrics.