TorchDIVA: An extensible computational model of speech production built on an open-source machine learning library

doi:10.1371/journal.pone.0281306

Fig 1.

DIVA model architecture.

Simplified schematic view of the DIVA model, showing the combination of feedforward and feedback control loops.

More »

Expand

Fig 2.

DiffWave supervised training process.

Top: Process for training in the original DiffWave model. Bottom: Modified DiffWave training, using a deep CNN upsampler to match the conditioner in DiffWave’s reference upsampler.

More »

Expand

Fig 3.

Normalized RMSE of motor command during training.

Normalized root mean-square error (RMSE) in motor command output of TorchDIVA vs DIVA over 20 repetitions during the training process with the speech target ‘u’.

More »

Expand

Fig 4.

Normalized RMSE of motor command after training.

Normalized root mean-square error (RMSE) in motor command output of TorchDIVA vs DIVA over 20 repetitions with a trained speech target ‘u’.

More »

Expand

Table 1.

Normalized Root-Mean-Square Error (RMSE) of TorchDIVA motor signal.

More »

Expand

Fig 5.

DIVA and TorchDIVA spectrogram comparison.

Speech production ‘happy’ output audio comparison. The first subplot is DIVA, the second is TorchDIVA, and the bottom is the difference calculated from the two output signals.

More »

Expand

Fig 6.

DIVA and TorchDIVA auditory excitation pattern comparison.

Speech production ‘happy’ auditory excitation pattern (AEP) comparison for DIVA and TorchDIVA. The first subplot is the AEP, the second subplot is the difference between the two AEPs obtained.

More »

Expand

Fig 7.

TorchDIVA and DiffWave speech quality metrics.

Speech quality metric comparison between TorchDIVA and DiffWave-enhanced TorchDIVA samples. Original human speech sample is reference signal for all metric calculation. H_DIVA: human reference vs. TorchDIVA output. H_DW: human reference vs. DiffWave-enhanced output. a) Perceptual evaluation of speech quality (PESQ). b) Predicted rating of speech distortion (CSIG). c) Predicted rating of background distortion (CBAK). d) Predicted rating of overall quality (COVL). e) Segmental signal-to-noise ratio (segSNR).

More »

Expand

Table 2.

Paired sample t-test and Cohen’s D for TorchDIVA and DiffWave speech quality metrics.

More »

Expand