On Short-Time Estimation of Vocal Tract Length from Formant Frequencies

doi:10.1371/journal.pone.0132193

Fig 1.

Three randomly-generated area functions (a) from the simulated speech data set and their corresponding coefficient values (b) that define their shape according to Eqs (17) and (18).

These example area functions were found to be closest, in terms of RMS error, to the area functions presented by Wood [35] for the English vowels /u/, /i/ and /a/.

More »

Expand

Table 1.

Sentences used as stimuli for eliciting the analyzed utterances.

All subjects spoke these same five sentences.

More »

Expand

Fig 2.

Midsagittal rtMR image of subject al1 (a) with vocal tract outlines and Voronoi diagram overlaid.

Mean images of the five subjects (b–f), a representation of the average vocal tract posture assumed by the subject during production of the formants in the data set. Vocal tract outlines are overlaid (dashed lines) with the vocal tract midlines (solid lines) calculated from those outlines using Voronoi skeletons. The length of these midlines was used as the measure of vocal tract length for each subject. The specific values of vocal tract length for each subject were as follows: b: 18.0 cm, c: 16.4 cm, d: 14.5 cm, e: 17.3 cm, f: 17.6 cm

More »

Expand

Table 2.

Estimation accuracies of several estimators on simulated speech data in terms of RMS error, along with the parameters that define the estimator.

For estimators with no parameter β₀, the coefficients β₁…β₄ correspond to those in Eq (4). For estimators with all five parameters listed, those parameters correspond to those in Eq (20). Note that the coefficient values for the maximum likelihood estimator, the proposed estimator, the maximum a posteriori estimator and the extended proposed estimator are mean values across all resamplings. Standard deviations of the coefficients β_1…4 for all estimators across all resamplings was < 0.005, and standard deviations of the coefficients β₀ was < 5.

More »

Expand

Table 3.

Estimation accuracies of several estimators on human speech data in terms of RMS error, along with the parameters that define the estimator.

For estimators with no parameter β₀, the coefficients β₁…β₄ correspond to those in Eq (4). For estimators with all five parameters listed, those parameters correspond to those in Eq (20). Note that the coefficient values for the maximum likelihood estimator, the proposed estimator, the maximum a posteriori estimator and the extended proposed estimator are mean values across all resamplings. Standard deviations of the coefficients β_1…4 for all estimators across all resamplings was < 0.015, and standard deviations of the coefficients β₀ was < 10.

More »

Expand

Table 4.

Empirical comparison of the sensitivity of formant frequencies to speech articulation in the simulated and human speech data sets.

The measure presented is normalized standard deviation: , where n is the formant number. Lower numbers indicate less sensitivity to articulation. Note that sensitivity decreases as higher formants are considered.

More »

Expand

Fig 3.

Sensitivity functions predicted by perturbation theory, showing the sensitivity of the first four formants to a vocal tract constriction whose center is placed at all possible locations along the length of the vocal tract.

This example uses a vocal tract constriction of uniform area, with a length that is 25% of total vocal tract length. Note that the range of the sensitivity functions decreases as higher formants are considered. The range of sensitivity functions can be used as a measure of formant sensitivity to constrictions of a given length, regardless of their location, as in the sensitivity range functions shown in Figs 4 and 5.

More »

Expand

Fig 4.

Sensitivity range functions predicted by perturbation theory, showing the range of sensitivity of the first four formants to vocal tract constrictions of different lengths, regardless of their location.

This example uses uniform vocal tract constrictions with lengths varying from 0 to 50% of total vocal tract length. Note that, in general, the sensitivity range decreases as higher formants are considered, although there are some exceptions to this general trend. When constrictions are very small (i.e., less than 5% of total vocal tract length), there is not much difference between the sensitivity ranges of different formants. There are also some exceptions to the decreasing sensitivity trend, as can be seen when constrictions are between approximately 34% and 48% of total vocal tract length. In that range, F4 is more sensitive than F3.

More »

Expand

Fig 5.

Sensitivity range functions predicted by the multitube model, showing the range of sensitivity of the first four formants to vocal tract constrictions of different lengths, regardless of their location.

This example uses uniform vocal tract constrictions with lengths varying from 0 to 50% of total vocal tract length. Note the striking similarity to the results predicted by perturbation theory, as shown in Fig 4, which provides corroborating evidence that the proposed assumptions behind and extensions to perturbation theory are reasonable.

More »

Expand

Fig 6.

Frequency response of the filter specified by Eq (19), which reflects an interpretation of vocal tract constrictions, presented in Eq (16), as a summation over some section of the sensitivity function defined for single, short perturbations.

The transfer function depends on the length of the constriction, tΔl, which is why the frequency response for different constriction lengths are shown. Overlaid are lines indicating the fundamental frequencies of single-perturbation sensitivity functions corresponding to the first four formant frequencies. When constriction length is long (d, tΔl = 50% of L), magnitude of the frequency response generally decreases as the spatial frequency of the sensitivity function increases, corresponding to the overall reduction in the sensitivity of higher formants. For very short constrictions (a; tΔl = 6% of L), this sensitivity difference is marginal. It is also possible to get very dramatic reduction of sensitivity for higher formants (b; tΔl = 28% of L), or to maximally attenuate the sensitivity of non-highest formant (c; tΔl = 41% of L). These results are highly consistent with the sensitivity range functions from the formant sensitivity experiments, as presented in Figs 4 and 5.

More »

Expand