Table 1.
Evaluation of segmentation methods for spontaneous speech.
Fig 1.
Boundaries of phrases are often signaled by discontinuities in speech rate.
(a) An example of boundaries (red lines) of phrases set at word initiations (grey diamonds) that correspond to a peak in relative speech rate. (b) The distributions of durations of middle (i.e., neither first nor last) and of the last words in phrases containing at least 3 but no more 20 words. (c) The distributions of durations of middle (i.e., neither first five nor last 10) and of last phones in phrases containing 3–20 words. N = 60 audio files.
Fig 2.
Durations of final words and phones in phrases are extended.
(a) Durations of words grouped by their positions, from the last word to the first, in intonation units obtained manually (red) or automatically identified phrases (blue). (b) Same as panel (a), but depicting durations of phones. (c) Relative speech rate is lower at the beginning of manually segmented IUs (red bars). Blue bars confirm the expected trend for in phrases that were automatically identified using speech rate. (d) A sketch of a ‘typical’ IU in the Santa Barbara Corpus. In panels (a, b) phrases containing 2–20 words were considered. In panels (a-c) circles denote mean values and error bars correspond to ± s.e.m. N = 60 audio files.
Fig 3.
Automatic and manual taggings yield similar distributions of durations and length.
(a) The distributions of the number of words per IU for automatic (blue) and manual (red) tagging. Mean durations were (4.26±0.08) and (4.10±0.07) words, respectively. (b) The distributions of durations of IUs for automatic (blue) and manual (red) tagging. Mean durations: (1.08±0.03) sec and (1.14±0.03) sec, respectively. Dashed lines denote exponential fits to the tails of the distributions; time constants: t = 0.73 sec and 0.68 sec, respectively; goodness of fit: R2 = 0.993 and 0.998, respectively. In both panels, the calculation was performed for each audio file individually: Error bars correspond to ± s.e.m. N = 60 audio files.
Fig 4.
Automatic and manual tagging exhibit pitch reset.
(a) Mean normalized pitch as a function of normalized time exhibits a peak near the initiation of a phrase. Blue: automatic phrase boundary detection. Red: manual boundary detection. Inset: the average pitch at time intervals t = 0.15–0.25 (beginning) and t = 0.85–0.95 (end). Asterisks denote that the average was significantly higher at the beginning: p = 2x10-9 (automatic) and p = 10−16 (manual). (b) The standard deviation of the pitch as a function of normalized time is higher near termination of phrases. Blue: automatic boundary detection. Red: manual boundary detection. Inset: the average standard deviation (STD) at normalized time intervals t = 0–0.5 (first half) and t = 0.9–1 (end). Asterisks denote that the STD was significantly higher at the end: p = 0.0004 (automatic) and p = 0.017 (manual). In both panels, the calculation was performed for each audio file individually: N = 60 audio files. Lines and shaded areas represent mean and ± s.e.m., respectively.
Table 2.
Fig 5.
Frequent words are over-represented at beginnings of automatically identified phrases.
For each of the first four positions of the phrases and for each of the five most popular words found in that position, the probability to appear at that particular position was calculated. This was done by dividing the number of times that a word appears at a particular position by the total number of times this position was found in the dataset. To evaluate the errors, a calculation was performed for each of the N = 3 groups (of 20 audio files each) individually. Lines and shaded areas represent mean and ± s.e.m., respectively. M1-M4: curves based on manual boundary detection. A1-A4: curves based on automatic boundary detection.
Fig 6.
Pauses comparable to (or longer than) the duration of a word mark boundaries.
For each threshold value of the minimal duration of meaningful pauses, the phrase boundaries were identified using criteria of both speech rate and of pauses longer than the threshold value. For each resulting boundary detection, the precision was calculated as compared to manual boundary detection. Precision as a function of the threshold values peaked at 300 ms (denoted by vertical dashed line)–a value comparable to the mean duration of a word. The calculation was performed for each audio file individually: N = 60 audio files. Lines and shaded areas represent mean and ± s.e.m., respectively. The point at an infinite threshold represents the mean precision obtained when pauses were not used, showing that including pauses increased the agreement between automatic and manual boundary detection by 2.5%.
Table 3.
Detection rate in conversational vs audience-oriented files.