Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.


Introduction
Automatic speech processing (ASP) technology has been used increasingly in a wide variety of scientific and practical applications. Preliminary work in speech recognition began in the 1960s, with talker-independent automatic speech recognition and ASP work gaining a foothold by in the 1980s [1] and the first attempt at child speech not coming until the mid-1990s [2,3]. The majority of the literature considers ASP as it is applied to healthy adult speech, although there has been some attention to the application of ASP with child speech [4,5,6,7] and disordered populations [8,9]. One ASP system used with children and disordered children is the Language ENvironment Analysis (LENA; LENA Research Foundation, Boulder, CO).
The LENA system is the first and, to date, only of its kind to allow for massive-scale naturalistic speech data to be collected and analyzed with ASP methods.

Description of the device
The LENA system consists of a body-worn audio recorder designed to be worn unobtrusively on the body of a child and proprietary ASP processing software [13]. The system hardware is designed to collect unprocessed whole-day audio recordings (up to 16 h) [28]. After the audio is uploaded to a computer, LENA software processes the audio off-line. The result of processing is a time-aligned record of the segmentation at centisecond resolution and assignment of one of about 60 apriori labels to each segment, providing details of the auditory environment of the child wearing the recorder. The labels are broadly divided into environmental and live humanvocal categories. The environmental labels include tags such as noise, electronic media, silence, and uncertain/fuzzy. The live human-vocal category includes labels such as key-child vegetative, key-child speech-like, adult male, adult female, and other child.

Description of the technology
The ASP processing used by the LENA is a set of algorithms designed to assess child and family speech [29]. The system relies on standard, modern ASP methods, using Gaussian mixture and hidden Markov models to segment and assign labels [13,28,30]. It uses an optimized, dynamically programmed searching algorithm to compare acoustic templates with diphones in the observed segment to achieve a maximum likelihood match, ultimately assigning one label to that segment [31,32,33].

Reliability of the technology
Empirical studies on the reliability of the LENA labels and adult word counts compared to human coders has been reported for typically-developing (TD) children learning English [14,15,24,28,34], Spanish [35], and French [36]. In these reports, 82% of segments coded by humans as adult were similarly coded by machine; 73-76% of segments coded by humans as child were similarly coded by machine. For those segments the ASP labeled as adult and child, humans similarly labeled segments 68% and 64-70%, respectively.
One recent, detailed study looked at the validation of the LENA machine labels compared with human transcriptions in a dataset representing 94 family recordings of children 2-48 months of age [33]. They found greater than 72% machine-human agreement in segments identified as "clear key child." This finding was shown to be similar to the 76% agreement the same research group found in a subset sample of 70 family recordings with children 2-36 months of age [30]. Xu and colleagues [33] further analyzed a subset of about 2374 child utterances for phonetic content validity to assess the robustness of the machine classifier. Phonetic units in the child utterances were analyzed by open source Sphinx speech recognition software based on well-pronounced adult speech models. The machine recognizer segmented and assigned one of 46 labels in categories of consonant, vowel, nonspeech, and silence. Human transcribers coded the same child utterances, and the machine and human transcriptions were then compared. The correlations between machine and human coders' average count of consonants and vowels per utterance were high, 0.85 and 0.82, respectively. The machine recognition algorithms significantly underestimated the total number of vowels and consonants, but overall the results gave robust reliability estimates of the ASP software in this domain.

Research questions
A persistent issue of ASP, especially as it is applied to natural speech, child speech, and disordered speech, is the reliability of its labeling. The goal of this report is to provide an empirical examination of LENA ASP output using a large sample of machine-labeled segments of interest to speech and language work evaluated by human judges comprising a generalizable gold standard with which to compare machine labels. The specific research questions are as follows: 1. How does LENA machine ASP speaker classification performance compare with human coders? In particular, how do the machine labels for target child, adult female, and adult male compare with human judge assessments?
2. How are LENA machine classification errors organized? That is, are certain types of errors or confusions more common than others?
3. How are acoustic signal characteristics known to be important for speech (duration, fundamental frequency, and amplitude) associated with human and LENA machine classification performance?

Materials and Method Participants
Twenty-three judges evaluated the same stimuli to assess inter-judge reliability. All judges were formally trained in speech and hearing sciences and had familiarity with judgement testing procedures. All but two judges were female.

Materials
Twenty-six recordings, one recording from each of 26 families, were used for this study. All families lived together full-time, and all children were typically-developing. Demographic factors of the families such as socio-economic status, ethnicity, or race were not examined in the present work. The recordings from each family were collected and organized based on the age of the child wearing the recording device, averaging 29.1 months (SD = 2.7 months). Twentythree independent judges evaluated the recordings. Each daylong acoustic recording was analyzed with the LENA system. To obtain the daylong audio recordings, families were given a shirt with a custom chest pocket designed to hold the small, self-contained audio recorder unobtrusively on the child's chest. The family was instructed to turn the recorder on in the morning, place it into the chest pocket, and turn it off in the evening. Families recorded on days that were typical (not the child's birthday, for example) and on a day when family members including the father or an adult male were present. Raw recordings were uploaded to a computer and processed offline using LENA software. Processed recordings generated a daylong audio file (16-bit, 16 kHz, lossless PCM, WAV format) and an XML-coded record of the segmentation onset/offset points with the segment label for every segment. The present work does not consider the segmentation accuracy of the machine algorithms. It assumed that the result of the segmentation procedures are sufficient to evaluate the labeling procedures described here.
The LENA software identifies the onset and offset times of segments determined probabilistically of being live vocal segments belonging to an adult female (FAN segments), an adult male (MAN segments), or the child wearing the recording device (CHN segments). For each of the 78 recordings selected, three recorded talkers in each of 26 families, 30 segments labeled as FAN, MAN, and CHN were excised from the daylong recording in the following manner. For each of the three talker labels of interest, the total number of segments with that label for that recording were determined, then divided by 30 to yield an integer value, n. Using a custom MATLAB script, each nth instance of that label was then excised from the raw, daylong recording and stored as an individual computer sound file. This distribution was used here to insure a relatively even spread of talker segments throughout the daylong recording and avoiding over-or under sampling from certain environmental situations (e.g., bath or meal times), times of (vocal) fatigue such as later in the day, or contextual variability (e.g., regularities of family members, events, or conversations topics). Thus, stimuli consisted of 30 audio segments from each of three talker labels (FAN, MAN, CHN) collected from 26 family recordings totaling 2340 unique audio files. There were a total of 53820 stimuli presentations, 17940 each from the machine-classified categories of adult-female, adult-male, and target-child.

Ethics Statement
This study was specifically approved by the Washington State University Institutional Review Board. Information about the experiment was provided and written informed consent was collected prior to participation by both the families contributing audio recordings and the judges. In terms of the minors/children in the study, written informed consent was obtained from the parents on behalf of minors/children prior to participation.

Acoustic characteristics of the stimuli
Acoustic characteristics known to be important to speech and language include duration, f 0 , f 0 trajectory, amplitude, and amplitude variability over time [37,38]. Acoustic characteristics of the stimuli are given in Table 1 for segment duration, f 0 , and RMS (relative) amplitude. Descriptive statistics for the pooled stimuli set, and for each grouping designation, FAN, MAN, and CHN segments. Following previous studies that have examined the relationship between classifications and the acoustic factors associated with those decisions [14], here we examine the relationship between ten similar apriori acoustic features and the classification decisions of the judges. For this report, the ten features, shown in Table 2 below, give a first approximation of the underlying acoustic signal information that drives classification decisions, including possible differences between human and machine processes.

Procedure
Written informed consent was obtained from all participants prior to data collection. The 23 judges evaluated the same 2340 stimuli. A session consisted of the judge evaluating all the recordings from one age group, totaling 2340 decisions. Prior to each session, a custom MATLAB script randomized the presentation order of all stimuli for that session.
Participants were instructed to listen to each stimulus and select from 1-child, 2-mother, 3-father, or 4-other by entering the number from a standard keyboard. Judges could replay the audio stimulus an unlimited number of times, and session percent complete was shown in real time. After several practice trials, stimuli were presented to the judges via a nearfield monitor loudspeaker (model 8030A, Genelec, Iisalmi, Finland) adjusted by the judge to a comfortable listening level in quiet, sound-treated room. There are two notable aspects of the task. First, judges were given a four-alternative forced-choice (4AFC) task but the possible response label other ASP was not an actual stimulus as labeled by the LENA software. That is, only three actual LENA labels were evaluated, FAN, MAN and CHN, but there were four alternatives possible for the human judges to label. Second, the labels used by the LENA ASP (FAN, MAN, CHN) are not strict analogs offered in the forced-choice task (mother, father, child) to the judges. That is, the tacit task of the ASP was to assign a nominal label corresponding to an adult female (i.e., FAN), to an adult male (i.e., MAN), or to the child wearing the recorder (i.e., CHN), while the task of the human judge demanded identification of mother, father, and child. There is no guarantee that an adult woman, for example, would be unambiguously also a mother. It was assumed that there is a high correspondence between any FAN label, for instance, and that person being the actual mother of the child wearing the recorder. In the event that this assumption is not borne out, however, it is unlikely to have a material effect on the overall goal of estimating language learning from a child's perspective whether the adult female, for example, is in fact that child's actual mother or another adult female within auditory range of the child. Judges were given short breaks as needed and completed the task in 90-200 minutes. No feedback was given to the judges, and data acquisition was controlled via the custom MATLAB script. A short debriefing followed participation.

Data analysis and statistical approach
In order to assess the relationship between ASP and human coder judgments, Fleiss' kappa and Cohen's kappa reliability coefficients were calculated. Fleiss' kappa provides an overall measure of agreement between the ASP system and all of the human coders. Cohen's kappa provides a separate measure of agreement between the ASP system and each human coder. Because the human coders had the option to label a stimulus other, while all tokens were coded father, mother, or child by the ASP system, the other-coded tokens were excluded from calculation of Fleiss' kappa and Cohen's kappa. To mitigate the effects of any deviation between the ASP and human coders' classifications, weighted kappa coefficients are also reported. The weighted kappa coefficients are calculated as the raw kappa coefficients multiplied by the proportion of father, mother, or child judgments (i.e., one minus the proportion of other responses).
The patterns of agreement and disagreement are analyzed informally by analysis of a matrix indicating the proportions of tokens given each combination of classification by the ASP system and the human coders.
Classification trees were used to analyze the relationship between the acoustic data and the ASP and human coder judgments, using the RPART and RPART.PLOT packages in R [39,40,41].
The classification tree employs an iterative procedure in which, at each stage, each acoustic variable is partitioned to find the best criterion, and the best partition for the best variable is retained, where best is defined with respect to goodness of fit between the model and the input judgments (i.e., how closely the tree's classifications match either the ASP system's or the human coders' judgments). This procedure repeats until additional partitions of the acoustic variables do not provide statistically useful increases in the goodness of fit. Data are available without restriction at Harvard Dataverse (V1) via the following URL: http://dx.doi.org/10. 7910/DVN/7CW9KO.

Results
Descriptive aggregate values of responses are given for all categories in Table 3. Valid responses were collected for 99.91% (53772 of 53820) of stimuli presented to judges, likely due to response entry errors during testing.
Fleiss' kappa for the full data set of human coders was 0.79. Only 13% of human coder judgments were other, so the weighted Fleiss' kappa was 0.79 × 0.87 = 0.68. Both values indicate a high degree of agreement among the human coders. Fig 1 shows the unweighted (circles) and  15, the level of agreement between the ASP system and the human coders was consistent both between the machine and any individual human judge and between all individual human judges, with a significant correlation between the weighted and unweighted decisions (r = .36, p < .05). Table 4 provides the proportions of each combination of ASP and human coder classifications across for the full data set, with ASP classifications given in the rows and human coder classifications in the columns. As suggested by the high kappa coefficients presented above, most tokens were classified the same by the ASP system and the human coders. It is also clear that disagreements between the ASP system and human coders was not random. Most disagreements occurred between either child and mother judgments or between mother and father judgments, with a pronounced asymmetry between these types of disagreements.
With tokens classified by human coders classified as child, when the ASP system disagreed, it was far more likely to classify a token as mother than father (leftmost column). However, with tokens classified by human coders as mother, when the ASP system disagreed, it was far more likely to classify a token as father than child (second column from left). With tokens classified by humans as father, the ASP system was also more likely to disagree by classifying a token as mother than child. Finally, the tokens classified as father by the ASP system were approximately twice as likely to be classified as other by human coders than were either child or mother ASP-system-classifications.
The classification tree fit to the ASP system's judgments is illustrated in Fig 2. The illustrated trees depict a decision procedure wherein a stimulus is evaluated according to the stated inequalities in each node, beginning at the top and descending until a terminal node is reached. So, for example, the classification tree fit to the ASP system judgments begins by evaluating the maximum f 0 of a given stimulus. If the maximum f 0 is greater than or equal to 251 Hz, the left branch is followed. If the maximum f 0 is greater than or equal to 399 Hz, the stimulus is classified as child, whereas if it is less than 399 Hz, the duration of the stimulus is evaluated. If the duration is less than 595 ms, the stimulus is classified at child, whereas if it is greater than or equal to 595 ms, it is classified as mother. If the maximum f 0 is less than 251 at the first node, the mean f 0 is evaluated. If the mean f 0 is less than 202 Hz, the stimulus is classified as father, otherwise it is classified at mother. A ten-fold cross-validation was performed on the classification tree fit of the machine decisions. In this process, 90% of the data was used to train the classification tree model, with an error term computed on the held-out 10%. This process was repeated with ten arbitrary, unique training-error sets. The overall cross-validation error rate for this data set is 0.124.
The classification tree fit to the human coder's judgments is illustrated in Fig 3. In this tree, the decision procedure also begins by evaluating the maximum f 0 of a given stimulus. If the maximum f 0 is greater than or equal to 399 Hz, the stimulus is classified as child. If the maximum f0 is less than 399 Hz, the mean f 0 is evaluated. If the mean f 0 is less than 190 Hz, the stimulus is classified as father, whereas if the mean f 0 is greater than or equal to 190 Hz, the duration of the stimulus is evaluated. If the duration is less than 995 ms, the stimulus is   classified as child, otherwise it is classified as mother. The same ten-fold cross-validation procedure described above was performed on the classification tree fit of the human decisions, except decisions from all 23 judges were entered into the model (thus, the data was about 23-times larger for this dataset). The overall cross-validation error rate for this data set is 0.353.

Discussion
This work examined the accuracy of ASP labels in a large sample of speech collected from natural family settings using LENA technology. This work is motivated by a need to establish the accuracy of ASP technology in common use in the current research literature and clinical practice. We found a high degree of agreement between the ASP system and human coders, as indicated by high unweighted Fleiss' and Cohen's kappa coefficients and moderate-to-high weighted kappa coefficients [42]. Consideration of the pattern of disagreements between the ASP system and human coders indicates an asymmetry on the part of the ASP system relative to human coders. When the ASP system and human coders disagreed, human child classifications were mostly classified as mother by the ASP system, whereas human mother classifications were more often classified as father by the ASP system, as were human father classifications.
Analysis of the relationships between ten acoustic measures and the observed classifications indicate that both the ASP system's and the human coders' judgments correspond most closely to differences in maximum f 0 , mean f 0 , and duration. The classification tree models fit to the ASP and human coder judgments are very similar, both indicating that tokens classified as child exhibited high (maximum and/or mean) f 0 values and shorter durations, while tokens classified as father exhibited low (maximum and/or mean) f 0 values, and tokens classified as mother exhibited moderate to high f 0 values and longer durations. Taken as a whole, these findings show certain structural similarities between the human decision processes known to be important for speech perception (namely, f 0 , amplitude, and temporal characteristics) and the results of the machine algorithms.
Overall, the machine performance found here is consistent with ASP performance [43,44] given the naturalistic, open-set acoustic data the system takes as input.

Practical application
Results from this work could be used to improve the algorithms and ASP procedures generating output. This is certainly not a straightforward task and the current results give rather abstract areas for improvement. Future work might explore concrete methods for improving the technology. For example, human classification decisions are demonstrated here to be influenced by spectral and temporal aspects of the acoustic signal with little influence from amplitude characteristics. This finding might guide researchers to focus on parameters likely to be useful for human applications such as speech. Another application might use the results of the present work directly to interpret future application of the ASP output. In particular, this work provides a fairly detailed estimate of the error (broadly defined) of the system output. This error, detailed by the label types examined here, could be straightforwardly interpreted alongside the input to better understand the results. For example, error estimates of the label outputs could be input into a model as the likelihood that a given label is correctly assigned, a sort of confidence coefficient for the classification results. This work is a first step in that direction, giving likelihood estimates for labels most likely to be useful for speech research, namely the target child, adult female, and adult male vocalizations.

Limitations
We do not account for segmentation in this work, but instead simply assume that segmentation is meaningful. It is unknown if changes or improvements in segmentation would alter the performance of the ASP or the judges.
Differences between the ASP labels (FAN, MAN, CHN) and the choices presented to judges (mother, father, child) are not necessarily commensurate. The LENA labels are intended to indicate a relatively high similarity between the model for adult female, adult male, or the key child wearing the device, respectively, and the sampled audio segment. The LENA model does not assume the segment bears a specific relationship such as father or mother.
The specific acoustic features used in this study could be profitably expanded. Although this study was designed to examine relatively coarse features due to the preliminary nature of this investigation and the broad variability in labels ostensibly identifying 78 individuals (mothers, fathers, and children in 26 families). Other work, notably Oller and colleagues [14], examined the role of 12 acoustic features from four general speech categories (the rhythmic/syllabification group, the low spectral tilt and high pitch control group, the wide formant bandwidth and low pitch control group, and the duration group) known to be relevant for child vocalizations. Examining the automatically coded vocalizations of children, they showed associations between the apriori acoustic features/feature classes and group classifications into typical and disordered classes. Future work could benefit from a wider application of acoustic features to better understand the underlying mechanism of classification for talker or group classification.
The ASP and human judges make decisions based on different factors. Despite conclusions that may be equitable or interpretable in terms of the other, it is not clear how much insight the ASP offers into the human decision process. For example, human judges certainly used semantic content in the decision process, a detail inaccessible to the ASP. The ASP may have also make use of details such as detailed representations of energy in the signal that may not be used in the same way for human listeners. Similarly, there is no guarantee that the acoustic factors consider here or in other post-hoc analyses such as Oller and colleagues [14] have analogs to those used by the inaccessible processing algorithms of the LENA system. Unless or until those processing techniques are made transparent, the acoustic correlates described here are, at best, secondary to the actual system performance.