Investigating the moments of “aha” and “hmm” through acoustic analysis of voice and speech in pre-service physics teacher education–A novel method for identifying significant learning moments

Kati Järvinen; Anne-Maria Laukkanen; Anna-Leena Kähkönen; Pasi Nieminen; Terhi Mäntylä

doi:10.1371/journal.pone.0314344

Abstract

The aim of the study was to find whether certain meaningful moments in the learning process are noticeable through features of voice and how acoustic voice analyses can be utilized in learning research. The material consisted of recordings of nine university students as they were completing tasks concerning direct electric circuits as part of their course of teacher education in physics. Prosodic features of voice—fundamental frequency (F0), sound pressure level (SPL), acoustic voice quality measured by LTAS, and pausing—were investigated. The results showed that instances of confusion and understanding were manifested in acoustic parameters. F0 was significant in characterizing the both kind of learning instances. Confusion had lower SPL and alpha ratio, indicating that voice quality was softer than in understanding. Degree of voice pauses was lower in understanding, suggesting less hesitation or need for clarification for understanding compared to confusion. Voice research adds to the research of learning as speaker´s voice is affected by the different instances in the process of learning. This research approach can be used for identification of important instances of learning and directing these instances to closer analysis of content or interaction to further understand the learning processes. Therefore, this study is a novel contribution to the study of learning as it adds acoustic voice and speech analyses to the discipline.

Citation: Järvinen K, Laukkanen A-M, Kähkönen A-L, Nieminen P, Mäntylä T (2025) Investigating the moments of “aha” and “hmm” through acoustic analysis of voice and speech in pre-service physics teacher education–A novel method for identifying significant learning moments. PLoS ONE 20(1): e0314344. https://doi.org/10.1371/journal.pone.0314344

Editor: Gaoxia Zhu, Nanyang Technological University, SINGAPORE

Received: June 12, 2024; Accepted: November 9, 2024; Published: January 24, 2025

Copyright: © 2025 Järvinen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant numerical data are within the manuscript and its Supporting Information files. The voice samples are not available since they contain personal data.

Funding: This study was supported by the Academy of Finland through Grant 341558 (to TM). https://www.aka.fi The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The “aha” and “hmm” moments in learning situations

Problem-based learning is a type of pedagogical setup used by teachers, and a popular approach in science education [e.g., 1]. In problem-based learning, the learners orient and analyse the problem, engage in self-directed problem solving, and end with a reporting activity such as explaining their answer to teachers or peers. Among its other goals, problem-based learning is used to produce developments in group work skills and self-directing skills, which are viewed as the recognition of missing knowledge and the ability to reapply the existing knowledge in a new, flexible way. Problem based learning is a very suitable context to study explanations and restructuring of knowledge [1, 2].

In the process of problem solving, people restructure available information. This can lead to sudden emergence of new understanding, insights. This kind of substantive reorganization of knowledge structure has been dubbed radical conceptual change (as opposed to incremental developments) [3, 4]. An insight has explanatory power, opposed to a single fact, as it addresses the “why” or “how” of a situation, not only the “what”, and it can indicate a conflict between the current understanding and intuition. Insights are also explained as being information that can restructure previous assumptions and can result in an “aha” experience [5]. The “aha” experience involves positive emotions, the release of tension upon resolving the situation, and recognizes the impasse that was before [6].

Gopnik [7] argues that explanation is a goal-directed human activity which combines some of the properties of cognitive and motivational phenomenology. Explanation acts as the phenomenological mark of a cognitive process. The phenomenology contains recognizing that an explanation has been reached and the search for explanation. Gopnik calls these instances the “aha”-moments and the “hmm”-moments. The “aha” can be described as the feeling when a causal representation is set either by applying the theory or by revising it, while the “hmm” is indicating the feeling when evidence to which the theory has not yet set a causal representation is presented. Gopnik discusses the “aha” moments in context of explanation. In more general the “aha” moment or insight refers to a sudden moment of comprehension, realization or understanding e.g., in the context of problem solving [8, 9]. Often before the occurrence of the “aha” moment, the elements of the problem or the situation have been restructured [8], and the “aha” moment often involves positive emotions [5, 6]. There is also evidence that explanation, knowledge, or solutions obtained through “aha” moments are better remembered [6]. Also, the “hmm”-moments compel learners to resolution and action and can, then, act as a driving force to develop learner´s conceptual understanding [7, 10, 11].

In general, Pekrun et al. [12] have classified academic emotions into four categories: 1. achievement emotions which are emotions that have their stimuli or object focusing on success or achievement, 2. topic emotions relating to the actual topic being studied, 3. social emotions which relate to social relationships between students and teachers or among peers in educational context, and 4. epistemic emotions which relate to the learning process. Tyng et al. [13] also pinpoint the importance of emotions in the learning process. The epistemic emotions focus on knowledge or knowledge construction and they can be e.g. surprise, curiosity, confusion, or boredom and these epistemic feelings occur in situations of conflicting information where a new understanding emerges [14]. According to Pearce et al. [5] the definition of insight, the “aha” moment, is summarized in four key characteristics: subjectivity, suddenness, certainty, and emotions. They argue that the insights have been emphasized by their cognitive nature when they should include affective qualities of insights, too. Thus, the affective moments of insights are certain kind of epistemic emotions.

Vilhunen et al. [14] underline the importance of epistemic emotions in educational settings and the complexity of the interplay between cognitive and affective factors in learning situations. And Schneider et al. [15] suggest that generating understandings about emotional experiences in scientific sensemaking, students´ learning, interest and engagement can be enhanced. Tyng et al. [13] summarize the studies of changes in emotional states to three: 1. subjective approaches investigating subjective feelings and experiences, 2. behavioral investigations focusing on gestural changes and facial and vocal expressions, and 3. objective approaches via physiological responses.

Since students’ emotions and learning processes are intertwined [16], it is justifiable to identify these “aha” and “hmm” moments in learning. The “aha” moments are insights that are sudden moments of understanding something or realization how to move forward from cognitive impasses [8]. “Hmm” moments are moments of cognitive impasses or confusion [10]. These instances are referred to as moments of confusion and understanding in this study. The moments of understanding or confusion have previously been detected or identified for example from affective responses, such as facial expressions [7] or through neuroimaging methods [8], or students’ self-evaluations [5, 6, 10, 16].

In the study of learning and interaction in learning situations, video analysis is today a vastly used method to find significant moments in the learning process [see e.g., 17, 18] and for revealing the behavioral changes. A common approach is to transcribe the talk and nonverbal information and work with this intermediate data. From this approach, it is not easy to decipher the emotions embedded in the process. Also, as Lodge et al. [11] point out, when face-to-face interaction is not present teachers can have difficulties in detecting students´ emotional responses, especially confusion. They argue that means for detecting emotions that students experience while learning are necessary. As online and remote learning grows the methods for detecting students´ emotions is becoming more essential. Tyng et al. [13] show the evidence of emotion modulating attention and retrieval of information through a review of neuroimaging studies and conclude their review with a call for education research to pick up the ball. However, neuroimaging techniques are impractical for classroom research. The goal of deciphering the interplay of emotion and learning progresses requires new methodologies. This present study focuses on the emotions in the learning process, the “aha” and “hmm” moments, as it aims to clarify whether these moments can be studied and recognized with acoustic analyses.

Verbal and non-verbal communication

Interpersonal communication has traditionally been divided into two: verbal and non-verbal communication [19], and classroom talk is traditionally investigated through analyzing teacher-student speech and interaction by linguistic ethnography and sociocultural research [20]. However, linguistic content and representations do not take the individual characteristics of voice into account and sufficiently represent the paralinguistic information in speech. A textual representation of a spoken utterance can, therefore, be an insufficient indicator of both its content and its functional intent. Voice characteristics add extralinguistic information on top of the linguistic one and, thus, give information to the listener about the speaker´s individual characteristics, such as age, gender, and physiological and psychological status. This means that when a speaker talks, he/she does not reveal something about self only by what is said but also by how it is said, as speech utterances convey non-lexical interpersonal and discourse related information to a large extent [21–25].

In automatic analysis and speech synthesis, Campbell [22, 23] proposes that speech utterances can be categorized into two main types: 1. the primarily information bearing and 2. the primarily affect-expressing. The first can be characterized by transcriptions alone, but the latter requires knowledge of the speaker´s prosody before an interpretation of the meaning can be made.

Characteristics of non-verbal communication in speech

Speech prosody has many functions in interpersonal communication, and one of the most important is to indicate changes in speaker´s emotional state. A speaker can express the same linguistic content with different emotions by changing the prosodic features of speech. These changes in prosody can be intentional or unintentional [24], also, the perceptions of the changes in the prosodic features are subconscious to a large extent [26]. Prosody helps the listener to better understand the discourse [27].

Acoustic parameters that can directly be affected by emotional prosody are mean fundamental frequency (F0) and its variation, mean intensity, and segment and pause duration [e.g., 28, 29]. Changes in arousal level and valence can be reflected in changes in pitch and/or intensity [e.g., 30–32]. Also, epistemic emotions can be categorized by their valence and activation, as enjoyment and curiosity are experienced as pleasant or positive and are therefore considered as positive activating emotions and associated with high arousal and activation while confusion is considered as negative activating emotion entailing negative valence and activating nature [14]. Additionally, voice quality can be affected by speaker´s emotional state, for example, Laver [19] summarizes that breathy voice is associated with intimacy, whispery voice with confidentiality, harsh voice with anger and creaky voice with boredom. According to previous studies [see e.g., 33–35], anger, joy, and fear are manifested in tense voice and sadness, surprise, enthusiasm, intimacy, and content with breathy voice.

When analyzing classroom interaction between teacher and students, Hämäläinen et al. [36] have found that voice research complements to the research. They state that prosodic features, such as intonation, volume, and pace, can add to the research, since it is important to know how things are said in addition to what is said.

This present study aims to investigate whether voice research can be broadened to the study of learning by searching answers to the following research questions:

RQ1: Are the “aha” and “hmm” moments recognizable through certain characteristics in the speaking voice?
RQ2: What can acoustic voice analyses provide to learning research?

Material and methods

Ethical statement

The participants were volunteers and gave their written informed consent for the collected material to be used for research. Material was collected and saved, and the results of the analyses were published respecting the Finnish Data Protection Act (Chapter 5, section 31). The Finnish procedure requires that the researchers and the PI are trained in research ethics and have the competence to perform their own ethical evaluation and to assess the risks. The researchers in this study have all completed the University’s training course in research ethics and data management.

The Ethical Committee of Research in Humanities in University of Jyväskylä follows the directives of TENK and provides pre-evaluation statements only for studies that pose a substantial risk to participants. The Finnish National board on Research Integrity (TENK) connects such risks to five conditions: deviation from informed consent, intervening in physical integrity of participants, exceptionally strong stimuli, greater than everyday probability to cause mental harm, or threat to safety of participants or close ones (Publications of TENK 3/2019, ISSN 2490-161X; https://tenk.fi/en/advice-and-materials/guidelines-ethical-review-human-sciences). The permission for the study was not applied from the University’s Ethical Committee as this study took place within the frame and conduct of a typical study session and applies the procedures for informed consent, triggering none of the aforementioned risk conditions.

Selection of participants

Twenty-three pre-service physics teachers from the course “Teaching of Physics at school" answered the preliminary questionnaire and completed the task of conceptual understanding instrument DIRECT [37] translated to Finnish, which measured their conceptual knowledge of direct current electric (DC) circuits. The DIRECT instrument is a research-based multiple-choice questionnaire consisting of 29 questions and these multiple-choice questions are a validated tool for measuring conceptual understanding of DC circuits. The questionnaire included questions about personal information, subjective notions of voice, and Voice Handicap Index (VHI) [38]. VHI score can reflect the subjective assessment of a handicap due to voice disorders and can be seen as a screening tool for distinguishing vocal health and vocal dysfunction between individuals [39, 40] and it has been validated for Finnish speakers (Alaluusua & Johansson 2003 [Unpublished]). The participants were third- or fourth-year students (five females and five males) who had already studied the basics of electricity, but their conceptual understanding was somewhat still developing.

A total of five pairs with the lowest VHI scores were selected to the voice related part of the research. The participants were further divided into pairs based on DIRECT instrument results, using varied pairing (pairs with high and low, high and high, and low and low scores). One pair was female-female, one was male-male, and three pairs were female-male. There were altogether three recorded sessions. One participant attended only the first session. Her partner was moved in with another pair, leaving three pairs and one trio to be investigated in the rest of the study.

Recordings

The recordings took place at the university on three consecutive Fridays from 10 to 12 o´clock. Before each recording the participants answered a short questionnaire about their voice production (easier than usual—as usual—need more effort than usual) and voice quality (better than usual—usual—worse than usual) based on their notions on that day.

The recordings were carried out with AKG C 111 headsets (AKG Harman, Stamford, USA), the microphone placed at 2 cm from the mouth corner. The input frequency was 44.1 kHz, and Revolabs HD Countryman adapters (Yamaha UC, Inc., Sudbury, USA) and Revolabs HD Dual Channel System 2-Ch (Yamaha UC, Inc., Sudbury, USA) combined to Zoom Livetrak L-12 mixer (Zoom Corp., Tokyo, Japan) were used in the recordings. Each voice was calibrated for SPL by using AZ 8922 digital sound level meter (AZ Instrument Corp., Taichung City, Taiwan) (Fig 1). The recordings were approximately 1 hour and 30 minutes in total duration per pair in each session. All participants wore a surgical face mask due to the Covid-19 situation. Although it has been noted that a face mask can have an effect on voice [e.g., 41, 42] it was imperative to use masks during the recordings. The masks were used on every occasion; therefore, the influence of the mask was the same in every situation throughout recordings and thus the mask should not affect the results as comparisons were made individually within sessions.

Download:

Fig 1. The recording procedure for one pair.

All pairs were present at the same time in the same location.

https://doi.org/10.1371/journal.pone.0314344.g001

After the recordings the participants answered a question on a 100-point VAS line about the stressfulness of the situation (0 = not at all stressful—100 = very stressful).

The tasks

The teaching sessions were structured for problem-based learning in pairs or small groups. All the pairs were given the problem”Rank the light bulbs in DC circuits in the order of their relative brightness” (Fig 2). During the first two recordings, tasks became more complex per each task. In the third recording, the students were given some of the same tasks as in first two recordings—in order to evaluate the progress of learning or understanding that took place. The directions for the problem-based sessions followed the Predict-Observe-Explain structure: First on pen and paper only–analysing the problem first individually and then discussing the prediction as a pair and possibly reaching a consensus–and then constructing the circuits and observing if their prediction was confirmed or if revisions were needed. Based on the observation, the pair formed their final explanation, which was reported to the teacher. The tasks were adapted from McDermott & Shaffer [43]. An example of a task is shown in Fig 2. Even for university students, the tasks are difficult, as they are designed to evoke known misconceptions about electric current, voltage, and resistance. Such misconceptions include assigning properties of energy onto the concept of current [37].

Download:

Fig 2. An example of the task, requesting the student to rank the bulbs A-H per brightness.

https://doi.org/10.1371/journal.pone.0314344.g002

From the recordings, samples indicating either confusion, that is when students expressed uncertainty or impasse in resolving the task(category 1), understanding, i.e. the moments where students expressed an insight of an idea or a conception (2), or explaining, the speech turns where students provided information to peer (3) were extracted by one of the researchers who based the categorization also on context of discussion. Explaining was chosen as one category since it can describe speech and voice with less emotional connotations than understanding and confusion [36]. Also, teachers´ explanative turns can affect students´ learning [44]. Criterium for the samples was that each sample was at least one sentence long and, thus, short interjections or one-word utterances were not included. After categorizing the samples, the ones with overlapping speech, laughter or other disturbances were excluded, and total of 328 analyzable samples were selected for the analysis (94 in category 1, 147 in category 2, and 87 in category 3). Additionally, a sample of normal conversational speech was extracted for reference from each session.

For validating the classification of the samples in categories another researcher categorized 55 randomly selected samples, 16.8 percent of the whole, which is an adequate proportion to be multiply coded. Interrater reliability was analyzed by crosstabulation, and the measurement for agreement is Cohen´s Kappa, which estimates the degree of consensus between two judges. Cohen´s Kappa between the two listeners was 0.72 which reads as substantial agreement [45].

Acoustic analyses

In spoken language, pitch is an important characteristic of voice. It contributes to the perception of intonation in all languages, and to the lexical identity of words in some languages. Pitch contributes with linguistic and paralinguistic functions to the identification of speech acts, the recognition of speaker states, the perception of prosodic structuring, as well as many other characteristics related to discourse and dialogue. A distinction between pitch and fundamental frequency (F0) is that pitch corresponds to the subjective perception of voiced sounds, while F0 corresponds to the physiological parameter of the frequency of vibration of the vocal folds [46]. The most used indicator of the acoustic amplitude of sound wave is sound pressure level (SPL) which correlates with the human perception of loudness [47].

In determining voice quality, two main factors are present: vocal fold vibration and vocal tract resonance, which are controlled by the speaker’s phonatory and articulatory behaviors. The voice organ includes the respiratory system, the larynx, and the vocal tract. Movements in the voice organ control the quality of voice sounds [48]. Long-term average spectrum (LTAS) provides information about the sound energy and its distribution to different frequency areas [49] and it is used for voice quality analyses. Alpha ratio is a manifestation of voice quality as it refers to sound level difference between low and high frequency ranges, here the level difference was calculated between 50–1000 and 1000–5000 Hz. Summary of the acoustic parameters and their physiological and perceptual correlates are given in Table 1.

Download:

Table 1. Acoustic parameters and their physiological and perceptual counterparts.

https://doi.org/10.1371/journal.pone.0314344.t001

Acoustic analyses of voice provide an objective tool for studying human voice. The acoustic analyses were carried out by Praat 6.0.49 [50] (Fig 3).

Download:

Fig 3. An example of voice and its analyses in Praat.

Picture A: Female speaker producing vowel /a:/, blue line represents F0, green line SPL. On the left F0 = 188.4 Hz and SPL = 68.7 dB, and on the right F0 = 333.2 Hz and SPL = 76.9 dB. Picture B: LTAS of a female speaker´s speech sample in a 20-second-long text reading, solid line: breathy voice, dotted line: pressed voice.

https://doi.org/10.1371/journal.pone.0314344.g003

Previous studies have shown that the most prominent prosodic features in studying the vocal characteristics during learning are pitch and intensity related characteristics, as well as voice quality and speech tempo [27, 51–54]. The parameters investigated per turn were fundamental frequency (F0), F0 variation and standard deviation (in semitones), sound pressure level (SPL) and its range, Alpha ratio, and pausing (degree of voice breaks in percentages, measured automatically from duration of the breaks in the signal divided by the total duration of the signal). The fundamental frequency was measured in linear Hz scale, but F0 variation and standard deviation of F0 have been expressed in a logarithmic scale (as semitones reference point at 100 Hz), in order to adapt the results to our pitch perception and to make male and female speakers better comparable with each other [46]. An equal (perceived) pitch difference of, say, 2 semitones, comprises different amounts of Hz depending on the actual pitch level. For example, for a male it can be a difference from 98 Hz to 110 Hz (i.e., 12 Hz), while for a female it can be from 196 Hz to 220 Hz (i.e., 24 Hz), as female pitch range is on average higher than male. The linear Hertz scale should be transformed to logarithmic scale (e.g., semitones) when differences in frequencies are measured, for example when the span of a speaker´s pitch range or pitch movements are analysed [46]. Changes in vocal loudness can affect the Alpha ratio, since increase in vocal loudness leads to decrease in the overall slope of the long-term average spectrum [55–57], and Alpha ratio is, then, reflecting both voice quality and intensity. In soft and breathy voice, the spectral slope is steep and Alpha ratio low, while in firmer and louder voice the slope is less steep and Alpha ratio bigger (see Fig 3).

The difference between category (confusion, understanding, or explaining) and normal conversational speech was calculated as a subtraction: value in category–value in normal conversational speech. Then, the outcome is a negative value when the value in category is lower than in normal conversational speech. The difference between categories and the normal conversational speech was calculated in percentages, since the amount of the samples varied between the categories and between sexes.

Statistical analyses

IBM SPSS 26 software was used for the statistical analyses. Kolmogorov-Smirnov test for normality was used to test the distribution of data. Since the distributions were not normally distributed, non-parametric tests were used for comparisons. Friedman test was used for comparisons between categories, Wilcoxon signed ranks test for pairwise comparisons. Significance level was set to .05. Due to the relatively small number of participants, generalized linear mixed model was conducted for finding out possible individual effect in the changes. Significance level was set at .05.

Results

The participants mean age was 23.8 years (sd 1.93), and the mean VHI score was 5.3 (sd 3.81). Subjective experienced stress was highest in session two. Only one participant evaluated the voice quality to be worse than usual on one occasion (Table 2).

Download:

Table 2. Descriptives of participants´ age and VHI scores, experienced stress after each session, and self-estimated voice quality and voice production before each session.

https://doi.org/10.1371/journal.pone.0314344.t002

Recognizing the “aha” and “hmm” moments

Samples for normal speech and explaining had in average longer sample durations than samples in categories of confusion and understanding (Table 3).

Download:

Table 3. Mean and standard deviation of sample duration in each category.

https://doi.org/10.1371/journal.pone.0314344.t003

F0 was 5.80 percent higher in understanding and 0.64 percent in confusion compared to normal conversational speech while in explaining it was 5.13 percent lower. F0 standard deviation was 14.41 percent larger in normal conversational speech than in explaining but in confusion and understanding smaller (4.51 and 4.42 percent respectively) (Fig 4).

Download:

Fig 4. F0 and F0 SD in normal, confusion, understanding, and explaining.

https://doi.org/10.1371/journal.pone.0314344.g004

SPL was 2.65 percent lower in confusion than in normal conversational speech, in understanding it was 3.05 percent and in explaining 0.17 percent higher than in normal conversational speech (Fig 5).

Download:

Fig 5. SPL and SPL range in normal, confusion, understanding, and explaining.

https://doi.org/10.1371/journal.pone.0314344.g005

Alpha ratio was 7.33 percent lower in confusion, while in understanding and explaining it was higher (0.54 and 3.79 percent respectively). Pausing was 12.08 percent lower in understanding, in confusion it was 1.45 percent and in explaining 1.59 percent higher than in normal conversational speech (Fig 6).

Download:

Fig 6. Alpha ratio and pausing in normal, confusion, understanding, and explaining.

https://doi.org/10.1371/journal.pone.0314344.g006

Means and standard deviations in each parameter in the three categories are given in Table 4.

Download:

Table 4. Means and standard deviations (SD) of the difference between normal conversational speech and the three categories, calculated in percentages.

https://doi.org/10.1371/journal.pone.0314344.t004

Between categories, statistically significant differences were found in all parameters except F0 variation and SPL range (Table 5).

Download:

Table 5. Comparisons between categories. Friedman test, significance level .05.

https://doi.org/10.1371/journal.pone.0314344.t005

Confusion differed significantly from understanding in F0 (Z = -4.35, p < .001, r = -0.45) and F0 SD (Z = -2.36, p = .018, r = -0.24), SPL (-6.50, p < .001, r = -0.67), Alpha ratio (Z = -2.75, p = .006, r = -0.28), and pausing (Z = -3.17, p = .002, r = -0.32) and from explaining in F0 (Z = -4.52, p < .001, r = -0.49), F0 SD (-3.38, p < .001, r = -0.36), SPL (Z = -3,26, p = .001, r = -0,35), and Alpha ratio (Z = -4.50, p < .001, r = -0.49). Understanding differed from explaining in F0 (Z = -7.17, p < .001, r = -0.77), SPL (-3.53, p < .001, r = -0.38), Alpha ratio (Z = -2.00, p = .045, r = -0.22), and pausing (Z = -3.96, p < .001, r = -0.43) (Table 6, Fig 7).

Download:

Fig 7.

LTAS (averaged) for females (left) and males (right). Black line representing confusion, red line understanding, and blue line explaining.

https://doi.org/10.1371/journal.pone.0314344.g007

Download:

Table 6. Pairwise comparisons between categories.

Wilcoxon signed ranks test, significance level .05.

https://doi.org/10.1371/journal.pone.0314344.t006

Generalized linear mixed model did not show significant individual effects on the results, Table 7.

Download:

Table 7. Individual speaker effect on parameters.

Generalized linear mixed model, significance level .05.

https://doi.org/10.1371/journal.pone.0314344.t007

Acoustic voice analyses and learning research

The significant learning moments were manifested differently in the speaking voice and the differences were detectable by acoustic analyses. Acoustic analyses provided significant results between the three categories investigated (Table 5). The results show that acoustic analyses are suitable tool in studying learning moments from learner´s speech and voice.

Discussion

Speakers modify their prosody for communicative reasons, and for example, by these modifications the same linguistic content can express different emotions [24]. The emotions can be recognized through vocal characteristics [31, 32]. According to Gopnik [7], the “hmm-moments” of learning are closely related to basic emotions of surprise and interest, and the “aha-moments” are accompanied by expression of joy. These moments are, then, often expressed, for example, by facial expression. According to the results in this study, these moments are recognizable also in the speaking voice. In this study, confusion was expressed with softer voice quality than understanding and explaining. Explaining was manifested in lower SPL and alpha ratio (see Figs 5 and 6). This is in line with previous findings that LTAS and Alpha ratio derived from it are affected by voice quality or phonation type (i.e. breathy, modal, or pressed), and SPL [55–57].

Speech samples of this study did not have strong emotional contents, and, thus, it is understandable that differences were found in voice quality related Alpha ratio. In signaling emotions, voice quality and pitch variables may have at least partially different functions: voice quality makes distinctions in general speaker states, moods, and attitudes, whereas pitch variables are more critical in signaling stronger emotions [35]. Understanding can represent stronger emotions, such as excitement [30–32] or joy [7] than confusion or explaining. Therefore, understanding is manifested with higher F0 and louder voice. This is in line with previous studies which suggest that valence and activation are important characteristics of epistemic emotions [14].

While explaining had in average longer samples than confusion and understanding, the standard deviation of F0 was smaller. This is in line with previous findings that increase in activity can be manifested with greater pitch variation [58]. Then, both confusion and understanding would have more activity than explaining, as it is possible that when explaining things to another person, the speaker does not express as strong emotional states as in expressing confusion and/or understanding. Hämäläinen et al. [36] found different voice patterns in teacher´s classroom situations. They suggest that pitch variation plays an important role in these patterns, and teacher´s presentative speech has the lowest pitch variation compared to, for example, disputational and promotive speech. Explanatory speech in this study is comparable with presentative speech, and, therefore, the result in our study is in line with their findings. However, further study on acoustic analyses combined with content analysis is required to understand the context and for example, the effects of subject matter understanding on the clear delivery or assertiveness of the speech.

Additionally, understanding had less pausing than confusion or explaining indicating more fluent speech in demonstrating understanding. Speakers can manipulate their use of pausing to structure information, and increased pausing can act as a factor either for cognitive processing for the speaker, or as a communicative-interactive marker giving the interlocutor more time to process the explanatory contents [27, 59, 60]. This was seen in this study in increased degree of pausing in confusion and explaining. In confusion, increased degree of pausing can indicate hesitation, while in explaining it can indicate clarification. For further research, speech rate is worth adding for addressing fluency, hesitation, and clarification markers, since speaking rate tends to increase and pause duration to decrease as uncertainty decreases [27].

Lodge et al. [11] state that, especially when face-to-face interaction is absent, emotions such as confusion can be difficult to detect. As there is a growing use of digital learning environments, a deeper understanding of students´ difficulties should be addressed. This study showed that confusion can be detected from the speaking voice which can provide tools for teachers for recognizing these emotions during, for example, online teaching.

Recording speech samples in authentic learning situations set certain challenges to acoustic analyses, since there were several samples that could not be analyzed due to overlapping speech, and other disturbances, such as laughter. However, for gathering authentic samples in the learning dialogues, the procedure was necessary. The inter-rater reliability in categorizing the samples was somewhat moderate which can be explained by the fact that the first categorizing was made from the whole recordings while the other researcher listened to only the selected samples. Therefore, the first categorizing could be influenced by the whole conversation and the contexts where the utterances were spoken.

The participants were compelled to wear surgical face masks during the recordings which can affect especially the spectral characteristics of the voice. According to previous studies [41, 42] wearing a face mask can cause attenuation of the higher frequencies (over 1000 Hz) and thus steeper spectral tilt (which would result in lower alpha ratio), and reduction in intensity. In this study, comparisons were made with individual changes between the normal conversational speech and the categories (i.e., confusion, understanding, and explaining) in each recording session and then, the effect of the mask was the same in each category and also conversational speech. Therefore, the use of a mask should not affect the differences found between the categories.

The number of participants was quite small due to the recording circumstances. However, sample quotient was quite large and, therefore, the statistical analyses can be considered adequate which is confirmed by the effect sizes. Also, the generalized linear mixed model confirms the results as no significant individual effects were found. However, larger number of participants should be included in future research.

This present study is piloting new research methods for learning research, and for generalizing the results, other situations and environments for the learning discussions should be studied. Additionally, this study was conducted in Finnish cultural and language context. The results may be affected by the cultural use of voice and, thus, the characteristics of voice in other languages and cultural contexts should be investigated. Also, this study concentrated on peer discussions and teacher-student dialogues were not considered. Future research could investigate teacher´s influence on student´ vocal manifestations of the learning moments.

Recommendations for future work

For finding the learning occasions through speech, further study is required. One possible measure can be the syllabic prosodic index (SPI) introduced by Tavi [24]. This index is a phonetic measure for analyzing prosodic emphasis in syllables, and it combines all main aspects of prosody, that is, pitch, rhythm, and intensity, and it can be used as an addition in analyzing emotional states from voice. SPI could give information about the different ways learning moments are manifested in speakers´ voices. Also, it should be studied whether it is possible to find the crucial moments of learning, or confusion automatically from speech patterns. This would benefit the research in the field of learning since voice parameters can offer objective information regardless of the lexical content of speech and speed up other analysis processes by providing quick access to moments of interest automatically with, for example, data mining and machine learning.

Also, it would be beneficial to study whether machine learning could be prompted to recognize the significant moments of the learning process from students´ speech which could help teachers to detect these moments during online teaching. Next step is to study whether the significant moments can be recognized from audio data by extracting the samples based on such changes in the parameters that were found in this study to be significant in distinguishing different learning moments.

For voice quality analysis, inverse filtering could be added in future research. In inverse filtering [see e.g., 61, 62] glottal flow signal is estimated from corresponding acoustic speech pressure signal. Then, the effect of vocal tract and lip radiation are removed from a microphone signal. Inverse filtering is a method which can reveal more information of voice quality changes between learning instances than spectrum-based Alpha ratio. Also, a larger number of samples and participants are required in future research.

Acoustic analyses provide a possibility to study how voice conveys understanding in the learning process. It should be studied further how speakers´ modify their voice and speech in interaction and how these changes construct understanding.

As significant moments of learning process can be recognized through acoustic analyses of the voice, future research should combine the acoustic analyses with interaction and content analyses. This will give information on emotional reactions in the learning process, for example at different stages of problem-solving or with differences in difficulty of the problems. This novel way of studying the learning process can provide new comprehension of important events of understanding. Also, possible patterns of dissonance, such as frustration, and its manifestation through voice and speech should be studied further.

Conclusions

Speaker´s voice is affected by the different emotive moments in the learning process which are not necessarily detectable in mere written transcripts. Instances indicating confusion, understanding, and explaining were manifested differently in vocal features. Increase of the activity level may be higher in understanding than in confusion which leads to increase in pitch, loudness, and firmer voice quality in understanding. Explaining was manifested with smaller F0 standard deviation than in confusion. Understanding had less pausing than confusion and explaining. Larger degree of pausing may have reflected hesitation in confusion and served as a tool for clarification in explaining.

We find voice research a promising addition to the traditional research of learning. The successful linking of voice features to moments relevant to the learning process is a promising first start. The work continues as the voice features are studied alongside the conceptual learning progressions of the students. We appreciate voice research for both the ability to triangulate findings from the traditional interaction analyses as well as its standalone value for understanding the way voice is used in the interactive learning situation for multiple purposes.

Supporting information

S1 Data.

https://doi.org/10.1371/journal.pone.0314344.s001

(CSV)

References

1. Hmelo-Silver CE. Problem-based learning: What and how do students learn?. Educational psychology review. 2004;16:235–266.
- View Article
- Google Scholar
2. Yew EH, Goh K. Problem-based learning: An overview of its process and impact on learning. Health professions education. 2016;2(2):75–79.
- View Article
- Google Scholar
3. Asterhan CS, Schwarz BB. Argumentation and explanation in conceptual change: Indications from protocol analyses of peer‐to‐peer dialog. Cognitive science. 2009;33(3):374–400. pmid:21585475
- View Article
- PubMed/NCBI
- Google Scholar
4. de Leeuw N, Chi MTH. Self-explanation: Enriching a situation model or repairing a domain model? In Sinatra G, Pintrich P. (Eds.). Intentional conceptual change (pp. 55–78). Hillsdale, NY: Erlbaum; 2003. https://doi.org/10.4324/9781410606716
5. Pearce BJ, Deutsch L, Fry P, Marafatto FF, Lieu J. Going beyond the AHA! moment: insight discovery for transdisciplinary research and learning. Humanities and Social Sciences Communications. 2022;9:1–10.
- View Article
- Google Scholar
6. Danek AH, Fraps T, von Müller A, Grothe B, Öllinger M. It’s a kind of magic—what self-reports can reveal about the phenomenology of insight problem solving. Frontiers in psychology. 2014;8,5:1408. pmid:25538658
- View Article
- PubMed/NCBI
- Google Scholar
7. Gopnik A. Explanation as orgasm and the drive for causal knowledge: The function, evolution, and phenomenology of the theory formation system. In Keil F. C. & Wilson R. A. (Eds.), Explanation and cognition (pp. 299–323). The MIT Press; 2000.
8. Kounios J, Beeman M. The Aha! moment: The cognitive neuroscience of insight. Current directions in psychological science. 2009;18,4:210–6.
- View Article
- Google Scholar
9. Kounios J, Beeman M. The cognitive neuroscience of insight. Annual review of psychology. 2014;1,65,1:71–93. pmid:24405359
- View Article
- PubMed/NCBI
- Google Scholar
10. D’Mello S, Lehman B, Pekrun R, Graesser A. Confusion can be beneficial for learning, Learning and Instruction. 2014;29:153–170.
- View Article
- Google Scholar
11. Lodge J, Kennedy G, Lockyer L, Arguel A, Pachman M. Understanding Difficulties and Resulting Confusion in Learning: An Integrative Review. Frontiers in Education. 2018;3,49:1–10.
- View Article
- Google Scholar
12. Pekrun R, Muis KR, Frenzel AC, Götz T. Emotions at School. England, UK: Routledge; 2018. https://doi.org/10.4324/9781315187822
13. Tyng CM, Amin HU, Saad MNM, Malik AS. The Influences of Emotion on Learning and Memory. Front Psychol [Internet]. 2017 [cited 2022 Nov 29];8. pmid:28883804
- View Article
- PubMed/NCBI
- Google Scholar
14. Vilhunen E, Turkkila M, Lavonen J, Salmela-Aro K, Juuti K. Clarifying the Relation Between Epistemic Emotions and Learning by Using Experience Sampling Method and Pre-posttest Design. Front. Educ. 2022;7:826–852.
- View Article
- Google Scholar
15. Schneider B, Krajcik J, Lavonen J, Salmela-Aro K. Learning science: The value of crafting engagement in science environments. Yale University Press; 2020.
16. Vilhunen E, Chiu MH, Salmela-Aro K, Lavonen J, Juuti K. Epistemic Emotions and Observations Are Intertwined in Scientific Sensemaking: A Study among Upper Secondary Physics Students. International Journal of Science and Mathematics Education. 2023;21:1545–1566. pmid:36090464
- View Article
- PubMed/NCBI
- Google Scholar
17. Derry SJ, Pea RD, Barron B, Engle RA, Erickson F, Goldman R, et al. Conducting Video Research in the Learning Sciences: Guidance on Selection, Analysis, Technology, and Ethics. J Learn Sci. 2010;27,19,1:3–53.
- View Article
- Google Scholar
18. Xu L, Aranda G, Widjaja W, Clarke D, (Eds). Video-based Research in Education: Cross-disciplinary Perspectives. London: Routledge; 2018: p 302. https://doi.org/10.4324/9781315109213
19. Laver J. Principles on Phonetics. Cambridge. Cambridge University Press; 1994.
- View Article
- Google Scholar
20. Mercer N. The analysis of classroom talk: Methods and methodologies. British Journal of Educational Psychology. 2010;80:1–14. pmid:20092680
- View Article
- PubMed/NCBI
- Google Scholar
21. Barr BJ. Paralinguistic correlates of conceptual structure. Psychonomic Bulletin & Review. 2003;10,2:462–467. pmid:12921425
- View Article
- PubMed/NCBI
- Google Scholar
22. Campbell N. Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue Speech. In Andre E, Dybkjaer L, Minker , Heisterkamp P. (Eds.). Affective Dialogue Systems. Springer Verlag. Berlin; 2004. https://doi.org/10.1007/978-3-540-24842-2_22
23. Campbell N. Perception of affect in speech—towards an automatic processing of paralinguistic information in spoken conversation. Proceedings Interspeech. 2004:881–884.
- View Article
- Google Scholar
24. Tavi L. 2020. Prosodic Cues of Speech Under Stress—Phonetic Exploration of Finnish Emergency Calls. Academic dissertation. Publications of the University of Eastern Finland. Dissertations in Education, Humanities, and Theology: No 154. University of Eastern Finland. Joensuu.
- View Article
- Google Scholar
25. Waaramaa T, Kankare E. Acoustic and EGG analyses of emotional utterances, Logopedics Phoniatrics Vocology. 2013;38,1:11–18. pmid:22587654
- View Article
- PubMed/NCBI
- Google Scholar
26. Zald DH. The human amygdala and the emotional evaluation of sensory stimuli (review). Brain Research Reviews. 2003;41:88–123. pmid:12505650
- View Article
- PubMed/NCBI
- Google Scholar
27. Swerts M. Prosodic features at discourse boundaries of different strength. Journal of the Acoustical Society of America 1997;101,1:514–521. pmid:9000742
- View Article
- PubMed/NCBI
- Google Scholar
28. Alinezhad B. A Study of the Relationship between Acoustic Features of “bæle” and the Paralinguistic Information. The Journal of Teaching Language Skills (JTLS). 2010;29,1:1–25.
- View Article
- Google Scholar
29. Belin P, Fecteau S, Bédard C. Thinking the voice: neural correlates of voice perception. TRENDS in Cognitive Sciences. 2004;8,3:129–135. pmid:15301753
- View Article
- PubMed/NCBI
- Google Scholar
30. Waaramaa T, Palo P, Kankare E. Emotions in freely varying and mono-pitched vowels, acoustic and EGG analyses, Logopedics Phoniatrics Vocology. 2014;40:4:156–170. pmid:24998780
- View Article
- PubMed/NCBI
- Google Scholar
31. Waaramaa T. Laukkanen A-M, Airas A, Alku P. Perception of Emotional Valences and Activity Levels from Vowel Segments of Continuous Speech. Journal of Voice. 2010;24,1:30–38. pmid:19111438
- View Article
- PubMed/NCBI
- Google Scholar
32. Laukkanen A-M, Vilkman E, Alku P, Oksanen H. On the perception of emotions in speech: the role of voice quality, Logopedics Phoniatrics Vocology. 1997;22,4:157–168.
- View Article
- Google Scholar
33. Scherer KR. Vocal affect expression: A review and a model for future research. Psychological Bulletin. 1986;99:143–165. pmid:3515381
- View Article
- PubMed/NCBI
- Google Scholar
34. Laukkanen A-M, Vilkman E, Alku P, Oksanen H. Physical variation related to stress and emotionally state: a preliminary study. Journal of Phonetics. 1996;24:313–335.
- View Article
- Google Scholar
35. Gobl C, Ní Chasaide A. The role of voice quality in communicating emotion, mood and attitude. Speech Communication. 2003;40:189–212.
- View Article
- Google Scholar
36. Hämäläinen R, De Wever B, Waaramaa T, Laukkanen A-M, Lämsä J. It’s Not Only What You Say, But How You Say It: Investigating the Potential of Prosodic Analysis as a Method to Study Teacher’s Talk. Frontline Learning Research. 2018;6,3:204–227.
- View Article
- Google Scholar
37. Engelhardt PV, Beichner RJ. Students’ understanding of direct current resistive electrical circuits. American journal of physics. 2004;72,1:98–115.
- View Article
- Google Scholar
38. Jacobson BH, Johnson A, Grywalski C, Silbergleit A, Jacobson G, Benninger MS et al. The Voice Handicap Index (VHI): development and validation. American Journal of Speech Language Pathology. 1997;6,3:66–70.
- View Article
- Google Scholar
39. Gräßel E, Hoppe U, Rosanowski F. Grading of the Voice Handicap Index. HNO. 2008;56:1221–1228. pmid:17676287
- View Article
- PubMed/NCBI
- Google Scholar
40. Ohlsson AC, Dotevall H. Voice handicap index in Swedish. Logopedics Phoniatrics Vocology. 2009;34:60–66. pmid:19308791
- View Article
- PubMed/NCBI
- Google Scholar
41. Nguyen DD, McCabe P, Thomas D, Purcell A, Doble M, Novakovic D, et al. Acoustic voice characteristics with and without wearing a facemask. Scientific Reports. 2021;11:5651. pmid:33707509
- View Article
- PubMed/NCBI
- Google Scholar
42. Knowles T, Badh G. The impact of face masks on spectral acoustics of speech: Effect of clear and loud speech styles. The Journal of the Acoustical Society of America. 2022;151:3359–3368. pmid:35649889
- View Article
- PubMed/NCBI
- Google Scholar
43. McDermott LC, Shaffer PS. Research as a guide for curriculum development: An example from introductory electricity. Part I: Investigation of student understanding. American journal of physics. 1992; 60,11:994–1003.
- View Article
- Google Scholar
44. Michener CJ, Proctor CP, Silverman RD. Features of instructional talk predictive of reading comprehension. Reading and Writing: An Interdisciplinary Journal. 2018;31(3):725–756.
- View Article
- Google Scholar
45. O’Connor C, Joffe H. Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines. International Journal of Qualitative Methods. 2020;19.
- View Article
- Google Scholar
46. Hirst DJ, de Looze C. 2021. Fundamental frequency and pitch. In Knight R-A& Setter J. (Eds.) Cambridge Handbook of Phonetics. Cambridge University Press. 336–361. https://doi.org/10.1017/9781108644198
47. Long M. 2014. Fundamentals of Acoustics. Architectural Acoustics (2nd edition). Academic Press. https://doi.org/10.1016/B978-0-12-398258-2.00002-7
48. Sundberg J, Patel S, Björkner E, Scherer KR. Interdependencies among Voice Source Parameters in Emotional Speech. IEEE Transactions on Affective Computing. 2011;2(3):162–174.
- View Article
- Google Scholar
49. Laver J. 1980. The Phonetic Description of Voice Quality. Cambridge University Press.
50. Boersma P, Weenink . Praat: Doing phonetics by computer. Version 6.0.49. Praat: doing Phonetics by Computer. https://uvafon.hum.uva.nl/praat/, accessed Oct 5^th 2021.
51. Aubanel V, Nguyen N. Speaking to a common tune: Between-speaker convergence in voice fundamental frequency in a joint speech production task. PLoS ONE. 2020;15,5:e0232209. pmid:32365075
- View Article
- PubMed/NCBI
- Google Scholar
52. Lubold N, Pon-Barry H. Acoustic-prosodic Entrainment and Rapport in Collaborative Learning Dialogues. MLA ’14: Proceedings of the 2014 ACM workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 2014;5–12,
- View Article
- Google Scholar
53. Lubold N, Pon-Barry H. A Comparison of Acoustic-Prosodic Entrainment in Face-to-Face and Remote Collaborative Learning Dialogues. Proceedings of the IEEE Workshop on Spoken Language Technologies. 2014.
- View Article
- Google Scholar
54. Frick RW. Communicating emotion: The role of prosodic features. Psychological Bulletin. 1985; 97,3:412–429.
- View Article
- Google Scholar
55. Master S, De Biase N, Chiari BM, Laukkanen A-M. Acoustic and Perceptual Analyses of Brazilian Male Actors’ and Nonactors’ Voices: Long-term Average Spectrum and the ‘‘Actor’s Formant”. Journal of Voice. 2008;22,2:146–154. pmid:17134874
- View Article
- PubMed/NCBI
- Google Scholar
56. Sundberg J, Nordenberg M. Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. The Journal of the Acoustical Society of America. 2006;120:453–457. pmid:16875241
- View Article
- PubMed/NCBI
- Google Scholar
57. Hammarberg B, Fritzell B, Gauffin J, Sundberg J. Acoustic and perceptual analysis of vocal dysfunction. Journal of Phonetics. 1986;14:533–547.
- View Article
- Google Scholar
58. Breitenstein C, Van Lancker D, Daum I. The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cognition & Emotion. 2001;15,1:57–79.
- View Article
- Google Scholar
59. Hazan V, Pettinato M. The emergence of rhythmic strategies for clarifying speech: variation of syllable rate and pausing in adults, children and teenagers. Conference paper: 10th International Speech Production Seminar. Cologne, Germany, 2014;178–181.
- View Article
- Google Scholar
60. Smiljanić R, Bradlow AR. Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes. Language and Linguistics Compass. 2009;3,1:236–264. pmid:20046964
- View Article
- PubMed/NCBI
- Google Scholar
61. Alku P. Glottal inverse filtering analysis of human voice production—A review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana. 2011;36, 623–650.
- View Article
- Google Scholar
62. Airas M. TKK Aparat: An environment for voice inverse filtering and parameterization, Logopedics Phoniatrics Vocology. 2008;33:1,49–64. pmid:18344143
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Hmelo-Silver CE. Problem-based learning: What and how do students learn?. Educational psychology review. 2004;16:235–266.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Yew EH, Goh K. Problem-based learning: An overview of its process and impact on learning. Health professions education. 2016;2(2):75–79.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Asterhan CS, Schwarz BB. Argumentation and explanation in conceptual change: Indications from protocol analyses of peer‐to‐peer dialog. Cognitive science. 2009;33(3):374–400. pmid:21585475
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. de Leeuw N, Chi MTH. Self-explanation: Enriching a situation model or repairing a domain model? In Sinatra G, Pintrich P. (Eds.). Intentional conceptual change (pp. 55–78). Hillsdale, NY: Erlbaum; 2003. https://doi.org/10.4324/9781410606716

[ref5] 5. Pearce BJ, Deutsch L, Fry P, Marafatto FF, Lieu J. Going beyond the AHA! moment: insight discovery for transdisciplinary research and learning. Humanities and Social Sciences Communications. 2022;9:1–10.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref6] 6. Danek AH, Fraps T, von Müller A, Grothe B, Öllinger M. It’s a kind of magic—what self-reports can reveal about the phenomenology of insight problem solving. Frontiers in psychology. 2014;8,5:1408. pmid:25538658
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref7] 7. Gopnik A. Explanation as orgasm and the drive for causal knowledge: The function, evolution, and phenomenology of the theory formation system. In Keil F. C. & Wilson R. A. (Eds.), Explanation and cognition (pp. 299–323). The MIT Press; 2000.

[ref8] 8. Kounios J, Beeman M. The Aha! moment: The cognitive neuroscience of insight. Current directions in psychological science. 2009;18,4:210–6.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref9] 9. Kounios J, Beeman M. The cognitive neuroscience of insight. Annual review of psychology. 2014;1,65,1:71–93. pmid:24405359
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref10] 10. D’Mello S, Lehman B, Pekrun R, Graesser A. Confusion can be beneficial for learning, Learning and Instruction. 2014;29:153–170.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Lodge J, Kennedy G, Lockyer L, Arguel A, Pachman M. Understanding Difficulties and Resulting Confusion in Learning: An Integrative Review. Frontiers in Education. 2018;3,49:1–10.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. Pekrun R, Muis KR, Frenzel AC, Götz T. Emotions at School. England, UK: Routledge; 2018. https://doi.org/10.4324/9781315187822

[ref13] 13. Tyng CM, Amin HU, Saad MNM, Malik AS. The Influences of Emotion on Learning and Memory. Front Psychol [Internet]. 2017 [cited 2022 Nov 29];8. pmid:28883804
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref14] 14. Vilhunen E, Turkkila M, Lavonen J, Salmela-Aro K, Juuti K. Clarifying the Relation Between Epistemic Emotions and Learning by Using Experience Sampling Method and Pre-posttest Design. Front. Educ. 2022;7:826–852.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref15] 15. Schneider B, Krajcik J, Lavonen J, Salmela-Aro K. Learning science: The value of crafting engagement in science environments. Yale University Press; 2020.

[ref16] 16. Vilhunen E, Chiu MH, Salmela-Aro K, Lavonen J, Juuti K. Epistemic Emotions and Observations Are Intertwined in Scientific Sensemaking: A Study among Upper Secondary Physics Students. International Journal of Science and Mathematics Education. 2023;21:1545–1566. pmid:36090464
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref17] 17. Derry SJ, Pea RD, Barron B, Engle RA, Erickson F, Goldman R, et al. Conducting Video Research in the Learning Sciences: Guidance on Selection, Analysis, Technology, and Ethics. J Learn Sci. 2010;27,19,1:3–53.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref18] 18. Xu L, Aranda G, Widjaja W, Clarke D, (Eds). Video-based Research in Education: Cross-disciplinary Perspectives. London: Routledge; 2018: p 302. https://doi.org/10.4324/9781315109213

[ref19] 19. Laver J. Principles on Phonetics. Cambridge. Cambridge University Press; 1994.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref20] 20. Mercer N. The analysis of classroom talk: Methods and methodologies. British Journal of Educational Psychology. 2010;80:1–14. pmid:20092680
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref21] 21. Barr BJ. Paralinguistic correlates of conceptual structure. Psychonomic Bulletin & Review. 2003;10,2:462–467. pmid:12921425
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref22] 22. Campbell N. Extra-Semantic Protocols; Input Requirements for the Synthesis of Dialogue Speech. In Andre E, Dybkjaer L, Minker , Heisterkamp P. (Eds.). Affective Dialogue Systems. Springer Verlag. Berlin; 2004. https://doi.org/10.1007/978-3-540-24842-2_22

[ref23] 23. Campbell N. Perception of affect in speech—towards an automatic processing of paralinguistic information in spoken conversation. Proceedings Interspeech. 2004:881–884.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref24] 24. Tavi L. 2020. Prosodic Cues of Speech Under Stress—Phonetic Exploration of Finnish Emergency Calls. Academic dissertation. Publications of the University of Eastern Finland. Dissertations in Education, Humanities, and Theology: No 154. University of Eastern Finland. Joensuu.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref25] 25. Waaramaa T, Kankare E. Acoustic and EGG analyses of emotional utterances, Logopedics Phoniatrics Vocology. 2013;38,1:11–18. pmid:22587654
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref26] 26. Zald DH. The human amygdala and the emotional evaluation of sensory stimuli (review). Brain Research Reviews. 2003;41:88–123. pmid:12505650
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref27] 27. Swerts M. Prosodic features at discourse boundaries of different strength. Journal of the Acoustical Society of America 1997;101,1:514–521. pmid:9000742
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref28] 28. Alinezhad B. A Study of the Relationship between Acoustic Features of “bæle” and the Paralinguistic Information. The Journal of Teaching Language Skills (JTLS). 2010;29,1:1–25.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref29] 29. Belin P, Fecteau S, Bédard C. Thinking the voice: neural correlates of voice perception. TRENDS in Cognitive Sciences. 2004;8,3:129–135. pmid:15301753
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref30] 30. Waaramaa T, Palo P, Kankare E. Emotions in freely varying and mono-pitched vowels, acoustic and EGG analyses, Logopedics Phoniatrics Vocology. 2014;40:4:156–170. pmid:24998780
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref31] 31. Waaramaa T. Laukkanen A-M, Airas A, Alku P. Perception of Emotional Valences and Activity Levels from Vowel Segments of Continuous Speech. Journal of Voice. 2010;24,1:30–38. pmid:19111438
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref32] 32. Laukkanen A-M, Vilkman E, Alku P, Oksanen H. On the perception of emotions in speech: the role of voice quality, Logopedics Phoniatrics Vocology. 1997;22,4:157–168.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref33] 33. Scherer KR. Vocal affect expression: A review and a model for future research. Psychological Bulletin. 1986;99:143–165. pmid:3515381
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref34] 34. Laukkanen A-M, Vilkman E, Alku P, Oksanen H. Physical variation related to stress and emotionally state: a preliminary study. Journal of Phonetics. 1996;24:313–335.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref35] 35. Gobl C, Ní Chasaide A. The role of voice quality in communicating emotion, mood and attitude. Speech Communication. 2003;40:189–212.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref36] 36. Hämäläinen R, De Wever B, Waaramaa T, Laukkanen A-M, Lämsä J. It’s Not Only What You Say, But How You Say It: Investigating the Potential of Prosodic Analysis as a Method to Study Teacher’s Talk. Frontline Learning Research. 2018;6,3:204–227.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref37] 37. Engelhardt PV, Beichner RJ. Students’ understanding of direct current resistive electrical circuits. American journal of physics. 2004;72,1:98–115.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref38] 38. Jacobson BH, Johnson A, Grywalski C, Silbergleit A, Jacobson G, Benninger MS et al. The Voice Handicap Index (VHI): development and validation. American Journal of Speech Language Pathology. 1997;6,3:66–70.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref39] 39. Gräßel E, Hoppe U, Rosanowski F. Grading of the Voice Handicap Index. HNO. 2008;56:1221–1228. pmid:17676287
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref40] 40. Ohlsson AC, Dotevall H. Voice handicap index in Swedish. Logopedics Phoniatrics Vocology. 2009;34:60–66. pmid:19308791
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref41] 41. Nguyen DD, McCabe P, Thomas D, Purcell A, Doble M, Novakovic D, et al. Acoustic voice characteristics with and without wearing a facemask. Scientific Reports. 2021;11:5651. pmid:33707509
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref42] 42. Knowles T, Badh G. The impact of face masks on spectral acoustics of speech: Effect of clear and loud speech styles. The Journal of the Acoustical Society of America. 2022;151:3359–3368. pmid:35649889
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref43] 43. McDermott LC, Shaffer PS. Research as a guide for curriculum development: An example from introductory electricity. Part I: Investigation of student understanding. American journal of physics. 1992; 60,11:994–1003.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref44] 44. Michener CJ, Proctor CP, Silverman RD. Features of instructional talk predictive of reading comprehension. Reading and Writing: An Interdisciplinary Journal. 2018;31(3):725–756.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref45] 45. O’Connor C, Joffe H. Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines. International Journal of Qualitative Methods. 2020;19.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref46] 46. Hirst DJ, de Looze C. 2021. Fundamental frequency and pitch. In Knight R-A& Setter J. (Eds.) Cambridge Handbook of Phonetics. Cambridge University Press. 336–361. https://doi.org/10.1017/9781108644198

[ref47] 47. Long M. 2014. Fundamentals of Acoustics. Architectural Acoustics (2nd edition). Academic Press. https://doi.org/10.1016/B978-0-12-398258-2.00002-7

[ref48] 48. Sundberg J, Patel S, Björkner E, Scherer KR. Interdependencies among Voice Source Parameters in Emotional Speech. IEEE Transactions on Affective Computing. 2011;2(3):162–174.
View Article
Google Scholar

[145] View Article

[146] Google Scholar

[ref49] 49. Laver J. 1980. The Phonetic Description of Voice Quality. Cambridge University Press.

[ref50] 50. Boersma P, Weenink . Praat: Doing phonetics by computer. Version 6.0.49. Praat: doing Phonetics by Computer. https://uvafon.hum.uva.nl/praat/, accessed Oct 5^th 2021.

[ref51] 51. Aubanel V, Nguyen N. Speaking to a common tune: Between-speaker convergence in voice fundamental frequency in a joint speech production task. PLoS ONE. 2020;15,5:e0232209. pmid:32365075
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref52] 52. Lubold N, Pon-Barry H. Acoustic-prosodic Entrainment and Rapport in Collaborative Learning Dialogues. MLA ’14: Proceedings of the 2014 ACM workshop on Multimodal Learning Analytics Workshop and Grand Challenge. 2014;5–12,
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref53] 53. Lubold N, Pon-Barry H. A Comparison of Acoustic-Prosodic Entrainment in Face-to-Face and Remote Collaborative Learning Dialogues. Proceedings of the IEEE Workshop on Spoken Language Technologies. 2014.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref54] 54. Frick RW. Communicating emotion: The role of prosodic features. Psychological Bulletin. 1985; 97,3:412–429.
View Article
Google Scholar

[160] View Article

[161] Google Scholar

[ref55] 55. Master S, De Biase N, Chiari BM, Laukkanen A-M. Acoustic and Perceptual Analyses of Brazilian Male Actors’ and Nonactors’ Voices: Long-term Average Spectrum and the ‘‘Actor’s Formant”. Journal of Voice. 2008;22,2:146–154. pmid:17134874
View Article
PubMed/NCBI
Google Scholar

[163] View Article

[164] PubMed/NCBI

[165] Google Scholar

[ref56] 56. Sundberg J, Nordenberg M. Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. The Journal of the Acoustical Society of America. 2006;120:453–457. pmid:16875241
View Article
PubMed/NCBI
Google Scholar

[167] View Article

[168] PubMed/NCBI

[169] Google Scholar

[ref57] 57. Hammarberg B, Fritzell B, Gauffin J, Sundberg J. Acoustic and perceptual analysis of vocal dysfunction. Journal of Phonetics. 1986;14:533–547.
View Article
Google Scholar

[171] View Article

[172] Google Scholar

[ref58] 58. Breitenstein C, Van Lancker D, Daum I. The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cognition & Emotion. 2001;15,1:57–79.
View Article
Google Scholar

[174] View Article

[175] Google Scholar

[ref59] 59. Hazan V, Pettinato M. The emergence of rhythmic strategies for clarifying speech: variation of syllable rate and pausing in adults, children and teenagers. Conference paper: 10th International Speech Production Seminar. Cologne, Germany, 2014;178–181.
View Article
Google Scholar

[177] View Article

[178] Google Scholar

[ref60] 60. Smiljanić R, Bradlow AR. Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes. Language and Linguistics Compass. 2009;3,1:236–264. pmid:20046964
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref61] 61. Alku P. Glottal inverse filtering analysis of human voice production—A review of estimation and parameterization methods of the glottal excitation and their applications. Sadhana. 2011;36, 623–650.
View Article
Google Scholar

[184] View Article

[185] Google Scholar

[ref62] 62. Airas M. TKK Aparat: An environment for voice inverse filtering and parameterization, Logopedics Phoniatrics Vocology. 2008;33:1,49–64. pmid:18344143
View Article
PubMed/NCBI
Google Scholar

[187] View Article

[188] PubMed/NCBI

[189] Google Scholar

Figures

Abstract

Introduction

The “aha” and “hmm” moments in learning situations

Verbal and non-verbal communication

Characteristics of non-verbal communication in speech

Material and methods

Ethical statement

Selection of participants

Recordings

The tasks

Acoustic analyses

Statistical analyses

Results

Recognizing the “aha” and “hmm” moments

Acoustic voice analyses and learning research

Discussion

Recommendations for future work

Conclusions

Supporting information

S1 Data.

References