High variability phonetic training in adaptive adverse conditions is rapid, effective, and sustained

Christine Xiang Ru Leong; Jessica M. Price; Nicola J. Pitchford; Walter J. B. van Heuven

doi:10.1371/journal.pone.0204888

Abstract

This paper evaluates a novel high variability phonetic training paradigm that involves presenting spoken words in adverse conditions. The effectiveness, generalizability, and longevity of this high variability phonetic training in adverse conditions was evaluated using English phoneme contrasts in three experiments with Malaysian multilinguals. Adverse conditions were created by presenting spoken words against background multi-talker babble. In Experiment 1, the adverse condition level was set at a fixed level throughout the training and in Experiment 2 the adverse condition level was determined for each participant before training using an adaptive staircase procedure. To explore the effectiveness and sustainability of the training, phonemic discrimination ability was assessed before and immediately after training (Experiments 1 and 2) and 6 months after training (Experiment 3). Generalization of training was evaluated within and across phonemic contrasts using trained and untrained stimuli. Results revealed significant perceptual improvements after just three 20-minute training sessions and these improvements were maintained after 6 months. The training benefits also generalized from trained to untrained stimuli. Crucially, perceptual improvements were significantly larger when the adverse conditions were adapted before each training session than when it was set at a fixed level. As the training improvements observed here are markedly larger than those reported in the literature, this indicates that the individualized phonetic training regime in adaptive adverse conditions (HVPT-AAC) is highly effective at improving speech perception.

Citation: Leong CXR, Price JM, Pitchford NJ, van Heuven WJB (2018) High variability phonetic training in adaptive adverse conditions is rapid, effective, and sustained. PLoS ONE 13(10): e0204888. https://doi.org/10.1371/journal.pone.0204888

Editor: Claude Alain, Baycrest Health Sciences, CANADA

Received: May 24, 2017; Accepted: September 17, 2018; Published: October 9, 2018

Copyright: © 2018 Leong et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This research was partly supported by a summer bursary from the Experimental Psychology Society awarded to WJBVH. CLXR was supported by a funded PhD scholarship from the University of Nottingham Malaysia campus. The development of the HVPT-AAC paradigm was supported by funding from the University of Nottingham, which included an Innovation Fellowship awarded to NJP and WJBVH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Non-native English speakers have often difficulties discriminating phonemic contrasts that involve phoneme categories that are not present in their native language. For example, Japanese speakers have difficulties distinguishing between the English phonemes /r/ and /l/ [1–4], Mandarin speakers find it difficult to discriminate between the English voiced-unvoiced stops /t/ and /d/ in word-final positions [5, 6], and Spanish speakers have difficulties with the English vowels /æ/ and /ʌ/ [5]. These difficulties are explained by a number of theories and models of non-native speech perception. According to Best’s Perceptual Assimilation Model [7], difficulties arise when phonemes from the non-native language are assimilated inappropriately into native language phonetic categories. Kuhl’s [8] Native Language Magnet Theory proposes that native speech sound categories serve as magnets that increase perceived similarity of the phonetic sounds close to the phonetic representations. This perceptual magnet effect causes difficulties for non-native speakers to discriminate acoustically similar, but linguistically distinctive non-native phonetics sounds. According to Flege’s Speech Learning Model [9], the discrimination difficulty of non-native speakers prevents the formation of appropriate phonetic categories to represent the non-native phonetic sounds. Therefore, phonemic pairs are typically assimilated into one native category by the non-native speakers and overcoming the native language interference may be critical to successfully master these non-native distinctions.

A large number of studies have investigated perceptual phonetic training methods to improve speech perception in the non-native language using either synthetic (e.g., [10–12]) or natural (e.g., [12–14]) speech stimuli. The most common task used in perceptual training studies is the identification task, in which participants determine the identity of the auditory stimuli presented by deciding which of the written words presented on the screen matches the word heard and are provided with feedback after each response [1, 2, 4, 13–22]. However, some studies have used an identification task and a discrimination task in which participants determine whether auditory stimuli presented are identical or not [3, 5, 23–25]. An auditory discrimination task has the advantage that participants do not need to know how to read words in the language trained.

One of the most successful perceptual training methods was introduced by Lively and colleagues who showed the beneficial effects of training with highly variable material (later referred to as High Variability Phonetic Training or HVPT) with Japanese native speakers learning the English /r/-/l/ contrast [13–15]. A typical HVPT paradigm trains listeners with highly variable speech tokens which are usually produced by multiple speakers. Furthermore, target phonemes in this paradigm are presented in different phonetic contexts (i.e., at word initial, middle or final positions). The majority of studies in the literature have reported significant (weak to moderate) training benefits in perceptual phonetic training (e.g., [6, 13]), that lasted up to six months in some studies [15, 17]. The training benefits observed usually generalized to novel/untrained stimuli (of the same phonemic contrasts) that had not been used as training material, and to stimuli produced by novel speakers who had not been heard during training (e.g., [4, 19]). When participants in Lively et al. [14] were trained with single speaker stimuli (low variability), training did not generalize to novel stimuli and speakers because participants developed highly detailed representations of trained stimuli. Lively et al. [15] concluded therefore that high variability training materials promotes the generalization of the training. Although phonetic training benefits are well-established, training procedures used in the literature often take days if not weeks to be completed before improvements in phonetic perception is observed. Thus, it is important to find ways to improve the effectiveness of perceptual training.

A large number of studies have demonstrated that compared to native speakers, speech perception in adverse conditions is particularly challenging for non-native speakers (e.g., [26, 27]). Adverse conditions affect the intelligibility of the speech and they occur because of, for example, environmental or transmission degradation (for an overview, see [28]). Environmental degradation can be created by presenting speech in noise (e.g., white or pink noise that produce predominantly energetic masking) or by presenting the speech with background talkers (e.g., produce predominantly informational masking). The detrimental effect of energetic masking is greater on non-native speakers compared to native speakers in word discrimination tasks involving white noise [29] and with comprehending connected speech in sentences presented in pink noise (e.g., [30, 31], for a review, see [32]).

A series of studies conducted by Bradlow and colleagues showed that the non-native deficit in speech-in-noise perception may be attributed to their perceptual strategy that is different from the native speakers. An example of a native speakers’ strategic perceptual approach can be seen in the clear speech effect, which is the intelligibility difference between clear speech and normal conversational speech. In clear speech production, the acoustic salience of the speech signals is enhanced. Native speakers in the study by Bradlow and Bent [33] benefited more from clear speech production than non-native speakers when access to the target speech signals was impeded by background noise. Bradlow and Bent attributed this native speaker advantage to the strategic allocation of attention to language-specific acoustical cues that facilitate speech comprehension. Non-native speakers showed a smaller clear speech effect because they benefited only from the enhancement of the overall acoustic salience. The authors pointed out that extensive experience with the target language was crucial to develop a native perceptual strategy. Furthermore, non-native speakers also need greater signal clarity (e.g., with clear speech production) before other compensatory information (e.g., semantic cues to predict word identity in sentence) can be used for speech perception in noise [34]. Another disadvantage that non-native speakers have was revealed by Bent, Kewley-Port and Ferguson [35] who showed that non-native speakers are more affected by across-talker variation when identifying vowels in noise, even though the variation was within the normal range of native speakers.

When speech signals are degraded by an informational masker (e.g., multi-talker babble), non-native speakers’ difficulty in understanding speech can be attributed to the greater interference of the native phonetic system when there is ambiguity during second language (L2) processing (e.g., disrupted signal perceived due to the noisy environment), the increasing cognitive load in non-native speakers when processing two sources of non-native auditory information, or the non-native speakers’ inexperience in separating the signal from the masker language [32, 36], especially when they share similar spectrotemporal characteristics [37].

Although daily life conversations are often held in suboptimal or adverse conditions, very few phonetic training studies have been conducted with non-native normal-hearing listeners in difficult listening conditions. As far as we are aware, only Jamieson and Morosan [38] used background noise for some of the training materials to increase perceptual difficulty during the latter phase of their phonetic training; although no further attention was given to the implications of using background noise. In another related phonetic training study by Lengeris and Hazan [21], background noise was incorporated only in pre- and post-tests to examine the effect of training in quiet situations on non-native speech-in-noise perception. Furthermore, the Speech Perception Assessment and Training System for English as Second Language speakers (SPATS-ESL) developed by Miller and colleagues [39] trained English second language speakers with speech presented in multi-talker babble, but they focused on the syllable constituent perception and word identification in spoken sentences.

Training in noise, however, is common in perceptual training studies involving hearing-impaired listeners. In some of these studies, speech-shaped noise was used to simulate hearing loss in normal-hearing native listeners (e.g., [40]). A study by Burk and colleagues [41] with young normal-hearing and older hearing-impaired listeners showed that training in background noise improved word recognition performance in both groups, but their single talker training involved many hours (7 hours) and showed limited transfer from isolated presentation of trained words to presentation of these words in sentences.

Interestingly, novel research conducted by Hazan and colleagues (e.g., [42]) focused on the effects of audiovisual perceptual training. The study by Hazan, Kim and Chen [42] suggested that the reduction of information loaded on one channel (i.e., auditory) encouraged a heavier or more efficient weighting of another information processing channel (i.e., visual cues). Likewise, in adverse auditory training conditions where information from peripheral channels experiences interferences or is not available (i.e. no visual cues), participants are required to rely more on other cognitive processes to perform the perceptual task, such as a better selective attention strategy [28, 32]. Several studies have shown that the native speakers’ ability to identify and attend to critical language-specific acoustical cues that are more resistant to the adverse effects of noise, leads to the perceptual difference in speech-in-noise perception between native and non-native speakers [33–36]. Therefore, auditory training under adverse condition might help inexperienced non-native listeners to acquire these native-like selective attention strategies.

The present study

The purpose of the present experiments was to examine the effectiveness of high variability phonetic training in adaptive adverse conditions (HVPT-AAC). The initial concept of HVPT-AAC was developed by the late Richard Pemberton, Kathy Conklin, Nicola Pitchford, and Walter van Heuven at the University of Nottingham. The implementation of HVPT-AAC for the research presented here was developed by the last two authors of this paper. HVPT-AAC uses natural speech stimuli (e.g., English minimal pairs) spoken by multiple speakers (e.g., with different English accents). Critical features of HVPT-AAC include presentation of spoken words in background noise in the form of multi-talker babble (adverse conditions) and an adaptive level of adverse conditions whereby an optimal training level can be set for each listener through an adaptive staircase procedure (see Experiment 2 for further details). Adverse conditions in perceptual training increases the speech perception difficulties not only for non-native speakers but also for highly proficient non-native speakers and native speakers whose disadvantage in speech perception may only manifests itself in the presence of noise (see [32] for a review). Studies incorporating background noise in perceptual training have found significant training effects (e.g., [38, 39, 41]). Therefore, HVPT-AAC is expected to improve speakers’ performance in perceiving phonetic distinctions.

HVPT-AAC involves a discrimination task in which participants determine whether the stimuli presented are identical or not. Spoken word identification and discrimination tasks involve different cognitive mechanisms. Identification training engages top-down processing of speech signals in which participants respond based on their categorized or phonetic representations in memory, whereas discrimination training influences primarily the bottom up processing of speech signals in which it engages lower level and sensory-based information in speech signals. Discrimination training improves speakers’ sensitivity to detect minor differences between similar sounding stimuli [11].

Flege’s Speech Learning Model [9] hypothesizes that the more dissimilar speech sounds are, the higher the chance is that they would be encoded into two distinctive phonological categories and identified as distinctive phonemic sounds. Similar to training studies that used discrimination training (more common in training studies that used synthetic training stimuli, e.g., [10–12]), HVPT-AAC training is designed to improve in particular participants’ sensitivity towards meaningful cues in non-native speech signals, to facilitate discrimination of between-category differences [43]. Perceptual sensitivity development was targeted based on the assumption that this ability reflects one of the native speakers’ advantages in native speech sound perception and that detection of such between-category differences is the fundamental limitation for non-native speakers. Another advantage of using a discrimination task is that it does not require prior knowledge of the training language and can be therefore ideal for novice language learners. For similar reasons, Giannakopoulou et al. [44] chose to use an oddity discrimination task to examine phonetic learning from their auditory word to pictures identification training. This discrimination task allowed them to test both real and nonsense words in young children, without orthographic interference. In the present study, discrimination improvements as a result of discrimination training are expected to facilitate the formation of higher-level linguistic representations of non-native phonetics; which can be assessed in the word identification task used in pre- and post-tests. Importantly, studies such as Handley, Sharples and Moore [3] and Shinohara and Iverson [45] support the use of a discrimination task in perceptual phonetic training, as the task was found to be as effective as the identification task.

To evaluate HVPT-AAC, the current experiments were conducted with moderately to highly proficient Malaysian English speakers receiving a university education delivered in English whilst resident in Malaysia. The three experiments reported below explore the effectiveness of HVPT conducted in adverse conditions (Experiment 1) and the effectiveness of adapting the level of adverse conditions in HVPT (HVPT-AAC) versus a fixed level of adverse conditions (Experiment 2). The longevity of training in adverse conditions were investigated in Experiment 3.

Experiment 1

In this experiment we investigated whether perceptual training in two different levels of adverse conditions modulated the training results of non-native speakers, and whether the training generalized to untrained stimuli and untrained contrasts. Two levels of adverse conditions were created by manipulating the volume of the multi-talker babble (low vs. high) relative to the target stimulus volume level. We expected to see greater training benefits in the more adverse training condition (high volume) because listeners have to learn how to engage more effectively the cognitive processes used in the task (e.g., selective attention) when target auditory information is masked in the higher level of adverse conditions.

Method

Participants.

A group of 28 participants (aged 17–22; mean age 18.75; 16 females) who spoke Mandarin as their first language (L1) were recruited from the University of Nottingham, Malaysia Campus. All participants reported to have normal or corrected to normal vision, and they had no history of any hearing, speech, or reading problems. Participants were paid for their participation.

Participants completed a questionnaire to obtain information about their language background. Table 1 provides an overview of their mean age and their overall subjective proficiency scores for relevant languages (the overall scores were calculated by averaging reading, writing, speaking and listening ratings, scale: from 1 = very poor to 7 = fluent), as well as the age at which they first acquired Malaysian English (AoA).

Download:

Table 1. Participants’ demography, mean self-rated language proficiency in the two languages and English language test scores for the participants in the two adverse condition levels (low and high).

https://doi.org/10.1371/journal.pone.0204888.t001

The students' English Language test scores as required for admission onto their university course were converted to IELTS standard scores using the University of Nottingham English language qualification equivalencies. All participants spoke Malaysian English and Malay as their L2, and learnt both languages from a young age (range: 0–7 years; all except one reported 5 and below for age of Malaysian English acquisition). All participants had learnt the two languages for at least 11 years through formal education and were pursuing their tertiary education in English during the study. 22 participants also spoke at least one other Chinese language (generally Hokkien, Cantonese and/or Hakka). In addition, 3 participants reported to have learned a foreign language (Japanese or German) at a low proficiency (mean proficiency score < 2.2 from the scale of 7).

Participants were assigned randomly to either the low or high adverse condition groups. Between-subjects t-tests were then conducted to compare their linguistic experience and ability, so that the groups were matched in terms of their self-rated language proficiency, AoA and IELTS score.

The mean self-rated English and Mandarin proficiency did not differ between the two groups. Participants also did not differ in terms of their age of English acquisition (AoA) and IELTS standardized test scores.

Design and materials.

The stimuli consisted of three groups of English minimal pairs: sixteen /t/-/d/ minimal pairs of which eight differed at the initial position (e.g., tame–dame) and eight at the final position (e.g., sat–sad), and sixteen /ε/-/æ/ minimal pairs (e.g., leg–lag). These two phonemic contrasts were chosen because they are difficult for the L1 Mandarin speakers (/t/-/d/ at word final position [5, 6] and /ε/-/æ/ [46]). The sixty-four English words consisted of nouns and verbs (complete list of stimuli can be found in the Supporting Information). The minimal pairs across the three groups were similar in word frequency (/t/-/d/ initial: 117.72, /t/-/d/ final: 101.32, /ε/-/æ/: 172.34 occurrences per million based on SUBTLEX-US [47]), number of syllables (/t/-/d/ initial: 1.0, /t/-/d/ final: 1.0, /ε/-/æ/: 1.3) and number of phonemes (/t/-/d/ initial: 3.38, /t/-/d/ final: 3.25, /ε/-/æ/: 3.5). The /t/-/d/ final minimal pairs were shorter than the other minimal pair groups in terms of the number of letters (/t/-/d/: 3.75, /t/-/d/: 3.50, /ε/-/æ/: 4.09, F(2,61) = 3.84, p < .05). For the training sessions, the sixteen /ε/-/æ/ minimal pairs were split into two sets of stimuli (trained and untrained). Half of the participants were trained with one set of stimuli, whereas the other participants were trained with the other set. Four speakers (2 females) with different English accents (female British English, male Southern Irish English, female American English, and male Irish English) recorded the spoken word stimuli. The stimuli were recorded in an Anechoic chamber using an AKG Perception 400 microphone connected to a Presonus FireBox, which was linked to an Apple Macbook Pro. Speech was recorded at 44.1.kHz (16 bit) using Amadeus Pro (version 2). Recordings were edited using Amadeus Pro and the volume of the recordings were normalised by amplifying the sound recordings of each speaker to an average root mean square (RMS) power of -25 dB (200 ms window). Stimuli spoken by the male Southern Irish English speaker were presented in the first training session, stimuli spoken by the female American English speaker were used in the second training session, and the stimuli spoken by the male Irish English speaker were presented in the third training session. The stimuli spoken by the female British English speaker were used in the pre- and post-tests.

The background noise consisted of 6-talker babble and was created by combining the audio recording of 6 native English speakers (3 females) taken from six BBC Radio 4 interviews in which the interviewees talked about their life and work. The interviewer's voice was edited out and the volume of each speaker was normalised using Amadeus Pro by amplifying the sound to an average root mean square (RMS) power of -25 dB (200 ms window). The resulting 6 audio files were combined into a single 6-talker babble mono audio file (44.1 kHz, 16 bit) of 6 minutes.

During pre-test, post-test and training with a high level of background babble, the multi-talker babble was played continuously during the task at half the stimulus level with a mean signal-to-noise ratio (SNR) of -2.6 dB (range -8.4 to 1.9 dB). During training with a low level of background babble, the multi-talker babble was played at one-tenth of the stimulus level with a mean SNR of 11.3 dB (range 5.6 to 15.9 dB). Praat [48] was used to obtain the RMS of the audio files in order to calculate the SNR. The experiment was approved by the University of Nottingham Malaysia Campus Research Ethics committee.

Procedure.

All participants completed five sessions, one session per day. On the first day, participants completed the pre-test, followed by three training sessions spread across the following three consecutive days and then on the final day the participants completed the post-test. A 14-inch laptop (HP EliteBook 8460p) was used to run the training program and a mouse was used to record responses.

The pre- and post-tests consisted of two alternative forced-choice (2AFC) identification task. The stimuli (thirty-two minimal pairs: sixteen /ε/-/æ/ and sixteen /t/-/d/) in this task were repeated four times resulting in a total of a hundred and twenty-eight experimental trials. In each trial, the two words of a minimal pair were visually presented side by side at center of the computer screen and at the same time one of the words was presented auditorily. Participants were asked to indicate which word on the computer screen matched with the word they heard. Auditory stimuli were presented at a comfortable listening level set by each participant using Sony Headphones (MDR-NC8/WHI).

Participants completed eight practice trials in order to familiarize themselves with the task. Presentation of the minimal pairs was randomized. No feedback was provided after each trial. Only at the end of the practice trials and after each block of 32 experimental trials, the total percentage correct was presented. The pre- and post-tests each took approximately 15 minutes.

In the training sessions participants performed a Same-Different word discrimination task with background babble presented either at one-tenth or half of the stimulus level. Participants heard pairs of words and had to decide whether the words were the same or different by clicking on one of the two response buttons ("Same" or "Different") presented on the computer screen using a computer mouse. The second word was played 1000 ms after the first word. Participants received feedback after each response. After a correct response, the response button with the correct answer turned green and then the next trial started. After an incorrect response, the button with the incorrect answer turned red and the correct answer turned green. The word pair was played again before the next trial was presented and no response was needed.

Each training session lasted approximately 20 minutes. Participants were instructed to focus on the auditory words they heard and try to ignore the multi-talker babble played in the background. All participants gave written informed consent prior to the start of the experiment.

Results and discussion

The mean percentage of correct identification was calculated for each phonemic contrast in the pre- and post-tests respectively (see Table 2). Data from the /t/-/d/ phonemic contrast at word initial position was excluded from the data analysis due to the near ceiling identification accuracy in pre-test for both participant groups (mean above 98%). Effect sizes (eta-squared, generalized eta-squared, Hedges’ g_av and Hedges’ g_s) were calculated for significant findings using the spreadsheet provided by Lakens [49] and reported in the results sections of this and the following experiments. F and p values are only reported for significant effects.

Download:

Table 2. Mean percentage of correct identification in pre- and post-tests for each phonemic contrast and fixed level of adverse conditions (low and high level of background multi-talker babble) with standard error in parentheses.

https://doi.org/10.1371/journal.pone.0204888.t002

The impact of training in adverse conditions with a low and high fixed level of background babble was examined in a 2 x 2 x 2 mixed ANOVA with the level of background multi-talker babble (low vs. high SNR) as the between-subject factor, and time of test (pre-test vs. post-test) and contrasts (/t/-/d/ final vs. /ε/-/æ/) as the within-subject factors. Overall, participants' perceptual performance was 3.0% better in the post-test (M = 75.8%, SE = 1.80) than in the pre-test (M = 72.8%, SE = 1.96), F(1,26) = 10.42, p < .01, η²_p = 0.29, 95% CI [1.09, 4.93], η²_G = 0.02. There was no interaction between the level of background babble and the time of test, indicating that both levels of background babble yielded similar levels of improvement in identification accuracy. Furthermore, there was no interaction between type of contrast and the time of test, which indicates that the training generalized to the stimuli of the untrained contrast /t/-/d/ final.

The above analysis included trained and untrained stimuli from the /ε/-/æ/ contrast. To assess whether the effect of the trained /ε/-/æ/ stimuli also generalized to untrained /ε/-/æ/ stimuli another mixed ANOVA was conducted (means are presented in Table 3). Results revealed that, as expected, the accuracy in the post-test (M = 77.7%, SE = 2.40) was significantly higher than in the pre-test (M = 72.6%, SE = 1.89), F(1,26) = 8.91, p < .01, η²_p = 0.26, 95% CI[1.60, 8.67], η²_G = 0.04 (5.1% difference). Importantly, there was no interaction between stimulus set (trained vs. untrained /ε/-/æ/) and the time of test, which indicates that the perceptual improvements generalized to the untrained stimuli.

Download:

Table 3. Mean percentage of correct identification in pre- and post-tests for each stimulus set and level of background multi-talker babble (with standard error in parentheses).

https://doi.org/10.1371/journal.pone.0204888.t003

HVPT training with a fixed level of background multi-talker babble successfully improved participants’ perceptual performance after a total of just one hour of training. Although participants identified all stimuli with a considerably high accuracy in the pre-test (mean 72.8%), the HVPT training in adverse conditions was able to further improve their overall perceptual performance by 3.0%.

Similar to findings from other training studies [1, 2, 24], identification accuracy improvements generalized to the untrained/novel words in post-test. Importantly, improvement was also generalized to stimuli of the untrained contrast /t/-/d/ final (see general discussion for further discussion). As far as we are aware, there is no other HVPT study that has reported generalization of training effects across phonemic contrasts, except for the study by Callan et al. [2]. Their training with the English /r/-/l/ contrast using HVPT benefited identification accuracy of the /b/-/v/ contrast as well.

The current findings also indicate that learning from the training transferred to a different speaker because the speaker of the post-test spoke a different variety of English (i.e., British English). Participants’ performance improved after training regardless of the level of adverse conditions used during training. This suggests that training in adverse conditions improves perceptual performance of non-native speakers and increasing the level of adverse conditions does not seem to influence the training outcomes. However, it is important to note that there were large individual differences in the current study. The percentage identification accuracy in the pre-test varied between 56% to 93%. In this experiment, the level of background multi-talker babble was not adapted to the participants’ individual performance level. Therefore, the participants' auditory system might not have been stressed sufficiently to maximize training benefits.

Experiment 2

In our second experiment, we examined the impact of HVPT training in adaptive adverse conditions (HVPT-AAC). The level of background multi-talker babble (i.e., SNR) was determined before each training session using an adaptive staircase procedure. This individually determined SNR was then used in the subsequent training session with stimuli of the same speaker as used in the adaptive staircase procedure.

An adaptive staircase procedure is a psychometric method used to measure a person’s sensory capabilities and is used to determine the person’s threshold or limit to detect and discriminate similar and confusable physical stimuli [50]. This procedure has been used in a small number of phonetic training studies using acoustically manipulated synthetic stimuli [12, 51]. The acoustic properties of stimuli in these studies were carefully manipulated to produce speech stimuli that systematically differed from each other. As far as we are aware, no natural speech training studies have used an adaptive staircase procedure in a similar way because it would affect the naturalness and variability of the training materials. The adaptive staircase procedure in our HVPT-AAC, however, manipulates the volume of the background multi-talker babble (and thus the SNR) to increase or decrease phonemic discrimination difficulty. Thus, this does not affect the naturalness of the speech stimuli.

By combining the strengths of existing training methods and individualized levels of adverse conditions, HVPT-AAC is expected to be more effective than HVPT with a fixed level of background multi-talker babble (as used in Experiment 1) in terms of its training improvements and generalizability. Participants in Experiment 2 had more diverse Malaysian English proficiency levels (L1 and L2 Malaysian English speakers) to investigate whether the perceptual performance improvements were modulated by Malaysian English proficiency. It was expected that the training would benefit all Malaysian speakers, irrespective to their English proficiency level due to the adaptiveness of the adverse conditions in HVPT-AAC.