A Tale of Two Features: Perception of Cantonese Lexical Tone and English Lexical Stress in Cantonese-English Bilinguals

This study investigated the similarities and differences in perception of Cantonese tones and English stress patterns by Cantonese-English bilingual children, adults, and English monolingual adults. All three groups were asked to discriminate pairs of syllables that minimally differed in either Cantonese tone or in English stress. Bilingual children’s performance on tone perception was comparable to their performance on stress perception. By contrast, bilingual adults’ performance on tone perception was lower than their performance on stress perception, and there was a similar pattern in English monolingual adults. Bilingual adults tended to perform better than English monolingual adults on both the tone and stress perception tests. A significant correlation between tone perception and stress perception performance was found in bilingual children but not in bilingual adults. All three groups showed lower accuracy in the high rising-low rising contrast than any of the other 14 Cantonese tone contrasts. The acoustic analyses revealed that average F0, F0 onset, and F0 major slope were the critical acoustic correlates of Cantonese tones, whereas multiple acoustic correlates were salient in English stress, including average F0, spectral balance, duration and intensity. We argue that participants’ difficulty in perceiving high rising-low rising contrasts originated from the contrasts’ similarities in F0 onset and average F0; indeed the difference between their major slopes was the only cue with which to distinguish them. Acoustic-perceptual correlation analyses showed that although the average F0 and F0 onset were associated with tone perception performance in all three groups, F0 major slope was only associated with tone perception in the bilingual adult group. These results support a dynamic interactive account of suprasegmental speech perception by emphasizing the positive prosodic transfer between Cantonese tone and English stress, and the role that level of bilingual language experience and age play in shaping suprasegmental speech perception.


Introduction
Most models of second language speech perception emphasize how differences in phonetic inventories-i.e., the specific vowels and consonants used in a language-affect bilinguals' segmental speech perception [1][2][3]. However, languages also differ considerably in the way they use suprasegmental features, such as the use of lexical tone in Cantonese and the use of lexical stress in English to differentiate one word from another [4]. Cantonese lexical tones are pitch patterns used to minimally distinguish word meanings [5]. For example, the monosyllable /fu/ can represent six different meanings in Cantonese, each with a different tone: /fu55/膚(skin), /fu25/虎 (tiger), /fu33/褲 (trousers), /fu21/符 (symbol), /fu23/婦 (woman) and /fu22/父 (father) (The numerical notational system developed by Chao [6] is used to transcribe six different tones. The numbers 1 (lowest) to 5 (highest) are used to indicate relative height, shape and duration of pitch contour. High, middle and low indicate the relative average F0 while level, rising, falling specify the slope of the F0 contours). English lexical stress can also be used to convey differences in meaning; for example, PREsent/'pre-zənt/ (gift) versus preSENT/pri-'zent/ (offer) (The capitalized letters represent stressed syllables). Like Cantonese tone, English stress uses pitch to convey such differences, but in contrast with tone, it only does so in conjunction with other acoustic parameters such as duration and intensity. These pitch-based differences, in single syllables in the case of tone and between contiguous syllables in the case of stress, have been largely unexplored in speech perception, especially for bilinguals of a tonal language (Cantonese) and a stress language (English). Thus, the goal of the present study is to examine similarities and differences in the perception of Cantonese tones and English stress patterns by Cantonese-English bilinguals, comparing bilingual children, bilingual adults, and English monolingual adults.
The relative weighting of each of these acoustic features in Cantonese tone perception varies across different tone contrasts [9][10][11]. For example, Lee et al. showed that 3-year-old Cantonese children's performance in discriminating the high rising (25, T2)-low falling (21, T4) tone contrast was significantly lower than their performance in discriminating high level (55, T1)-high rising (25, T2) and high level (55, T1)-low falling (21, T4) tone contrasts [10]. Notably, T2 and T4 have similar F0 onset values but different F0 offset values, whereas the tone contrasts that children were better able to discriminate (i.e., T1-T2 and T1-T4) have large differences in F0 onset. The authors concluded that the similarity of two tones' F0 onset plays a much more important role than F0 offset similarity in predicting the relative ease with which tone contrasts can be perceived. Tong and colleagues extended this line of work by showing that 5-to 6-year-old Cantonese children attended to F0 onset for the perception of similarly-contoured tones such as high rising (25, T2)-low rising (23, T5), while using F0 contour to distinguish tones with different contours such as high level (55, T1)-high rising (25, T5) [11].
Similarly to Cantonese tone, English lexical stress involves modulation in pitch. As shown in Fig 1B, stressed syllables tend to have a higher average F0 compared to unstressed syllables. However, unlike Cantonese tones, which use F0 alone to convey phonetic distinctions, the realization of English stress depends on F0 as well as other acoustic features such as duration, intensity, spectral balance (the distribution of intensity over different frequency bands) and vowel quality [12][13][14][15][16].
There is perceptual evidence showing that some of these acoustic parameters-specifically, F0, duration, and intensity-appear to be more important than others in distinguishing stressed syllables from unstressed ones [15,17,18]. For example, Fry found that among the three acoustic correlates of English lexical stress, i.e., average F0, duration, and intensity, F0 is the strongest cue in perceiving English lexical stress, followed by syllable duration and intensity [18]. Bolinger also suggested that average F0 is the primary cue for English lexical stress perception, whereas syllable duration and intensity are secondary cues [17].
Taken together, these findings suggest that the perception of Cantonese tone and English lexical stress rely on some similar acoustic concomitants. This raises the possibility that certain acoustic cues, such as F0, may be employed by Cantonese-English bilinguals to perceive both Cantonese tones and English stress patterns. Given that these previous findings are derived exclusively from empirical studies of monolingual speakers of Cantonese and English, there is a remarkable lack of evidence addressing this question in bilingual speakers. Empirical work is needed to determine how different acoustic correlates-including F0 onset, average F0, F0 contour, duration, intensity, and spectral balance-are employed by Cantonese-English bilingual speakers to perceive Cantonese tones and English stress patterns, and furthermore to identify the relationships between bilinguals' sensitivity to acoustic cues and their ability to perceive tones and stress patterns. Thus, the first aim of the present study is to examine Cantonese-English bilingual speakers' relative use of different tone-and/or stress-related acoustic correlates in perceiving Cantonese tones and English stress patterns, compared to the use of these cues by English monolinguals. We also examine the relationships between bilinguals' sensitivity to acoustic cues and their performance in perceiving tone and stress contrasts. Normalized F0 traces of (a) six Cantonese tones and (b) two English stress patterns in the stimuli /sula/. Both tone and stress stimuli were produced by a bilingual female Cantonese speaker. /su1la1/ = high level (55), /su2 la/ = high rising (25), /su3la1/ = mid level (33), /su4la1/ = low falling (21), /su5la1/ = low rising (23), /su6la1/ = low level (22). /'sula/ = initial stressed syllable, /su'la/ = final stressed syllable.
that experience with tonal Cantonese influences the use of acoustic cues in non-native English stress perception [27]. Specifically, native Cantonese speakers used F0 as the primary acoustic cue to perceive English stress patterns, whereas English speakers used spectral balance as the primary cue.
Collectively, findings from these studies on cross-language perception of suprasegmental speech contrasts have alluded to the possibility that Cantonese lexical tone and English lexical stress are processed interactively, at least in the very early stage of acoustic-phonetic processing, because they involve processing some of the same acoustic features, such as fundamental frequency (F0). This interdependent and interactive processing of suprasegmental speech features may be even more evident for Cantonese-English bilinguals who speak one language that employs tones (Cantonese) and another language that employs stress patterns (English), as these bilinguals may build a holistic, common representation of suprasegmental features for their two languages. Thus, the second aim of the present study is to explore this possibility by examining the similarities and differences between Cantonese tone perception and English stress perception in Cantonese-English bilinguals.
These previous studies pointing to the effects of one's native language on non-native suprasegmental speech perception are almost exclusively done with skilled adults. However, a very recent study by Quan & Swingley has shown that age affects native English lexical stress perception among monolingual English speakers [28]; specifically, young children's English lexical stress perception was worse than skilled English-speaking adults. The authors argued that adults had far more language experience than children, leading to differences between the groups in English stress perception. There is also compelling evidence showing auditory capabilities mature with age, with children using different acoustic cues than adults in auditory processing tasks [29,30]. Findings from these previous studies suggest that children and adults have different levels of linguistic knowledge about their languages and that children have different phonological processing capabilities than adults. Thus, from a theoretical viewpoint, a comparison between bilingual children and adults in the perception of Cantonese tones and English stress patterns sheds new light on the role of linguistic knowledge in the perception of suprasegmental speech. Furthermore, studying differences in the use of acoustic cues by bilingual children and bilingual adults offers insights into the acoustic cues underlying the perception of Cantonese tone and English stress across different ages of bilinguals. Given that no studies to date have examined or compared the similarities and differences between developing bilingual children and skilled bilingual adults in perceiving Cantonese tones and English stress patterns, the third aim of the present study is to determine the effect of linguistic knowledge on Cantonese tone and English stress perception by comparing children with skilled adults.

The Present Study
An overarching goal of the present study is to test a dynamic interactive theory of suprasegmental speech processing in bilinguals. This theory would argue that Cantonese-English bilinguals process tone and stress interactively in a way that varies as a function of the development of their linguistic knowledge of each language's prosodic system. The dynamic interactive theory of suprasegmental speech processing incorporates both between-language and within-language interactions by assuming an interaction between tone and stress, and an interaction between top-down (unconscious expectation) and bottom-up (acoustic analysis of the stimuli) processes within each language. These interactions evolve and develop over an extended period of time, and therefore are expected to vary as a function of developmental differences in bilinguals' linguistic knowledge of the two prosodic systems. To test this account, the current study evaluates the differences and similarities between perception of tone and stress in Cantonese-English bilingual children and adults, as well as English monolingual adults. There are three related but distinct hypotheses relevant to this account. The first hypothesis is that if tone and stress are interdependently and interactively processed, there will be a significant positive correlation between Cantonese tone perception and English stress perception in both bilinguals and monolinguals. This hypothesis rests on the acoustic and perceptual evidence showing that Cantonese tone and English stress share similar acoustic concomitants [8,18], and that the perception of tone and stress involve processing of the same acoustic feature, i.e., F0 [16,27].
The second hypothesis is that if tone and stress are interactively and dynamically processed, we will find differences between bilingual children and adults in their perception of Cantonese tone and English stress due to age-related differences in their acoustic-phonetic processing capabilities. Given their more extensive L1 and L2 exposure, bilingual adults have much richer experience in the use of Cantonese tone and English stress than bilingual children do, and their understanding of the languages' tone and stress systems is much more sophisticated and finegrained than that of bilingual children. They can generate expectations about acoustic cues that are crucial to distinguishing tone and stress contrasts much more quickly than bilingual children, and they can make more efficient use of acoustic features to verify those expectations. Thus, the interactive processing of tone and stress is much faster and more efficient in bilingual adults than in bilingual children, and the interaction between top-down expectations and bottom-up acoustic analyses is likewise more efficient. Conversely, due to their developmental immaturity and their less extensive linguistic knowledge of L1 and L2, bilingual children may not preserve fine-grained acoustic details of Cantonese tone and English stress. Thus, they may not yet have developed subtle sensitivity to the acoustic cues that distinguish tone and stress contrasts, particularly less dominant cues (such as major slope for lexical tone). This would be evidenced by differences in both tone perception and stress perception performance, and also in the relative use of different acoustic cues to distinguish difficult tone contrasts, such as high rising (25)-low rising (23).
The third hypothesis is that if Cantonese tones and English stress patterns are processed interactively, we would expect prosodic transfer between Cantonese tones and English stress in Cantonese-English bilinguals but not in English monolinguals. That is, we would expect that Cantonese-English bilinguals, at least bilingual adults, will show better performance in perceiving Cantonese tones and English stress patterns than English monolingual adults, because under the dynamic interactive account, tones and stress would be able to mutually facilitate each other during the course of suprasegmental processing. Moreover, we would expect our three participant groups (both bilinguals and monolinguals) to use acoustic correlates common to tones and stress, such as F0, in perceiving both Cantonese tones and English stress, and we would expect significant positive associations between F0 acoustic correlates and tone perception in the three groups.
These three hypotheses were examined with an AXB perceptual experiment, as well as an extensive analysis of all acoustic correlates of the large set of Cantonese tones and English stress stimuli used in our perceptual experiment. Most previous studies on suprasegmental speech perception have used a relatively small number of perceptual stimuli, and as such have lacked the type of systematic acoustic analyses that might shed light on the acoustic features that underlie the perception of suprasegmental speech features. By including a large set of suprasegmental stimuli and carrying out such analyses in the current study, we will be better able to draw detailed conclusions about the processes by which Cantonese-English bilinguals' perceive the suprasegmental features of speech. By testing these hypotheses in Cantonese-English bilinguals for whom one language employs lexical tone (Cantonese) and the other employs lexical stress (English), we are able to examine the extent to which one's native language experience (e.g., to a tonal language) influences the perception of suprasegmental features of another language (e.g., stress). This will help us to better understand the cross-language representation and transfer of prosodic features.

Participants
Three groups of participants were included in the present study: 30 Cantonese-English bilingual children (20 girls, 10 boys; mean age = 11.70 yrs, SD = 2.28 yrs), 30 Cantonese-English bilingual adults (11 female, 19 male; mean age = 21.06 yrs, SD = 1.22 yrs); and 15 English monolingual adults (13 female, 2 male; mean age = 24 yrs, SD = 4.54 yrs). Eleven-year-old children were selected for this study because of clear evidence showing that children at this age perform as well as adults in perceiving Cantonese tones [31] and English lexical stress [32]. Cantonese-English bilingual children were recruited from Hong Kong International Primary Schools or English-medium schools. Based on the parent-report questionnaire data, the children included in our study were all from families with a medium to high income level (the monthly family income range in Hong Kong is 3,867-6,445 USD according to the Hong Kong Census and Statistics Department [33]). All children's mothers had attained education at a bachelor's level or above. The Cantonese-English bilingual adults in our study were all undergraduate students at the University of Hong Kong, an English-medium university, and the English monolingual adults were either university students or staff in Hong Kong. We obtained written informed consent from all adult participants and from the parents of all participating children. The research protocol was approved by the Human Research Ethics Committee for Non-Clinical Faculties of The University of Hong Kong (EA040711).
Participants' bilingual or monolingual status was assessed with a revised Language, Social and Music Background Questionnaire that was adopted from Bialystok and Depape [34]. The questionnaire was administered to the parents of participating children (parent-report) and the adult participants (self-report) prior to testing, and it collected information about the participant's first and second language exposure, level of fluency in both languages, and home literacy practices. The questionnaire also gathered information about the participant's parents' occupation, income, and education level, which was then indexed as the participant's socioeconomic status. The children's parents and the adult participants were asked to rate their oral and aural ability in Cantonese and English on a 5-point scale ranging from poor (1) to excellent (5). Our analyses of parents' reported data showed that the mean percentage of time that their child spoke in English and Cantonese each day was 53.17% and 46.83%, respectively. Parents' ratings of spoken language proficiency suggested that the Cantonese-English bilingual children's Cantonese spoken proficiency (M = 3.80, SD = 1. Parents reported that Cantonese-English bilingual children start learning English at birth. Cantonese-English bilingual adults reported that they started learning English at 3.73 years old (SD = 2.16 yrs). English monolingual adults reported English as their native language, and they had no experience learning Chinese or any other tone language.
According to parents' reports and adults' self-report, our participants had no history of hearing, language and learning difficulties. All of them passed a pure-tone hearing screen (250, 500, 1000, 2000, 4000HZ at 20dB SPL).

Materials
Fifty-five pseudowords were constructed as stimuli (for examples, see Fig 1A and 1B). These pseudowords complied with the phonotactic rules of both Cantonese and English, but none of them carried any lexical meanings. The use of pseudowords eliminates any influence of the properties of lexicons and intrinsic knowledge of pitch, loudness and duration of vowels of real words on the perception of linguistic pitch [15]. All of the pseudowords were disyllabic with each syllable bearing a Consonant-Vowel (CV) structure. The CV syllable was used because it is the core syllable in both Cantonese and English, and it is universal in all languages [35][36][37]. The CVCV structure has also been successfully used to assess the perception of suprasegmental features of speech in different languages [16,27,38].
The 55 pseudowords were created by using 11 consonants (/b/, /p/, /d/, /t/, /g/, /k/, /m/, /h/, /l/, /s/, and /f/) and five vowels (/a/, /ε/, /i/, /ɔ/, and /u/). There are three theoretical reasons for selecting these consonants and vowels. First, they have similar phonetic realization in both Cantonese and English, and they are commonly used in both languages [7,39]. Thus, the use of these consonants and vowels can control for the familiarity of phonetic segments across languages, which recent research has shown to influence perception of suprasegmental contrasts [11]. Second, the eleven consonants cover five different places of articulation, (i.e., bilabial /b/, /p/ and /m/, labial-dental /f/, alveolar /d/, /t/, /s/ and /l/, velar /g/, /k/ and glottal /h/), as well as four manners of articulation (i.e., plosive /b/, /p/, /d/, /t/, /g/ and /k/, fricative /f/, /s/ and /h/, nasal /m/, and lateral approximant /l/), and the five vowels selected are commonly used in both Cantonese and English. The combination of these consonants and vowels can create large sets of stimuli containing most of the possible acoustic variability existing in lexical tone and stress perception outside the laboratory. Third, language acquisition research has shown that the acquisition order of these consonants is the same between Cantonese and English, despite differences in acquisition rate [40,41]. Furthermore, according to the norms in the Hong Kong Cantonese Articulation Test, these consonants are all acquired before school age. Thus, the use of these consonants and vowels eliminates the possible influences of the acquisition order of phonetic segments on perception of suprasegmental features.
The consonants and vowels were assigned randomly into CVCV structures to form disyllabic pseudowords (e.g., /sula/, /hɔmi/, /lamu/), with the restriction that none of the CVCV structures were allowed to contain any repeating consonants or vowels. These pseudowords were used to carry Cantonese lexical tones and English lexical stress.
Cantonese tone perception test. There were two types of CVCV Cantonese stimuli; they consisted of the same phonetic segments and differed only in the location of constant tone. In the first type, the tone of the first syllable remained constant, i.e., high level tone (55, T1), while the second syllable carried six tones, varying from tone 1 to 6 (see Fig 1A). In the second type, the tone of the first syllable varied from tone 1 to 6, while the second syllable carried constant tone. In both cases, the high level tone (55, T1) was used as the constant tone because, firstly, it has the highest occurrence in Cantonese [42]; both the speakers recording pseudowords and the participants should therefore be most familiar with this tone. Secondly, Hong Kong Cantonese speakers find tone contrasts with high level tones easier to distinguish, as the high level tone is well separated from the other five tones in terms of its pitch value [43,44]. Therefore, it should exert the least influence on production in the stimulus recordings and the participants' perception in the experiment.
A total of 15 tone contrasts were formed by pairing the six distinct Cantonese tones: T1-T2,  T1-T3, T1-T4, T1-T5, T1-T6, T2-T3, T2-T4, T2-T5, T2-T6, T3-T4, T3-T5, T3-T6, T4-T5,  T4-T6, and T5-T6. These 15 tone contrasts were carried by our 55 CVCV pseudoword sequences. A balanced latin-square design was used to control for the variation of tone contrasts and CVCV structures. The 15 tone contrasts and 55 CVCV sequences were both randomly grouped into five blocks, with each block comprising 11 CVCV sequences carrying three tone contrasts. With this design, we generated five versions of the stimulus set, each having 165 trials (3 tone contrasts × 11 CVCV sequences × 5 blocks). This design enabled the inclusion of all 55 CVCV stimuli and all 15 tone contrasts in each version. In addition, to counterbalance the order of the position of the constant tone, two types of Cantonese stimuli (1 st syllable with the constant tone; 2 nd syllable with the constant tone) were used. Thus, a total of 10 versions of the Cantonese stimuli were used for the tone perception test.
English stress perception test. The same 55 CVCV sequences were used to construct English stress stimuli (see Fig 1B). Stress was placed on either the first or the second syllable, producing initial-stressed (IS) and final-stressed (FS) pseudowords, respectively. This yielded total of 110 stimuli (55 CVCV × 2 stressed conditions). An example of six Cantonese tones and two English stress patterns that were carried by the CVCV sequence /sula/ is shown in Fig 1A and 1B, respectively.

Recording and Segmentation
The items carrying both Cantonese tones and English stress were recorded three times each by a female bilingual who had received phonetic training and had a linguistic background in both Cantonese and English. She was born in Hong Kong, but lived in an English-speaking country for 10 years, and her Cantonese and English had both reached native-like levels of proficiency (according to ratings of her production of Cantonese and English words by native speakers of Cantonese and English, respectively). She was informed about the background of the experiment and was aware that the items recorded would be used as stimuli in the experiment.
All utterances were recorded in a soundproof booth and were digitized at a sampling rate of 44.1 kHz (16 bit, mono) using the phonetic analysis software Praat [45]. The tone and stress stimuli were written on slides and presented to the speaker separately, with each slide showing one written stimulus in IPA transcription. The speaker was asked to produce each stimulus three times, and was asked to count '1, 2' after each stimulus to confirm that she was ready for the next stimulus. The pace of the slide shows was controlled by the experimenter. Tone and stress stimuli were recorded separately, while the order of stimuli within each recording session was randomized. Practice trials were provided before recording began. The speaker was asked to produce the stimuli at a tone level she was comfortable with. A total of 2310 stimuli were recorded (55 CVCV × 6 tones × 2 types × 3 repetitions + 55 CVCV × 2 stress conditions × 3 repetitions = 1980 + 330 = 2310). The recordings were made over two separate days.
All recordings were edited using Praat [45]. We excluded all recorded stimuli that contained disfluencies, that were unclear or had unstable loudness, or that were of extreme durations (longer than 1000ms or shorter than 700ms). All segmentations were made at zero crossings, as determined auditorily and by visual inspection of waveforms and spectrograms. The selected stimuli for both the Cantonese tone and English stress perception tests were between 800ms and 1000ms, and the intensity of all recordings were between 50dB to 66dB. A total of 660 (55 CVCV × 2 types × 6 tones) tone stimuli and 110 (55 CVCV × 2 stress conditions) stress stimuli were selected for the perception tests.

Procedure
The experiment was conducted in a soundproof room at the University of Hong Kong. We used the AXB discrimination paradigm that has been widely used in previous research on speech perception with both adults and children [46][47][48]. A trial consisted of three auditory stimuli (with either Cantonese tone or English stress) presented sequentially to participants, separated by 400ms. In each trial, either the first (e.g., tone trial: /ku1ma5/, /ku1ma5/, /ku1ma2/; stress trial: /kuˈma/, /kuˈma/, /ˈkuma/) or the third stimulus (e.g., tone trial: /la1mu2/, /la1mu4/, /la1mu4/; stress trial: /laˈmu/, /ˈlamu/, /ˈlamu/) in the sequences shared the same tone/stress pattern as the second stimulus. The participants were asked to decide whether the second stimulus sounded more like (or the same as) the first or the last stimulus within 3000ms by pressing the "1" or "3" button on the keyboard, respectively. A new trial automatically started if the participants did not make any response within 3000ms [49,50]. There were four AXB orders (AAB, ABB, BAA, BBA) in terms of target position and choice position. In order to avoid any bias due to choice positions, these four orders of choice and target arrangement were randomly assigned to the stimuli. Testing began with practice trials, followed immediately by the test trials. The inter-trial interval was 400ms. We used 3000ms as the interstimuli interval and 400ms as the inter-trial interval because the auditory span was 200-300ms and the processing time of 250ms should be sufficient for recognizing a speech signal [51].
All stimuli were played automatically in a randomized order by the Java program on an IBM Thinkpad Lenovo laptop computer. Both Cantonese tone and English stress stimuli were set to have average level of 63dB SPL. The participants listened to the stimuli via headphones connected to the computer. Accuracy and reaction time were automatically recorded by the program. The Cantonese tone and English stress perception tests each took 10-15 minutes, and a break was given between the two sessions.

Results
One lexical tone stimulus (LUki4) was excluded from all the analyses due to experimental manipulation error. Preliminary data analyses of the lexical tone perception task showed no significant main effect of the position of the constant tone (1 st syllable vs. 2 nd syllable) for either accuracy, F(1, 73) = 0.00, p = .995 or response latency, F(1, 73) = 1.42, p = .238. Thus, the two types of Cantonese tone stimuli were collapsed in the following data analyses.

Differences in Perception of Cantonese Tone and English Stress by the Three Groups
The three participant groups' mean proportion of correct responses (i.e., accuracy rates) and mean response latencies (i.e., reaction times) for Cantonese tone contrasts and English stress contrasts are shown in Fig 2A and 2B, respectively. The mean response latencies reported here are based on items that were responded to correctly. We excluded trials on which participants made extremely fast or slow responses (i.e., more than 2SDs above or below the mean per condition per subject) from our latency analyses; this amounted to less than 5% of all trials.
As shown in Fig  Meanwhile, the simple main effect of group was also significant for both tone perception, F(2, 72) = 4.76, p < .05, η 2 p = .12, and stress perception, F(2, 72) = 16.95, p < .001, η 2 p = .32. A Fisher's least significant difference (LSD) post hoc test revealed that Cantonese-English bilingual adults had significantly higher accuracy rates than Cantonese-English bilingual children on tone (p < .05) and stress (p < .001) perception. Cantonese-English bilingual adults performed better than English monolingual adults at tone perception (p < .01), and their performance on stress perception was marginally higher than English monolingual adults (p = .061). Cantonese-English bilingual children had lower accuracy on stress perception than English monolingual adults (p < .05), with no difference in tone perception (p = .446).
The mean response latencies for Cantonese tone contrasts and English stress contrasts were submitted to a similar two-way ANOVA. There were significant main effects of pitch contrast, F(1, 72) = 19.17, p < .001, η 2 p = .21, and group, F(2, 72) = 15.98, p < .001, η 2 p = .31. The interaction between pitch contrast and group was also significant, F(2, 72) = 8.13, p < .01, η 2 p = .18. As shown in Fig 2B, further analysis of the interaction effect showed that Cantonese-English bilingual children's response latencies to Cantonese tone contrasts (M = 446ms) were statistically indistinguishable from their latencies to English stress contrasts (M = 450ms), t(29) = -.19, p = .848. However, Cantonese-English bilingual adults showed significantly longer response latencies toward Cantonese tone contrasts (M = 359ms) than English stress contrasts (M = 250ms), t(29) = 7.48, p < .001. Similarly, English monolingual adults tended to show slower responses to Cantonese tone contrasts (M = 466ms) than to English stress contrasts (M = 396ms), p = .070. Additionally, a Fisher's LSD post hoc test revealed that Cantonese-English adults responded more quickly than Cantonese-English bilingual children on the tone perception (p < .01), and stress perception (p < .001) tests. Similarly, bilingual adults responded more quickly than English monolingual adults on both the tone perception (p < .01), and stress perception (p < .01) tests. However, there was no difference between the response latencies of bilingual children and English monolingual adults on either the tone perception test (p = .530) or the stress perception test (p = .192).
Next, we conducted correlational analyses on the accuracy of Cantonese tone and English stress perception, exploring their association separately for each group. We found a significant correlation between Cantonese tone perception and English stress perception accuracy in Cantonese-English bilingual children (r = .73, p < .001) and English monolingual adults (r = .79, p < .01). However, there was no significant correlation between tone perception and stress perception accuracy in Cantonese-English bilingual adults (r = .13, p = .49). To examine if the patterns of tone perception vary across three groups, we first conducted an overall omnibus test on the effects of group and tone contrast by using a 3 × 15 mixeddesign analysis of variance (ANOVA) with group as a between-subjects factor and tone contrast as a within-subjects factor. There was a significant main effect of group, F(2, 72) = 5.16, p < .01,  25. The crucial result was a significant interaction between group and tone contrast, F(28, 1008) = 2.46, p < .001, η 2 p = .06. Thus, in the subsequent analyses, we examined group differences in each individual tone contrast. We also examined the perceptual differences of 15 tone contrasts separately for each group.

Perceptual Saliency of Different Tone Contrasts Across Groups
The simple main effect analysis of group difference in each individual tone contrast showed that the three groups' accuracy rates differed for the tone contrasts T2-T5 [F(2, 72) = 7.61, p < .01, η 2 p = .17]; T2-T4 [F(2, 72) = 3.76, p < .05, η 2 p = .10]; T1-T3 [F(2, 72) = 8.28, p < .01, η 2 p = .19]. Post-hoc tests with Bonferroni correction revealed that Cantonese-English bilingual adults' accuracy scores were higher than those of English monolingual adults for all of these tone contrasts (ps < .05). However, there were no differences between Cantonese-English bilingual children and English monolingual adults on accuracy for these tone contrasts (ps > .05), with the T1-T3 contrast being the only exception (p < .05). Cantonese-English bilingual adults' accuracy on the T2-T5 contrast was significantly higher than that of Cantonese-English bilingual children (p < .05); however, there were no other significant differences between the bilingual adults and children.
There was a trend that three groups' accuracy rates differed in Post-hoc tests with Bonferroni correction revealed that Cantonese-English bilingual adults' accuracy rates tended to be higher than that of English monolingual adults (ps < .10) only; there were no significant differences in these tone contrasts between Cantonese-English bilingual adults and children (ps > .05), as well as Cantonese-English bilingual children and English monolingual adults(ps > .05).
We also conducted a separate ANOVA for each group to compare the mean accuracy scores for all 15 tone contrasts. For Cantonese-English bilingual children, there was a significant difference in accuracy rates across the 15 tone contrasts, F(14, 406) = 12.03, p < .001, η 2 p = .29. Pairwise comparisons with Bonferroni correction revealed that their performance on T2-T5 (M = .75) was significantly lower than their performance on all other tone contrasts (ps < .05). There were no significant differences in performance between other contrasts (ps > .05).
To summarize, bilingual children reached adult levels of accuracy on most of the 15 tone contrasts (see Fig 3), but not all-accuracy on the T2-T5 contrast, specifically, was low. Moreover, bilingual children had better performance than English monolingual adults on T1-T3, but this was the only tone contrast on which they outperformed the adults.

Acoustic Analyses of Cantonese Lexical Tone and English Stress
Our two perception tests each included a large set of tone (660 tokens) and stress (110 tokens) stimuli, which allowed us to conduct a systematic analysis of all possible acoustic correlates of Cantonese tones and English stress patterns to examine the critical acoustic correlates that determine the perceptual saliency of Cantonese tone and English stress. Using Praat acoustic software [45], we extracted a wide range of acoustic parameters: vowel duration (in ms), fundamental frequency (F0, in Hz), average intensity (in dB), spectral balance, and formant frequency (F1 and F2), all of which have been demonstrated in different studies to be used by speakers in perceiving or producing tone or stress [15,16].
In our analyses, there were a total of 660 Cantonese tone tokens and 110 English stress tokens. Each token was measured according to all the above listed acoustic parameters. For Cantonese tone tokens, we analyzed all tone syllables with tone placing on the first syllable given that there was no perceptual difference in perceiving tones in the first or the second syllable. The analyses were based on the acoustic measures of all tokens of the same tone. There were 110 tokens for each tone, whereas there were 55 tokens for each type of stress pattern. The analyses of tone are reported first, followed by analyses of stress. Table 1 shows means and standard deviations of duration, average F0, F0 onset, F0 offset, F0 general slope, F0 displacement, F0 major slope, intensity, F1, and F2 for all six Cantonese tones. An ANOVA was carried out to examine differences among the six Cantonese tones in terms of each of those acoustic parameters.

Acoustic Correlates of Six Cantonese Tones
Duration. For our purposes, duration refers specifically to vowel duration in milliseconds. Results of the ANOVA of average duration across the six tones showed a significant main effect of tone category, F(5, 270) = 131.03, p < .001, η 2 p = .71. Pairwise comparisons with Bonferroni correction suggested that T1, T4, and T6 duration values were statistically indistinguishable from one another (ps = 1.00). By contrast, the duration values for T2, T5, and T3 were statistically different from one another (ps < .001), with high rising T2 having the longest duration (M = 311.62ms) followed by the low rising tone T5 (M = 288.05ms) and then by the level tone  Tone Contour. We first used one simple measure to quantify tone contour: F0 general slope. F0 general slope was defined as the general gradient between F0 onset and F0 offset. It was calculated by dividing the difference between F0 offset and F0 onset by duration. F0 general slope is a directional measure that shows the F0 movement of tones. It was expected that the level tones (T1, T3, T6) would have similar general slopes while rising tones (T2, T5) and falling tone (T4) would have more positive and more negative general slopes, respectively. Also, we expected that the high rising tone (T2) might have a greater positive general slope than the low rising tone (T5).
We noticed that T3, a mid-level tone, had a significantly different F0 general slope than the two other level tones, T1 and T6. Since T3 was shown to have longer duration (M = 250.68ms) than both T1 (M = 230.64ms) and T6 (M = 229.16ms), the question arose as to whether such duration differences led to the difference in F0 general slope between the three level tones. Thus, we took out the time factor from our F0 general slope measure and examined F0 displacement. F0 displacement was a directional measure of a change from F0 onset to F0 offset.
The results presented up until now showed that high level T1 and low falling T4 had similar F0 general slopes and F0 displacements, which indicated similar tone contour. We speculated that these two measures of tone contour, i.e., both F0 general slope and F0 displacement, were influenced by tonal context effects resulting from the disyllabic pseudowords used in the present study. In those disyllabic stimuli, the first syllable carried the experimental tones ranging from tone 1 to 6, whereas the second syllable always carried the high level tone T1 (a control tone). Since T4 had the lowest F0 offset (182.01Hz) among all the tones, its F0 offset would be substantially affected by the following T1, which has the highest F0 onset among all other tones (315.99 Hz). This design feature may lead to a tonal context effect in the F0 offset of T4, thereby making the F0 general slope value and F0 displacement of T4 less negative than it might be otherwise.
Indications of such a tonal context effect are evident in Fig 1A. According to the F0 traces depicted in Fig 1A, the contours of the three level tones, i.e., high level T1, mid level T3, and low level T6, were linear. But the contour of T4 was a concave upward curve: the F0 movement tendency of T4 gradually changes from falling to rising when it comes closer to its end. This could be due to the influence of the following high level T1. Similarly, the two rising tones T2 and T5 also exhibited an obvious concave upward curve. On the basis of our results, it seems that the F0 general slope measure was not sensitive to a concave function for tone contours. Therefore, we conducted a fine-grained analysis of the tone contours of the rising tones and falling tone by examining the falling and rising portions of the F0 traces for T2, T4, and T5.
Falling Contour for T2, T4, and T5. The falling portions of T2, T4, and T5's contours were examined using three different acoustic parameters: F0 fall displacement, F0 fall time, and F0 fall slope. F0 fall displacement was defined as a change in F0 from F0 onset to F0 minimum (in Hz). F0 fall time was the duration between F0 onset and F0 minimum (in ms). F0 fall slope referred to the gradient between F0 minimum and F0 onset (in Hz/ms). It was calculated by dividing F0 fall displacement by F0 fall time. Table 2 displays the means and standard deviations of F0 fall displacement, F0 fall time, and F0 fall slope for T2, T4, and T5.
Up to this point, our results indicate that T2 and T5 shared similar falling portions, and that they were both different from T4. As shown in our earlier analyses, T2 and T5 also had similar F0 onset, F0 fall displacement, F0 fall time, and F0 fall slope. Thus, we further analyzed the rising contours of these three contour tones in order to examine whether rising contour was the key determiner of the perceptual distinction between T2 and T5.
Rising Contour for T2, T4 and T5. We used three similar measures to examine the rising portions of the contours of T2, T4, and T5: F0 rise displacement, F0 rise time, and F0 rise slope. F0 rise displacement referred to a change in F0 from F0 minimum to F0 offset (in Hz). F0 rise time referred to the duration between F0 minimum and F0 offset (in ms). F0 rise slope referred to the gradient between the F0 offset and F0 minimum (in Hz/ms) and it was calculated by dividing F0 rise displacement by F0 rise time. Table 2 shows means and standard deviations of F0 rise displacement, F0 rise time, and F0 rise slope for T2, T4, and T5.
Results of the analysis of F0 rise displacement showed a significant main effect of tone category, Further, we calculated the ratio of F0 rise displacement to F0 fall displacement, referred to as the R-F D-ratio, and the ratio of F0 rise time to F0 fall time, referred to as the R-F T-ratio, in order to directly compare the rising portions to the falling portions of T2, T4, and T5's contours. Table 2 shows means and standard deviations of the R-F D-ratios and R-F T-ratios for T2, T4, and T5. Results of an ANOVA on the R-F D-ratio showed a significant main effect of tone category, F(2, 108) = 64.26, p < .001, η 2 p = .54. Pairwise comparisons with Bonferroni correction suggested that there was no difference between T2 (M = 2.40) and T5 (M = 2.14) in R-F D-ratio (p = .76), but each of them was significantly larger than T4 (M = .24) in R-F D-ratio (ps < .001). Similarly, the main effect of tone category was also significant in an analysis of the R-F T-ratio, F (2, 108) = 119.37, p < .001, η 2 p = .69. Pairwise comparisons with Bonferroni correction suggested that the two rising tones T2 (M = 2.01) and T5 (M = 1.90) were statistically indistinguishable from each other (p = .96), and that each of them was significantly larger than T4 (p < .05).
To summarize, our analyses of the falling and rising portions of tones' contours showed that T2 and T5 were significantly different from each other in the rising portions but not in the falling portions, and that the length of the rising portions of T2 and T5 were about twice the length of their own falling portion. These results raised the possibility that a listener in the present study might just focus on the rising portion in their perceptual discrimination of T2 and T5. Conversely, for T4, a listener might just focus on the falling portion in perceiving the contextually-affected T4 tone in this study.
To examine this possibility, we used another F0 slope measure, F0 major slope, to compare the most distinctive tone contour among the six Cantonese lexical tones. The F0 major slope referred to the F0 rising slope for the rising tones T2 and T5, while the F0 major slope for the falling tone T4 was its falling slope. For the level tones T1, T3, and T6, their F0 major slopes were the same as their F0 general slopes.
Intensity. An ANOVA of average intensity (measured in dB) showed a significant main effect of tone category, Spectral Balance. Consistent with Sluijter and van Heuven's (1996) study of spectral balance on English lexical stress, in this study spectral balance was measured in four contiguous frequency bands: Band 1 (0-500Hz), Band 2 (500-1000Hz), Band 3 (1000-2000Hz), and Band 4 (2000-4000Hz). Spectral balance was investigated only on the vowels of the CVCV stimuli. Band energy of each frequency band for a vowel was obtained from Praat by analyzing its stable portion of no less than 70ms in duration. To remove the factor of duration in the analysis of spectral balance, band energy in squared pressure seconds (Pa 2 sec) was converted to sound pressure level (dB) by the following formula: Sound pressure level in dB ¼ 10log 10 Band energy in Pa2 sec time in second 4 Â 10 À10 in Pa2 sec 0 @ 1 A Table 3 shows the means and standard deviations of B1, B2, B3, and B4 for the six Cantonese tones. To examine differences in the spectral balance of the tones, we carried out a two-way repeated measures ANOVA with both tones (T1, T2, T3, T4, T5, T6)  For Band 2, the main effect of tone category was significant, F(5, 270) = 51.32, p < .001, η 2 p = .49. Pairwise comparisons with Bonferroni correction suggested that three pairs of tones were statistically indistinguishable from one another: T1 and T3 (p = 1.00), T2 and T4 (p = .16), T2 and T6 (p = 1.00); all other pairwise comparisons of overall spectral energy were statistically significant (ps < .05).
For Band 3, the main effect of tone category was significant, F(5, 270) = 94.95, p < .001, η 2 p = .64. Pairwise comparisons with Bonferroni correction suggested that only two pairs of tones, T1 and T3 (p = 1.00) and T5 and T6 (p = .06), were statistically indistinguishable from each other, while all other pairwise comparisons of overall spectral energy were statistically significant (ps < .05).
For Band 4, the main effect of tone category was significant, F(5, 270) = 35.84, p < .001, η 2 p = .40. Pairwise comparisons with Bonferroni correction suggested that three pairs of tones, T1 and T3 (p = 1.00), T2 and T6 (p = .12), T5 and T6 (p = 1.00), were statistically indistinguishable from one another, while all other pairwise comparisons of overall spectral energy were statistically significant (ps < .05). Thus, only T1 and T3 shared the same spectral energy across the four bands, as depicted in Fig 4. Estimated Importance of Different Acoustic Correlates of Cantonese Tone. To further estimate the overall percentage variation of attributable to different acoustic correlates of Cantonese tone, we divided the sum of the absolute differences between the mean of the acoustic correlates of all 15 tone contrasts by the sum of the absolute mean of the acoustic correlates of all six Cantonese tones. The equation is shown below. Using that equation, the estimated overall variation in all tone contrasts accounted for by the different acoustic correlates, in descending order, were: average F0 (56%), F0 onset (48%), F0 offset (45%), duration (40%), F0 major slope (38%), B3 (38%), B4 (22%), B2 (21%), B1 (9%), Intensity (6%).
Estimated overall variation ¼ jX T2 À X T5 j 1 þ jX T3 À X T6 j 2 þ Á Á Á þ jX T1 À X T2 j 15 jX T1 j þ jX T2 j þ jX T3 j þ jX T4 j þ jX T5 j þ jX T6 j Of all 10 possible acoustic correlates of Cantonese tones, then, intensity had the lowest estimated overall variation. Since the intensity cue is very subtle compared to other acoustic cues in terms of estimated overall variation, a native Cantonese speaker may attend to other more salient and more reliable cues. Also, intensity is a weak cue for stress perception, suggesting that it may not be useful to native English speakers. Thus, the intensity cue was not significantly used by any of the three groups in this study.
In addition, it is interesting that although the tone contour measure of F0 major slope had very large estimated overall variation, and although 14 out of 15 tone contrasts (93%) had statistically different F0 major slope, only adult bilinguals could use tone contour as one of the general perceptual mechanisms to process tones across all tone contrasts.
Duration. As in the Cantonese tone analysis described above, duration refers specifically to vowel duration. The ANOVA showed that there was a significant main effect of stress condition,  Intensity. Results of the ANOVA on intensity showed a significant main effect of stress condition, F(3, 162) = 169.10, p < .001, η 2 p = .76. Pairwise comparisons with Bonferroni correction suggested that the intensity values of all stress conditions were statistically different from one another (ps < .001). The intensity values for these four stress conditions, in descending order, were: TS1 (M = 66.60dB), IS2 (M = 64.12dB), IS1 (M = 62.57dB) and TS2 (M = 59.69dB).
Spectral Balance. Fig 5 shows spectral balance of TS1, TS2, IS1 and IS2 for English lexical stress. To examine differences in spectral balance, a two-way repeated measures ANOVA was conducted with stress condition (TS1, TS2, IS1 and IS2) and band category (B1, B2, B3, and B4) as within-subject factors. There was a significant main effect of stress condition,  The interaction between stress category and band category was also significant, F(9, 486) = 2.54, p < .01, η 2 p = .05. To unpack the interaction, we performed simple main effect analyses of stress conditions in each of the different bands. A series of one-way repeated measures ANO-VAs with stress conditions as a between-subject factor were conducted. For Band 1, the main effect of stress condition was significant, F (

Association between Tone Perception Accuracy and Acoustic Correlates
We further examined the relationship between perceptual performance and acoustic parameters by conducting correlational analyses. Table 5 shows the correlations between the tone perception accuracy of all three groups and all acoustic parameters including duration, average F0, F0 onset, F0 offset, F0 general slope, F0 major slope, intensity, F1, F2, B1, B2, B3, and B4. It is evident that adult bilinguals' perceptual accuracy was significantly correlated with both child bilinguals' accuracy (r = .86) and adult English monolinguals' accuracy (r = .87), and child bilinguals' accuracy was also strongly correlated with English monolinguals' accuracy (r = .84), suggesting sufficient shared variance in tone perception accuracy among three groups.

Discussion
The present study demonstrates that, although Cantonese tone and English stress differ in their acoustic composition and linguistic function, both are perceived equally well by Cantonese-English bilingual children who simultaneously learned one tone language (Cantonese) and one stress language (English). In contrast, Cantonese-English bilingual adults and English monolingual adults exhibited superior performance in perceiving English stress than in perceiving Cantonese tone, with the bilingual adults tending to outperform the monolingual adults in both tone and stress perception. Cantonese tone perception and English stress perception were significantly associated with each other in Cantonese-English bilingual children and English monolingual adults, but not in bilingual adults. The high rising (25)-low rising (23) contrast was perceived with the lowest accuracy among all 15 Cantonese tone contrasts across the three groups; acoustic analyses indicate that the high rising (25) and low-rising (23) tones are similar in F0 onset and in average F0 but they differ in their major slopes. The correlation analyses between tone perception performance and acoustic correlates showed that average F0 and F0 onset were associated with tone performance in all groups, but F0 major slope was only associated with tone perception in the bilingual adult group. The correlation analyses also showed that English monolinguals' sensitivity to spectral balance was significantly associated with their Cantonese tone performance. These results are discussed in terms of the mechanism employed by bilinguals in the perception of tone and stress.

Dynamic Interactive Processing of Tone and Stress in Cantonese-English Bilinguals
There are at least two possible pathways by which Cantonese-English bilinguals may perceive the two suprasegmental features of their languages, i.e., tone and stress. One is that tone and stress are processed independently by two separate systems. This view is primarily based on the important differences in how tone and stress are assigned. As described in the introduction, although the phonetic realizations of Cantonese tone and English stress both involve the manipulation of vocal pitch, there are striking differences between Cantonese tone and English stress. First, tone involves variation in pitch at the syllable level whereas English stress is related to vocal pitch, duration and intensity between syllables [18]. Second, Cantonese tone is used to distinguish a broad set of words in Cantonese, while only a limited subset of content words are distinguished by stress in English [52]. Additionally, the rate of fluctuation in fundamental frequency is higher in tone marking than in stress marking [53]. These differences raise the possibility that tone and stress are processed independently. The alternative possibility is that tone and stress are interdependently and interactively processed within a single perceptual system [54,55].
If tone and stress are indeed processed independently, we would not observe any significant correlation between tone perception and stress perception in bilinguals. Nor would we observe a developmental difference between children and adults in tone and stress perception, given the existing evidence that 10-year-old children perform as well as native adults in tone perception [31] and stress perception [32]. Importantly, we would also not expect any facilitative effect of bilingualism when processing suprasegmental features of language, which we saw, for example, in the fact that Cantonese-English bilingual adults outperformed English monolingual adults on stress perception. Therefore, our results do not support the view that tone and stress are processed fully independently.
Instead, our results support the claim that tone and stress are interdependently and interactively processed by Cantonese-English bilingual children and adults, at least in the acousticphonetic processing stage. This view does not completely deny the existence of some independent processing for tone and stress. Rather, we suggest that Cantonese-English bilinguals may have a common suprasegmental representation that embodies the shared acoustic cues relevant to both tone and stress (i.e., average F0 and F0 onset), as well as separate representations that characterize tone-specific cues (e.g., major slope) and stress-specific cues (e.g., spectral balance) respectively. Thus, as an utterance unfolds, both the common suprasegmental representation and the relevant language-specific suprasegmental representations are activated simultaneously. This activation causes no competition between tone and stress processing, resulting in the observed comparable and correlated performance in Cantonese-English bilingual children. As we will discuss below, we found differences between bilingual adults and children in the present study; however, we suggest that this might be accounted for by the possibility that use of suprasegmental representations, both common and language-specific, are constrained by one's level of language experience and phonological processing capabilities. This reflects the dynamic nature of suprasegmental speech processing.
Our view that tone and stress are processed interactively is consistent with models of second language perception [1,2,[56][57][58]. According to these models, a listener's perceptual strategy arises from the interaction between their linguistic knowledge and their speech perception system. Specifically, listeners' linguistic knowledge imposes considerable demands on their speech perception system; the system either assimilates novel sounds into existing categories or alters its structure to accommodate the novel speech sound. Moreover, these assimilation and accommodation processes are constrained by features of listeners' native languages [1,59].
Of relevance to the present study, our group of Cantonese-English bilingual children grew up in highly bilingual families and in a bilingual society. Cantonese lexical tones and English lexical stress are both present in and permeate their daily language environment. Such extensive dual-language input must have an impact on their speech perception systems, as well as on the phonetic perception of prosodic cues. Our acoustic analyses show that certain acoustic correlates are common across tone and stress, i.e., average F0 and F0 onset. Thus, it seems very likely that Cantonese-English bilingual children developed a common suprasegmental representation that records those language-general acoustic correlates and makes them available for both tone and stress processing. In other words, the concurrent learning of two different types of suprasegmental features leads to the formation of a general suprasegmental prototype that is used for both Cantonese and English. Meanwhile, our analyses also showed that some acoustic correlates are only available to tone or stress processing, such as major slope for tone and spectral balance for stress; this results in the development of separate tone-specific and stress-specific representations in bilingual children.
Whereas bilingual children showed comparable performance on tone and stress perception, both bilingual and monolingual adults exhibited superior performance in perceiving stress than in perceiving tone. This was to be expected for the English monolingual adults, as they are exposed to stress-based English but not tonal Cantonese. However, the finding was surprising for the bilingual adults, in that Cantonese-English bilingual adults had better perceptual performance on non-native English stress than native Cantonese tone. Below, we will discuss the possibility that different task demands contributed in part to this finding, but it is important to note that the bilingual adults tended to perform significantly better on both tone contrast perception test and stress perception test than English monolingual adults. Taken together, these findings with bilingual adults are consistent with a dynamic interactive processing view of tone and stress, suggesting that interdependence between the two may create a facilitative interaction between Cantonese tone and English stress that results from the shared underlying factors that link them together. We argue that Cantonese-English bilingual adults' stress perception performance being stronger than that of English monolingual adults may be partly due to their having learned to perceive complex tone contrasts, a skill that transfers to and benefits stress perception. Our results imply that learning complex prosodic perception skills, such as those used for Cantonese tones, can generalize to processing simpler prosodic contrasts, such as English stress patterns.
In contrast, it seems unlikely that learning the simple prosodic contrasts of English stress generalize to complex prosodic processing of Cantonese tones. As evident in the results of our acoustic analyses, only a restricted set of acoustic correlates relevant to F0 (i.e., average F0, F0 onset, major slope) are critical to distinguishing tone contrasts. On the other hand, there are many acoustic correlates that can be used to distinguish English stress, i.e., average F0, F0 onset, duration and spectral balance. In other words, many of the tone-related acoustic correlates can be used to distinguish stress, but not all stress-related correlates are helpful to tone perception. Thus, the higher stress perception performance that we see in bilingual adults could be due to the additive effects of the language-general suprasegmental cues, such as F0 average, and the stress-specific cues. We note that in our study, bilingual adults' tone perception accuracy was not correlated with their stress perception accuracy, although this can be explained by their ceiling performance in both tone and stress. In posing this explanation, we are certainly aware that the present study focuses on perception of Cantonese tone and English stress at the acoustic-phonetic level. Future research is needed to test this explanation by examining suprasegmental speech processing at the lexical level.
While our explanation for the facilitative effect of Cantonese tone on English stress perception may account for bilingual adults' superior performance on stress perception than tone perception, we must also consider why bilingual children performed similarly on the two tasks. We suggest that although bilinguals have both a common suprasegmental representation and separate tone-and stress-specific representations, the effective use of these representations may be constrained by the bilingual's level of language knowledge and their acoustic-phonetic processing capabilities. Bilingual adults have extensive linguistic knowledge of Cantonese tones and English stress patterns, affording them more opportunity to develop a sophisticated understanding of the acoustic correlates of tone and stress. In contrast, young bilingual children have only recently acquired all six Cantonese tones and English stress contrasts [31], As noted in the introduction, a very recent study showed that young children's suprasegmental speech perception (specifically English lexical stress) was not as good as that of skilled adults, which can be attributed to children's comparatively limited language experience [28]. Moreover, individuals' acoustic-phonetic processing and auditory capabilities mature with age [29,30].
Another likely possibility is that the common suprasegmental representation posited by the interactive account of suprasegmental processing is primarily used during the earliest stages of perceptual development for tone and stress, yielding comparable tone and stress perception performance in our bilingual children, with the separate representations of tone and stress being strengthened over time through language learning. Note that our child participants were balanced Cantonese-English bilinguals who had received balanced input in both languages and whose parents rated them as having equal proficiency in both Cantonese and English. In contrast, our adult participants were Cantonese-dominant bilinguals who rated themselves as being more proficient in Cantonese than in English, despite their early exposure to English (around the age of 3). Given the evidence showing that different amounts of language input from bilinguals' two languages modulates the perception of lexical stress patterns [60], it is very likely that tone-specific and stress-specific representations are strengthened in bilingual adults through extensive language inputs, leading to the development of separate perceptual strategies to distinguish tone and stress. That is, Cantonese-dominant bilingual adults use F0-related acoustic correlates as the primary cue to perceive tone contrasts, whereas they use stress-related acoustic correlates, such as spectral balance, to perceive stress contrasts. In this way, they are able to achieve a relatively high level of performance in both tone and stress perception.
Although bilingual adults outperformed monolingual adults on stress and tone perception, we observed an asymmetric pattern of performance favouring stress in both groups of adults. This may be due to differences in the relative perceptual salience of tone and stress, as well as in part to differences in perceptual load. In the stress perception test, there were two patterns of consonant-vowel bisyllabic stimuli: either strong stress on the first syllable followed by weaker stress on the second syllable (e.g., SUla), or weak stress on the first syllable followed by strong stress on the second syllable (e.g., suLA). These two patterns resulted in one minimal pair differing only in stress location (e.g., SUla vs. suLA) (see Fig 1B). Our acoustic analyses showed that each stressed syllable tends to have higher frequency and longer duration than the unstressed syllables. By contrast, in the tone perception test, there were six contrastive tones, and they combined into 15 tonal contrasts that varied in pitch height, pitch contour, or both. Moreover, the relative perceptual saliency of some tone contrasts is very low, for example, high rising (25, T2)-low rising (23, T5) (see Fig 1A). Thus, it seems that the task difficulty of the English stress perception test is relatively less, and stress patterns are relatively more predictable, than is the case for the Cantonese tone perception test. However, as noted earlier, the finding that Cantonese-English bilingual adults responded more accurately and more quickly than English monolingual adults to both Cantonese tone and English stress perceptions suggests that task difficulty alone cannot explain superior performance on stress perception compared to tone in bilingual adults. Rather, the facilitative interaction between Cantonese tone and English stress (as discussed above) is responsible for this effect.
The significant perceptual differences observed between bilingual children and adults also suggest that bilingual children's perceptual abilities in tone and stress improve even after the age of 11 years old. This finding is partially consistent with previous work showing that, by the age of 10, Cantonese children achieved adult-like accuracy in tone perception [31]. Our finding also shows that Cantonese-English bilingual children performed as well as bilingual adults on most, but not all, tone contrasts (e.g., T2-T5 performance was lower). Nevertheless, we must be cautious of the possibility that these developmental differences could be partly due to the effect of bilingualism on tone perception, wherein dual-language exposure to one tone language and one stress language may impact the perceptual organization of suprasegmental speech categories at the early stage of processing. Specifically, unlike Cantonese-dominant children in previous research [31], our balanced Cantonese-English bilingual children have to negotiate the differences between tone and stress, and to characterize two types of prosodic regularities in both of their native languages. The interaction between tone and stress could be facilitative, as we observed in bilingual adults' superior performance on stress perception. However, it could be also inhibitory at the early stage of processing, for the processing of complex suprasegmental contrasts like Cantonese tones in particular. The possible inhibitory effects of bilingualism on Cantonese tone perception need to be further investigated in future research.

Acoustic Correlates of Cantonese Tone and English Stress
Another important contribution of the present study is that we systematically analyzed all possible acoustic correlates that could potentially contribute to variability in the perceptual saliency of English lexical stress patterns and of all 15 tone contrasts for the six Cantonese tones. Our acoustic analyses of Cantonese tone suggested that average F0, F0 onset and major slope are the critical acoustic correlates in determining the relative ease of Cantonese tone perception. Cantonese-English bilingual children, adults, and monolingual English-speaking adults could all pick up on average F0 as a cue to discriminate tones. However, when average F0 and F0 onset are similar, as between T2 and T5 in particular, these similarities cause the three groups difficulty in perceiving tone contrasts.
The magnitude of major slopes of the six tones, in descending order, were: T2 > T5 > T3 > T6 = T1 > T4, implying that the perception of the T2-T5 contrast depended on the F0 major slope (in the rising portion). Our perceptual results showed that bilingual adult's accuracy in distinguishing the T2-T5 contrast was significantly higher than that of bilingual children and English monolingual adults, therefore further supporting that bilingual adults were much more sensitive to F0 major slope.
Although duration, intensity, the formant frequencies F1 and F2, and spectral balance were not critical in accounting for tones' perceptual saliency, we noticed that English monolingual adults appeared to use spectral balance as a perceptual cue to perceive Cantonese tones. This was indicated by the signification correlations of English monolingual adults' tone perception performance with spectral energy across bands 1, 3 and 4. Also, English monolingual adults performed worse than both bilingual adults and bilingual children in the T1-T3 contrast alone; recall that only T1 and T3 shared the same spectral energy across the first three bands. This further reinforced our inference that spectral balance was only a predominant cue for perceiving Cantonese tones among English monolingual adults.
The results of our acoustic analyses of English stress patterning suggested that average F0, F0 onset, F0 offset, F0 general slope, duration, intensity, and spectral balance are important acoustic cues to distinguish between stress patterns. These findings are consistent with previous research suggesting that English lexical stress is a multidimensional construct involving F0, duration, and intensity [15,16]. More importantly, we have demonstrated that among all of these cues, spectral balance may be one of the most reliable acoustic correlates of English stress.

Implications, Limitations and Future Directions
The present study provides compelling evidence, from both a perceptual experiment and detailed acoustic analyses, for the interactive processing of Cantonese tone and English stress in bilinguals. Specifically, the study provides the most comprehensive analyses available to date of acoustic correlates of tone and stress, enabling us to make inferences about the possible perceptual cues employed by bilingual and monolingual listeners in perceiving tone and stress. More importantly, the present study proposes an interactive account of suprasegmental speech processing to account for the performance differences in Cantonese tone and English stress observed among Cantonese-English bilingual children, adults, and English monolingual adults, as well as the acoustic-perceptual associations of stress and tone. Our interactive account of suprasegmental speech processing extends the interactive account of speech perception [61] to suprasegmental speech perception by putting forward the possibility of a facilitative interaction between Cantonese tone and English stress.
However, we should note that the interactive account assumes that bilinguals may build a common suprasegmental representation that encodes both Cantonese tone and English stress. We must be cautious that the common representation assumption is derived from perceptual performance differences between bilinguals and monolinguals, as well as the correlations between Cantonese tone and English stress. We are aware that the perceptual correlation seen here could be due to the fact that there is an underlying common factor that ties Cantonese tone and English stress together, such as the sensitivity to the common acoustic cues (i.e., F0). Thus, it is important for future research to verify whether the perceptual correlation between Cantonese tone and English stress reflects a common suprasegmental representation or the general sensitivity to acoustic cues; studies might do so by adding a perceptual control condition (e.g., perception of other contrasts that do not exist in either Cantonese or English) and examining whether perception of novel contrasts is correlated with tone or stress.
It should also be noted that our tone and stress stimuli were pseudoword CVCV sequences without any lexical meaning. This limits the current study from finding any lexical expectation effects, which have been seen in previous research [11]. It would be worthwhile for future work to consider using real words as perceptual stimuli in order to examine how lexical level information may impact perception of tone and stress. The AXB discrimination task has limitations due to the way in which it has participants classify stimuli. It could be that participants made their "different" or "same" decisions on the basis of auditory sensitivity instead of linguistic information. Like previous tone perception research, we used 400ms as the inter-trial interval to allow the auditory features of the previous trial's stimuli to fade. However, the question of how fast auditory memory fades, or even if it fades at all, is debatable [62,63]. Additionally, the AXB design may place significant demands on working memory/attention resources, which may underlie individual differences. In particular, our adult participants may have more stability in executive function than the children. Thus, future research might consider taking steps to address these potential limitations. For example, including non-linguistic pitch stimuli would help to determine whether the pattern of results obtained for non-linguistic pitch might be similar to what we found here. Varying the inter-trial interval would also be worthwhile as a way to examine the effects of inter-trial interval on perception of suprasegmental speech. Including working memory and auditory attention measures in future work will enable researchers to remove the possible effect of individual differences in working memory and executive function on performance on an AXB discrimination task. Such research would provide further insight into the perceptual mechanisms employed by bilinguals in perceiving tone and stress.
Nevertheless, our results add to the existing body of work on cross-language speech perception. We show that, despite their differences on acoustic and linguistic dimensions, tone and stress are processed interactively in Cantonese-English bilinguals. Moreover, the relative ease of tone perception is determined by average F0, F0 onset, and F0 major slope. On the other hand, distinguishing stress patterns relies on multiple acoustic correlates, including spectral balance, F0-related features, duration, and intensity. The most robust acoustic correlate of English stress is spectral balance. Listeners' prevalent difficulty in perceiving the high rising (T2)-low rising (T5) tone contrast arises from their similarities in F0 onset and average F0, and the primary cue to distinguish this contrast is F0 major slope. These findings emphasize the interactive processing mechanism of suprasegmental speech perception, and also suggest a role of one's level of linguistic knowledge in the development of perceptual strategies used in speech perception.