Focus perception in Japanese: Effects of lexical accent and focus location

Albert Lee; Faith Chiu; Yi Xu

doi:10.1371/journal.pone.0274176

Abstract

This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners’ focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency. Experiment 1 compared the identification accuracy in lexical accent × focus location conditions using both natural and resynthesised stimuli. The results showed that focus identification rates were similar with the two stimulus types, thus establishing the reliability of the resynthesised stimuli. Experiment 2 explored these conditions further using only resynthesised stimuli. Narrow foci bearing the lexical pitch accent were always more correctly identified than unaccented ones, whereas the identification rate for final focus was the lowest among all focus locations. From these results, we argue that the difficulty of focus perception in Japanese is attributed to (i) the blocking of PFC by unaccented words, and (ii) similarity in F0 contours between lexical pitch accent and narrow focus, including in particular the similarity between downstep and PFC. Focus perception is therefore contingent on other concurrent communicative functions which may sometimes take precedence in a +PFC language.

Citation: Lee A, Chiu F, Xu Y (2022) Focus perception in Japanese: Effects of lexical accent and focus location. PLoS ONE 17(9): e0274176. https://doi.org/10.1371/journal.pone.0274176

Editor: Masatoshi Koizumi, Tohoku University, JAPAN

Received: September 16, 2021; Accepted: August 22, 2022; Published: September 22, 2022

Copyright: © 2022 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This work was partially supported by an EdUHK internal grant (RG79/2018-19R) awarded to AL. Preliminary results have appeared in Lee, Chiu, and Xu (2017) and Lee and Xu (2015). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Focus is a communicative function for directing the listener’s attention to information that the speaker believes is especially important [1, 2]. For those languages which employ fundamental frequency (F0) as a cue to mark focus, a natural question arises as to whether there is a conflict or competition between focus and other communicative functions (e.g., lexical tone / accent) which are also expressed mainly by F0. A further question is whether the effectiveness of F0 as a prominent cue in conveying focus varies across different focus locations. This question stems from the fact that, in many languages, focus is conveyed by multiple markers (e.g. prosodic, syntactic, morphological), each of which can employ one or multiple phonetic alterations of various cues such as duration, intensity, and F0 [e.g. 3 on Finnish]. However, a good understanding of the interaction of these surface cues is only possible after investigating the role of each of these particular phonetic cues when independently manipulated. The questions about how focus may interact with other functions which too control F0 could be answered by a perception task in which F0 is manipulated while other aspects of speech are held constant. This can be achieved through a speech resynthesis tool called PENTAtrainer that we have developed [4].

Focus markers in Japanese

To mark narrow focus in Japanese, native speakers can use a combination of syntactic [see 5 for a review], morphological (i.e. using the focus particles dake ‘only’ or mo ‘too’), and prosodic strategies. Prosodic cues to narrow focus include, acoustically, on-focus F0 range expansion and post-focus F0 range compression [PFC henceforth, 6–9] alongside the modification of non-F0 cues such as duration and formant frequency [10]. These prosodic cues were identified in production experiments when participants were disallowed from using any of other aforementioned non-prosodic focus-marking strategies.

On the other hand, how native listeners perceive narrow focus is not as well understood. While it is easy to instruct speakers to produce focus, it is next to impossible to get participants to produce it with only one of the possible prosodic cues. It is well known that beside F0, focus also affects duration [10, 11], voice quality [12] and formant frequency [10], all of which could also serve as secondary cues in focus perception. Because F0 is involved in a wide range of communicative functions (e.g., focus, emotion, sentence type), and a given prosodic pattern (e.g., a raised F0 peak) could be associated with a number of different meanings, how well native listeners perceive focus with only F0 cues available warrants investigation. This is interesting because when all the secondary cues are held constant, it is possible that F0 patterns associated with focus alone would not be very effective.

There are very few studies that have systematically investigated focus identification in Japanese, with exceptions such as [13]. To verify the naturalness of their production data from three speakers, they had 20 native listeners participate in a 6AFC identification task in which narrow focus was one option; the other options were ‘admiration’, ‘suspicion’, ‘disappointment’, ‘indifference’, and ‘neutral’. Their results showed that correct identification of focus varied greatly across the three speakers, ranging from 23% to 77% (ibid.). As all the three speakers were native experienced teachers of Japanese language, it is surprising that narrow focus in their production would be so poorly identified by the native listeners. Also, the huge discrepancy in focus identification between the productions of different speakers means that rather different acoustic cues may be employed. It is thus necessary to consider using resynthesised speech research on focus perception, so that it is possible to control one acoustic parameter while others are held constant.

Effect of lexical prosody

F0 is an acoustic dimension that has been shown to be involved in cuing multiple communicative functions [see, for example, introduction by 14]. For languages that use F0 to mark both lexical prosody (e.g., tone or lexical pitch accent) and focus, it is an intriguing question how listeners simultaneously decode the multiple pieces of information from the F0 signal.

For example, the role of lexical prosody in focus perception has been reported for a tone language like Mandarin. Mandarin has four lexical tones (High, Rising, Low, and Falling), each differing from the others in terms of F0 movement direction (alongside other cues). Theoretically, these four ‘full’ tones are considered equal in prominence, as opposed to the Neutral Tone which is produced with weaker articulatory effort [15]. In a perception study, [16] showed that native Mandarin listeners identified focus much less accurately when it was on the Low tone than on the other tones. They attributed this discrepancy to the fact that the Low tone in Mandarin has a smaller capacity for F0 range expansion and a relatively weaker intensity. This is interesting because unlike culminative word prosody systems (e.g., stressed vs. unstressed syllables in English and accented vs. unaccented mora in Japanese), the ‘full’ tones in Mandarin are presumably equal in prominence. If a given tone category can stand out as being more poorly identified for focus in a language where every syllable is specified for tone, it would be interesting to ask how large the discrepancy would be in a culminative word prosody system where one tone category is naturally more prominent than the other. The lexical pitch accent system of Japanese offers a perfect test case for this question.

In Japanese, a word can be either lexically accented (accented henceforth) or unaccented; for an accented word, the pitch accent could fall on any syllable. The lexical pitch accent (pitch accent henceforth) in Japanese, or its lack thereof, serves to mark lexical contrasts. Acoustically, it bears a high falling F0 pattern [17]. For example, in ha’shi ‘chopsticks’ the pitch accent falls on the first syllable, and pitch shows a high-low pattern; in contrast, hashi ‘edge’, which is unaccented, is phonologically assigned a LH pitch pattern. Unlike lexical tones, of which all members are deemed equal in prominence within a language [except, for example, the Neutral Tone in Mandarin which is ‘weaker’ than other tones, 18], an accented mora in Japanese stands out among unaccented ones (which bear a relatively level F0 pattern, ‘H-’ in J-ToBI, the prevailing annotation convention for Japanese prosody [17]). Acoustically, the pitch accent differs from unaccented words with a higher F0 peak followed by a steep fall [8, 19]. The F0 movement of the pitch accent allows more room for F0 range and intensity variation compared with unaccented words, much like the Mandarin tones compared with the Low tone in [16].

Between accented and unaccented words, there are both theoretical and phonetic reasons to consider the former being more prominent. Within J-ToBI, unaccented words are marked as bearing the default melody of a prosodic word (%L H-), whereas accented words are additionally marked by H*+L, i.e. %L H- H*+L. Acoustically, the H* tone is perceptually salient with both higher F0 scaling and stable alignment [20]. As such, it is reasonable to assume that in a neutral (i.e. broad) focus utterance where an accented word is surrounded by unaccented words (i.e. unaccented-accented-unaccented, henceforth UAU), the accented word would stand out and be more prone to being misperceived as bearing a narrow focus. It is thus likely that a UAU utterance under neutral focus would be the most easily confused with medial focus and yield the lowest focus identification accuracy, among all the accent conditions.

Effect of focus location

Another likely source of difficulty in focus perception is focus location. In the literature on Japanese focus production, various narrow focus conditions have been either reported or predicted to be confusable with neutral focus:

Initial focus.

In their review of prominence marking in Japanese prosody, [21] argued that initial focus and neutral focus might be ambiguous because ‘there has to be at least one IP-initial rise at the beginning of every well-formed utterance (in Japanese). That is, when there is no narrower focus prompting an IP break and reset later on, the rise from the utterance initial [%L] makes the next immediate [H] target (whether a phrasal [H–] or the [H] of a [H*+L]) the highest (most prominent) peak in the utterance’ [21] (in J-ToBI, the IP (Intonational Phrase) ‘is the prosodic domain within which pitch range is specified…’ [17]).

The left panel in Fig 1 [data adapted from 9] illustrates this scenario: when compared with neutral focus (Fig 1, solid blue line), the initial narrow focus contour (dashed blue) shows clear evidence of on-focus raising at the beginning of the utterance and PFC in the middle and ending parts of the utterance. However, when inspected individually, both of these two contours are characterised by a high utterance-initial peak which could mark either narrow focus or pitch accent, and by a lowered peak in the middle of the utterance which could be either caused by PFC or by downstep (‘downstep’ refers to the lowering effect of a low tone on a following high tone, such that a new, lower, ceiling is set on all subsequent high tones in a given domain [22]). In other words, where F0 is not reset utterance-medially by a later narrow focus, the highest peak will always be on the first word of the utterance. Meanwhile, when narrow focus is utterance-initial, on-focus expansion will raise the first peak, but will not change the fact that it is the highest in the first place. Thus this case likely leads to ambiguity for the listener as they cannot be sure if the initial peak is raised by focus or is intrinsically high due to normal realisation of an early pitch accent.

Download:

Fig 1. Averaged F0 contours of initial / penultimate / final narrow vs. neutral × accented vs. unaccented focus [data from 9].

https://doi.org/10.1371/journal.pone.0274176.g001

Penultimate focus.

Unlike initial focus, there is evidence that for Japanese (and other subject-object-verb, or SOV, languages) there is no PFC after a penultimate focus, leaving on-focus raising as the only cue available [6, 7, 9]. This is considered to be due to the ‘focus projection’ principle. Focus projection predicts that placing prosodic focus on the object noun phrase (NP) leads to two possible interpretations: narrow focus on the NP and broader focus on the verb phrase (VP). It follows that for an SVO language, like English, final focus and broad focus on the VP would be ambiguous [23–25] whereas for a SOV language like Japanese broad focus on the VP would be indistinguishable from narrow focus on the object NP, i.e. penultimate focus [6, 7, 26] (see also the middle panel of Fig 1). The same has also been observed in Turkish, another SOV language [27]. The distinction between the two focus conditions when produced in laboratory speech is marked by on-focus F0 raising, and PFC appears to be absent (overlapping blue contours towards the end). Thus, compared to initial focus, listeners have one less cue to rely on in penultimate focus. Because of this, focus perception may be more difficult in this position. See [5] for a review of relevant literature on the syntax-prosody interface in Japanese.

Final focus.

Across languages it has been shown that final focus is prosodically expressed much less effectively than an earlier focus [11, 28–31]. In English, for example, an utterance-final word bearing narrow focus is produced with less relative emphasis [30]. For SVO languages, part of the reason would be complications due to focus projection, as discussed above. Meanwhile, [32] suggested that this could be the result of the conflicting needs to encode both sentence type (questions vs. statements) and focus in the sentence-final word. As Japanese also marks questions with an utterance-final F0 pattern (boundary tone) [8], an overladen utterance-final word would have reduced space for F0 modification for focus, possibly leading to ambiguity. If acoustic cues in production are ambiguous in the first place, it is reasonable to expect that listeners are also easily confused in perception. Fig 1 (right panel) shows that although there is clear evidence of on-focus raising that separates narrow from neutral focus, the pre-focus portions of the F0 contours largely overlap. How sensitive listeners are to the F0 difference in the final word alone would hence determine their ability in identifying narrow final focus.

Given the above-listed issues, the first goal of this study is to find out how Japanese listeners’ perception of narrow focus can be affected by pitch accent. Secondly, we want to determine if focus location has a clear impact on focus perception in Japanese, and if so, whether it is initial, medial or final focus that can be most affected. Finally, we are interested in how well listeners can identify focus when F0 is the only cue. These questions will be answered by a series of perception experiments.

Experiment 1: Pilot study

Experiment 1 explores the possibility of using resynthesised stimuli for focus perception experiments. This is because resynthesised speech is better controlled and free from cross-repetition variations unlike naturally produced stimuli, and as such, it would be theoretically better to test focus identification using resynthesised stimuli. However, it is unclear if listeners would perceive focus on resynthesised stimuli differently from natural ones. We thus compared listeners’ focus perception across natural and resynthesised stimuli in this experiment.

Our goal is to test the effects of focus location and accent condition on focus identification using resynthesised stimuli, which are better controlled and free from cross-repetition variations that are common in natural stimuli. To achieve this goal, it is necessary to first establish that the resynthesised and natural stimuli are not significantly different in focus perception. In this pilot experiment, we compared how participants respectively performed with the two types of stimuli.

Method

Natural stimuli.

Both naturally produced and resynthesised stimuli were used in this experiment. The target sentences, adapted from [9], were designed to elicit quasi-minimal contrasts in F0 patterns in a production experiment (see Table 1). In choosing these target sentences, several factors were taken into consideration: (i) they should be as similar to one another as possible in terms of segmental contents (e.g. same vowel height, consonant manner) so as to directly test the effect of F0; for the same reason, (ii) they should not contain any non-F0 cues to focus such as the marker dake ‘only’ that modifies noun phrases or -noda attached to final verb phrases, and (iii) they should be identical in length, which can affect F0 range due to soft pre-planning [33]. While yielding semantically less natural sentences, our design ensured strict experimental control that allowed us to assess the effects of focus on F0 contours as well as the effects of F0 variations as cues for focus perception. As will also be explained in the General Discussion section, these target sentences have elicited responses in line with comparable studies in the focus prosody literature.

Download:

Table 1. Stimuli used in the present study [adapted from 9].

For easy illustration, here the accusative case marker–o (which collocates with mita ‘saw’) and the dative case marker–ni (which collocates with nita ‘resembled) are presented as though belonging to Word III; syntactically they are part of Word II.

https://doi.org/10.1371/journal.pone.0274176.t001

In the original corpus (N = 6,400), each utterance was either eight (short) or 11 (long) morae in length so as to compare the course of F0 movement under different utterance length conditions. For each word location, an initially accented (e.g., HLL) word and an unaccented (e.g. LHH) word were compared, yielding eight possible combinations of pitch accent condition (two accent conditions × 3 word locations). There were four possible focus conditions for each target sentence, namely initial, medial (i.e., penultimate), final, and neutral (i.e., broad). The sentence types were yes/no questions vs. statements. Narrow focus was elicited by having the speaker produce a given sentence first as a question then as a (corrective) statement in pair.

The natural stimuli used in this study were produced by a 33-year-old female native Japanese speaker from Greater Tokyo (born in Tokyo, grew up in Kanagawa) who worked as a professional voice-over actress in London. Recording took place in a sound-attenuated booth in University College London, using a RØDE NT1-A microphone. The sampling rate was 44,100 Hz. The speaker was seated in front of a computer screen, on which stimuli were displayed one by one in random order. From Table 1, one utterance of each of the long target sentences (N = 64, i.e., eight accent conditions × four focus conditions × two sentence types) was randomly chosen. Short utterances were not included in order to reduce the total number of trials. For details of the acoustic analysis of the original corpus, please refer to Fig 1 for averaged F0 contours in some of the accent conditions and [9] for full details.

The natural stimuli (N = 64) were analysed using ProsodyPro [34]. Speech data were first segmented into morae (where a light syllable is one, e.g., ta ‘field’ and a heavy syllable is two, e.g., tan ‘phlegm’). Vocal pulses detected by Praat were manually checked and rectified. Because of the controlled nature of the experimental setting, we were able to yield consistently produced utterances with highly comparable F0 patterns and good accuracy. The speaker produced on-focus raising of F0 on the whole word, rather than on the following case marker only [see discussion in 21]. In general, unless there is a later narrow focus (i.e., medial or final), each utterance constitutes one Major Phrase with no evidence of subsequent pitch reset. Paired samples t-tests showed that the word produced under narrow focus was often significantly higher in mean F0 than its neutral focus counterpart. For initial focus, it was 19.69 Hz (SD = 17.99) higher, t(15) = 4.38, p = .001 (two-tailed); for medial focus, it was 17.05 Hz (SD = 26.13) higher, t(15) = 2.61, p = .020; for final focus, the difference was non-significant.

Resynthesised stimuli.

The resynthesised stimuli were generated using PENTAtrainer [4]. PENTAtrainer is a semi-automatic software package for analysis and synthesis of speech melody based on an articulatory-functional model [35]. The following steps were taken during stimulus generation: (i) data preparation, (ii) functional labelling, (iii) model training, and (iv) F0 synthesis, as will be described in more detail below.

Based on the Parallel Encoding and Target Approximation (PENTA) Model [35], PENTAtrainer extracts function-specific underlying pitch targets (target height, target slope, and target strength) by means of analysis-by-synthesis [4]. The pitch targets are articulatory goals that are approached within user-defined tone-bearing unit, which is always the syllable in our own practice e.g. [36]. The articulatory strength of a target specifies how fast the target is approached. Users annotate communicative functions in tiers in the form of Praat TextGrid interval labels. The programme then automatically learns the pitch targets through analysis-by-synthesis controlled by simulated annealing, a stochastic machine learning algorithm [37]. The learned pitch targets each correspond to a unique combination of multiple communicative functions (e.g., H + Question + pre-focus + Left Edge of Sentence), which can be used to generate F0 contours that can be directly compared with natural utterances [4]. The accuracy of synthesis (measured in terms of Pearson’s r and root-mean-square error) of PENTAtrainer has been reported to be outstanding [e.g. 36, 38], rendering it particularly suitable for our purpose–to test focus identification using accurately resynthesised, natural-sounding stimuli. In fact, [39] reported that PENTAtrainer could resynthesise the original corpus on which the present study was based as accurately as Pearson’s r > .90 (i.e. comparing F0 data of natural utterances and corresponding resynthesised utterances). This high level of synthesis accuracy led us to choose PENTAtrainer to generate the stimuli used in this study.

Firstly, to obtain accurate F0 trajectories, vocal pulses were manually checked and rectified with ProsodyPro for all the natural utterances produced by the aforementioned speaker (N = 640, i.e., eight accent conditions × four focus conditions × two sentence types × two lengths × five occurrences). This step was necessary because F0 estimation can be imprecise, particularly during creakiness. The recordings were then segmented by the mora in Textgrid files. In this case, heavy syllables (i.e. CVV and CVn) were labeled as two intervals equal in duration by inserting an interval boundary in the middle of the syllable.

Then, the resultant data were labelled in terms of communicative functions [35], each in a separate tier in the TextGrid. In this approach, the labels of speech recordings are blind to actual F0 contours, unlike the more common practice of annotation based on phonetic realisation [17]. It is based on the assumption that communicative functions, such as ‘tone’, ‘focus’, ‘sentence type’, ‘emotion’, are the underlying categories that generates surface F0 contours through an articulatory process that can be simulated by the target approximation model [40] as the core of the PENTA model. In PENTAtrainer, communicative functions as well as their internal components are considered as hypothetical whose phonetic values are learned from natural speech data through computational optimisation. Researchers can continually refine their labelling schemes to find out the optimal combination of communicative functions for a given corpus.

Fig 2 illustrates how functional labeling was performed in this corpus. For the present corpus, four communicative functions were labeled, namely Tone [with the labels ‘H’ and ‘L’ for accented words and ‘M’ for unaccented words, following 35], Sentence Type (“Question” and “Statement”), Focus [“pre-focus”, “on-focus”, and “post-focus”, following 41, and “neutral”] and Demarcation (“Left Edge of Sentence”, “Right Edge of Sentence”, “Left Edge of Word”, “Right Edge of Word”, and “Medial”). As in [36], “H” and “L” in the Tone tier are associated with the pitch accent, where “H” marks the accented mora, and “L” represents the tone following “H”. Meanwhile, “M” indicates the tones in an unaccented word. Note that sentence length was not included in the model as the effect of length on F0 realisation is considered to be predictable and determined by the Target Approximation mechanism [40]. The order of the tiers in Fig 2 is irrelevant as communicative functions are considered to be parallel to each other [35] and are implemented accordingly in PENTAtrainer [4]. See [39] for more details regarding resynthesis procedures and [36] for a comparable study using a different corpus (single-word Japanese utterances).

Download:

Fig 2. Example of functional labelling using PENTAtrainer.

https://doi.org/10.1371/journal.pone.0274176.g002

In the next step, PENTAtrainer extracted the pitch target parameters (in terms of target height, target slope, and articulatory strength) for each combination of the four communicative functions through analysis by synthesis. This means that from our training corpus (N = 640), 72 sets of parameters were extracted. With these parameters, PENTAtrainer then generated F0 contours which were imposed onto the segmental materials of the natural utterance to form the resynthesised stimuli.

To ensure that the resynthesised stimuli of different focus conditions differ only in F0, we had 16 base sentences on which to impose F0 contours generated by PENTAtrainer. These sentences consisted of the eight accent conditions × two lengths in Table 1. This means that for a given focus condition, non-F0 acoustic cues such as duration and intensity were held constant for all resynthesised stimuli. This is in contrast to [39] where each resynthesised stimulus was based on its respective natural utterance counterpart. Fig 3 illustrates the high synthesis accuracy of PENTAtrainer based on a neutral focus natural utterance vs. its synthesised counterpart (same base sentence in this case). The F0 contours closely overlapped each other, showing that in this example the synthesised utterance was highly similar to the natural one. Since some resynthesised stimuli do not share the same base sentence with their natural stimulus counterparts (to ensure minimal contrasts in F0), a direct assessment of synthesis accuracy like in [39] was not possible; instead, we justify the suitability of our resynthesised stimuli with a naturalness judgment task, as will be reported below (Experiment 2).

Download:

Fig 3. F0 contour of a natural vs. corresponding synthesised utterance (me’i-ga mo’mo-o mi’ta) in neutral focus.

X-axis shows actual time.

https://doi.org/10.1371/journal.pone.0274176.g003

Participants.

We recruited seven native listeners of Japanese (four male) for this pilot study. Their age range was 20 to 42 (M = 30.4, SD = 9.5). All were students who had moved to Hong Kong or England for less than six months at the time of the experiment. One participant had also lived in the USA for four years. None had reported any history of speech or hearing impairment. No participant in this task also took part in Experiment 2 which will be reported below. Written informed consent was obtained from all participants in this experiment and in Experiment 2. All experiments reported in this paper were approved by the UCL Research Ethics Committee (SHaPSetXU002).

Procedures.

The experiment took place in a quiet room. Participants were seated in front of a laptop computer, which displayed the PRAAT ExperimentMFC interface (see S3 File). They wore circumaural headphones and listened to the stimuli consecutively. The entire experiment was conducted in Japanese. Participants were instructed to ‘determine which word was being emphasised’ with four options, namely ‘Word 1’, ‘Word 2’, ‘Word 3’, and ‘No emphasis’, which respectively corresponded to initial, medial (or penultimate), final, and neutral focus. They were also asked to respond as quickly as possible. There were 384 trials altogether (eight accent conditions × four focus conditions × two sentence types × three occurrences × two types of stimuli, i.e. natural vs. resynthesised). Each stimulus could be replayed up to three times.

Result

The overall accuracy of focus identification was highly similar between natural (M = 49.1%, SD = 24.6%) and resynthesised (M = 47.6%, SD = 25.9%) stimuli. For final (natural = 42.3%, resynthesised = 45.8%) and neutral (natural = 50.3%, resynthesised = 63.4%) foci, resynthesised stimuli even appeared to yield better accuracy than natural stimuli, although the differences were not significant. A paired samples t-test showed that identification accuracy rates did not differ between the two types of stimuli, t(27) = 1.238, p = .227. These results thus indicate that focus cues in Japanese are mostly carried by F0, as the other cues that natural stimuli may contain did not provide clear advantages over F0. Given this finding, Experiment 2 will use resynthesised stimuli alone to explore the effects of pitch accent and focus location on focus perception in Japanese.

Experiment 2

This experiment investigated the effects of focus location and accent condition on focus identification accuracy using only resynthesised stimuli. The research question is which word location under narrow focus is the most indistinguishable from neutral focus. We also asked the question of whether an UAU utterance with no narrow focus would yield the lowest identification accuracy since the lexical pitch accent may sound like medial narrow focus. We started by checking whether natural and resynthesised stimuli were equally natural-sounding to the participants, and then analyzed their focus identification accuracy using only resynthesised stimuli.