Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Do ‘leaders’ in change sound different from ‘laggers’? The perceptual similarity of New Zealand English voices

  • Elena Sheard ,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review & editing

    elena.sheard@canterbury.ac.nz

    Affiliation New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand

  • Jen Hay,

    Roles Conceptualization, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Department of Linguistics, University of Canterbury, Christchurch, New Zealand

  • Joshua Wilson Black,

    Roles Conceptualization, Formal analysis, Methodology, Software, Writing – review & editing

    Affiliation New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand

  • Lynn Clark

    Roles Conceptualization, Funding acquisition, Writing – original draft, Writing – review & editing

    Affiliations New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, Department of Linguistics, University of Canterbury, Christchurch, New Zealand

Abstract

Work on covariation in New Zealand English has revealed groups of speakers characterised by their back vowel spaces and status as ‘leaders’ or ‘laggers’ across a set of ongoing vowel changes. We investigate whether listeners hear speakers from different groups as perceptually distinct. We conduct a perception task in which New Zealanders rate the similarity of pairs of speakers. We use the results to create a two-dimensional perceptual similarity space by means of Multi-Dimensional Scaling, and test if speakers are organised within this space according to their back vowels, leader-lagger status, speed, or mean pitch. Results indicate higher pitched and faster speakers are perceptually distinct from lower pitched and slower speakers. Leaders are perceptually distinct from laggers if they are not markedly higher pitched. A Generalised Additive Mixed Model fit to the trial-by-trial ratings shows order effects, revealing that perception of similarity is not symmetrical. They also support the perceptual relevance of speaker speed, pitch and leader-lagger status.

1. Introduction

What information do listeners use to make social judgements about voices? The conventional starting point for perceptual research within the variationist paradigm is the sociolinguistic variable. Considerable work has shown how sociolinguistic variables vary across different speakers in production, and it is assumed that listeners can also interpret this variation as socially meaningful. Indeed, we have substantial evidence for the potential of different (combinations of) variants to affect listener evaluations of speaker macro- or micro-social characteristics [e.g., 1,25], and of (purported) social information about a speaker to influence listener categorisation of linguistic variables [e.g., 68].

However, researchers are often starting with variables known to (co)vary in production but with unknown perceptual relevance. This can pose significant methodological questions. Is it reasonable to assume they are even perceptible to listeners in spontaneous speech? Which social characteristics might they be associated with, and how can we test this association in perception? Rather than assuming answers to these questions, an alternative approach is to start with the speaker and work bottom-up. By focusing on who listeners perceive to sound similar and different to one another first, and the (socio)linguistic variables or social characteristics that contribute to the perceptual differentiation of speakers second, it is possible to gain direct insight into listener perception without having to make prior assumptions about the variables in question. Here, we start with speakers of New Zealand English (NZE) and ask what structural relationships emerge among these speakers when listeners assess their similarity based on their spontaneous speech. We then further examine the degree to which our variables of particular interest, covarying vowel patterns in NZE, are relevant to this structure.

The analysis was carried out using the R Programming Language [9]. All code and anonymised data to reproduce the findings is publicly available in a GitHub repository (https://github.com/nzilbb/qb-pairwise-public). The supplementary materials also contain the preregistration for the experiment and additional methodological and analytical details. However, readers do not need to refer to any external content to follow the manuscript and we specify which elements of the reported analysis differ from the preregistration.

2. Connecting the production of covarying New Zealand English monophthongs to their perception

Sociolinguistic research in the 21st century has shown increasing interest in how linguistic variables pattern together at both the individual and community level [e.g., 1012]. There is now growing evidence that linguistic features do not work in isolation from each other but can exist within systematic patterns of covariation and coherence. Such patterns may be more socially motivated (e.g., shifts away from dialectal variants) or shaped by linguistic pressures (e.g., vowel chain-shifts). Brand et al. [13] is illustrative of a shift towards multivariate analyses, focusing on a set of 10 NZE monophthongs (fleece, kit, dress, trap, start, strut, lot, nurse, thought, goose) in data from the Origins of New Zealand English (ONZE) corpus [14]. Brand et al. [13] investigate the degree to which a given speaker’s realisation of a single vowel is predictive of their realisations of other vowels, relative to the population once known physiological and socio-demographic sources of variation are controlled for. Their analysis revealed the existence of structured, systematic, patterns in vowel realisations, with no monophthong produced independently of all other vowels in the set [13].

Brand et al. [13] implemented a novel statistical methodology to track potential vowel systems or clusters which had the following pipeline:

  1. (1) Generalised Additive Mixed Models (GAMMs) are fit to F1 and F2 measures for each monophthong
    1. ◦. Each GAMM includes relevant fixed effects (e.g., speech rate, gender, age) and random effects (e.g., word, speaker)
  2. (2) Speaker random intercepts are extracted from each GAMM [cf. 15]
  3. (3) The extracted speaker random intercept values for each variable (F1/F2 for each monophthong) are used as input for a Principal Component Analysis (PCA)

PCA is a multivariate dimension-reduction technique that reduces many variables (i.e., F1/F2 speaker intercepts for 10 vowels) to fewer Principal Components (PCs), onto which multiple, covarying vowels are loaded [see also 16]. PCA identified three main PCs, or distinct clusters of covarying vowels, in the ONZE data [13]. One cluster related to the back vowels thought, start and strut (the back-vowel configuration). Individual speakers with lower and fronter realisations of thought have backer realisations of start and strut, and vice versa. The second cluster related to sound change, with individual speakers consistently “leading” or “lagging” in the ongoing changes for kit, fleece, dress, trap, nurse and lot (the leader-lagger continuum). The final cluster captured two pairwise relationships: start and lot, and dress and goose.

Hurring et al. [17] replicated and extended Brand et al. [13], analysing covariation of the same monophthongs with a different corpus of contemporary NZE, the QuakeBox [18,19]. The original QuakeBox project (QB1) collected high quality audio and video recordings of earthquake stories in multiple locations across Christchurch, New Zealand in 2011–2012. The QuakeBox 2 (QB2) project then re-recorded stories from a subset of the QB1 speakers in 2019–2020. Hurring et al. [17] applied the GAMMs-to-PCA methodology developed by Brand et al. [13] to the QB data from each time point, resulting in two PCs for QB1 and QB2. The first two vowel clusters in Brand et al. [13] (the back vowel configuration and leader-lagger continuum) were identified in both the QB1 and QB2 principal components. trap, dress, fleece, kit, and nurse are consistently loaded onto the leader-lagger continuum, start, thought, and strut onto the back-vowel configuration (we note goose and strut F1 in Hurring et al. [17] are also loaded onto the former and lot is loaded onto the latter). Moreover, both vowel clusters in the QuakeBox data have remained stable over time.

The results of Hurring et al. [17] uphold the results of the original ONZE analysis and highlight the persistence of covarying NZE vowel patterns on both the collective and individual level. The same speakers maintain their position in the leader-lagger continuum over time, even when the production of individual vowels may be changing in the community. The combined results of Hurring et al. [17] and Brand et al. [13], then, provide evidence for NZE monophthongs working together as part of a complex system within which speakers can be leaders of structurally unrelated changes (although some changes in the leader-lagger continuum such as the Short Front Vowels (kit, dress and trap) are arguably, and likely, structurally related, there are not such clear structural explanations for their relationships to the other vowels loaded onto the same Principal Component). Brand et al. [13] suggest that the patterns within this system may reflect clusters of speakers with shared social characteristics, and/or subsystems of sounds that carry shared social meaning. Here, we take a first step towards connecting the spontaneous production of covarying NZE monophthongs to their perception by NZE listeners.

2.1. Analysing sociolinguistic perception

How do we approach listener perception within the variationist paradigm? Perceptual studies on how different (combinations of) variants affect listener evaluations of speakers tend to employ the Matched Guise Technique [20] or a verbal guise format [e.g., 2123]. Listeners hear different “guises” of the same speakers containing different (frequencies of) variants of the variables in question in the former, and excerpts from different speakers in the latter. In both formats, listeners either rate speakers along a scale [e.g., 4,2328], or make a categorical forced choice [e.g., 2,3,29] for prespecified social characteristics. Listeners can assess speakers in relation to macro-social characteristics such as age, ethnicity or race [e.g., 2,21,24,27], or more localised speaker attributes, styles and social personae [e.g., 3,3032].

While some research has explored attitudes towards NZE accents relative to other English varieties [e.g., 33,34], very little work has looked at sociolinguistic evaluations of variables within NZE. Szakay [21] has shown that voice quality and speech rhythm are used by listeners in tasks involving perception of ethnicity. Bayard and Bartlett [35] have demonstrated a perceptual association between rhoticity and region. Gordon’s [36] work eliciting responses to three NZE voices shows that there are variables that listeners hear as socially distinct but does not reveal which ones. Perceptual dialectology work, in which participants label maps of NZ, shows consistent use of labels evoking social class, suggesting a general orientation toward a relationship between accent and social class [37]. Some analyses have shown that listeners use social information to adjust how they listen to vowels [e.g., 1,38], which suggests a relationship between these variables and social evaluation but does not explicitly demonstrate it. In sum, despite the extensive sociolinguistic work on the production of NZE, we have very little clear evidence about the social judgement or perception of specific sociolinguistic variables, either in isolation or combination.

The leader-lagger continuum and back-vowel configuration present, therefore, a methodological challenge to investigating the social meaning(s) they may carry for listeners. Both guise techniques require researchers to specify relevant social characteristics, however we do not yet know whether listeners perceive the leader-lagger continuum and back-vowel configuration at all. One alternative approach is to start with listener perceptions of speaker similarity and then test if listeners differentiate between speakers based on specific social characteristics or linguistic features, such as covarying vowel patterns. For example, work on the perception of regional dialects in the United States has used both listener ratings of the likelihood of pairs of speakers coming from the same region [39] and free classification groupings of speakers listeners perceive to be from the same regions [40] to show speakers are perceptually differentiated by both their region and accent markedness. Explorations of perceived voice similarity in psychology and forensic linguistics have also explored the acoustic correlates that drive judgements of perceived similarity, producing consistent evidence for the role of fundamental frequency, laryngeal differences and formant values in differentiating between speakers perceptually [e.g., 4143]. There are few investigations of sociolinguistic perception that seek to first differentiate between speakers based on their perceived similarity, and then test whether a given variable is relevant to how they are differentiated [though see 44].

In this paper, we start with exploring the perceived similarity of NZE speakers to create a general structure of listener perception and ask whether the leader-lagger continuum or the back-vowel configuration differentiate between these speakers within this structure. If they do, we then have a sound foundation for future explorations of the potential connections between the vowel clusters and associated social meanings.

2.2. Sociolinguistic perception of variables in context

Our experiment uses spontaneously produced speech. As noted by Campbell-Kibler [45], sociolinguistic perceptual work tends to employ controlled audio stimuli [though see 21,2426,30,45,46]. But, as the reanalysis of the perceptual data in Villarreal [25] by Villarreal and Grama [47] highlights, it is possible for researchers’ assumptions about variables’ relative perceptual importance to listeners, which must be made in techniques such as the Matched Guise Technique, to result in the variables which most influence listener evaluations in spontaneous speech being overlooked.

It is also well-established that paralinguistic variables inherent to speech, and particularly variable in spontaneous speech, such as voice quality, speech timing and pitch, are perceived by listeners and subject to social evaluations [e.g., 48]. For example, there is longstanding evidence for features of speech timing influencing perceptions of dialectal differences, including the folk linguistic concept of ‘drawling’ in the Southern United States [49] and Wells’ [50] impressionistic claim that urban speakers speak faster than rural speakers [see discussion in 51]. Perceptual research has also shown listeners can not only distinguish between fast and slow speech and lower and higher pitched voices, but can associate different speech timings and voice pitches with different ages [e.g., 52], genders [e.g., 5355], ethnicities [e.g., 56], and personality attributes [e.g., 57,58]. These associations can, however, be mediated by language familiarity, listener native language and the language used by the speaker [e.g., 59,60]. Sociolinguistic perceptual work has also produced evidence that pitch can interact with other variables in affecting the perceived masculinity and/or sexuality of cisgender men [e.g., 5].

It is possible, therefore, that the perceptual relevance of a given sociolinguistic variable(s) depends on what other features are also present, or that their social significance will be mediated or dwarfed by other features. However, outside of investigations of perceptions of male masculinity and sexuality [e.g., 4,5 both consider pitch as a variable], the sociolinguistic perceptual studies that have used spontaneous speech tend to not examine whether paralinguistic features also contribute to listener evaluations or interact with their variables of interest [though see 2,29]. We investigate not only whether the covarying vowel patterns are relevant to listener perception of speakers, but whether speaker articulation rate and pitch also differentiate between speakers within the same structure of listener perception.

2.3. Research questions

We conduct a pairwise similarity rating task to address two linked research questions. Our primary research question, building on the literature above, is:

Research question 1: What structural relationships emerge among speakers when listeners evaluate them based on their spontaneous speech? That is, which speakers do listeners perceive to sound similar and different to one another within a multi-dimensional perceptual space, and are speakers that listeners perceive to sound different to one another differentiated by their covarying vowel patterns, their speed, and/or their pitch? Our preregistration predicts that one or both covarying vowel patterns will link to perceived speaker similarity, reading: “We want to identify the groups of speakers that listeners think sound similar to each other, and then compare these groups to the patterns of speakers identified in Hurring et al.(In prep). We hypothesise that the speakers that listeners perceive as sounding similar in the experiments of this study will align with one or more of the vowel patterns associated with different Principal Components in Hurring et al.(In prep). If they do not, this will imply that factors other than vowel formants are being used to perceptually group speakers, and we will conduct exploratory analysis to explore what these factors may be.” We note that Hurring et al. [17] is the paper we refer to in our preregistration.

We also explore a secondary research question to understand the patterns in our participants’ trial-by-trial responses, and what these patterns can tell us about the results of our preregistered task. Perceptual studies have demonstrated that listeners are more likely to notice and attend to (changes in) certain linguistic features than others (i.e., some features are more perceptually salient than others) [6163]. Work on the sociolinguistic assessment of linguistic variants also points to the variable salience of different features [e.g., 64] and suggests that non-standard or innovative variants may affect listener perception more than standard or conservative variants (i.e., some variants are more socio-linguistically salient than others) [6567]. Thus, it may be that pitch, articulation rate or vowels are not only inherently salient to different degrees, but their sociolinguistic or perceptual salience might depend on the extent to which a speaker favours certain variants of one variable or another.

There is also evidence that sociolinguistic judgements can be made early and be relatively ‘bullet-proof’ once made; listeners can be relatively immune to variables that run counter to their initial judgements [e.g., 31,66,68]. If one variable is particularly salient in a speaker’s recording, and occurs early, it may be the dominant influence in listeners' responses. Moreover, work on perceived similarity shows that this is not always symmetric. As Tversky (1977) outlines, similarity can depend on what is the ‘subject’ and what is the ‘referent’, with the features of the subject being more heavily weighted than the features of the referent. Hodgetts and Hahn [69] also demonstrate asymmetry in a non-verbal implicit measure of similarity, arguing that similarity is influenced by how complex it is to transform one object into an another. Regardless of the mechanism, asymmetry may be even more acute when the stimuli are presented auditorily, as the listener must assess the first speaker along various dimensions before encountering the second speaker.

An advantage of the Multi-dimensional Scaling approach we take to answer research question 1 is that – in our counterbalanced design – we will be able to observe the structure that emerges despite any trial-by-trial order effects. However, we were also interested in explicitly exploring these effects, to see what they could teach us about how listeners were completing our task. Our secondary research question, then, is:

Research question 2: What factors influence individual pairwise similarity ratings? Does looking at the trial-by-trial response patterns reveal different information than the MDS in terms of how the speakers are perceived? In particular, does the order in which voices are presented affect the ratings given? This question is exploratory and was not preregistered.

3. Methodology: online pairwise similarity task

3.1. Stimuli

The experiment uses data from the same QB1 corpus data analysed in Hurring et al. [17]. The specific audio stimuli used in the experiments come from the QuakeBox recordings of 38 women aged 46–55 who met two conditions. First, that they were included in the PCA analysis in Hurring et al. [17]. Second, that they had consented to have the audio of their recording shared publicly [18]. We selected stimuli from a specific gender and age group in the QuakeBox corpus to reduce the risk of participants assessing speaker similarity along these social factors. All 38 women were Pākehā (New Zealanders of European background); 33 had grown up in the South Island, predominantly in the North Canterbury region (27). As there was only one Māori woman in this age group who met the two conditions, we chose to not include her in the stimuli. Stimuli were selected based on length (maximum 10 seconds) and the presence of the monophthongs analysed in Brand et al. [13] and Hurring et al. [17] (at least 5 of the 10 had to be present).

We also considered content when selecting the stimuli. Consistent with the design of the QB corpus, many QB participants’ stories have upsetting aspects, and we did not want to continually expose our participants to such content, especially as some of them may have experienced the earthquakes themselves. We consequently focused on more positive parts of the recordings, like the sense of community or amusing aspects of their earthquake experiences. Where this was not possible, we ensured the stimuli topics were not explicitly negative (i.e., while some stimuli discuss damage to property, none talk about death or traumatic personal events). We also ensured that the clips did not contain information that could give an indication as to the speakers’ social backgrounds (e.g., occupation, specific Christchurch suburbs, schools attended etc.).

3.2. Experiment

We recruited participants online between December 2023 and February 2024 using targeted social media ads. Participants were offered a $10 e-voucher. Participants had to be over the age of 18, be a speaker of NZE, and have lived in NZ since at least the age of 7. All participants provided written consent by completing an online form prior to being directed to the experiment, and they could withdraw at any time before submitting their responses. Upon completing the experiment, participants completed a background questionnaire. This experiment was reviewed and approved by the Human Research Ethics Committee at the University of Canterbury (2023/60/LR-PS).

Following the approved protocol, we ran the experiment online using a JavaScript application developed by Chan [70] and adapted for the current study. Each participant listened to a subset of the possible 703 combinations of the 38 stimuli, because listening to all possible combinations would take multiple hours. Longer experiment times can reduce participant engagement [71] and lead to more unreliable responses [e.g., 72,73]. As such, each participant listened to two blocks of 19 stimulus pairs. In each block, each speaker was heard once. We used a semi-random sampling procedure that distributed the possible combinations of stimuli pairs as evenly as possible across the stimuli subsets participants heard (see supplementary materials Section 4 for details). In effect, for approximately every 37 participants all 703 stimuli pairs are listened to once. We note that this is how the stimuli were distributed in building the experiment, but participant drop out meant that this did not translate exactly in practice (see Section 5.5. for further discussion). The order the audio stimuli were presented to individual listeners was randomised, as was the order of stimuli within each pair.

We tried to steer participants toward social judgements, rather than superficial judgements based on acoustic properties, or aspects of what was said. Participants were thus given the following instructions:

“Sometimes people sound similar because they are friends, have similar occupations or personalities, or grew up in the same area.

In this task, we want to know which speakers of New Zealand English you think sound similar to each other, and which speakers you think sound different to one another.

In this task you will listen to 40 pairs of audio clips from different New Zealand women talking about their experiences of the Christchurch earthquakes as part of the QuakeBox project. We would like you to rate how similar the women in each pair sound to you. We are interested in who you think sound similar based on the way they talk, rather than the things they say.”

Participants listened to each pair of speakers and rated how similar they thought each pair sounded [cf. 74]. Participants were asked to rate each pair on a scale from ‘not similar’ (0) to ‘similar’ (1) using a slider interface; they did not see any numeric values on the scale. Participants could not progress to the next stimuli until both stimuli had been played in their entirety, and they had clicked on the slider. 140 listeners completed the task, which we note is fewer than our preregistered goal of 180–200. We stopped data collection and started analysis once it was clear participant recruitment had slowed down, and a significant delay would be needed to recruit the final 40 participants. We removed outliers (n = 7) using a method diverging from our preregistration as described in the supplementary materials (Section 7) and report an analysis of 133 participants. The supplementary materials also include an analysis which follows the preregistered filtering method (Section 8).

4. Analysis and results

4.1. Creating a two-dimensional perceptual space of speaker similarity by means of multidimensional scaling analysis

In this section we conduct our preregistered analysis, applying multidimensional scaling (MDS) to the similarity ratings from the pairwise ratings task. MDS is a data-reduction technique that represents measurements of (dis)similarity among pairs of objects “as distances between points of a low-dimensional multi-dimensional space” [75]. The multi-dimensional space is a (pseudo-)perceptual space which, theoretically, corresponds to the cue(s) driving the perceived similarity of the objects. It is, therefore, ideal for analysing pairwise similarity ratings and is commonly used to quantify perceptual similarity of a range of objects, including speech. While MDS has received some application in sociolinguistic perceptual work [e.g., 40,44,7678], it has primarily been applied in phonetic, psychological and forensic analyses of perceived voice and speaker similarity [e.g., 41,79,80]. A recurring finding is that fundamental frequency (F0) correlates with perceived similarity of speakers [e.g., 43,8183]. Some studies have also found evidence for an influence of F1 and F2 measurements [e.g., 43,82,83]. However, no MDS analyses have explicitly considered the role of covarying vowel patterns, and few have considered audio stimuli with variable speech rates.

Following the procedure detailed in the supplementary materials (see Section 5.1), we apply a non-metric MDS method (M-splines) to the results of the experiment using the smacof R package [84]. We scaled the pairwise similarity ratings per participant and then took the mean rating for each stimuli pair across all participants to create a 38 x 38 similarity matrix. As MDS requires all numbers in the input matrix to be above 0, the individual scaled ratings were all brought above 0 by adding the minimum score to all ratings before calculating the mean. We then we applied MDS to a dissimilarity matrix derived from the 38 x 38 similarity matrix using the ‘reverse’ option from the sim2diss function [84].

MDS requires the researcher to specify the number of dimensions of the multi-dimensional space a priori. A conventional approach to determining the number of dimensions to specify is to rely on a rule-of-thumb cut off for ‘stress’, a measure of the fit of the MDS where a lower stress value corresponds to a better fitting analysis (i.e., how well the analysis explains the underlying structure). However, as stress will always decrease as the number of dimensions increases, we do not consider relying on a single stress value to be best practice [see 85]. We additionally apply two permutation-based tests to inform the choice of dimensions. The first test is a novel method available via the mds_test function in the nzilbb.vowels R package [16]. The function implements a permutation and bootstrapping procedure to compare the distribution of stress reduction as the number of dimensions increases to the distribution of stress reduction we would expect if there were no structure in the data. The results of this procedure indicated that adding in a third dimension would not reduce stress more than we would expect by chance. The second test is an informal significance test [see 85] in which we use the permtest function from the smacof package to calculate stress for two-dimensional MDS analysis applied to 500 permuted iterations of the dissimilarity matrix [84]. The stress value for the unaltered matrix is below 95% of the permuted stress values, which we treat as informally equivalent to a p-value <0.05. Together, the results from the two tests support a two-dimensional MDS.

Following Mair et al. [85] we report the two-dimensional MDS with the lowest stress value (Stress-1 = 0.31) from 100 random starts. Fig 1 maps the coordinates from the two specified dimensions in our final MDS analysis for the 38 stimuli: Dimension 1 (D1) scores are on the horizontal axis, and Dimension 2 (D2) scores are on the vertical axis (Scree, shepard and bubble plots are also available in Section 5.1 of the Supplementary Materials). The closer two individuals are to each other in this space, the more similar they are perceived to sound. The further two individuals are from each other, the less similar they are perceived to sound.

4.1.1. Predicting perceptual dimensions from MDS analysis with regression trees and random forests.

MDS can reduce similarity data to a smaller number of dimensions, but the technique cannot provide insight into what each dimension represents or the cues with which they correlate. To investigate whether our variables of interest correspond to either D1 and D2, we fit two regression trees with D1 and D2 as the dependent variables. We also implemented random forests to evaluate the importance of each predictor in predicting each dimension. As outlined in detail in the supplementary materials (Section 5.2) we fit the regression trees and random forests using the parsnip package [86] in R, with the engine set to rpart for the former [87] and ranger [88] for the latter.

The independent variables in both regression trees were:

  • Stimulus articulation rate
  • Stimulus mean pitch
  • The speaker’s position on the leader-lagger continuum (i.e., How much they are “leading” or “lagging” in the changes for fleece, dress, kit, trap, nurse, strut, goose, based on their full QB1 monologue – see Hurring et al. [17])
  • The speaker’s position in the back-vowel configuration (i.e., the relationship between their start, thought and lot vowels, based on their full QB1 monologue – see Hurring et al. [17]).

We preregistered a series of pairwise correlations between the perceptual dimensions and these factors, but moving to a regression tree is preferable as it allows us to examine how the factors might work together, rather than considering them independently. We report the preregistered correlations in the supplementary materials (Section 6).

The mean pitch measurements were extracted manually from the stimuli in Praat [89] using the default pitch range settings and the cross-correlation method. We have quantified speaker ‘articulation rate’ as the total number of canonical syllables they produced in their stimulus, divided by their total phonation time. Total phonation time was calculated based on the forced-alignment of participants’ QB monologues in LaBB-CAT [90] and includes corrections, incomplete productions, the filled pauses um and uh/ah, and inter-word pauses less than 50 milliseconds in length. The number of canonical syllables is based on the CELEX dictionary used by LaBB-CAT in forced alignment. Speakers’ positions in the leader-lagger continuum and back-vowel configuration are represented by their QB1 Principal Components loadings from Hurring et al. [17], which quantifies speaker position for these variables based on the monophthongs produced by the speaker across their entire monologue. All variables are scaled across the 38 speakers.

Fig 2A displays the regression tree predicting Dimension 1, while Fig 2B shows the regression tree predicting Dimension 2. Regression trees split the data into smaller groups called “nodes,” then fit a model with the independent variables to each node. The tree then generates an if-else rule at each node based on the most important predictor, which further divides the data into subsequent nodes. This process continues until certain stopping criteria are fulfilled. Each node displays the estimated value of the dependent variable, along with the proportion and number of observations contained within that node.

thumbnail
Fig 2. (A) The results of the regression tree for Dimension 1, (B) The results of the regression tree for Dimension 2, and (C) the interpretation of the perceptual space based on (A) and (B).

https://doi.org/10.1371/journal.pone.0338199.g002

The most important predictor of Dimension 1 is a speaker’s mean pitch, with higher pitched speakers estimated to have a lower D1 score. Within the lower pitched speakers, there is a role of a speaker’s position in the leader-lagger continuum, with laggers in this cluster of vowel changes (those with lower leader-lagger scores) estimated to have a lower D1 score, and leaders estimated to have a higher D1 score. As such, speakers with a higher D1 score are more likely to be lower pitched speakers who are leaders in the leader-leader continuum. The most important predictor of D2 is articulation rate, with slow speakers estimated to have a lower D2 score. Mean pitch also plays a role for the faster speakers, where fast, higher pitch speakers are estimated to have a higher D2 score.

The results of the random forest procedures uphold the relative importance of pitch, speed and the leader-lagger continuum in predicting Dimensions 1 and 2. Specifically, mean pitch, followed by the leader-lagger continuum emerge as the most important predictors of D1. Articulation rate, followed by mean pitch emerge as the most important predictors of D2. The random forest procedure also points to the back-vowel configuration as having a positive predictive effect on D2, albeit to a much lesser extent than speed and mean pitch (please refer to Section 5.2.2 of the supplementary materials for our discussion on this result). The results of the regression trees and random forests therefore point to these variables contributing to the perceived similarity of these speakers.

Fig 2A and Fig 2B also map the cutoff values in the regression trees onto the D1 and D2 coordinates in Fig 1. Fig 2A highlights the high-pitch speakers in red and the lower-pitch laggers and leaders (with a leader-lagger score below/above −0.26) in purple and yellow, respectively. Fig 2B highlights slow speakers (this time with an articulation rate below −0.27) in blue. Two groups of slower and lower pitch speakers are concentrated in the bottom right and left of the space. Fig 2B also reflects the perceptual importance of pitch, with the fast high pitch speakers concentrated in the top left of the space (with a mean pitch above 0.31) in red; we also see most (4/5) slow and higher pitched speakers concentrated in the left of the space. The fast, low-pitched speakers are concentrated in the middle in orange.

If we combine the cutoffs for speed, mean pitch, and the leader-lagger continuum, we can identify five main groups of speakers as shown in Fig 2C. The first two groups are the slower and/or lower pitched leaders (yellow upside-down triangles, bottom right) and laggers (purple squares, bottom left). Third, we have leaders who are both faster and lower pitched (orange triangles, middle). Fourth, we have speakers who are higher pitched, regardless of whether they are a leader or lagger (red diamonds, top right). Finally, we have laggers specifically who are fast and/or higher pitched (dark orange circles, top). While there is some overlap between the different groups, there is nonetheless evidence for listeners making subtle perceptual distinctions between speakers based on all three variables. The MDS therefore points to speed and pitch, and one of the covarying NZE vowel patterns, underlying the perceptual relationships between speakers.

4.1.2. Testing differences between the main groups in the MDS space.

We have proposed that five main speaker groups are differentiated within the MDS space. In this section, we test whether these groups are statistically distinct from each other by means of permutational MANOVA (PERMANOVA) [91,92]. PERMANOVA is a flexible statistical technique that compares the variation between groups to the variation within groups, based on a distance or dissimilarity matrix. We use the adonis2 function from the vegan package [93] to test the null hypothesis that the centroids (mean middle point) of our five speaker groups are equivalent. While the function considers both group centroids and dispersion, the five speaker groups have comparable levels of dispersion (the average distance of a speaker from the centroid of their group is between 2.27–2.48 for all groups, see Section 2.4 of the supplementary materials).

Table 1 summarises the result of PERMOVA applied to the same dissimilarity matrix input to the MDS analysis. The results indicate that we can reject the null hypothesis that the centroids of the speaker groups are equivalent, and that the groups are, indeed, perceptually distinct. This raises the question, however, of whether all the proposed groups are distinct from each other. To investigate further, we apply the pairwiseAdonis2 function [94], which builds on adonis2 to conduct a pairwise comparison of each group. Table 2 summarises the pairwise contrasts, including original p-values and those adjusted using the Bonferroni correction. We can see that not all the groups are distinct from each other. Specifically, it is the group of fast leaders in the middle of the perceptual space that does not differ significantly from the other groups.

In other words, the position of the fast leaders in the middle of the perceptual space may be precisely because they share production features with each of the other four groups and are, consequently, not categorically distinct from them. Fast and low leaders are the intermediary group between the extremes of both leaders (i.e., they separate slow/low pitch leaders from high pitch leaders) and laggers (i.e., they separate slow/low pitch from fast/high pitch laggers). The production groups surrounding the fast leaders do, however, contrast with each other. Fast/high pitch laggers are distinct from slow/low pitch laggers, and slow/low pitch leaders are distinct from both high pitch speakers and fast/high pitch laggers. Finally, and importantly for us, slow/low leaders and laggers are significantly different from each other.

The results of the PERMANOVA support, overall, the proposed interpretation of the MDS space. They also provide additional nuance to our understanding of the relative distinctiveness of the different groups, indicating that four of the five proposed groups are statistically distinct from at least one of the other groups within the MDS space. The fifth group, fast and lower-pitch leaders, appears to be the “bridging” group at the centre of the MDS space which shares production features (speed, pitch or leader-lagger status) with speakers in the surrounding groups.

4.2. Predicting pairwise similarity ratings

Using MDS, we have explored the overall structure of perceptual similarity amongst our speakers. The application of MDS, which abstracts away from trial effects such as the order of the presentation of the speakers, raises a question of whether the same patterns would have emerged in a more direct analysis of the trial-by-trial pairwise similarity ratings from the online task. Are ratings of similarity symmetric, or are the properties of the first voice substantially relevant in influencing perceived similarity? To address this question, we fit a generalised additive mixed model (GAMM) using the bam function from the mgcv package [95] in R [9] which allowed us to explore the predictors of individual pairwise ratings, and their relationship to the perceptual space.

The dependent variable was the listener ratings for each pair of stimuli (i.e., speakers’ perceived pairwise similarity). Ratings have been scaled for each participant. The predictor variables were the same four measures (leader-lagger score, back vowel configuration stimulus pitch and stimulus articulation rate) but with these measures for the first and second stimulus in each pair each fit separately as a tensor product smooth. Each tensor product smooth used four knots. Fitting tensor product smooths allows us to examine how the relationship between the first and second stimulus affects perceived similarity ratings. Moreover, opting for a GAMM over a linear mixed model allows us to account for the non-linear relationships between the independent variables in the tensor product smooths and the dependent variable. To control for the potential participant-specific impacts of the dependent variables on perceived similarity, we included a random smooth by participant ID. We also included a random smooth for each ordered pair (i.e., for stimulus a and stimulus b, there is a separate intercept for pair ab and for pair ba).

4.2.1. Model results.

Table 3 summarises the model output, with statistically significant relationships between first and second stimulus articulation rate, pitch and the leader-lagger scores. The relationship between the first and second stimuli back-vowel configuration scores does not significantly predict similarity ratings. As GAMM predictions are more easily understood from their visualisation than their model summaries, our focus will be on the information presented in Fig 3 and Fig 4. Each figure depicts a significant tensor smooth alongside the individual rated pairs, with the relevant measurement for the first stimulus on the horizontal axis and for the second stimulus on the vertical axis. The colour corresponds to the estimated similarity rating, with darker colours corresponding to a lower rating (lower perceived similarity), and lighter colours to a higher rating. Only model predictions that are statistically significantly different from the mean rating (alpha = 0.05) are plotted.

thumbnail
Table 3. Model summary predicting trial-by-trial pairwise similarity ratings.

https://doi.org/10.1371/journal.pone.0338199.t003

thumbnail
Fig 3. Estimated similarity ratings based on first and second stimuli articulation rate (A) and pitch (B).

Darker colours correspond to lower estimated dissimilarity ratings, lighter colours to higher estimated similarity ratings.

https://doi.org/10.1371/journal.pone.0338199.g003

thumbnail
Fig 4. Estimated similarity ratings based on leader-lagger scores.

Darker colours correspond to lower estimated dissimilarity ratings, lighter colours to higher estimated similarity ratings.

https://doi.org/10.1371/journal.pone.0338199.g004

Fig 3A shows that pairs where the first and second stimulus are faster (top right) have higher similarity ratings (in yellow). Conversely, if the articulation rates are different (top left and bottom right), this leads to low similarity ratings (in blue), regardless of whether the slow or the fast speaker is presented first. This supports the interpretation that articulation rate affects perceived similarity. Fig 3B shows a similar pattern for mean pitch, supporting the idea that pitch also affects perceived similarity, regardless of whether a high or low pitch speaker is presented first.

There is, however, a suggestion in both graphs that the effect of perceived dissimilarity may be stronger if the first speaker is faster or higher pitch (bottom right), than if the first speaker is slower or lower pitch (top left, where the blue covers a smaller area). We also note that stimuli pairs where both stimuli are slow or low pitch do not have higher predicted similarity ratings (bottom left of the graphs, where we might have expected to see yellow). This suggests that higher pitch and faster speakers are more perceptually salient, leading to high similarity ratings, whereas lower pitch and slower speakers are less salient, and thus less likely to make speakers sound strikingly similar. This may lead raters to rely on other, more salient, cues to make their ratings.

Fig 4 similarly shows that pairs where both stimuli are leaders in change (top right) have higher similarity ratings. Pairs where both stimuli are laggers (bottom left) do not, suggesting that leaders are more perceptually salient than laggers. If the first speaker is a leader, then this may orient listeners to vowels, leading to high similarity when followed by a leader (top right) and low similarity if the second speaker is a lagger (bottom right). But if the first speaker is a lagger (left), this is not salient, and listeners do not use vowels to make their judgement. Another effect we see here is a general low similarity score for cases where the first speaker is not extreme in their vowels (middle of the graph), and the second speaker is also not extreme or toward the (potentially non-salient) ‘lagger’ end. This may suggest that speakers who are neutral on their vowels may be judged on another dimension, such as speed or pitch, which may distinguish these speakers more.

In summary, the relationship between speakers’ articulation rate, pitch and leader-lagger scores all appear to have an impact on the perceived similarity of stimuli pairs, but these effects are not symmetrical. Moreover, the impact of speed and pitch looks to be both stronger (differences lead to comparatively low similarity scores, as indicated by the range of estimated similarity ratings represented in the Fig 3 and Fig 4 legends) and more consistent or symmetrical (hearing a lagger followed by a leader may not result in a lower similarity score, but hearing a faster and/or higher pitched speaker followed by a slower and/or lower pitched speaker will). The GAMM therefore provides evidence for the same variables contributing to listener evaluations as the MDS and contributes additional nuance to our understanding of how different variables predict similarity ratings.

5. Discussion

5.1. Addressing the research questions

Our primary question (RQ1) asked what structural relationships emerge among speakers when listeners evaluate their similarity based on spontaneous speech. Our second, more exploratory question, asked what factors influence pairwise similarity patterns in a trial-by-trial basis. What, then, have we learned about these questions from our two analyses?

The GAMMs indicate that articulation rate, pitch and the leader-lagger vowels affect participant trial-by-trial behaviour, while the back-vowel configuration does not. The model also provides evidence that articulation rate, pitch and leader-lagger vowels not only influence perceived similarity, but (a) their order of presentation plays a role in listener responses and (b) one end of each continuum is more perceptually salient than the other. If the first speaker is fast, high pitch, or a leader, this leads to high similarity ratings if the second speaker also shares this characteristic. However, this effect does not emerge for pairs of slow or low pitch speakers, or laggers. For the leader-lagger continuum in particular, we only see a clear effect of increased perceived similarity if the first speaker is a leader. This, together with a general effect of dissimilarity for pairs with initial ‘average’ speakers and following lagger/average speakers, suggest that, in the absence of hearing a leader initially, listeners may be more likely to use other characteristics to rate the speakers. Finally, the GAMM suggests that, relative to the leader-lagger vowels, the impact of speed and pitch may be both stronger and more symmetrical.

The results of the GAMM are consistent with what was learned in the MDS. Namely, leader-lagger vowels, articulation rate and pitch are relevant to how listeners perceptually differentiate between speakers. In D1, we see that high pitch speakers are perceptually distinct. Given the GAMM indicates that high pitch is more salient than low pitch, perhaps, in cases of extreme high pitch, this is the main dimension used by listeners, and the vowels are thus irrelevant. Outside of markedly high pitch speakers, however, the speaker’s status as a leader or lagger becomes more important. We can also see in Fig 2C that lower pitch speakers also tend to be slower, which may facilitate the ability of listeners to orient towards vowel realisations. Turning to D2, we see a primary effect of articulation rate, mediated by pitch. Slower speakers are lower in D2, and faster speakers are higher. Combined with the evidence from the GAMM that faster articulation rate may be more salient, here we see a mediating effect at the potentially more salient end of the articulation rate continuum. It is the perception of faster speakers which is mediated by pitch, and it is fast and high pitch speakers who are particularly high on D2. That is, we see the more salient ends of continua jointly affecting perceptual similarity. This may indicate that these dimensions are working together to influence perceived similarity.

It is through combining the insights of our two analyses, then, that we get a comprehensive overall picture of the perceptual patterns. From the GAMM, we gain additional information about the speeds, pitches, and covarying vowel patterns that are likely more perceptually salient for listeners, enabling us to further interpret the patterns in the MDS. We also learn that the order stimuli are presented to listeners is relevant to individual pairwise ratings. From the MDS, we learn that pitch affects perceived similarity, mediated by the leader-lagger vowels, and articulation rate appears as a second dimension of perceived similarity, mediated by pitch. In other words, all three of these characteristics work together to determine the perceptual similarity structure, and they do not work in isolation of each other.

5.2. The role of covarying vowels

In this analysis we explored the degree to which two sets of co-varying vowel patterns influenced listener perception. Both the GAMM and MDS results presented above support the interpretation that one set – the ‘leader-lagger’ vowels – play a role in the perception of NZE voices. This supports the suggestions in Hurring et al. [17] and Brand et al. [13] that these vowels may be socially meaningful.

The second set was the configuration of a set of back vowels, which were found to covary in both Brand et al. [13] and Hurring et al. [17]. The GAMM and MDS did not provide evidence that speakers’ realization of these vowels systematically affected listener perception in our task. One potential interpretation of this lack of effect is that patterns of variation in the back vowels may not be perceptually or socio-linguistically salient and are irrelevant for listener evaluations of speaker similarity. That is, perhaps the covariation is entirely structural, and not perceptually relevant at all. Alternatively, the back vowel configuration may be perceptually or socially salient to listeners, but to a lesser extent than articulation rate, pitch and the leader-lagger vowels. Our bottom-up approach, using relatively uncontrolled stimuli, is good for picking up dominant dimensions that listeners tune into. But this does not rule out that, in more controlled stimuli, where perhaps the only thing that varied was the back vowels, that listeners may have tuned into their realization to differentiate between speakers.

5.3. The role of articulation rate and pitch

The combined results from the MDS and GAMM provide evidence for the leader-lagger continuum being one of multiple cues that may be used to assess speaker similarity, alongside articulation rate and mean pitch. Indeed, the perceptual salience of speed and pitch look to be greater than the leader-lagger continuum.

A question that remains is the degree to which speed and pitch are socially evaluated and truly working together with the vowels. Our results could arise from different listeners doing the task differently – some reacting to surface acoustic characteristics of the voices, others conducting the intended social evaluation. Or they could arise from different pairs of stimuli being rated on different dimensions, which the GAMM results suggest may, at least sometimes, be the case. But the results could also arise from true assessments of social meaning, where ‘fast’ or ‘high pitch’ carry social evaluations, and in a complex way that interacts with each other, and with the realization of whether a speaker is a lagger or a leader. An important topic for future work will be to conduct tasks which reveal what listeners are doing when they rate pairs of speakers. When we ask why listeners rate two high pitch speakers as similar, for example, would they tell us it is because they are both high pitch (i.e., perceptually salient)? Or because they both sound young, feminine, or some other social characteristic (i.e., socio-linguistically salient)?

Understanding the role of such prosodic factors and – particularly – the degree to which they may work with segmental factors in the creation of social meaning in New Zealand English [see 96] is another important topic for future work. In general, the salience of articulation rate and pitch is consistent with previous research documenting these as salient acoustic features to listeners [e.g., 97,98] and our results support O’Rourke and Baltazani’s [99] call for a greater focus on the nascent field of ‘socioprosodics’.

5.4. Methodological considerations

We have used two different approaches to analyse our data. As the GAMM is modelling the trial-by-trial responses, it is effective at indicating the types of properties that are being used throughout the experiment, and the potentially complex order effects. The MDS is instead effective at abstracting away from trial-by-trial effects and leveraging the collective results to provide a robust high-level picture of the perceptual organization of these speakers. MDS can also compensate, to an extent, for the uneven number of ratings across stimuli pairs by using the information that is available to situate an individual relative to all other speakers, not just those they were paired with. For example, GAMMs can only estimate how close Speaker A is to Speaker B, or A to C. They cannot utilise other ratings, such as those between B and C, to build a map of where A sits relative to both speakers.

The model results do indicate that applying MDS both reduced noise (i.e., presenting an aggregate picture of listener perception) and reduced some information in the original data (i.e., not showing potential stimuli order effects on listener perception). However, reducing the perceptual space down to two dimensions and abstracting away from order effects meant we were also able to more easily explore interactions in this space. This is because an interaction between two of our predictors (e.g., articulation rate and pitch) is now a simple two-way interaction, rather than a four-way interaction as it would have been in the GAMM (the articulation rate and pitch of each of the two stimuli). There is, therefore, clear value in approaching perceptual similarity data from multiple angles in explorations of listener perception, and the utility and applicability of MDS will, ultimately, depend on the research question and variables of interest.

The order effects in our pairwise similarity rating task indicated by the GAMM are, nonetheless, important. Our results are consistent with work on similarity, which suggests that this is not a symmetric concept [69,100] and point to the importance of future work considering how variable/variant salience might affect participant behaviour in a rating task [64, see 65,66]. There are clear methodological implications that need to be considered, and interesting opportunities for revealing potential differences in salience. Moreover, there is work on pairwise ratings of voice similarity in the forensic literature, where the question of perceived similarity is relevant due to the use of ‘voice line-ups’ [see, e.g., 74,101]. While the use of MDS is recommended [e.g., 41], order effects may be important to understand when applying such work in a practical context.

5.5. Limitations

In this section we will discuss the main limitations of the analysis. First, the main limitation of the experiment design is that not all listeners heard all pairs. Moreover, participant dropout meant that not all pairs were heard the same number of times. On average, each pair was rated seven times (the median number of ratings is also seven), and the distribution of ratings has a standard deviation of 1.6 and variance of 2.5. It is, therefore, likely that some mean ratings used to create the similarity matrix are more informative than others. The order effects observed in the GAMM results may also be affected by a more even distribution of the pairwise ratings. Future pre-registered modelling that tests higher-order interactions with a set of stimuli that is more balanced and more evenly distributed across participants is required to fully disentangle the potential order effects at play in the perception of speaker similarity.

Second, we note that the GAMM only explained 18.4% of the variance in the data (R2 measure in Table 3). It is, therefore, highly likely that factors other than those we fit to the model are relevant to the perceptual differentiation of speakers. It is also possible (some of) the variables we did include are functioning, to varying extents, as proxies for other features that more strongly underly perceived similarity. For example, articulation rate is related to, but does not capture, use of pauses and mean pitch is related to, but distinct from, other voice quality measures such as spectral tilt and shimmer. The leader-lagger and back-vowel configuration scores also capture information about two sets of individual variables which may be variably relevant to listener perception. Modelling the relative importance of different speech features to perceived speaker similarity is a clear avenue for future research.

Relatedly, the use of uncontrolled audio stimuli introduces multiple limitations. While we checked for certain forms of content (Section 3.1), we did not control for all potential lexical or topical influences on listener evaluations. More prominently, the distribution of vowels from the two clusters is not even across stimuli. There are fewer back vowels than leader-lagger vowels, providing less evidence of their realization in the stimuli that we played. As such, a third possible explanation for the lack of effect of the back-vowel configuration is that listeners were simply not exposed to sufficient tokens to make judgements based on their realisations. The back-vowel configuration may, therefore, be perceptually or socially salient, but we would have needed more targeted recordings with evenly distributed vowel tokens to reveal it.

Finally, we would like to discuss statistical power. We did not conduct a power analysis to determine the desired participant sample size because dimension-reduction techniques such as our preregistered MDS analysis do not test a null hypothesis. MDS does not, by extension, produce the Type I or Type II statistical errors power analyses are intended to mitigate [102]. The extent to which larger listener sample sizes improve the fit (i.e., stress) of an MDS analysis is, nonetheless, a relevant and under-explored question. Rodgers [103] found samples as small as one to six participants can provide good Metric Recovery of original distances and comparable stress values to larger samples, and our sample size exceeds both those numbers and common participant numbers in applications of MDS in psychology [see 104] and linguistics [e.g., 78,80,105]. Discussions of stress and “sample size” in applications of MDS otherwise focus on the number of input items [e.g., 104,106], which in our case was constrained by QuakeBox participant demographics. The exact relationship between our participant sample size and MDS fit remains, however, an open question.

As the question of pairwise ratings emerged as our analysis progressed, we did not preregister the reported GAMM analysis or conduct a post-hoc power analysis of the participant sample size. The general risk of post-hoc power analyses aside [see 107,108], there is limited precedent for determining participant sample sizes in the application of GAMMs. Linguistic papers that discuss GAMMs and statistical power consider power across model types or in the model-fitting process [e.g., 109,110], rather than in relation to sample sizes and Type I/II error probability. Furthermore, simulation methods have been developed for assessing statistical power for generalised linear mixed models [e.g., 111,112], but they do not currently apply to the outputs of GAMM models in R. In other words, there is not yet an accessible, conventionalised, approach to informing a desired sample sizes for fitting GAMMs in linguistics. As such, we highlight simulation-based approaches to participant sampling in both MDS and GAMMs as potential directions of future research and methodological innovation.

6. Conclusion

Past work has identified two sets of vowels in New Zealand English which work together as vowel subsystems in production. We were interested in the degree to which variation within these subsystems played a role in listener perception. To test this, we ran a pairwise-similarity task with New Zealand voices and investigated whether differences in covarying vowel patterns affected perceived speaker similarity (i.e., are perceptually salient). We also investigated the role of mean pitch and articulation rate.

Across two analyses, we found evidence that one of the vowel subsystems is perceptually salient – a set of vowels which are undergoing change in New Zealand English, and for which speakers tend to be ‘laggers’ or ‘leaders’ in the change. This did not affect similarity in isolation though. Both mean pitch and articulation rate also played a role and – indeed – these were somewhat more predictive of similarity ratings than the vowel realizations. Our analysis also suggests that perceived similarity is not symmetric, and that the degree of salience of a feature differs at different ends of the continuum. If two speakers are leaders in the sound changes, for example, this makes them sound more similar to each other than two speaker who are laggers.

Using a bottom-up approach, we were able to identify relationships between patterns of production and listeners' perceptions of voices, without artificially or prematurely imposing the social meanings that might be carried by these characteristics. The methodology of starting with pairwise similarity ratings is an effective strategy for revealing patterns in the way in which speakers are perceived. The resultant evidence for the perceptual salience of the leader-lagger vowels, mean pitch, and articulation rate now provides a solid foundation for future work investigating precisely how these variables work together in the creation of social meaning in New Zealand English

Acknowledgments

We would like to acknowledge the Speech Communication Research Group at Northwestern and Chun-Liang Chan for the original development of the software underpinning our experiment. We thank Robert Fromont and Wakayo Mattingley for their technical support implementing the online experiment, and Gia Hurring and our colleagues at the New Zealand Institute of Language, Brain and Behaviour who have followed this project since its inception. Finally, we would like to thank all the listeners who participated in our experiment and the speakers who contributed their voices to the Quakebox.

References

  1. 1. Hay J, Warren P, Drager K. Factors influencing speech perception in the context of a merger-in-progress. J Phon. 2006;34(4):458–84.
  2. 2. Holliday NR. Perception in Black and White: effects of intonational variables and filtering conditions on sociolinguistic judgments with implications for ASR. Front Artif Intell. 2021;4:642783. pmid:34337391
  3. 3. MacFarlane AE, Stuart-Smith J. ‘One of them sounds sort of Glasgow Uni-ish’: social judgements and fine phonetic variation in Glasgow. Lingua. 2012;122(7):764–78.
  4. 4. Campbell-Kibler K. Intersecting variables and perceived sexual orientation in men. Am Speech. 2011;86(1):52–68.
  5. 5. Levon E. Sexuality in context: Variation and the sociolinguistic perception of identity. Lang Soc. 2007;36(4):533–54.
  6. 6. Niedzielski N. The effect of social information on the perception of sociolinguistic variables. J Lang Soc Psychol. 1999;18(1):62–85.
  7. 7. Hay J, Nolan A, Drager K. From fush to feesh: exemplar priming in speech perception. Linguist Rev. 2006;23(3):351–79.
  8. 8. D’Onofrio A. Persona‐based information shapes linguistic perception: Valley Girls and California vowels. J Socioling. 2015;19(2):241–56.
  9. 9. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. 2024.
  10. 10. Beaman KV, Guy GR. The coherence of linguistic communities. Routledge. 2022.
  11. 11. Guy GR. The cognitive coherence of sociolects: how do speakers handle multiple sociolinguistic variables? J Pragmat. 2013;52:63–71.
  12. 12. Becker K. Linking community coherence, individual coherence, and bricolage: the co-occurrence of (r), raised bought and raised bad in New York City English. Lingua. 2016;172–173:87–99.
  13. 13. Brand J, Hay J, Clark L, Watson K, Sóskuthy M. Systematic co-variation of monophthongs across speakers of New Zealand English. J Phon. 2021;88:101096.
  14. 14. Gordon E, Maclagan M, Hay J. The ONZE corpus. In: Beal JC, Corrigan KP, Moisl HL, editors. Creating and digitizing language corpora. Hampshire: Palgrave Macmillan UK. 2007:82–104.
  15. 15. Drager K, Hay J. Exploiting random intercepts: two case studies in sociophonetics. Lang Var Change. 2012;24(1):59–78.
  16. 16. Wilson Black J, Brand J, Hay J, Clark L. Using principal component analysis to explore co‐variation of vowels. Lang Linguist Compass. 2022;17(1).
  17. 17. Hurring G, Wilson Black J, Hay J, Clark L. How stable are patterns of covariation across time? Lang Var Change. 2025;37(1):111–35.
  18. 18. Walsh L, Hay J, Bent D, King J, Millar P, Papp V. The UC QuakeBox Project: creation of a community-focused research archive. N Z Eng J. 2013;27:20–32. http://dx.doi.org/10.26021/2.
  19. 19. Clark L, MacGougan H, Hay J, Walsh L. “Kia ora. This is my earthquake story”. Multiple applications of a sociolinguistic corpus. Ampersand. 2016;3:13–20.
  20. 20. Lambert WE, Hodgson RC, Gardner RC, Fillenbaum S. Evaluational reactions to spoken languages. J Abnorm Soc Psychol. 1960;60:44–51. pmid:14413611
  21. 21. Szakay A. Voice quality as a marker of ethnicity in New Zealand: from acoustics to perception. J Socioling. 2012;16(3):382–97.
  22. 22. Clopper CG, Pisoni DB. Effects of region of origin and geographic mobility on perceptual dialect categorization. Lang Var Change. 2006;18(2):193–221. pmid:21423820
  23. 23. Davydova J, Tytus AE, Schleef E. Acquisition of sociolinguistic awareness by German learners of English: a study in perceptions of quotative be like. Linguistics. 2017;55(4):783–812.
  24. 24. Dailey‐O’Cain J. The sociolinguistic distribution of and attitudes toward focuser like and quotative like. J Socioling. 2000;4(1):60–80.
  25. 25. Villarreal D. The construction of social meaning: a matched-guise investigation of the California Vowel Shift. J Engl Linguist. 2018;46(1):52–78.
  26. 26. Regan B. Intra-regional differences in the social perception of allophonic variation: the evaluation of [tʃ  ] and [  ʃ  ] in Huelva and Lepe (Western Andalucía). J Ling Geogr. 2020;8(2):82–101.
  27. 27. Walker A, García C, Cortés Y, Campbell-Kibler K. Comparing social meanings across listener and speaker groups: the indexical field of Spanish /s/. Lang Var Change. 2014;26(2):169–89.
  28. 28. Díaz-Campos M, Killam J. Assessing language attitudes through a matched-guise experiment: the case of consonantal deletion in Venezuelan Spanish. Hispania. 2012;95(1):83–102.
  29. 29. Holliday N, Jaggers ZS. Influence of suprasegmental features on perceived ethnicity of American politicians. In: Wolters M, Livingstone J, Beattie B, Smith R, MacMahon M, Stuart-Smith J, Scobbie JM, editors. Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow, UK: the University of Glasgow, 2015.
  30. 30. Podesva RJ, Reynolds J, Callier P, Baptiste J. Constraints on the social meaning of released /t/: a production and perception study of U.S. politicians. Lang Var Change. 2015;27(1):59–87.
  31. 31. Campbell-Kibler K. The nature of sociolinguistic perception. Lang Var Change. 2009;21(1):135–56.
  32. 32. Pharao N, Maegaard M, Møller JS, Kristiansen T. Indexical meanings of [s+] among Copenhagen youth: social perception of a phonetic variant in different prosodic contexts. Lang Soc. 2014;43(1):1–31.
  33. 33. Bayard D. Antipodean accents and the “cultural cringe”: New Zealand and American attitudes toward NZE and other English accents. Te Reo. 1991;34:15–52.
  34. 34. Bayard D, Weatherall A, Gallois C, Pittam J. Pax Americana? Accent attitudinal evaluations in New Zealand, Australia and America. J Socioling. 2001;5(1):22–49.
  35. 35. Bayard D, Bartlett C. You must be from Gorrre: attitudinal effects of Southland rhotic accents and speaker gender on NZE listeners and the question of NZE regional variation. Te Reo. 1996;39:25–45.
  36. 36. Gordon E. Sex, speech, and stereotypes: Why women use prestige speech forms more than men. Lang Soc. 1997;26(1):47–63.
  37. 37. Duhamel M-F, Meyerhoff M. An end of egalitarianism? Social evaluations of language difference in New Zealand. Linguist Vanguard. 2014;1(1):235–48.
  38. 38. Drager K. Speaker age and vowel perception. Lang Speech. 2011;54(Pt 1):99–121. pmid:21524014
  39. 39. Clopper CG, Levi SV, Pisoni DB. Perceptual similarity of regional dialects of American English. J Acoust Soc Am. 2006;119(1):566–74.
  40. 40. Clopper CG, Pisoni DB. Free classification of regional dialects of American English. J Phon. 2007;35(3):421–38. pmid:21423862
  41. 41. McDougall K. Assessing perceived voice similarity using Multidimensional Scaling for the construction of voice parades. Int J Speech Lang Law. 2013;20(2):163–72.
  42. 42. Walden BE, Montgomery AA, Gibeily GJ, Prosek RA, Schwartz DM. Correlates of psychological dimensions in talker similarity. J Speech Hear Res. 1978;21(2):265–75. pmid:703276
  43. 43. Baumann O, Belin P. Perceptual scaling of voice identity: common dimensions for different vowels and speakers. Psychol Res. 2010;74(1):110–20. pmid:19034504
  44. 44. Casserly ED. Perceptual similarity across multiple sociolinguistic variables. IULC Working Papers. 2010;10(1):1–15.
  45. 45. Campbell-Kibler K. Accent, (ING), and the social logic of listener perceptions. Am Speech. 2007;82(1):32–64.
  46. 46. Huygens I, Vaughan GM. Language attitudes, ethnicity and social class in New Zealand. J Multiling Multicult Dev. 1983;4(2–3):207–23.
  47. 47. Villarreal D, Grama J. Modeling social meanings of phonetic variation amid variable co-occurrence: a machine learning approach. In: Skarnitzl R, Volín J, editors. Proceedings of the 20th International Congress of Phonetic Sciences, Prague, Czech Republic: Prague Congress Centre, 2023:3745–9.
  48. 48. Drager K, Hardman-Guthrie K, Schutz R, Chik I. Perceptions of style: A focus on fundamental frequency and perceived social characteristics. In: Hall-Lew L, Moore E, Podesva RJ, editors. Social meaning and linguistic variation: Theorizing the third wave. Cambridge: Cambridge University Press. 2021:176–202.
  49. 49. Niedzielski N, Preston DR. Folk linguistics. Berlin: Mouton de Gruyter. 2000.
  50. 50. Wells JC. Accents of English. Cambridge: Cambridge University Press. 1982.
  51. 51. Kendall T. Speech rate, pause, and sociolinguistic variation. New York: Palgrave Macmillan. 2013.
  52. 52. Gordon JK, Andersen K, Perez G, Finnegan E. How old do you think I am? Speech-language predictors of perceived age and communicative competence. J Speech Lang Hear Res. 2019;62(7):2455–72. pmid:31265362
  53. 53. Latinus M, Taylor MJ. Discriminating male and female voices: differentiating pitch and gender. Brain Topogr. 2012;25(2):194–204. pmid:22080221
  54. 54. Calhoun S, Warren P, Mills J, Agnew J. Socialising the frequency code: effects of gender and age on iconic associations of pitch. J Acoust Soc Am. 2024;156(5):3183–203. pmid:39535238
  55. 55. Pernet CR, Belin P. The role of pitch and timbre in voice gender categorization. Front Psychol. 2012;3:23. pmid:22347205
  56. 56. Leung G, Deuber D. Indo-Trinidadian speech: an investigation into a popular stereotype surrounding pitch. In: Hundt M, Sharma D, editors. English in the Indian Diaspora. Amsterdam/Philadelphia: John Benjamins. 2014:9–27.
  57. 57. Guyer JJ, Fabrigar LR, Vaughan-Johnston TI. Speech rate, intonation, and pitch: investigating the bias and cue effects of vocal confidence on persuasion. Pers Soc Psychol Bull. 2019;45(3):389–405. pmid:30084307
  58. 58. Belin P, Boehme B, McAleer P. Correction: The sound of trustworthiness: acoustic-based modulation of perceived voice personality. PLoS One. 2019;14(1):e0211282. pmid:30653619
  59. 59. Gnevsheva K, Bürkle D. Age estimation in foreign-accented speech by native and non-native speakers. Lang Speech. 2020;63(1):166–83. pmid:30760127
  60. 60. Jiao D, Watson V, Wong SG-J, Gnevsheva K, Nixon JS. Age estimation in foreign-accented speech by non-native speakers of English. Speech Commun. 2019;106:118–26.
  61. 61. Huggins AW. Just noticeable differences for segment duration in natural speech. J Acoust Soc Am. 1972;51(4):1270–8. pmid:5032943
  62. 62. Pisanski K, Rendall D. The prioritization of voice fundamental frequency or formants in listeners’ assessments of speaker size, masculinity, and attractiveness. J Acoust Soc Am. 2011;129(4):2201–12. pmid:21476675
  63. 63. Cutler A, Weber A, Smits R, Cooper N. Patterns of English phoneme confusions by native and non-native listeners. J Acoust Soc Am. 2004;116(6):3668–78. pmid:15658717
  64. 64. Rácz P. Salience in sociolinguistics: a quantitative approach. Berlin: De Gruyter Mouton. 2013.
  65. 65. Labov W, Ash S, Ravindranath M, Weldon T, Baranowski M, Nagy N. Properties of the sociolinguistic monitor. J Socioling. 2011;15(4):431–63.
  66. 66. Watson K, Clark L. How salient is the NURSE~SQUARE merger?. Engl Lang Linguist. 2013;17(2):297–323.
  67. 67. Pflaeging J, Mackay B, Schleef E. Sociolinguistic monitoring and L2 speakers of English. Linguistics. 2024;63(3):607–38. pmid:40322244
  68. 68. Levon E, Sharma D, Ye Y. Dynamic sociolinguistic processing: real-time changes in judgments of speaker competence. Language. 2022.
  69. 69. Hodgetts CJ, Hahn U. Similarity-based asymmetries in perceptual matching. Acta Psychol (Amst). 2012;139(2):291–9. pmid:22305350
  70. 70. Chan CL. Speech in Noise 2. Northwestern University. 2018.
  71. 71. Galesic M, Bosnjak M. Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opin Q. 2009;73(2):349–60.
  72. 72. Berry DTR, Wetter MW, Baer RA, Larsen L, Clark C, Monroe K. MMPI-2 random responding indices: Validation using a self-report methodology. Psychol Assess. 1992;4(3):340–5.
  73. 73. Baer RA, Ballenger J, Berry DT, Wetter MW. Detection of random responding on the MMPI-A. J Pers Assess. 1997;68(1):139–51. pmid:16370774
  74. 74. Perrachione TK, Furbeck KT, Thurston EJ. Acoustic and linguistic factors affecting perceptual dissimilarity judgments of voices. J Acoust Soc Am. 2019;146(5):3384–399. pmid:31795676
  75. 75. Borg I, Groenen PF. Modern multidimensional scaling: theory and applications. 2nd ed. New York: Springer. 2005.
  76. 76. Clopper CG, Bradlow AR. Free classification of American English dialects by native and non-native listeners. J Phon. 2009;37(4):436–51. pmid:20161400
  77. 77. Alcorn S, Meemann K, Clopper CG, Smiljanic R. Acoustic cues and linguistic experience as factors in regional dialect classification. J Acoust Soc Am. 2020;147(1):657. pmid:32006987
  78. 78. Shin W, Lee H, Shin J, Holliday JJ. The potential role of talker age in the perception of regional accent. Lang Speech. 2020;63(3):479–505. pmid:31288603
  79. 79. Nolan F, McDougall K, Hudson T. Some acoustic correlates of perceived (dis)similarity between same-accent voices. In: Lee W-S, Zee E, editors. Proceedings of the 17th International Congress of Phonetic Sciences, Hong Kong: City University of Hong Kong, 2011:1506–9.
  80. 80. Bradlow A, Clopper C, Smiljanic R, Walter MA. A perceptual phonetic similarity space for languages: evidence from five native language listener groups. Speech Commun. 2010;52(11–12):930–42. pmid:21179563
  81. 81. Matsumoto H, Hiki S, Sone T, Nimura T. Multidimensional representation of personal quality of vowels and its acoustical correlates. IEEE Trans Audio Electroacoust. 1973;21(5):428–36.
  82. 82. Murry T, Singh S. Multidimensional analysis of male and female voices. J Acoust Soc Am. 1980;68(5):1294–300. pmid:7440851
  83. 83. Kreiman J, Gerratt BR, Precoda K, Berke GS. Individual differences in voice quality perception. J Speech Hear Res. 1992;35(3):512–20. pmid:1608242
  84. 84. Mair P, Groenen PJF, de Leeuw J. More on multidimensional scaling and unfolding in R: smacof version 2. J Stat Soft. 2022;102(10).
  85. 85. Mair P, Borg I, Rusch T. Goodness-of-fit assessment in multidimensional scaling and unfolding. Multivariate Behav Res. 2016;51(6):772–89. pmid:27802073
  86. 86. Kuhn M, Vaughn D. parsnip: a common API to modeling and analysis functions. 1.2.1 ed. 2024.
  87. 87. Thernau T, Atkinson B, Ripley B. rpart: recursive partitioning and regression trees. 4.1.23 ed. 2023.
  88. 88. Wright MN, Ziegler A. ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Soft. 2017;77(1).
  89. 89. Boersma P. Praat, a system for doing phonetics by computer. Glot International. 2001;5(9/10):341–5.
  90. 90. Fromont R, Hay J. LaBB-CAT: an annotation store. In: Cook P, Nowson S, editors. Proceedings of the Australasian Language Technology Association Workshop, 2012:113–7.
  91. 91. Anderson MJ. A new method for non‐parametric multivariate analysis of variance. Austral Ecology. 2001;26(1):32–46.
  92. 92. Anderson MJ. Permutational Multivariate Analysis of Variance (PERMANOVA). In: Balakrishnan N, Colton T, Everitt B, Piegorsch W, Ruggeri F, Teugels JL, editors. Wiley StatsRef: Statistics Reference Online. Wiley. 2017. 1–15. 
  93. 93. Oksanen J, Simpson GL, Blanchet FG, Kindt R, Legendre P, Minchin PR. Vegan: community ecology package. 2.6-4 ed. 2022.
  94. 94. Martinez Arbizu P. PairwiseAdonis: pairwise multilevel comparison using adonis. 2020.
  95. 95. Wood SN. Generalized Additive Models: an introduction with R. 2nd ed. New York: Chapman and Hall/CRC. 2017.
  96. 96. Warren P. The interpretation of prosodic variability in the context of accompanying sociophonetic cues. Lab Phonol. 2017;8(1):11.
  97. 97. Harnsberger JD, Shrivastav R, Brown WS Jr, Rothman H, Hollien H. Speaking rate and fundamental frequency as speech cues to perceived age. J Voice. 2008;22(1):58–69. pmid:16968663
  98. 98. Skoog Waller S, Eriksson M, Sörqvist P. Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age. Front Psychol. 2015;6:978. pmid:26236259
  99. 99. O’Rourke E, Baltazani M. Sociophonetics and intonation: A proposal for socioprosodics. In: Strelluf C, editor. The Routledge Handbook of Sociophonetics. London: Routledge. 2023: 23–54.
  100. 100. Tversky A. Features of similarity. Psychological Review. 1977;84(4):327–52.
  101. 101. Gerlach L, McDougall K, Kelly F, Alexander A, Nolan F. Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features. Speech Commun. 2020;124:85–95.
  102. 102. Ding CS. Multidimensional scaling. In: Little TD, editor. The Oxford handbook of quantitative methods. Volume 2: statistical analysis. Oxford: Oxford University Press; 2013:235–56.
  103. 103. Rodgers JL. Matrix and stimulus sample sizes in the weighted MDS model: empirical metric recovery functions. Appl Psychol Meas. 1991;15(1):71–7.
  104. 104. Hout MC, Cunningham CA, Robbins A, MacDonald J. Simulating the fidelity of data for large stimulus set sizes and variable dimension estimation in multidimensional scaling. Sage Open. 2018;8(2).
  105. 105. Clopper CG, Bradlow AR. Perception of dialect variation in noise: intelligibility and classification. Lang Speech. 2008;51(Pt 3):175–98. pmid:19626923
  106. 106. Dexter E, Rollwagen‐Bollens G, Bollens SM. The trouble with stress: a flexible method for the evaluation of nonmetric multidimensional scaling. Limnol Oceanogr Methods. 2018;16(7):434–43.
  107. 107. Hoenig JM, Heisey DM. The abuse of power. Am Stat. 2001;55(1):19–24.
  108. 108. Perugini M, Gallucci M, Costantini G. A practical primer to power analysis for simple experimental designs. Int Rev Soc Psychol. 2018;31(1).
  109. 109. Sóskuthy M. Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. J Phon. 2021;84:101017.
  110. 110. Baayen RH, Fasiolo M, Wood S, Chuang Y-Y. A note on the modeling of the effects of experimental time in psycholinguistic experiments. Ment Lex. 2022;17(2):178–212.
  111. 111. Johnson PCD, Barry SJE, Ferguson HM, Müller P. Power analysis for generalized linear mixed models in ecology and evolution. Methods Ecol Evol. 2015;6(2):133–42. pmid:25893088
  112. 112. Watson S. glmmrBase: Generalised linear mixed models in R. 2025. https://doi.org/10.32614/CRAN.package.glmmrBase