Human roars communicate upper-body strength more effectively than do screams or aggressive and distressed speech

Despite widespread evidence that nonverbal components of human speech (e.g., voice pitch) communicate information about physical attributes of vocalizers and that listeners can judge traits such as strength and body size from speech, few studies have examined the communicative functions of human nonverbal vocalizations (such as roars, screams, grunts and laughs). Critically, no previous study has yet to examine the acoustic correlates of strength in nonverbal vocalisations, including roars, nor identified reliable vocal cues to strength in human speech. In addition to being less acoustically constrained than articulated speech, agonistic nonverbal vocalizations function primarily to express motivation and emotion, such as threat, and may therefore communicate strength and body size more effectively than speech. Here, we investigated acoustic cues to strength and size in roars compared to screams and speech sentences produced in both aggressive and distress contexts. Using playback experiments, we then tested whether listeners can reliably infer a vocalizer’s actual strength and height from roars, screams, and valenced speech equivalents, and which acoustic features predicted listeners’ judgments. While there were no consistent acoustic cues to strength in any vocal stimuli, listeners accurately judged inter-individual differences in strength, and did so most effectively from aggressive voice stimuli (roars and aggressive speech). In addition, listeners more accurately judged strength from roars than from aggressive speech. In contrast, listeners’ judgments of height were most accurate for speech stimuli. These results support the prediction that vocalizers maximize impressions of physical strength in aggressive compared to distress contexts, and that inter-individual variation in strength may only be honestly communicated in vocalizations that function to communicate threat, particularly roars. Thus, in continuity with nonhuman mammals, the acoustic structure of human aggressive roars may have been selected to communicate, and to some extent exaggerate, functional cues to physical formidability.


Introduction
In competitive contests, evolutionary selection processes favour vocal communication of resource holding potential to settle disputes without engaging in potentially costly combat [1]. For example, many terrestrial mammalian species, including giant pandas [2], sea lions [3], fallow and red deer [4,5], and domestic dogs [6] use acoustic cues to body size or dominance rank in aggressive vocalizations to mediate agonistic interactions, particularly during malemale competition.
Among humans, the nonverbal components of speech also allow listeners to assess body size from the voice, including height and weight [7][8][9][10]. Yet, few studies provide evidence that human listeners can assess physical strength from the human voice. Sell et al. [11] found that actual strength explained 18% and 7% of the variance in listeners' voice-based strength attributions of male and female vocalizers, respectively, when listeners were presented with short speech utterances. A more recent study showed that listeners were also able to judge the strength and height of unseen vocalizers relative to their own strength and height, from both aggressive speech utterances and aggressive roars [12]; however, that study did not examine the acoustic correlates of strength or body size nor whether these predicted listeners' judgments. Indeed, despite the apparent capacity for listeners to gauge strength from the voice, the acoustic correlates of strength remain largely unknown following null or inconsistent results of past work [11,[13][14][15][16][17].
Due to an evolutionary continuity in both structure and function between the vocalizations of other mammals and human nonverbal vocalizations, such as laughter [18][19][20][21] and infant distress screams [22][23][24], human nonverbal vocalizations may communicate evolutionarily and socially relevant information more effectively than speech, which is also relatively more constrained by linguistic content. Indeed, recent work has shown that human laughter (e.g., [25,21,26] but see [27]), tennis grunts [28], and simulated pain cries [29] all convey ecologically relevant cues to vocalizer traits that listeners utilize in their biosocial judgments. At the same time, while past studies show that listeners can estimate absolute strength from modal speech [11] and relative strength from both speech and roars [12], roars appear to exaggerate the expression of threat, as listeners judge male vocalizers as relatively stronger and larger than themselves when those vocalizers are producing roars compared to aggressive speech [12]. The information carried by nonverbal vocalizations may also be context-specific. For example, aggressive roars may communicate, or exaggerate, physical strength more effectively than fear screams.
To test these hypotheses, we compared the ability of listeners to estimate physical strength from human speech and from nonverbal vocalizations produced in two hypothetical contexts: aggression and distress. In these two distinct agonistic contexts, nonhuman mammals typically produce acoustically and perceptually distinct vocalizations that follow Morton's motivational-structural rules [30]; hence, capitalising on perceptual associations between low frequency sounds and large size or dominance [31], aggressive vocalizations (roars, barks or growls) are typically structurally noisy and low-pitched [30][31][32]. In contrast, distress vocalizations are higher-pitched and usually (but not always) tonal, exploiting perceptual associations between high frequencies and small size or submission [30,31,33]. While aggressive vocalizations are thought to function to display threat and physical formidability, distress vocalizations typically function to solicit aid [34][35][36].
Like other mammals, humans produce roar-like vocalizations in aggressive contexts such as battle [37][38][39], and scream-like vocalizations in distress contexts [40]. Furthermore, women, who are on average physically weaker than men [41][42][43], are more likely to scream in response to threat scenarios than are men, whose responses are typically biased towards aggression [40].
Following the hypothesis that human roars and screams are homologous to mammalian vocalizations produced in aggressive and distress contexts, respectively, and are likewise affected by anatomical and physiological constraints, we may expect that the acoustic structure of these nonverbal vocalizations encodes honest information about the physical characteristics of the vocalizer [44][45][46][47][48][49][50]. However, we may also expect vocalizations produced in an aggressive context (hereafter roars) to function to maximize the expression of threat relative to those produced in a distress or submissive context (screams), which may in turn minimize perceived threat.

The present study
In a recent paper we showed that listeners can judge the strength and height of others (relative to their own) from aggressive speech and roars, and that roars, while communicating honest information about strength and body size, also exaggerate these physical traits compared to aggressive speech among men [12]. While those results support the prediction that roars function to maximize the expression of formidability and threat, the study lacked acoustic data to examine the vocal correlates of strength and body size in nonverbal vocalisations and speech, or to link these acoustic parameters to listeners' judgments of strength and size, and contained no data on screams or distressed speech.
Here, we thus build on previous research by comparing the acoustic structure of roars, screams, and their speech equivalents, and examining the functional relevance of these vocal stimuli in communicating absolute strength and height to novel samples of listeners. To do this, we measured the upper-body strength and height of men and women and audio recorded them producing aggressive roars and distress screams as well as aggressive and distressed speech sentences. We then examined differences in the acoustic structure of these four types of voice recordings, and the effects of vocalizer height and strength on a range of acoustic parameters. Finally, to contrast the functional relevance of roars, screams, and their speech equivalents in communicating strength and size, we asked separate samples of listeners to estimate the strength or height of vocalizers from each type of vocal stimulus. Our key hypothesis was that the acoustic structure of vocal stimuli will reflect their function in accordance with motivational-structural rules, and thus, that the encoding and communication of strength and size will be maximized in aggressive and nonverbal speech variants.
F0, minimum F0, maximum F0, start-end F0 (a measure of the F0 contour), and F0CV (coefficient of variation over the duration of the signal, representing pitch variability). During visual inspection of each spectrogram, we also measured the proportion of the signal for which amplitude modulation was present, and created a measure representing this proportion as a percentage (%AM). We then applied two distinct smoothing algorithms to suppress either minor or major F0 fluctuations, and counted inflection points after each smoothing procedure, divided by the total duration of voiced segments, to derive two distinct indices of F0 modulation (inflex25-minor inflections, and inflex2-major inflections).
A second procedure measured mean amplitude and intensity contour (time of max intensity expressed as a percentage of the signal's duration, and amplitude variability, intCV, representing the coefficient of variation of the intensity contour). A third procedure characterized noise and perturbation parameters, including harmonics-to-noise ratio (HNR, a measure of the ratio of harmonic spectral energy to chaotic spectral energy), jitter (small fluctuations in periodicity measured as the average of 'local', 'rap' and 'ppq5' measures in Praat) and shimmer (small changes in amplitude between consecutive periods, measured as the average of 'local', 'apq5' and 'apq11' parameters in Praat). While some researchers have argued that jitter and shimmer are inconsequential in the perception of non-pathological modal speech [53], these perturbation parameters appear to play a significant role in characterizing emotional nonverbal vocalizations. Indeed, acoustic analysis procedures similar to these have been applied successfully in previous studies of human babies' cries [54,55].
A fourth and final procedure characterized the spectral centre of gravity for each vocal stimulus (spectral COG), calculated as the amplitude-weighted mean of signal frequencies. Given the acoustic structure of nonverbal vocalizations, particularly their high pitch, formant frequencies were poorly defined and difficult to measure via cepstrum or linear predictive coding analyses. However, the spectral centre of gravity carries some information about vocal tract resonances [56]. In addition, we measured the dominant frequency within sex-specific expected frequency ranges for the fourth formant, F4: 3108-4250 Hz for males, and 3524-4887 Hz for females [57]. These data have been used to establish formant thresholds in a previous study of vocal cues to upper-body strength [14]. This dominant formant frequency measure (hereafter 'DFF4') may be used as a proxy for vocal tract length, as articulatory manipulations of vocal tract shape minimally affect F4 [57], and as the measurement of dominant frequency within an expected F4 range is less likely to capture strong harmonics than for expected ranges of lower formants, as their amplitude declines exponentially with increasing frequency [48]. Importantly, F4 is among the strongest formant-based predictors of height in both men and women, explaining a similar amount of variance in height within-sexes as composite formant measures (e.g., formant spacing) and significantly more variance than F1, F2 or F3 [58].
Fig 1 presents spectrograms illustrating examplary roars and screams. For additional details regarding acoustic analysis, please refer to S1 Text.
Statistical analysis. To examine acoustic differences among vocal stimuli, we conducted a conventional leave-one-out discriminant function analysis (DFA) with forced entry, as this is less vulnerable to collinear variables, random effects, and type I errors than is stepwise entry [59]. We entered all acoustic variables except duration, using within-sex z-scores in place of raw measures for sexually dimorphic acoustic characteristics (mean F0, max F0, min F0, startend F0, spectral COG, DFF4). We conducted a further DFA, split by sex, to investigate whether there were differences in the discriminability of vocal stimuli between sexes.
To investigate whether strength and height were encoded in the acoustic structure of vocal stimuli, we computed stepwise linear regressions with acoustic variables as predictors, and either actual strength or actual height as outcome variables, split by sex, stimulus type (speech/ vocalization), and stimulus context (aggression/distress). Stepwise regressions were designed to test whether linear combinations of a wide set of acoustic characteristics could reliably predict physical formidability, and whether the structure of these models was consistent across stimulus types. To assess the individual contribution of each acoustic characteristic we computed zero-order correlations between each voice parameter and strength or height (reported in Supporting Information, S2 Text). The dataset for these analyses is also provided as Supporting Information (see S1 File).

Results
Do roars, screams, and valenced speech sentences differ in acoustic structure?. Discriminant function analyses indicated that all four voice conditions (roars, screams, aggressive speech, distress speech) were acoustically distinct (Fig 2). The DFA's classification success rate significantly exceeded chance (correct classification = 79.9%, chance = 25%, p <0.0005). Supplementary tables report the factor loadings of acoustic parameters on the first three discriminant functions, collapsing across sexes (Table A in S1 Tables) and for male (Table B in S1 Tables) and female vocalizers (Table C in S1 Tables) separately (see S1 Tables, for all supplementary tables).
The first discriminant function (eigenvalue = 6.43, variance explained = 74.1%) differentiated each of the four voice conditions relatively equally while also separating nonverbal vocalizations from speech sentences (see Fig 2). Distressed speech stimuli were characterized as the quietest of the four voice conditions and had the greatest amplitude variability, the least amplitude modulation, and the most major F0 inflections, followed by aggressive speech and then distress screams. In contrast, roars were characterized by the highest amplitude, the least amplitude variability, the most amplitude modulation, and the fewest major F0 inflections.
The second discriminant function was less important in discriminating stimulus groups (eigenvalue = 1.93, variance explained = 22.2%), showing primarily that screams and, to a  Table A in S1 Tables). The second function (DF2, Table A in S1 Tables) relied mostly on F0 and harmonics-to-noise ratio. The pattern of separation was similar in male (b) and female (c) vocalizers.
https://doi.org/10.1371/journal.pone.0213034.g002 lesser degree, distressed speech sentences were more harmonic (high HNR) than were roars and aggressive speech (Figs 1 and 2). F0 variables (mean, max, min) loaded primarily on this function, but also on the first function. Mean values of measured acoustic variables (reported in Tables 1 and 2) showed that distress screams were characterized by the highest F0, followed by aggressive roars, with both speech conditions characterized by the lowest F0.
Finally, aggressive roars displayed higher jitter than did all other stimuli, whereas screams (but not distressed speech) were characterized by higher shimmer and a higher dominant formant frequency (DFF4) than aggressive stimuli. We excluded duration from our discriminant analyses because multiple-word speech sentences were inherently longer than single vocalizations, but we report duration means for each voice condition (see Tables 1 and 2). The acoustic characteristics separating vocal stimuli were similar across sexes (Fig 2, see also Tables B and C  in S1 Tables). Do roars, screams and valenced speech stimuli contain acoustic cues to actual strength and height?. Strength did not correlate with height among either male (r = -.04, p = .833) or female (r = .083, p = .655) vocalizers. Therefore, at least in our sample, these two physical measurements appear to represent distinct aspects of physical formidability.
We observed very few significant, systematic relationships between acoustic variables and vocalizer height or strength (see Tables D and E in S1 Tables). The only notable exception was that the dominant formant frequency (DFF4) was negatively associated with strength for female vocalizers in all voice stimulus types except distress screams (Table D in S1 Tables). Zero-order correlations corroborated the absence of systematic acoustic predictors of strength and height (see S2 Text).

Discussion
The high classification accuracy of the discriminant function analysis shows that vocal stimuli were characterized by distinct acoustic structures that varied according to both stimulus type (speech/nonverbal vocalization) and context (aggression/distress). Nonverbal emotional expressions of anger and fear have, in earlier DFA's, been confused [60], offering a partial explanation for the slight overlap among speech categories in the present DFA.
Nonverbal vocalizations displayed more variability in acoustic characteristics, were louder, higher-pitched, and exhibited more amplitude modulation than did their speech equivalents, consistent with evidence that laughter exhibits higher F0 mean and range [61] and higher F1 [62] compared to speech. This could be due to a lack of linguistic constraints on nonverbal vocalizations [63] enabling a wider acoustic space compared to speech. Indeed, speech necessitates a relatively low pitch/spectral density for formant perception [64] and places constraints on intonation for semantic encoding [65] and phoneme recognition [66].
The co-occurrence of high F0, high amplitude, and nonlinear phenomena in nonverbal vocalizations suggests that they were produced with high vocal effort [67]. Fundamental frequency and amplitude are both known to increase with subglottal pressure [68,69], and nonlinear phenomena (indicating a transition to unstable regimes of vocal fold vibration) arise more commonly when subglottal pressure is relatively high [69][70][71][72][73]. By operating at or near the upper limits of amplitude production, nonverbal vocalizations may be more readily subject to anatomical constraints that constrain vocal exaggeration and thus increase the honesty of acoustic indexical cues [44,45,47], and thus, may communicate physical traits of the vocalizer more effectively than speech. This may be particularly true of aggressive roars, which exhibited the most nonlinearities of all stimuli.
In accordance with motivational-structural rules [30,31,33], distress stimuli were more tonal (higher HNR and less amplitude modulation) than aggressive stimuli. In nonhuman mammals, distress vocalizations are indeed typically tonal, but may be noisy if fear and aggression are conflicting or if their function is to solicit support from distant allies [33,74]. Our analyses showed that roars and screams occupied opposite extremes in terms of harmonics-tonoise ratio, again suggesting that vocalizations exploit wider ranges of acoustic space compared to speech utterances, which fell in between these extremes. Screams were characterized by a higher F0 (see Fig 1), lower jitter, and a higher dominant formant frequency (DFF4) than roars, also as predicted by motivational-structural rules. Yet these differences were not observed between aggressive and distressed speech. Our results therefore suggest that the acoustic constraints necessary to intelligibly communicate speech may limit the expression of motivational-structural rules in speech, including emotional or valenced speech.
Reliable cues to height were not consistently encoded in the acoustic structure of our vocal stimuli. While previous work has shown that formant frequencies in modal speech predict vocal tract length and thus height within sexes [58], the prevalence of high pitch/low spectral density and/or amplitude modulation in nonverbal vocalizations resulted in poor representation of vocal tract resonances. This was also observed to some extent in valenced speech sentences that were also produced with high vocal effort, potentially explaining why our formantbased voice parameters (COG, DFF4) did not reliably predict height even in speech. This result may also reflect variation in vocalizers' propensity to exaggerate size in an aggressive context or minimize size in a distress context.
Although formants are a well-established indicator of human height [58], previous research has produced inconsistent findings regarding the acoustic encoding of physical strength in speech [11,13,14]. Formant dispersion has been reported to predict male strength [13,14], but only in cases where correlations between height and strength were strong [13,14], suggesting that any relationship between strength and formants is mediated by the relationship between height and formants. However, the unexpected but consistent association between DFF4 and strength in our sample of females suggests that spectral characteristics reflecting complex contributions of both source and filter may still play a role in encoding strength.
While the present study utilized an amalgamated strength measure based on flexed bicep circumference, handgrip strength, and chest strength (following [11]), some other studies examining vocal correlates of strength have utilized amalgamated scores based on fewer measures (e.g., flexed bicep circumference and handgrip strength only [12,15]), or have examined strength measures individually (e.g., biceps only, handgrip strength only [14,16]). Nevertheless, different measures of upper-body strength covary within and between individuals and, given that these previous studies likewise did not report consistent or robust acoustic correlates of strength, differences in how strength was computed across these few studies are not likely to explain such null results.
To summarize, despite indications that our aggressive roars and distress screams utilised a wider acoustic space than did speech sentences, and despite measuring a much wider set of acoustic variables than previous studies examining cues to strength in speech [11,13,14], our investigations still failed to reveal consistent acoustic cues to strength. Thus, despite one study that reported an association between F0 and strength [13] in speech, our study corroborates the more commonly observed lack of significant relationship between F0 and strength in the human voice [11,14]. Thus, while our results support the general hypothesis that aggressive roars and distress screams are acoustically distinct and evolved to respectively maximize or minimize the impression of strength and threat, their acoustic structure did not reliably predict vocalizer strength or height within call types.

Experiments 2 and 3: Can listeners estimate strength and height from roars, screams and valenced speech?
Following acoustic analysis, we used playback experiments to assess the functional relevance of aggressive roars, aggressive speech, distress screams, and distressed speech in communicating strength and body size. Separate samples of listeners judged either the physical strength or height of the vocalizers whose voices we analyzed in Experiment 1.
We predicted that ratings of strength and height would be highest for aggressive stimuli, as such vocalizations index quantitative information regarding the severity of potential threat (i.e. the formidability of the aggressor), potentially adaptively influencing decision-making in competitive interactions. In contrast, for distress stimuli, listeners may have been selected to pay attention to the level of distress rather than to the signaller's formidability. Indeed, among nonhuman mammals, vocalizations produced in aggressive contexts function specifically to signal formidability, and in these contexts many species functionally exaggerate acoustic cues to dominance and size [47,[75][76][77][78].
Male-male competition is thought to have played a key role in shaping men's vocal signals [79,80] and in producing sexually dimorphic acoustic features that function in part to more effectively communicate threat potential in men's than women's voices. Hence, we further predicted that listeners would more accurately estimate strength and height from male than female speech stimuli. However, as size and strength are relevant in both mate competition and mate choice contexts, we did not predict sex differences in listeners' judgments of strength.

Participants
Participants from the USA were recruited from Amazon Mechanical Turk (see [81] for a review of the validity of this research method) to provide voice-based assessments of strength and height. All participants provided informed consent and completed the experiments online using a custom computer interface. They were compensated with $3.50 USD. Ninety adults took part in Experiment 2 (48 females and 42 males, age = 33.82 ± 9.60) and 60 different adults took part in Experiment 3 (30 females and 30 males, age = 33.80 ± 8.98). Data from four participants in Experiment 2 and six participants in Experiment 3 who did not complete the experiment but rated more than half of the stimuli were included in analyses, as the exclusion of their responses did not change the overall pattern of results.
Voice stimuli. Participants rated all 244 voice stimuli acquired in Experiment 1 (61 vocalizers x 4 stimulus types) on one dimension (either strength or height). To reliably assess the effect of amplitude on listeners' attributions, it was necessary for listeners to maintain the same listening volume for the duration of the playback experiment. The difference in mean amplitude between the quietest (40.40 dB) and loudest (81.66 dB) stimulus was large; hence, we partially normalized amplitude to minimize auditory discomfort while ensuring that listeners could clearly hear all stimuli. Speech stimuli (mean amplitude = 58.31 dB) were consistently quieter than vocalizations across sexes (70.27 dB), therefore, we increased the amplitude of speech stimuli and decreased the amplitude of vocalizations by 4 dB each.
Procedure. Playback studies were hosted in Syntoolkit, a dedicated online testing platform used to generate and present psychology studies (see e.g., [82]). Participants were directed to the URL testing site and provided informed consent before beginning the study. They were instructed to use headphones and to complete the experiment in a quiet place. Listeners heard a demo sound file before commencing the experiment which contained the loudest stimulus and the fifth quietest stimulus, and were instructed to raise their volume until they could clearly hear the quiet vocalization while the loudest vocalization did not cause discomfort. Following this, listeners were asked not to adjust the volume during the experiment unless it became too uncomfortable. Listeners were also asked at the end of the experiment if they had adjusted their volume at any point. Due to the agonistic nature of the stimuli, they were made aware that if they felt uncomfortable or distressed listening to the sounds, they could stop the experiment.
Voice stimuli were blocked by sex (male/female), stimulus type (speech/vocalization), and stimulus context (aggression/distress). The order of blocks and stimuli within blocks was randomized. Before each block, participants were reminded to listen to each stimulus in full before rating it, and informed that they could take a break at any time. Listeners rated the physical strength (Experiment 2) or height (Experiment 3) of each voice stimulus ("Rate how strong/tall this vocalizer is") on a 101-point scale from 0 (extremely weak/short) to 100 (extremely strong/tall).
Listeners were debriefed upon completion that the roars and screams were acted, and that the vocalizers were not really experiencing aggression or distress. We inspected listeners' ratings and compared their reaction times against stimulus duration to ensure that they completed the experiments properly. Data from two participants who did not do so were removed (and are not reported in the participant statistics given above).
Statistical analysis. In a series of linear mixed models, we first tested whether male vocalizers were stronger/taller than female vocalizers. Next, we tested the effects of vocalizer sex, listener sex, stimulus context, and stimulus type on attributed strength/height ratings. The third set of models added actual strength/height into the previous models to assess accuracy in listeners' strength and height estimates. As the strength and height distributions for males and females displayed little overlap, we split these models by vocalizer sex rather than including sex as a factor. In all models, we included listener identity as a subject variable and vocalizer identity as a random factor, thus allowing the intercepts and slopes of the relationships between predictors and outcomes to vary between both vocalizers and listeners and testing null hypotheses based on the average of these intercepts and slopes.
Effect sizes were estimated using R 2 coefficients derived from simple linear regressions among relevant variables, and using γ coefficients derived from the linear mixed models. R 2 values denote the percentage of variance in mean strength ratings explained by variance in actual strength, and can be interpreted as representing the overall reliability of listeners' strength estimations, adjusted to the linear sensitivity of listeners to variation in actual strength within each condition. Differences in slope gradients between conditions, represented by the gamma (γ) statistic denoting the standardised increase in rated strength/height per one unit increase in actual strength/height, indicate linear differences in listeners' sensitivity to variation in vocalizer strength or height.
Subsequently, we computed stepwise linear multiple regressions to assess relationships between acoustic characteristics and strength/height ratings. The previously measured acoustic variables were used as predictors, and either mean strength or mean height ratings as outcome variables. Participants who indicating having modified their volume during the experiment (Experiment 1: n = 4, Experiment 2: n = 15) were excluded from the calculation of mean ratings, enabling valid analysis of the effect of amplitude on ratings. Regression models were split by sex, stimulus type (speech/vocalization), and stimulus context (aggression/distress).
Height attributions. Vocalizers were rated as taller when producing aggressive than distressed sounds and sentences. This was particularly true for male vocalizers (M difference = 5.44 vs. M female vocalizers = 2.91, Fig 4, Table 4, p < .001; see Table 5 Table 4, p = .046). ). Yet, males were only rated as stronger than females by male listeners judging aggressive roars (Table 3, p = .032). For all other conditions, females were rated as comparably strong as males (Fig 3), indicating that listeners' strength attributions were generally not consistent with sexual dimorphism in actual strength.
Height ratings were consistent with sexual dimorphism in height. Listeners rated males as taller than females across all stimulus types and contexts (Fig 4, Table 4, p < .0005). This sex difference in height ratings was larger for aggressive (M difference = 7.04) than distress stimuli (M difference = 4.51, Table 6, p < .0005), and for nonverbal vocalizations (M difference = 6.50) than for speech sentences (M difference = 5.06, Table 4, p = .009).
Can listeners accurately estimate strength and height from the voice?. Strength estimation. For male vocalizers, actual strength predicted attributed strength only when listeners rated aggressive stimuli (Table 5, p < .001). For female vocalizers, listeners could estimate strength from aggressive roars, aggressive speech, and distressed speech, but not distress screams ( Table 5, p < .001; see also γ statistics in Table 7 denoting the standardised increase in rated strength per one unit increase in actual strength). For both male and female vocalizers, the reliability of strength estimation was higher for aggressive roars than for aggressive speech or female distressed speech (Fig 3; refer to R 2 denoting variance in mean strength ratings explained by actual strength). Thus, listeners consistently estimated strength from aggressive but not distress stimuli, and estimated strength most reliably from aggressive roars. There was little evidence for listener sex or vocalizer sex differences in the capacity to estimate strength. The only exception was for distressed speech, whereby listeners were more sensitive to variation in actual strength when rating female than male vocalizers.
Height estimation. For male vocalizers, actual height predicted rated height when listeners rated distress stimuli but not aggressive stimuli (Fig 4, Table 6, p = .008; see also Table 7 for γ effect sizes). For female vocalizers, actual height predicted attributed height when listeners rated speech stimuli but not nonverbal vocalizations (Fig 4, Table 6, p = .007; see Table 7 for γ). Effect sizes for the relationship between actual and attributed height were much smaller than those for the relationship between actual and attributed strength (Figs 3 and 4).
As with strength, there were few sex differences in height estimation, except that listeners were more sensitive to variation in actual strength in male than female vocalizers when rating distress screams. Are ratings of physical traits related to acoustic characteristics?. Mean amplitude consistently predicted ratings of physical strength across stimulus categories and sexes (see Tables F and G in S1 Tables). In addition, vocalizers who were rated as stronger generally produced rougher voice stimuli. Decreases in F0 variability, and increases in amplitude modulation and duration with rated strength were also observed, though inconsistently (Table F in S1 Tables). Zero-order correlations corroborated the influence of these acoustic characteristics on rated strength (see S2 Text).
The influence of acoustic characteristics on height ratings was in general much less consistent than for strength ratings (Table G in S1 Tables). In males, louder and lower-pitched stimuli were consistently judged as produced by taller vocalizers. Male roars and screams characterized by higher jitter were also rated as produced by taller vocalizers. No acoustic characteristic consistently predicted height ratings of female vocalizers, but louder aggressive roars and distressed speech were rated as produced by taller vocalizers. Zero-order correlations corroborated the lack of consistent acoustic predictors of rated height (S2 Text).

Discussion
The results of playback experiments indicated that roars maximized impressions of strength relative to other vocal stimuli. Listeners attributed higher strength and height ratings to aggressive stimuli (aggressive speech and roars) than to distress stimuli (distress speech and screams), consistent with functional exaggeration of acoustic cues to body size by nonhuman mammals in aggressive contexts [47,[75][76][77][78]. This effect may be due to acoustic differences between stimuli: aggressive roars were characterized by higher roughness and amplitude than distress screams, as well as a lower F0 and DFF4. This suggests that aggressive roars capitalised on perceptual associations between low frequency sounds and large size, exaggerating perceived formidability relative to distress screams, which instead exploited perceptual associations between high frequencies and small size or submission [9,11,31,33].
In the absence of differences in F0 and DFF4 between aggressive and distressed speech, the smaller difference in strength ratings between these speech stimuli (compared to roars and screams) may be attributed to differences in roughness and amplitude, consistent with the observation that both roughness and amplitude consistently predicted listeners' ratings within voice conditions. Differences in the linguistic content of aggressive and distressed speech may have also contributed to differences in listeners' ratings between the two types of speech stimuli. The verbal content of each speech stimulus was selected specifically to convey either aggression (That's enough, I'm coming for you!) or distress ('Please, show mercy, don't hurt me!), as previous studies have failed to find acoustic correlates of actual or perceived strength in emotionally neutral speech [11,16]. Nevertheless, a third speech condition, in which participants produce the same linguistic content while imagining themselves in each of the aggressive and distress situations, may reduce the ecological validity of the task but could in turn help to disentangle the influence of linguistic content and emotional valence on listeners' ratings of speech stimuli.
Comparing speech to non-speech, our results revealed that listeners judged strength comparably for distressed speech and screams, but were more sensitive to variation in strength, and estimated strength more reliably, from roars than from aggressive speech (see γ (sensitivity) and R 2 (reliability) in statistical analyses). Thus, roars communicated strength more reliably than aggressive speech, but also exaggerated strength more effectively. These results accord with evidence that affective information is preferentially decoded from nonverbal vocalizations over emotionally inflected speech [83,84], suggesting that nonverbal vocalizations may, in certain contexts, be more effective carriers of motivational and indexical cues than speech. Interestingly, recent work has further shown that identity-related information is more effectively encoded in volitional than in spontaneous laughter [27].
Our results build on evidence by Sell and colleagues that listeners can accurately assess strength from neutral speech stimuli [11], showing here that listeners can also detect strength from emotional speech and nonverbal vocalizations. However, with the exception of female distressed speech, this ability was limited only to aggressive stimuli. Thus, aggressively motivated vocal behavior, whether in the form of speech or nonverbal vocalizations, appears to be optimised to communicate threat potential. These results are consistent with an extensive body of research demonstrating that listeners attend to formidability cues in aggressive calls across a wide range of mammals (e.g., giant pandas [2], sea lions [3], fallow and red deer [4,5], and dogs [6]). Moreover, the fact that variation in strength was generally not detected in distress stimuli indicates that the availability of formidability cues varies with the putative function of the signal, possibly reflecting differential selection on vocalizers to encode formidability cues in aggressive rather than submissive voice signals.
Listeners were less sensitive to variation in actual height than strength, and estimated height less reliably. Nevertheless, they could detect a small but significant proportion of variation in height from male and female distressed speech, female aggressive speech, and male distress screams. Compared to other stimulus types, these stimuli were on average characterized by relatively lower F0, thus facilitating formant perception through increased spectral density [8,85]. They were also characterized by less amplitude modulation than were other stimulus types, thus minimising the interference of sidebands with formant perception. Listeners may have therefore utilised formant cues to estimate height from these vocal stimuli. Our results are consistent with previous work indicating that listeners are only moderately accurate in voicebased estimates of body size for natural height distributions and on the basis of neutral speech stimuli, such as vowel sounds [8][9][10].
The finding that F0 predicted listeners' height ratings but not actual height suggests that F0 may have confounded accurate height assessment. Many studies report a consistent perceptual bias in listeners to associate low-F0 speech with larger body size at the within-sex level [8][9][10][86][87][88][89][90], despite F0 being a very poor predictor of body size when controlling for sex and age [58]. We show that this bias, potentially driven by overgeneralization of sound-size relationships [9] and long thought to interfere with accurate body size estimation ( [91,92,9] but see [8]), extends beyond speech to judgments of nonverbal vocalizations. While it has also been reported that low F0 may elicit higher strength attributions in neutral speech [11], our study did not corroborate this finding.
As strength and height were not correlated in the present study, our results provide strong evidence that the human voice contains independent cues to strength and height and that strength cues may be more perceptually salient. This finding complements the greater relevance of physical strength than body size to perceptions of men's fighting ability [51] and bodily attractiveness [93] from images, where absolute strength may be easier to gauge from individual images of bodies than absolute size.
Contrary to some previous studies, we did not find evidence that strength and height are more reliably estimated from male than female voices [9,11], nor that male listeners are more sensitive than female listeners to acoustic cues to body size (e.g., [7] but see [9]). Thus, accuracy in strength and size estimation was largely unaffected by the sex of the vocalizer or listener. Yet male vocalizers were, in reality, both physically stronger and larger than were female vocalizers due to sexual dimorphism in the human body. Listeners' estimates of height correctly reflected this dimorphism in body size, as males were consistently judged as taller than females (though particularly for aggressive and nonverbal vocalisations). In contrast, listeners did not consistently rate male vocalizers as stronger than females. Rather, males were only rated as stronger than females by male listeners, and only for judgments of aggressive roars.
These sex effects partly corroborate those reported in a recent study on relative voice-based judgments of strength and body size [12]. In that study, where we utilized the same roars and aggressive speech sentences as those used here, listeners were more likely to judge vocalizers as taller and stronger relative to themselves when those vocalizers produced roars compared to aggressive speech. This 'exaggerating effect' of roaring only worked for male vocalizers. Moreover, male listeners generally underestimated the size and strength of female vocalizers relative to their own, whereas female listeners overestimated the size and strength of male vocalizers. While the results of the present study are not immediately comparable due to differences in the nature of the task (i.e., absolute versus relative judgments of strength and size), an interesting pattern emerging in both studies is that roars appear to exaggerate strength and size, particularly for men.
In the playback experiments presented here, listeners' ratings of strength and height were absolute and given on a scale ("Rate how strong/tall this vocalizer is"), similar to the method used by Sell and colleagues [11], thus facilitating cross-study comparisons. Other studies have asked listeners to judge the absolute height of vocalizers in centimetres (e.g., [91]) or the relative height of two same-sex vocalizers [8,9,89]. More recently [12], listeners were tasked for the first time with judging the strength and size of vocalizers relative to their own. While the results of these varied studies indicate that listeners can judge strength and size from the voice using either absolute or relative scales, listeners appear particularly accurate when judging the strength and size of others relative to themselves, perhaps because such a task seems the most ecologically valid and thus easiest (12). We recommend that researchers now examine the acoustic correlates of listeners' relative strength judgments, as this could reveal more consistent and robust effects.
Finally, in the present study, male and female voices were presented in separate blocks. While it is possible that such a design could encourage listeners to judge the strength or size of vocalizers relative to others of the same rather than opposite sex, listeners consistently judged males as larger than females despite a similar blocking design, suggesting that blocking by sex did not substantially influence listeners' ratings.

General discussion
We compared the acoustic structure of aggressive roars, distress screams, and their valenced speech equivalents (Experiment 1), and examined the effectiveness of these various speech stimuli in communicating physical strength (Experiment 2) and height (Experiment 3) to listeners. Our results provide strong evidence that the acoustic structure of human aggressive and distress vocal signals, particularly nonverbal vocalizations (roars and screams), varies according to Morton's motivational-structural rules [30]. Accordingly, aggressive stimuli exaggerated impressions of strength and body size relative to distress stimuli. Corroborating previous attempts [11,15,16], our acoustic analyses did not identify vocal features that reliably mediated the communication of strength, yet listeners could nevertheless accurately estimate strength from male and female aggressive (but not distress) vocal stimuli, and most reliably from aggressive roars. To a lesser degree, listeners could also estimate the height of vocalizers. Roars therefore conveyed honest inter-individual variation in strength more reliably than did any other type of vocal stimulus, and also exaggerated impressions of physical formidability most effectively.
The acoustic basis by which physical formidability (particularly strength) is communicated therefore remains unclear. Loudness and roughness were consistently associated with higher strength ratings, whereas loudness and lower F0 were often associated with higher height ratings, but these acoustic characteristics did not predict actual strength or height, and thus cannot account for the ability of listeners to reliably estimate strength, and to a lesser degree, height, solely from the acoustic structure of vocal stimuli. Similarly, while listeners detected strength variation in voice conditions for which the dominant formant frequency (DFF4) negatively correlated with actual strength, DFF4 did not predict listeners' strength ratings. Listeners also detected strength variation from male aggressive speech and roars despite the absence of acoustic predictors of actual strength for these stimuli. Thus, despite measuring a wide set of relevant acoustic characteristics, our analyses failed to determine the acoustic pathways that mediate strength communication, confirming previous observations based on fewer vocal parameters-namely F0 and formants [11,[15][16][17].
Despite a lack of robust vocal indices of actual physical formidability, this research provides compelling evidence that volitional voice production in an aggressive or submissive context effectively and respectively maximizes or minimizes listeners' impressions of a vocalizer's strength and body size (see also [29]). Differences in the acoustic structure of aggressive and distressed vocal stimuli support the exploitation of perceptual biases linking low and harsh voice frequencies to large body size and dominance [8,9,30,31,33,90,94]. Further experimental research is now needed to elucidate the relative roles of emotional context (aggression versus distress) and vocal stimulus type (nonverbal vocalisation versus speech) on listeners' strength ratings, as both variables accounted for variance in the accuracy of listeners' judgments.
The vocal stimuli used in this study were collected through acted scenarios and hence our results provide novel insight into both the acoustic structure, and probable social functions, of voice modulation and deception. Indeed, the ability to exaggerate one's size or strength through vocal production is likely to have conferred an evolutionary advantage, as both larger body size and greater strength are associated with various socioeconomic, competitive, and mating benefits [93,[95][96][97][98][99][100]. In line with our findings, other recent evidence indicates that the capacity to volitionally exaggerate or minimize body size via simulated nonverbal emotional expressions is not limited to actors [101,102]. In our study, screams and roars, while volitionally produced, nevertheless had the largest effect on listeners' ratings of strength and height. This, paired with recent work showing that listeners can effectively estimate pain intensity from simulated pain cries [29], is consistent with the emerging hypothesis that deceptive voice modulation may be at the origins of selection for humans' uniquely advanced vocal control abilities [20,65,103]. Indeed, some nonhuman mammals already demonstrate a limited capacity for functional vocal deception [103] and body size exaggeration [75,77,47,20] in agonistic contexts, as well as more voluntary vocal flexibility recently observed in nonhuman primates ( [104][105][106] see also [20] for review). Survival benefits conferred to those able to modulate the expression of primary indexical cues may have given rise to increasingly greater vocal control, paving the way for the evolution of complex speech capabilities [20,103].
However, while the co-optation of primary relationships between acoustic cues and physical attributes may more effectively serve motivational signalling, variation in individuals' capacity to modulate these cues may result in a decoupling between the cues and attributes. This may partly account for the lack of consistent acoustic correlates of actual height or strength observed here and in previous work. Interestingly, that listeners were able to accurately gauge strength from simulated roars and screams suggests that they could detect vocal deception and adjust their judgments accordingly. Evolutionary accounts of vocal signalling contend that, in agonistic or competitive contexts, vocalizers should evolve strategies to better manipulate receivers (thus obfuscating indexical information in favour of motivational signalling), while receivers should evolve ways to detect and resist such manipulation (thus reliably estimating indexical characteristics in spite of deceptive voice modulation) [103,107,108]. In future work, acoustic analyses could be used to investigate whether cues to deception are encoded in nonverbal vocalizations (e.g., whether roars elicited in natural versus simulated contexts vary structurally), and playback experiments could be employed to assess whether listeners can differentiate between natural and simulated vocalisations, or detect volitional vocal exaggeration or minimisation of traits such as body size and strength. Researchers may also examine whether other nonverbal vocalizations relevant to the signalling of formidability (e.g. martial arts kiaps) communicate indexical cues, and whether these vocalizations more reliably communicate motivational state than does speech (e.g. aggression, submission, distress, experienced pain).
It is possible that cues to strength and body size were communicated by acoustic characteristics that were not captured by our acoustic analyses. For example, information may be contained in the dynamic temporal variation of these vocal parameters; indeed, such information is commonly utilised in the construction of model-based emotion recognition from speech [109][110][111]. Listeners may also rely on complex linear or nonlinear combinations of acoustic parameters. While analysis of the individual contribution of acoustic characteristics has revealed numerous indexical cues in human and nonhuman mammal vocal behavior [112], future research should utilise alternative acoustic analytical approaches (e.g. linear interactions between acoustic characteristics, deep neural networks, hidden Markov models) to elucidate more complex acoustic mechanisms potentially communicating not only interindividual variation in strength, but also other functional cues which linear acoustic analysis has been unable to account for (e.g., sex discrimination from babies' cries [55]).

Conclusion
We show that listeners can detect variation in vocalizer strength and body size from simulated nonverbal and verbal vocal stimuli produced in agonistic contexts (aggression and distress, i.e., contexts in which the communication of physical formidability is most ecologically relevant). Roars were particularly effective in communicating strength; the lack of linguistic constraints on aggressive roars appears to afford a greater acoustic space with which to both honestly communicate variation in strength between individuals, and exaggerate strength relative to other vocal signals within individuals. These results complement studies examining the vocal communication and exaggeration of physical traits and threat in nonhuman mammal species [5,44,45,47,78] and add to a growing body of evidence indicating structural and functional homology between human and nonhuman mammal vocalizations such as laughter [18][19][20][21] and infant distress cries [22][23][24]. Nonverbal vocalizations, and the ability to voluntary produce and modulate them, may constitute a direct intermediary link between involuntary control of stereotyped calls in nonhuman mammals, and full-blown volitional speech in humans [20,65,103]. As such, further investigation into the structure and function of nonverbal vocalizations may be essential to understanding the origins and evolution of human vocal communication (both verbal and nonverbal), and its relationship to animal vocal signals.