Effect of Formant Frequency Spacing on Perceived Gender in Pre-Pubertal Children's Voices

Background It is usually possible to identify the sex of a pre-pubertal child from their voice, despite the absence of sex differences in fundamental frequency at these ages. While it has been suggested that the overall spacing between formants (formant frequency spacing - ΔF) is a key component of the expression and perception of sex in children's voices, the effect of its continuous variation on sex and gender attribution has not yet been investigated. Methodology/Principal findings In the present study we manipulated voice ΔF of eight year olds (two boys and two girls) along continua covering the observed variation of this parameter in pre-pubertal voices, and assessed the effect of this variation on adult ratings of speakers' sex and gender in two separate experiments. In the first experiment (sex identification) adults were asked to categorise the voice as either male or female. The resulting identification function exhibited a gradual slope from male to female voice categories. In the second experiment (gender rating), adults rated the voices on a continuum from “masculine boy” to “feminine girl”, gradually decreasing their masculinity ratings as ΔF increased. Conclusions/Significance These results indicate that the role of ΔF in voice gender perception, which has been reported in adult voices, extends to pre-pubertal children's voices: variation in ΔF not only affects the perceived sex, but also the perceived masculinity or femininity of the speaker. We discuss the implications of these observations for the expression and perception of gender in children's voices given the absence of anatomical dimorphism in overall vocal tract length before puberty.


Introduction
Adults can discriminate the sex of adult [1] and of children [2,3] speakers by listening to their voice only. Sex identification in adult voices is substantially determined by acoustic differences in fundamental frequency (F0) and in the overall pattern of formant frequencies (DF, or formant spacing), which in turn reflect anatomical dimorphisms in the vocal apparatus between the two sexes. During male puberty, the testosterone-related growth of the laryngeal cartilages [4][5][6], and the associated lengthening and stiffening of the vocal folds [7,8] cause men's F0 to drop by almost 50% compared to women's (men's F0: 120 Hz; women's: 200 Hz [8]), conferring men their characteristically lower-pitched voices. Moreover, the testosterone-induced differential body height, with men being on average 7% taller than women [9], coupled with the male-specific secondary descent of the larynx [10], result in men having longer vocal tracts and thus narrower DF (15-20% [11,12]) than women, conferring a disproportionately more baritone quality to the male voice [10].
The voices of pre-pubertal children are also acoustically and perceptually different, and perceptual studies show that adults are able to correctly identify gender from the voice in children as young as four [3]. Several acoustic investigations have shown that, while children of both genders speak with similar F0s ( [13][14][15]; but also see [16]) boys speak with lower formants and consequently narrower DF than girls [2,3,13,14,17,18] despite the absence of overall differences in vocal tract length between the two sexes before puberty [10,[19][20][21]. This dimorphism has led to the suggestion that pre-pubertal sex differences in DF have a behavioural basis (for example boys may round their lips or lower their larynx when they speak to lengthen their vocal tracts - [2,14]).
Taken together, these studies indicate that the between-sex dimorphism in the voice frequency characteristics (DF only in children and both DF and F0 in adults) is perceptually relevant to categorize the sex of speakers. Moreover, at least in adult voices, between-speaker variation in these parameters appears to also influence the perception of gender, a term which encompasses the biological and social attributes which a given society deems typical of either male (masculine attributes) or female (feminine attributes) sex [22]. For example, listeners consistently rate adult voices with naturally or artificially lower F0, lower DF, or both, as belonging to more masculine individuals than their raised versions [23,24]. While variation in F0 and DF, which are both sexually dimorphic in adult voices, has been shown to influence listeners' attributions of adults' sex and gender characteristics, to our knowledge the effect of naturalistic variation in DF on sex and gender attributions has not been investigated in children's voices, despite the fact that this trait is sexually dimorphic.
Here we investigate whether small increments of DF in children's voices affect sex (male, female), as well as gender (masculine, feminine) attributions by adult listeners. In the first experiment (sex identification) we resynthesize DF along gender continua within the observed natural variation of this parameter and ask listeners to identify the sex of the speakers. We expect the identification function to be characterized by a gradual change from the male to the female category. In the second experiment (gender rating), we ask listeners to rate each voice stimulus on a scale that combines sex and gender information (from ''masculine boy'' to ''feminine girl''). We expect that small, consecutive increments in DF will elicit a gradual increase in listeners' ratings from "masculine boy" to "feminine girl".

Ethics statement
Written consent from children's guardians as well as verbal consent from children were obtained prior to the recording of the voice stimuli. All adult subjects taking part in the psychoacoustic experiments gave written informed consent. Both procedures (voice recording and psychoacoustic experiments) were reviewed and approved by the Ethics Committee of the University of Sussex (authorization codes: DRVC0709 and DRVC0711).
Subjects 252 second-year Psychology students (74 males, 178 females) from Sussex University took part in the psychoacoustic experiments (as part of their practical coursework in a Cognitive Psychology level two module). All subjects were fluent English speakers.

Stimuli
Speech utterances were recorded using a Shure SM94 microphone and a Tascam DR07mkII handheld recorder at a primary school in Sussex, as part of a previous study of gender expression in children's speech. During these recordings, two girls and two boys aged eight were asked to read out seven short words (''bed'', boot'', ''book'', ''box'', ''duck'', ''hat'', ''pig''). The recorded singlesyllable words were individually standardized to 65 dB and concatenated prior to acoustic analysis and resynthesis.

Acoustic analyses
We extracted F0 and formant frequencies using PRAAT v.5.1.19 freeware [25]. F0 was extracted using the command 'to Pitch', with analysis parameters set to: time-step 0.01 s; pitch floor, 60 Hz; pitch ceiling, 500 Hz. The frequency values of the first three formants (F 1 , F 2 , F 3 ) were extracted using linear predictive coding (LPC) via the 'LPC: To Formants (Burg)' command, with analysis parameters set to: maximum number of formants, 5; maximum formant frequencies, 6000-6600 Hz; window of analysis, 0.025 s. Formant spacing ((1) DF = F i+1 -F i ) was derived from F 1 -F 3 values, by modelling the vocal tract as a uniform tube closed at the glottis and open at the mouth [26,27]. Under such model, F i are expressed as: Where i is the formant number, c is the speed of sound in a mammal vocal tract (35,000 cm/s), VTL is the vocal tract length (in cm) and F i is the frequency (in Hz) of ith formant. From (1) and (2), it follows that DF = F i+1 -F i = c/2VTL (3). By replacing c/ 2VTL with DF in equation (2), DF can be derived as the slope of a regression model with the observed F i values (y-axis) plotted against the expected formant positions: and the apparent vocal tract length (aVTL), as its inverse acoustic correlate measured in cm (aVTL = c/2DF). Therefore the longer the vocal tract, the lower the formant frequencies, and the narrower their overall frequency spacing. All extracted and derived acoustic values are reported in Table 1.

Re-synthesis
Following acoustic analysis, the stimuli were resynthesized using the "change gender" command in PRAAT. This command uses PSOLA, a resynthesis algorithm that allows the independent manipulation of formant frequency spacing (DF), mean fundamental frequency (F0), F0 variation and signal duration while keeping the values of all the other acoustic parameters (amplitude, noisiness etc.) unchanged. The mean fundamental frequencies were all standardised to 260 Hz (the average F0 measured in our sample). In order to remove possible intonation cues to gender, F0 variation was flattened by adjusting F0 values to the mean F0 (thus making the voice monotonous). Formant values were scaled up or down in increments of 2%, mimicking equivalent variations of DF (and thus aVTL) in speakers' voices. An increase of 2% of formant frequencies (achieved in the 102% stimuli) equates to a 2% increase in DF (corresponding to a 2% shortening of the vocal tract), and is expected to feminise the voice. As formant frequencies in our sample were on average 6% lower in the boy exemplars than in the girl exemplars, just below the gender difference reported in the literature for children of similar age (9-10% - [3,18]) male voices were rescaled from 88% to 118%, while female voices were rescaled from 82% to 112%. The resulting continua were therefore not identical, but largely overlapping: the boys' continuum ranged from 1526 Hz to 1138 Hz (aVTLs from 11.5 cm to 15.5 cm), while the girls' continuum ranged from 1542 Hz to 1129 Hz (aVTLs from 11.4 cm to 15.5 cm). Supplementary online material includes audio files of example stimuli for one girl (Audio S1) and boy (Audio S2) exemplar. The resulting continua are within the range of DF variation observed in pre-pubertal children, as derived from published F1-F3 values [14], with aVTLs ranging from 11.4 cm to 15.9 cm for 5-12 year old children. They are also consistent with anatomical variation reported in [10], where VTLs for boys and girls, measured during quiet respiration, varied from 9.7 cm at age 5 to 14.0 cm at age 12. In summary, we generated 64 audio stimuli consisting of 16 resynthesised variants of the single-syllable word lists by the two boys and the two girls. Figure 1 shows spectrograms of the vowel ''?'' spoken by one of the exemplars, in which the formants (dark bands of energy in the spectrogram) are shifted compared to the original signal, while signal duration, F0 and F0 variation remain unchanged.

Procedure
Participants completed the identification experiment first. Stimuli were presented using a PRAAT Multiple Forced Choice (MFC) experiment script and for each stimulus participants were asked to decide if the speaker was male or female (the instruction was: ''Please identify the sex of the speaker'') by clicking the respective button on the screen (labelled ''male'' or ''female''). A total of different 64 stimuli (16 variants from four exemplars) were presented once in a pseudo-random order. Participants were given an opportunity to pause after each series of 32 presentations. This experiment lasted approximately 10 minutes. In the second experiment, participants were asked to rate the same 64 voice stimuli from the sex identification task (also presented in a pseudorandom order using a MFC experiment script). The instruction was: ''Rate the voice of the speaker on a scale of 1 to 7'' and buttons were labelled as 1 = masculine boy, 2 = boy, 3 = feminine boy, 4 = neutral, 5 = masculine girl, 6 = girl, 7 = feminine girl.

Statistical analyses
Because different sets of resynthesis variants (different formant scaling factors) were used for male and female exemplars, data are analysed and reported separately by exemplar's sex.
In order to test the effect of stimuli variant and listener sex on sex identification, we ran Generalised Linear Mixed Models (GLMM) with stimuli variant (scale), listener sex (nominal) and their interaction as fixed factors, exemplar id and subject id as random factors, and sex identification score (0 = male, 1 = female) as a binomial target variable. In order to test the effect of stimuli variant and listener sex on gender ratings we ran Linear Mixed Models (LMM) with stimuli variant (scale), listener sex (nominal) and their interactions as fixed factors, exemplar id and subject id as random factors, and gender rating as a scale outcome variable (from 1 = masculine boy to 7 = feminine girl).
Simple logistic regressions (one for boy exemplars and one for girl exemplars) were then used to illustrate the relationship between formant frequency spacing and identified sex with average score (over all participants) as the dependent variable and stimuli variant as the independent variable. Logistic models provide estimates for the slope of the category (here 'male' to 'female') transition (b1 coefficient, ranging between 0 and 1, with lower values reflecting steeper transitions) [28][29][30] and for the perceived category boundary (where 50% of stimuli are categorised a male, and 50% as female). The category boundary was computed using the formula -Ln(b0)/Ln(b1) where b0 is the constant of the logistic curve and b1 is the coefficient related to the slope [30,31]. Simple linear regressions with stimuli variant as the predictor variable and average gender ratings (over all the participants) as the outcome variable were used to illustrate the relationship between formant frequency spacing variant and perceived gender. All the statistical analyses were performed using SPSS v.20.0.

Sex identification experiment
The results of the GLMM on sex identification scores of boy exemplars revealed a significant main effect of stimuli variant,

Gender rating experiment
The results of the LMM on gender ratings of boy exemplars revealed a significant main effect of stimuli variant,  The percentage of stimuli identified as female then increases gradually and linearly, and while no upper plateau is reached, average scores for stimuli 14 to 16 (DFs of 1474-1526 Hz) varied from 76% to 85%, indicating that boys' voices with the highest DF are mostly classified as female. Average gender rating scores are plotted from 1 = masculine boy (or girl) to 7 = feminine boy (or girl) (right y-axis) and fitted with a linear function (straight grey line). Mean gender ratings of male voices ranged from 1.78 (SE = .07) for the lowest DF variants to 5.36 (SE = .08) for the highest DF variants. doi:10.1371/journal.pone.0081022.g002

Discussion
The results of the sex identification and gender rating experiments show that DF is an important cue for the perception of sex and gender in the pre-pubertal human voice, in line with the previously reported acoustic dimorphism of this parameter in prepubertal speakers [8,14,17,32]. More specifically, the absence of a sharp boundary between the sex categories in the identification experiment, in which listeners were asked to identify the child speaker as male or female, suggests that small, sex-related acoustic variation in DF proportionally affects the probability of voices to be perceived as either male or female by raters. Additionally, the gradual slope in voice ratings from ''masculine boy'' to ''feminine girl'' in the second experiment shows that small linear increments in DF also proportionally affect listeners' attributions of speakers' gender (from masculinity to femininity). Similar results have been reported in studies of gender perception in adult voices. A study using a combination of identification and discrimination paradigms [29] found that variations along a male-female continuum of F0 and DF, the main cues to sex in adult voices, were not remapped by listeners into separate psychological (male or female) categories, indicating that the perception of voice sex was not categorical. Moreover, psychoacoustic studies have shown that both men's and women's voices with naturally low, or artificially lowered, F0 and DF (or both), are rated as more masculine [23,24,33].
In the present study, while the resynthesis continua used for boy and girl exemplars were largely overlapping (boys: 1138-1526 Hz; girls: 1129-1542 Hz) and both comprised within the range of DF values achievable by both genders before puberty [10,14], the effect of the rescaling of DF differed between boy and girl voice exemplars, suggesting that the resynthesis of this parameter was not sufficient to produce a voice systematically perceived as belonging to the opposite sex, despite the standardisation of F0 and its variation. In the sex identification experiment, the perceived sex boundary between male and female identification estimated by the logistic model is ,100 Hz higher in boy voice exemplars than in girl voice exemplars (Figure 2 -vertical lines), revealing that a greater upward shift in DF was required for , where between 10% and 15% of the stimuli are identified as female, indicating that stimuli variant with the lowest DF are mostly identified as male. The percentage of stimuli identified as female then increases gradually and linearly until it reaches an upper plateau from stimuli 12 to 16 (DFs of 1432-1542 Hz), with average scores varying from 92% to 95% and indicating that girl voices with the highest DF are mostly classified as female. Average gender rating scores are plotted from 1 = masculine boy (or girl) to 7 = feminine boy (or girl) (right y-axis) and fitted with a linear function (straight grey line). Mean gender ratings of female voices ranged from 2.33 (SE = .02) for the lowest DF variants to 6.10 (SE = .06) for the highest DF variants. doi:10.1371/journal.pone.0081022.g003 resynthesized stimuli from the voices of the two boy exemplars to be perceived as female. The identification curve (Figure 2 -black line) for the male exemplars is also shifted downwards relative to that of the female exemplars ( Figure 3 -black line), with a wider plateau at the lower (male) end of the continuum, and no plateau at the upper (female) end of the continuum. Further, the boys' rating function (Figure 2 -grey straight line) from the gender rating experiment is shifted downwards compared to girls', revealing that stimuli from boy exemplars were perceived as more masculine than those from girl exemplars. One possible explanation for the observed perceptual differences is that listeners were affected by acoustic factors other than those manipulated (DF) or factored out (F0 and its variation) in the present experiments. For example, Klatt & Klatt [34] report that women are perceived to have more breathy voices than men, corresponding to increased F 1 bandwidths and decreased F 1 amplitude, while breathy voices are judged as more feminine than less-breathy voices [35], suggesting that, at least in adults, breathiness may be a contributing factor to the perception of sex and gender. The potential role of parameters such as F0, F0 variation and breathiness [8,34], which are sexually dimorphic in adults, but not in pre-pubertal children [13][14][15], in the attribution of sex and gender to children's voices, is an important area for future research.
Independently from other hypothetical voice cues to sex and gender attributions of pre-pubertal children's voices, this study clearly identifies a substantial effect of DF variation on adults' ratings of gender in pre-pubertal speakers, with lower DF being consistently rated as belonging to more masculine children. DF variation has also been shown to affect judgements of body size and age in adult speakers, with listeners rating lower DF as belonging to older and larger individuals [36][37][38][39]. These perceptual differences in turn appear to relate to actual differences in age and size of speakers [39][40][41]. By extending the present paradigm to include age and body size ratings, future studies could investigate the perceptual linking of age-related size and gender dimensions, for example whether children that are perceived to be more masculine are also perceived to be older and bigger than their more feminine counterparts. Moreover, the use of natural (rather than re-synthesised) stimuli from children of different ages, body sizes and masculinities (i.e. as assessed by children's personal attributes questionnaires [42]), and of raters of different ages, would help clarifying the extent to which DF reliably cues for these dimensions throughout the lifespan.
Our observations that baseline DF variation within the natural range of children's voices affects listeners' sex and gender attributions (despite the absence of a clear anatomical basis for such variation) lends further support to the hypothesis that sex and gender expression in pre-pubertal children's voices have a strong behavioural, acquired dimension (with children learning to adjust their VTL in order to sound more or less feminine/masculine). Future studies using i.e. structural cine 3D structural MRI are now needed to further test this hypothesis.
Furthermore, it has been shown that children can also spontaneously modify DF (and F0) when asked to sound more or less like a boy or girl (Cartei, Cowles, Banerjee and Reby, unpublished data), suggesting that children can also control the gender-related characteristics of their voices. The extent to which this ability affects the expression of gender in everyday speech, in line with varying gendered roles (i.e. to affiliate with same-sex peers) and contexts (i.e. when speaking to a male or female), and its perceptual relevance in gendered attributions remains to be investigated.

Supporting Information
Audio S1 This audio file contains three variants derived from one of the two girl exemplar voices (exemplar 2), in which formant spacing was resynthesized from low (longer vocal tract -more masculine sounding voice) to high (shorter vocal tract -more feminine sounding voice) values (DFs: 88%,102%,110%).

(WAV)
Audio S2 This audio file contains three variants derived from one of the two boy exemplar voices (exemplar 4), in which formant spacing was resynthesized from low (longer vocal tract -more masculine sounding voice) to high (shorter vocal tract -more feminine sounding voice) values (DFs: 94%, 104%, 112%).