A New Method to Explore the Spectral Impact of the Piriform Fossae on the Singing Voice: Benchmarking Using MRI-Based 3D-Printed Vocal Tracts

The piriform fossae are the 2 pear-shaped cavities lateral to the laryngeal vestibule at the lower end of the vocal tract. They act acoustically as side-branches to the main tract, resulting in a spectral zero in the output of the human voice. This study investigates their spectral role by comparing numerical and experimental results of MRI-based 3D printed Vocal Tracts, for which a new experimental method (based on room acoustics) is introduced. The findings support results in the literature: the piriform fossae create a spectral trough in the region 4–5 kHz and act as formants repellents. Moreover, this study extends those results by demonstrating numerically and perceptually the impact of having large piriform fossae on the sung output.


Introduction
The piriform fossae, or piriform sinuses, owe their name to their pear shape. This pair of bilateral cavities is located posteriorly at the bottom of the pharynx, just above the oesophageal entrance. Together with the laryngeal vestibule and ventricles, they form the hypopharyngeal cavities (see Fig 1), whose acoustic properties are thought to contribute to the acoustic uniqueness of a voice, by shaping the formants F3, F4 and F5, with large inter-speaker variations and small intra-speaker (i.e., inter-phoneme) variations [1]. In particular, the piriform fossae, as side branches of the Vocal Tract (VT) produce troughs in the region of 4 to 5 kHz [2], and play a significant role in the singer's formant between 2 and 3 kHz [3]. The singer's formant cluster is a well-established feature of the acoustic output from the VT of trained opera singers that is independent of the vowel being sung [3]. It is commonly described as a cluster of F3, F4 and F5. This suggests that the singer's formant cluster is related to a region of the VT that does not change greatly in shape with vowel articulation; anatomically, this relates to the hypopharyngeal cavities [1]. More precisely, the epilarynx (laryngeal vestibule and laryngeal ventricles) does not change greatly in shape across vowels whereas Painter [4] claims that if the volume of the piriform fossae cannot be actively enlarged, action of the inferior pharyngeal constrictor muscles, posteroanterior expansion of the epilarynx, or raising the larynx can actively reduce their volume.
Davies et al. [5] found a decrease of around 5% in F1, F2 for the vowel /a/ when the fossae were incorporated in the vocal tract as side branches. Titze and Story [6] found that the formant frequencies are slightly shifted when appending the piriform fossae to the main tract. In particular, they qualify the fossae as a formant repellent, generally pushing F1, F2, F3 and F4 lower and F5 higher.
Dang and Honda [7] carried out a study of the piriform fossae on mechanical models as well as on human subjects, injecting water in the piriform sinuses of humans phonating in a supine position and in mechanical models of the lower half of Vocal Tracts. Comparing the acoustic output with and without piriform fossae they found that the fossae behave as side branches of the main tract and have a significant effect on the transfer function. For both models and humans, they found that the epilarynx tube resonance was enhanced, and that the fossae not only affected the spectral shape in the neighbourhood of its antiresonance but also decreased the lower resonance frequencies.
Physiologically, the piriform fossae play a role in feeding: they contribute to the process of swallowing by storing temporarily a bolus of food or liquid before it is propelled into the oesophagus [8]. In some mammals (such as the wolf and the fox), it is found that the larynx directly projects into the nasopharynx, providing continuity of the airway [9] whereas the bipedal man developed a two-part pharynx (the nasopharynx and the oropharynx), which allows for the food to bypass the larynx laterally, through the piriform fossae, before swallowing [10]. Similar arrangements are found in cats, pigs, goats and the tenrec. From the evolutionary standpoint, neonates evolve an oropharyngeal anatomy comparable to that seen in the macaque (with an intranarial larynx, [10]) to the morphology shown in Fig 1. In the present study, we investigate the spectral role of the piriform fossae by numerical simulations using the Finite Element Method and direct experimental measurements of 3D-printed full MRI-based Vocal Tracts, in contrast to the half VTs employed by Dang and Honda [7]. We introduce a new approach (inspired by a method used in room acoustics [11]) to measure the transfer function of MRI-based Vocal Tracts replicas moulded with a 3D rapid prototyping technique. We compare the experimental results with numerical simulations using the Finite Element Method. We explore the spectral differences in relation to length and volume measurements of the piriform fossae of 3 professional singers, based on MRI data. Finally, we assess perceptually the impact of having large piriform fossae on the sung output via a listening test.

Ethics statement
This study, labelled ''MRI Capture of the Vocal Tract'' (Project ID: P1135), was ethically approved by the Research Ethics Committee of the York Neuroimaging Centre. The participants provided their written consent to take part in this study.

Singers
For this study, 3 professional singers sang in an MRI scanner, in a supine position (see Fig 2 for their MRI-based Vocal Tracts). The corpus is composed of 1 Mezzo-Soprano, 1 Bari-Tenor and 1 Bass-Baritone. In order to retain anonymity, but to remind the reader what voice type the singers belong to, each singer has been assigned a name with mnemonic similarity to their voice type as follows: N BarnaBy is a Bass-Baritone, aged 31. N BarTholomew is a Bari-Tenor, aged 34. N MariStela is a Mezzo-Soprano, aged 29.
These professional singers have extensive experience performing in famous Opera houses including La Scala Milan, Deutsche Opern Berlin, Covent Garden London, English National Opera London, Opera Comique Paris and La Monnaie Brussels. Further details about the singers are referred on Table 1.
The scan of the 3 professional singers have been acquired according to the protocol described in [12]. They were tasked to choose a pitch on which they can comfortably sustain a moderately loud phonation on a phoneme (see Table 1) during the acquisition time (approximately 16 s) and were instructed to then attempt maintaining the articulatory setting in an unvoiced condition through breathing for the remainder of the scan, in case of breathlessness. No instruction was given regarding the operatic quality of voice to be produced. Scans are made at the York Neuroimaging Centre (YNiC), using a General Electric 3.0 T HDx Excite MRI Scanner. The scan developed was a 3D fast gradient echo sequence, [13]: The relaxation time was 4.8 ms and the excitation time was 1.7 ms. Acquisition is isotropic 2mm in a 192|192 matrix. Output is then interpolated to 512|512 using 50% slice overlap giving an effective anisotropic output of 0:75|0:75|1mm. A stack of 80 images is produced in the midsagittal plane in approximately 16 s.  A consideration to account for, when using magnetic resonance imaging to scan the human head is how different might be the phonation between the supine and standing positions. Gravity is thought to affect the articulations, resulting in a backwards movement of the tongue and a subsequent narrowing of the pharynx [14,15]. Nevertheless, the phonetic effect of a supine  phonation are thought to be minimal, perhaps with the aid of compensatory articulations [15,16]. The tip of the tongue has been observed to be subject to a significant retraction in the case of a sustained supine phonation, resulting in artifacting motion in the images [17]. To prevent from such image alterations, the subjects were tasked to consider carefully the tongue position during phonation. Note that Speed [13] has recorded the subjects in a supine/standing position in a 6-sided anechoic chamber before and after the supine phonation in the MRI scanner and found that there is a spectral consistency between the supine and standing phonation, despite the gravitational pull on the abdomen during phonation.
Maintaining a constant vocal tract configuration during phonation is crucial to prevent motion artifacting in the image [12]: alterations in the stability of phonation can arise from gravity, lung volume, required longevity of sound and fatigue [17]. According to [13], the data of Barnaby were acquired during the most stable phonation which defines the clearest edges between the structures on the MRI images, leading to the most accurate segmentation of the MRI data. This is the is the reason why the data of Barnaby were chosen to compare his VT configuration singing on different vowels, as in the words /h d/, /p t/, /st n/, /fu d/ and /ni p/. Out of these 5 MRI-based VTs, /st n/, /fu d/ and /p t/ were also 3D-printed to enable comparisons between numerical simulations and experimental measurements.

MRI-based 3D-printed Vocal Tracts
The VT models (VTMs) were moulded based on volumetric MRI data collected while Barnaby was singing 3 English vowels in a supine position [12], by a 3D fast gradient echo sequence.
The MRI data were then segmented with the open source code ITK-Snap, to rebuild a 3D Vocal Tract, whose .STL file was then sent for 3D rapid prototyping. The material used was Vero-WhitePlus Opaque. The tracts were printed on an Object24 3D Printer.
The vocal tracts can be opened just above the valleculae to enable plasticine to be placed in the cavities. The thickness of the shell of the VT is 2 mm.

Experimental Set-up
A new experimental method is used to measure the impulse response and hence the transfer function of the MRI-based 3Dprinted Vocal Tracts. The method is based on Farina's methodology [11] to measure simultaneously the linear response and harmonic distortions of a room with an exponential sine sweep, ESS. Fig 3 overviews the method which is developed in the following subsections: 1. The driver is given an input signal (ESS) which is recorded via a probe microphone.
2. The output of the microphone is then convolved with the inverse filter of the input signal (ESS {1 ).
3. As a result, the impulse response is ''linearised'', i.e. the Linear Impulse Response (LIR), and the harmonic distortions are split apart.
4. An FFT is performed on the LIR, giving the transfer function.
5. The transfer function of the driver alone is subtracted from that with the VTM, giving as a final result the transfer function of the VTM, which is independent of the driver's frequency response.
NB: here, processes 2+3 are termed ''Linearisation of the impulse response'' Processes 1 to 4 in Fig 3 are operated twice: once with the VTM, and once without. The resulting spectra are subtracted from one another (5 in Fig 3) to provide the transfer function of the VTM.
The experiment was carried out in a 6-sided anechoic chamber, at the temperature of 5 0 C. A G.R.A.S. probe microphone type 40SA was used at the glottis end. The signal was preamplified by a power Module type 12AA before being written on a USB type device (TASCAM) at a 192 kHz sampling rate and at 24 bits resolution as a WAV file. The driver was situated at 3 cm from the lips end.
Exponential Sine Sweep. An exponential sine sweep (ESS) is of the form with t being the time, T the duration of the sweep and v 1~2 pf 1 and v 1~2 pf 1 the lower and higher extremities of the frequency range swept by the sine. This signal exhibits a -3dB/octave slope [11]. Linearisation of the impulse response. Let r(t) be the room/cavity response to the excitation signal s(t) defined in (1). The room/cavity impulse response h(t) can be extracted by convolving r(t) with the inverse filter of s(t) [11,18,19]. The exponential sweep (which is a causal signal) is temporally reversed and then delayed to obtain a causal system [20]. However, if we time-reverse the excitation signal s(t), it still exhibits a {3dB=octave. Therefore, we need to compensate this energy drop by modulating the amplitude of the time-reversed signal with a z6dB=octave envelope so that the inverse filter exhibits a z3dB=octave slope [11,19]. We create an inverse filter f (t), which, after being convolved with the system response, yields to the impulse response.
This is termed post-modulation, in opposition to a pre-modulation suggested by [19], which modulates the input signal directly so that it has a flat spectrum and the reversed-time signal exhibits a flat spectrum alike. The post-modulation term which is to be multiplied with the time-reversed input signal is of the form [19]: where A is a scalar representing the modulation amplitude. At time t~0, the instantenous frequency v equals v 1 . In this condition, we can solve for A in (4), assuming arbitrarily that m(t)~1 at t~0: Substituting A in (4) gives : Modulating the time-reversed signal gives: which exhibits a slope of z3dB=octave.
Having designed an inverse filter which counter-balances the {3dB=octave slope, it is convolved with the system response. The convolution results in a series of impulse responses, separated on the time axis. As can be seen on Fig S1 (A), the Linear Impulse Response (LIR) of the system and its harmonic distortions are temporally separated. Hence, access can be gained simultaneously to the LIR itself and the impulse response of any harmonic distortion.
Note about the harmonic distortions Electro-mechanical transducers, such as those used in speakers and microphones, are non-linear systems, i.e. they do not react proportionally to the input signal given. In addition to the linear response of the system, such transducers resonate at several frequencies, the harmonic distortions of the device. The method described herein allows access to the linear response deprived from the harmonic distortions generated in both the speaker and the microphone. Therefore, this method is essentially independent of the speaker and the microphone (see Fig S1 (C)).
The convolution packs the harmonic distortions before the linear response on the time axis, as can be seen on Fig S1 (B). The linear response is located at the time t~T and the harmonic distortions are parallel to it.
The big improvement of the method developed in [11] resides in the fact that applying a Fast Fourier Transform (FFT) to the Linear Impulse Response removes the inherent harmonic distortions on the transfer function of the system.
Fast Fourier Transform. Each impulse response, starting with the LIR, is manually isolated from the other impulse responses and an FFT is performed on it, leading to the linear transfer function of the system. Fig S1 (D) shows the transfer function of the harmonic distortions and the linear response.
To isolate the LIR, Audacity software was used to zoom onto a window encompassing only the linear response, (the amplitude being switched to a logarithmic scale to assess more accurately the time interval between the start and the end of the impulse response).
To perform the FFT, an algorithm was used on each impulse response. For a time duration l: 1.  as shown in Fig S2 (C). This process is realised on 3 impulse responses from the VTM and complex-averaged in order to remove the inherent noise.

Transfer function
We need to perform the processes 1 to 4 (in Fig 3) twice, once to obtain the transfer function of the VTM, and again to obtain the transfer function of the driver alone. We can then subtract both spectra to get the transfer function of the VT model. The method is driver-independent.
Note about ESS and ESS {1 Using the ESS (1) as input signal, and the inverse filter (5) per se, and plotting spectrograms (frequency versus time), it can be seen that there is an instantaneous burst of energy at the start and at the end of the sweep (see the green vertical lines in Fig 4 (A1)). These are due to the fact that the sweep starts and ends non-smoothly, i.e. the slope is not continuous at the time t~0 and the sweep does not necessarily cross the time axis at t~T. If we convolve both those signals the result is an impulse response and its echoes in the frequency-time space, as in Fig 4 (A3). The idea is to provide the sine sweep with a fade-in and a fade-out.
A smooth start. In Fig S2 (A), we can see that the transition at the start of the sweep is not smooth. This is due to the fact that before the sweep, the signal value is zero, with zero slope, and suddenly, at the start of the sweep, the slope abruptly changes, creating a slope discontinuity, resulting in a burst of energy across the whole spectrum, prior to the sweep.
The first derivative at the time origin gives the transition slope. The first time derivative of (1) is which is a non-zero slope.
To smooth this transition, the start of the signal is multiplied by a sine-squared envelope (the result is displayed in Fig S2 (A)). Being part of the sigmoid family, it ensures a smooth transition between a threshold value and a fixed value. This transition is applied between the start frequency of the sweep, f 1 and a frequency fixed by the user, f in . The overall algorithm is as follows: 1. Find the time at which the instantaneous frequency is equal to f in .  4. Multiply the signal by the envelope from t~0 to t~t½in. The envelope needs to satisfy the following conditions: it ramps up from zero at frequency f 1 to 1 at frequency f in , after a quarter of a period. In other words, we need to find parameters a and b such as Once the pre-envelope has been applied, we see that the left vertical green line (the broad-band burst of energy preceding the sweep in Fig 4 (A3)), the ''pre-ringing'' to quote Farina [11], disappears.
A smooth end. The sweep stops abruptly as soon as the frequency upper limit has been reached, and it is very unlikely that at this exact frequency the amplitude of the sine sweep would be zero (see Fig S2 (B)). For this reason, the sine sweep defined in (1) generally creates a broad-band burst of energy, occurring as it ends. A post-envelope needs to be performed to smooth down the end of the sweep unto zero. For this purpose, we apply a sinesquared function which takes the value 1 at an upper fixed frequency f out and fades out smoothly to reach zero at f 2 .
The algorithm is as follows: 1. Find the time at which the instantaneous frequency is equal to f out .   (1) has a burst of energy across the whole spectrum both at its start and at its end (A1). Once convolved with its inverse filter (A2), it leads to an impulse response and its echoes in the frequency-time space (A3). Providing a smooth start to the (ESS) (B1), and convolving it with its inverse filter (B2) removes the pre-ringing (B3). Providing the (ESS) with both a smooth start and a smooth end (C1), and convolving it with its inverse filter (C2) removes both the pre-and the post-ringing (C3). doi:10.1371/journal.pone.0102680.g004  4. Multiply the signal by the envelope from t~t½out to t~T. We need to find parameters a and b such as the sine-squared goes from the value 1 at t~t½out to zero at t~T within a quarter of a period: Subtracting (9) from (8) gives Once the pre-and post-envelope have been applied, we see that both the left and the right vertical green line (the broad-band burst of energy preceding and following the sweep respectively), the ''pre-ringing'' and the ''post-ringing'' [11] disappear as shown on Fig 4 (C3).

Numerical method
The software ACTRAN (www.fft.be) was used to perform simulation of the transfer functions of the MRI-based Vocal Tracts with the Finite Element Method, implementing the wave equation L 2 w Lt 2~c 2 + 2 w on the Vocal Tract. A point source is used as excitation at the glottis end and a probe microphone situated 3 cm far from the lips end records the pressure versus the frequency, to obtain the transfer function. A frequency independent absorption factor is set to the value a~0:02 for the walls of the Vocal Tract. The absorption factor a is defined as where Z n is the normal acoustic impedance, r~1:269kg : m {3 the air density and c~334:319m : s {1 the speed of sound to meet the experimental conditions of the anechoic chamber.

Listening Test
The sound outputs of 3 singers (Barnaby, Bartholomew and Maristela) were recorded in a 6-sided anechoic chamber, in a supine position matching phonation position in the MRI scanner. The subjects were fitted with a headset mounted AKG CK77 omnidirectional lavalier microphone and a set of Audio-Technica ATH-M30 closed-back headphones [12].
The listening test constituted 6 pairs of sounds, where each pair comprises a specific sung vowel with and without the piriform fossae. To create the version without piriform fossae, the sound outputs of the singers with piriform fossae were filtered to mimic how the singers would have sounded without piriform fossae. The filter subtracts the spectrum with piriform fossae from that without piriform fossae. For each pair of sounds (i.e. with and without piriform fossae), 10 expert listeners were asked the question ''Which one would you qualify as a resonant voice?'' after begin instructed with the definition of a ''resonant voice'' as: a voice production that is both easy to produce and vibrant in the facial tissues [21]. They were given the following choices as answers: first sound, second sound or no preference.

Results
The first subsection benchmarks the new experimental method against theoretical predictions and numerical simulations of the acoustic modes of a cylinder, the second compares the experimental and numerical results of MRI-based Vocal Tracts and the last one assesses the spectral impact of the piriform fossae on the human singing voice.  : : : giving the name of a quarter wavelength resonator, where r, w and z are the cylindrical coordinates and L is the length of the tube. When its length is large in comparison with the wavelength, the resonant frequencies can be approximated under the 1D assumption of plane wave propagation [22]; the cross-sectional dimension of the tubes should be less than a half-wavelength, which means it is valid up until about 5 kHz (here, the diameter of the cylinder is 30 mm, see next paragraph). Under this assumption, the acoustic modes are given [3,22] as the solutions of where L Ã is the acoustical length of the tube (see below),  (13) is the effective acoustical length of the tube, i.e. the physical length plus the end correction which accounts for the small volume of air outside the tube vibrating along with the air inside [23]. The end correction factor is known analytically for two extreme cases, i.e. a cylinder with a circular flange of infinite and zero dimensions [24,25]. The length correction for low frequencies in these two cases is d ?~0 :8216R and d 0~0 :6133R, where R is the radius of the cylinder. A fit formula for an infinite flange is given by Dalmont et al. [26] after Norris and Cheng (1989) for kRv3:5: where d ?~0 :8216R, R is the radius of the inner tube and k~v=c is the wavenumber. OECC. The Open End Correction Coefficient (OECC) is the coefficient by which d ? has to be multiplied to account for the finiteness of the flange bearing in mind that the end correction factor is only known analytically for 2 extreme cases; a cylinder with a circular flange of infinite and zero dimensions. Based on experimental data, Dang et al. [27] after Hall (1987) give the following empirical formula describing the relation between the OECC and the width of the flange for a low-frequency approximation,  phonating on /p t/, /fu d/ and /st n/ respectively. The numerical simulations are in good agreement with the experimental measurements, although there are discrepancies, mostly due to the fact that the simulation propagates a lossless wave equation ignoring actual Vocal Tract losses due to turbulence, vorticity, viscous layers, heat losses, etc. Moreover, the absorption coefficient used in the simulation is not frequency dependent, which is unrealistic. Table 2 shows the comparison between the simulated and measured formant frequencies, and their relative difference. The  Figure 9. Listening test -perceptual effect of the piriform fossae. Listening test to assess perceptually the spectral effect of appending the piriform fossae to the main tract. 10 expert listeners were asked to choose between each pair of sound which one they were qualifying as a ''resonant voice'' [21]. The vertical bars represent the number of positive answers (up to 10) for the sound sample with (blue) and without (red) piriform fossae respectively.

Effect of the piriform fossae
Appending the piriform fossae to the main tract adds a trough around 4-5 (6) kHz in the output spectrum, probably enhancing the perception of the singer's formant cluster (SFC): a broad peak, followed by a trough. This confirms the findings of [28]. This can be seen on Fig 7 experimentally, and on Fig 8 numerically for the 3 singers. For the experimental part, the piriform fossae were filled with plasticine to simulate a vocal tract without its fossae. Since it is difficult to smooth manually the plasticine to completely fill in the piriform fossae, there are differences in observed results. The experimental results show the effect for the left and right piriform fossae individually. Fig 7 (A, B) show the experimental results for Barnaby phonating on /fu d/ and /ni p/ respectively (with and without piriform fossae). It can be seen that the main frequency region affected by the piriform fossae is between 4 and 5 kHz. The formants below and above this region are repelled: the formants whose frequency are lower/greater than the resonance frequency of the fossae are decreased/increased respectively. This agrees with the results found in [5][6][7]28]. Fig 8 shows the numerical results for the 3 singers. The green arrow represents the resonance frequency of the piriform fossae derived from their length (see Table 3). Titze et al. suggested the use the quarter-wave resonator formula (eq (13) from [6]) where F sn is the n th resonance of the piriform sinuses, c the speed of sound and L s the length of the sinuses. The predicted spectral zeros are in good accordance with the numerical simulations: the longer the sinus, the lower the resonance frequency. Knowing more accurately the acoustical length of the fossae (accounting for the end correction effect) would give a more accurate prediction of the resonance frequency. The mean antiresonance frequency across the singers is 4451 Hz with a standard deviation of 340 Hz whereas the mean across the vowels of Barnaby is 4182 Hz with a standard deviation of 179 Hz. Fig 8 visually confirms the experimental results: the piriform fossae act as formants repellent, the formants with a lower/greater frequency than the resonance frequency (green arrow) see their frequency decreased/increased. A listening test was performed to assess perceptually the spectral impact of appending the piriform fossae to the main tract. A group of 10 expert listeners were asked to choose for each of 6 pairs of sounds which one they would qualify as being a resonant voice. One of the voice samples included the spectral effect of the piriform fossae and the other did not. The results are shown on Fig 9 where it appears that the bigger the volume of the piriform fossae, the more resonant the voice sounds, perceptually. This supports the fact that the piriform fossae spectrally enhance the perception of the SFC.
It is interesting to note that the ratio of the volume of the piriform fossae and the Vocal Tract (penultimate column in Table 3) is related to the amplitude of their effect on the spectrum: the bigger the fraction, the bigger the impact on the transfer function. See for instance Maristela, whose piriform fossae constitute 8% of the Vocal Tract volume: her piriform fossae have a relatively larger spectral impact than those of the other singers.
From Fig 8, it can be seen that the female voice tends to show a spectral trough due to the piriform fossae at a higher frequency range (around 4-5 to 6 kHz) than males (around 3.5 to 5 kHz), which is consistent with the fact that the spectral role of the piriform fossae is to emphasise the SFC.
Moreover, the physiological role of the piriform fossae is to serve as side branches to ''capture'' foreign bodies, instead of swallowing them, but also a part of the food (at least temporarily) and the mucous, for instance when one has a cold [29]. We suggest, therefore, that singers with large piriform fossae would be more affected than others in the production of a ''resonant voice'' when they have a cold or when they have just eaten certain foods which would obstruct the fossae.

Discussion
In this study, we investigated the spectral impact of the piriform fossae on the human singing voice, on MRI-based Vocal Tracts, both experimentally (3D printed VTs) and numerically. We have introduced a new experimental method based on exponential sine sweep used in room acoustics [11], enabling transducer-independent measurements of the transfer function of 3D printed Vocal Tracts with and without piriform fossae (by mean of filling the cavities in the 3D printed models with plasticine). The transfer functions of MRI-based Vocal Tracts of 3 professional singers were simulated numerically with and without the piriform fossae.
The results support the findings previously highlighted in the literature [5][6][7]28]: the piriform fossae create a spectral trough in the region 4-5 kHz and act as a formant repellent, i.e. appending the piriform fossae repels the formant frequencies from the antiresonance they create. Here, we have provided new data and a new measurement method to confirm this effect through numerical modelling and experimental measurements on complete (rather than half as previously reported [7]) 3D-printed MRIbased Vocal tracts and relate it to MRI-based measurements of 3 professional singers.
The plots clearly show that the SFC is spectrally emphasised by appending the piriform fossae: they act as side branches and create an antiresonance (determined by the length of the fossae, see the green lines in Fig 8) at a higher frequency than the SFC (about 1-2 kHz above the SFC). The result is that the SFC is acoustically perceived as being enhanced, as confirmed with the listening test of Fig 9. Our data indicates differences with gender: female voices tend to have the spectral trough higher than males (5-6 kHz in comparison with 4-5 kHz).
From the evolutionary standpoint, the human pharynx is divided into 2 airways, the nasopharynx and the oropharynx [9]: the piriform fossae act as the last step in the process of swallowing: they allow temporary storage of bolus of food and/or liquid to use the airway both for breathing and feeding. The dimensions of these cavities in relation to those of the epilarynx play a particular role in spectrally enhancing the SFC. In addition, our results showed that the bigger the ratio of the volume of the piriform fossae to the volume of the Vocal Tract, the bigger spectral effect they have on the transfer function. This suggests that singers with large piriform fossae might experience a larger spectral change in their singing voice when they have a cold or when they have ingested certain food which obstructs the fossae.
In the future, the listening test should include same samples comparison to benchmark and support the results. A more extensive study needs to be performed on a larger number of singers to assess how the dimensions (especially the length) of the piriform fossae defines the precise location of the spectral trough and the full extent to which singers might or might not be affected by a cold while singing. Figure S1 Linearisation and speaker-independence. Convolution of the system response with the inverse filter signal. As a result, the Linear Impulse Response of the system is split from its harmonic distortions (A). The Linear impulse response is temporally separated form the harmonic distortions (B). Resonances of one cylinder opened at one end, closed at the other end, with two different transducers (C). Transfer functions of the Linear Impulse Response (LIR) in red and the harmonic distortions in blue (D). (EPS) Figure S2 A smooth start/end and FFT algorithm. The Exponential Sine Sweep (ESS) in (1) is provided with a smooth start (A) and a smooth end (B) to remove the pre-and post-ringing (See Fig 4). Algorithm used to obtain the transfer function out of an impulse response (C). (EPS)