Exploring the distribution of statistical feature parameters for natural sound textures

Sounds like “running water” and “buzzing bees” are classes of sounds which are a collective result of many similar acoustic events and are known as “sound textures”. A recent psychoacoustic study using sound textures has reported that natural sounding textures can be synthesized from white noise by imposing statistical features such as marginals and correlations computed from the outputs of cochlear models responding to the textures. The outputs being the envelopes of bandpass filter responses, the ‘cochlear envelope’. This suggests that the perceptual qualities of many natural sounds derive directly from such statistical features, and raises the question of how these statistical features are distributed in the acoustic environment. To address this question, we collected a corpus of 200 sound textures from public online sources and analyzed the distributions of the textures’ marginal statistics (mean, variance, skew, and kurtosis), cross-frequency correlations and modulation power statistics. A principal component analysis of these parameters revealed a great deal of redundancy in the texture parameters. For example, just two marginal principal components, which can be thought of as measuring the sparseness or burstiness of a texture, capture as much as 64% of the variance of the 128 dimensional marginal parameter space, while the first two principal components of cochlear correlations capture as much as 88% of the variance in the 496 correlation parameters. Knowledge of the statistical distributions documented here may help guide the choice of acoustic stimuli with high ecological validity in future research.


Introduction
parameter distributions that characterize a large portion of the perceptual diversity of 8 environmental sounds could be very useful, as it would allow us to ask how sound 9 stimuli used in psychoacoustic or physiological studies of the auditory system relate to 10 the types of sounds the auditory system actually encounters, and may have adapted to. 11 Here we describe our attempt to characterize these distributions by collecting and 12 statistically analysing a large corpus of a class of natural sounds known as sound 13 textures. We understand sound textures in the sense popularized by [1], as sounds 14 that may have a lot of complexity, like for example the sound of waves breaking on a 15 pebble beach, but which are nonetheless fully described by a finite set of stationary 16 statistical parameters. Hence highly realistic exemplars of such sounds can be 17 synthesized from scratch by morphing random noise samples to assume the spectral, 18 modulation power and cross-frequency correlation structure characteristic of that type 19 of texture. While textures defined in this way are fundamentally stochastic, and 20 thereby exclude some important classes of sounds which are highly deterministic (such 21 as highly regular rhythms) or rule based (such as a spoken sentence or a piece of 22 music) they nevertheless cover a large proportion of the sounds encountered in natural 23 and man-made environments, and the fact that they appear to be well characterized 24 by a potentially large but finite number of stationarity statistical parameters makes 25 the research questions we are pursuing here tractable. 26 Previous studies have identified a variety of parameters that are in principle 27 suitable for characterizing sounds, including, for example, features like Mel-frequency 28 cepstral coefficients (MFCC), band energy ratio, spectral flux and the wavelet subspace 29 cepstrum. These have been nicely reviewed by [2], and are often used in applications 30 such as sound event classification and computational auditory scene analysis (CASA). 31 However, cepstral coefficients tend to look at relatively short time windows, so here we 32 chose to use the auditory texture statistics by [1], which were inspired by the previous 33 characterization of features used in visual texture discrimination research, [3], [4] , [5]. 34 [1] were able to show that natural sounding textures can be synthesized de novo by 35 "shaping-noise" to impose the statistical features of the desired sound on random noise 36 samples. Synthesized sounds from "white noise" by [1] were often easily identifiable as 37 exemplars of a particular type of natural sound, and in many cases indistinguishable 38 from a natural recording. The statistical parameters they adopted in their study was of these "cochlear envelopes". The mean, variance, skew and kurtosis (also referred to 48 as the first, second, third and fourth moment respectively) of the envelope amplitudes, 49 will collectively be referred to as "marginal moments", or "marginals" of the sound 50 texture. In addition, the pairwise correlations between cochlear envelope amplitudes 51 ("cochlear correlations", for short) are computed. Previous studies by [5,6] reported 52 that both marginal moments and correlations are important features of visual textures, 53 and the same is clearly true for auditory textures. Furthermore, the compressed 54 cochlear envelopes are also passed through a second bank of band pass filters to 55 measure the distribution of amplitude modulations (the "modulation power" statistics) 56 and to compute correlations between modulation channels.

57
To appreciate how the types of statistical parameters extracted by the [1] model 58 can distinguish types of sound textures, consider that some textures are "sparse", 59 exhibiting periods of relative silence with burst of energy of widely varying amplitude 60 distribution (e.g. grains of hail bouncing on a tin roof), while others have a much more 61 constant stream of sound (e.g. a high pressure jet of water rushing out of a faucet).

62
Marginal moments will distinguish sparse from less sparse sounds easily, with sparse 63 sounds having relatively greater variance, skew and kurtosis. Indeed, the usefulness of 64 marginal moments in distinguishing natural sounds and images has been appreciated 65 for a while [7,8]. Similarly, modulation power statistics may be useful to distinguish 66 "buzzing insects" sound textures, which have low modulation power at low frequencies 67 but higher modulation power at higher frequency bands, from "waves on a beach" for 68 which modulation power is relatively uniform across all modulation frequency bands.

82
To examine the statistical parameter space spanned by natural sound textures, we 83 collected a corpus of natural sound recordings, computed their statistical parameters 84 using the [1] framework, and subjected the resulting database of statistical parameters   91 We collected 450 high quality raw sound samples from freesound, a freely available 92 web resource [9]. After a preliminary inspection, we selected 200 sound samples which 93 were deemed to be "texture like". Sound clips with long duration of silence were 94 excluded. All the sounds are of 48 kHz sample rate, and each clip is of 15s duration.  Using the sound synthesis tool box, marginal, correlation and modulation power statistics were computed for the entire corpus. Envelope variance and kurtosis, as well as modulation power parameters, are log transformed to make their distributions more symmetric around the mean. Each of these parameter sets is normalized and centered by z-scoring, and the z-scored parameter sets are subjected to PCA and interpreted.

Sound collection
The Sound Synthesis Toolbox V1.7 by [1] segregates each input sound into a  The Sound Synthesis Toolbox V1.7 also computes the pair-wise correlation between 111 the cochlear envelopes, yielding 32×32=1024 correlation parameters (although these 112 are somewhat redundant given that the correlation matrix is symmetric around the 113 main diagonal). To compute the modulation power parameters, the output of each 114 cochlear envelope is passed through another set of 20 "modulation" bandpass filters.

115
The center frequencies of these modulation filters are equally spaced on a log scale 116 from 0.5 to 200 Hz (same parameters to those used by [1]. Modulation power is then     in their envelope amplitudes must have a relatively lower mean envelope "baseline"). The 1st PC distinguishes textures of relatively low mean and high variance, skew and kurtosis from textures for which the reverse is true. The 2nd PC has mean and skew values that are near zero, and thus mostly distinguishes textures with low variance and but high kurtosis, particularly for frequency bands above 800 Hz, from sounds with the opposite feature combination.
To illustrate that the first PC of marginals distinguishes sound textures along a from our corpus, "sea at night" and "clock ticks", in  Comparison between the envelope statistics of "sea at night" from one end of PC1 dimension and "clock ticks" from the other end. (A) "Sea at night" has higher envelope mean than "clock ticks" as it is in the lower end of PC1 dimension(B, C, D) Envelope of "clock ticks" with high variance, skewness and kurtosis than "sea at night" for frequencies above 800Hz. but skew and kurtosis, as higher order moments, are "more sensitive" to such 195 excursions, growing with the third and fourth powers of the deviation from the mean 196 respectively, rather than just the square. Thus, an envelope distribution with a large 197 kurtosis but a small variance will have a particularly long, thin "tail", meaning that 198 sound amplitudes can shoot up to very large values relatively frequently, but will not 199 spend much time at "middling" amplitude levels, while for a texture with relatively larger variance and smaller kurtosis, the converse is true. We would therefore expect 201 sounds with large marginal PC2 scores to be not just sparse, but "bursty", exhibiting Thus, PC2 appears to rank sound textures on how "bursty" they are. In Fig 5 209 we illustrate the marginal statistics for two sounds chosen to vary systematically along 210 PC2, but have approximately the same values for PC1: "restaurants", and "roosters". "roosters" has a lower mean for cochlear envelope than "restaurant"(B) "roosters" has a higher variance cochlear envelope than "restaurant". PC2 in Fig 3D indicates that as we move along the PC2, sounds should have opposing trends in mean and variance values of their cochlear envelopes. (C, D) As we move in the PC2 direction "skew" and "kurtosis" should be higher. "roosters" has higher skew and kurtosis than "restaurant".
In summary, the first two PCs of the marginals of our corpus of sound textures 221 between them account for two thirds of the variance (45% and 21% respectively, see in environmental sound textures are highly redundant. We also observed that the two 224 first PCs lend themselves to an intuitive interpretation, capturing features that can be 225 described as the "sparseness" or the "burstiness" respectively of the sounds.  Figure 6B shows the distribution of 240 correlations after pre-processing for the PCA via z-scoring to achieve a more 241 symmetric distribution.  The first PC (Fig 7C) is essentially completely "flat", and it will therefore the high frequency part of its cochleagram, there are also many prominent narrow band features, particularly in the lower frequencies. In contrast, the "applauding crowds" sound in Fig 7F has a high PC1 correlation coordinate (Fig 7B, blue dot), and 280 a lot of prominent vertical striping throughout its cochleagram.

281
The second PC of the correlation parameters captures whether correlations are 282 more prominent in low or high frequency bands (see Fig 7D). Normalized cochleagrams 283 of sound textures with very different PC2 coordinates are shown in Fig 7G and 7H.

284
The sound texture "fire outside woods" (Fig 7G, green dot in Fig 7B)  0.1141 respectively. Log transformation and z-scoring for PCA preprocessing made the 307 distribution more much symmetric (Fig 8B). very similar manner across all cochlear frequency bands ( Fig 9C). Meanwhile, PC2

317
(shown in Fig 9D) accounts for 25% of the variance and is sensitive to the extent to 318 which sound textures exhibit amplitude modulations at "middling" modulation 319 frequencies of around 30-100 Hz. Distribution of sound textures from our corpus along the first two PC coordinates for modulation power. The first two PCs capture~73% of the variance between them. (C, D) Shape of first and second PC respectively. PC1 discriminates sounds "slowly" from "rapidly" modulating sounds, with a boundary near 60 Hz for all cochlear frequencies.
PC2 discriminates sounds with prominent modulations in a "mid range" (near 60 Hz) from sounds lacking such modulations. (E) Modulation spectrum of sound texture sample "gunshots", showing prominent modulation at low rates. (F) Modulation spectrum for "bees". High modulation frequencies (>~80 Hz) dominate. (G, H) The modulation spectrum for "applauding crowd" shows a relative dearth of modulations near 60 Hz, while that for "vacuum cleaner" shows prominent~60 Hz modulations.
We  The idea that statistical regularities may govern the types of sensory stimuli we 341 encounter in our environment has a long history, as does the idea that the sensory 342 systems may be adapted to some of these statistical features or regularities [11,12].

343
This idea has arguably been much more influential in vision research than in hearing 344 research. For example, an attempt by [13] to explain the centre-surround structure of 345 primary cortex visual receptive fields as nature's solution to the problem of having to 346 encode the structure of visual scenes in a sparse, and hence energy-efficient, manner, 347 has become enormously influential. (Note, however, that more recent work by [14] 348 proposes an intriguing alternative explanation, namely that cortical receptive field 349 structure not just of visual but also auditory cortical neurons may be optimized to 350 facilitate prediction of future inputs, rather than energy efficiency.)

351
An early example of work looking for statistical regularities specifically in the 352 auditory modality comes from [15], who already reported over 40 years ago that pitch 353 and amplitude fluctuations over long segments of music and speech streams recorded 354 from the radio exhibited a so-called 1/f distribution. Garcia-Lazaro and colleagues [16] 355 later built on that observation and showed that auditory cortex neurons appear to be 356 tuned to these statistics, in that they respond more strongly and reproducibly to artificial sound streams that follow 1/f distributions than to sounds which fluctuate according to slower (1/f0.5), or faster (1/f2) distributions. This was later shown to be 359 an emergent property of the ascending auditory pathway, as inferior colliculus neurons 360 generally prefer more rapidly fluctuating sounds, and neurons in the medial geniculate 361 exhibited no particular preference for fluctuations that were either faster or slower 362 than 1/f [17]. These studies are conceptually similar to earlier work by Also highly 363 relevant are studies by [7,18]  sounds and their relevance to auditory processing include a well-known study by [19], 369 which presented an efficient coding argument alongside an analysis of natural sounds 370 to explain the cochlear frequency tuning characteristics, or a study by [20] which 371 described the low-pass nature of spectral and temporal modulations in natural sounds, 372 in a manner corroborating and extending the findings by [15]. Lower-dimensional descriptions of natural sound statistics which nevertheless capture 411 much of the richness of the auditory environment should therefore be possible.

412
Another noteworthy finding of our PCA analysis is that it illustrates the high 413 degree to which many statistical features tend to co-vary greatly across frequency 414 bands. Thus, the first PC across the marginals showed very little variation in Mean,

415
Variance of Kurtosis as a function of cochlear frequency (Fig 3C), the first PC of and reported that temporal lower order statistics for a given sound sample tend to be highly similar across frequency bands. Nevertheless, in combination with the many additional values which we report here, this confirmatory finding is potentially quite 423 useful. Thus, if someone presented us with a "mystery texture sound", reproduced at 424 a unit RMS amplitude, and asked us to guess what its statistical parameters are likely 425 to be in some particular frequency band, then we would be able to declare with some 426 confidence, firstly, that the particular frequency band probably does not matter, 427 secondly, that its mean envelope amplitude has a 90% chance of falling between 428 0.0338 and 0.1618 with a maximum likelihood value of~0.0905 (Fig 2A), the variance 429 of the envelope amplitude has a 90% chance of falling between~0.1292 and 0.7763 430 with a maximum likelihood of~0.315 (Fig 2C), its skewness has a 90% chance of 431 falling between~-1.8 and +3 with a maximum likelihood of~0.6 ( Fig 2E), and its 432 kurtosis is 90% likely to fall between~1 and 18 ( Fig 2G) with a maximum likelihood 433 of~5. Similarly, envelopes in any two cochlear frequency channels are a priori more 434 likely than not to be substantially correlated, with an R > 0.55 ( Figure 6).

435
The data presented here can therefore facilitate informed guesses about as yet 436 unknown natural sounds that we may be presented with in the future, and we hope 437 that a better characterization of the statistical features of natural sounds will enable 438 us to start asking better questions about the extent to which expectations derived 439 from these distributions may be "built into" the functional anatomy of our central 440 auditory nervous system.