The Nature and Perception of Fluctuations in Human Musical Rhythms

Although human musical performances represent one of the most valuable achievements of mankind, the best musicians perform imperfectly. Musical rhythms are not entirely accurate and thus inevitably deviate from the ideal beat pattern. Nevertheless, computer generated perfect beat patterns are frequently devalued by listeners due to a perceived lack of human touch. Professional audio editing software therefore offers a humanizing feature which artificially generates rhythmic fluctuations. However, the built-in humanizing units are essentially random number generators producing only simple uncorrelated fluctuations. Here, for the first time, we establish long-range fluctuations as an inevitable natural companion of both simple and complex human rhythmic performances. Moreover, we demonstrate that listeners strongly prefer long-range correlated fluctuations in musical rhythms. Thus, the favorable fluctuation type for humanizing interbeat intervals coincides with the one generically inherent in human musical performances.


Introduction
The preference for a composition is expected to be influenced by many aspects such as cultural background and taste. Nevertheless, universal statistical properties of music have been unveiled. On very long time scales, comparable to the length of compositions, early numerical studies indicated ''flicker noise'' in musical pitch and loudness fluctuations [1,2] characterized by a power spectral density of 1=f type, f denoting the frequency. In reverse, these findings lead physicists to create so-called stochastic musical compositions. Most listeners judged these compositions to be more pleasing than those obtained using uncorrelated noise or short-term correlated noise [1].
Rhythms play a major role in many physiological systems. Prominent examples include wrist motion [3] and coordination in physiological systems [4][5][6][7][8]. An enormous number of examples for 1=f -noise in many scientific disciplines, such as condensed matter [9,10], econophysics [11], and neurophysics [12][13][14] made general concepts explaining the omnipresence of 1=f -noise conceivable [15][16][17][18][19]. However, to mention only a few, studies of heartbeat intervals [20][21][22][23], gait intervals [7,24], and human sensorimotor coordination [4][5][6]8,25] revealed long-range fluctuations significantly deviating from 1=f b noise with b&1. Negative deviations from b~1 involve fluctuations with weaker persistence than flicker noise which is superpersistent. Thus the scaling exponent b is important for characterizing the universality class of a long-range correlated (physiological) system. An ancient and yet evolving example of coordination in physiological systems are human musical performances. The neuronal mechanisms of timing in the millisecond range are still largely unknown and subject of scientific research [18,[26][27][28][29][30][31]. However, the nature of temporal fluctuations in complex human musical rhythms has never been scrutinized as yet. In this article, we study the correlation properties of temporal fluctuations in music on the timescale of rhythms and their influence on the perception of musical performances. We found long range correlations for both simple and complex rhythmic tasks and for both laypersons and professional musicians well outside the 1=f regime. On the other hand, our study unveils a significant preference of listeners for long-range correlated fluctuations in music.
We analyzed the deviations from the exact beats for various combinations of hand, feet, and vocal performances, by both amateur and professional musicians. The data includes complex drum sequences from different musical bands obtained from a recording studio. While intentional deviations from an ideal rhythmic pattern play an important role in the interpretation of musical pieces, we focused here on the study of 'natural' (i.e. unintended intrinsic) deviations from a given rhythmic pattern of given complexity. In order to measure the deviations of human drumming from a rhythmic reference pattern we took a metronome as a reference and determined the temporal displacement of the recorded beat from the metronome click. A simple example of a recording is shown in Fig. 1: Here a test person had to follow the clicks of a metronome (presented over headphones) beating with one hand on a drum. Fig. 1A shows an excerpt of the audio signal and Fig. 1B the time series of deviations {d n }.
The recordings of beat sequences for several rhythmic tasks performed by humans are compiled in Table 1. In all recordings the subjects were given metronome clicks over headphones, a typical procedure in professional drum recordings. We used the grid of metronome clicks to compute the time series of deviations of beats and offbeats of complex drum sequences (tasks I drum ).
Each of the different drummers performed simultaneously with their feet (for bass drum and hi-hat) and hands. Furthermore we analyzed the recordings of vocal performances of four musicians (tasks I voice ), consisting of short rhythmic sounds of the voice according to a metronome at 124 beats per minute (BPM). Short phonemes (such as ['dee']) were chosen to obtain well separated peaks in the envelopes of the audio signal. For comparison with less complex sensorimotor coordination tasks [4,6,8,25,28,32] we have also included recordings of test subjects tapping with a hand on a drumhead (tasks I tap ). We recorded tapping at two different meters for each test subject, 124 BPM and 180 BPM, omitting socalled 'glitches', i.e. deviations larger than 100 ms. These glitches occurred in less than 0:5% and do not substantially affect the long term behavior we seek to quantify.
We focused on the extraction of possible long-range correlations (LRC) from the recorded signals. A signal is called long-range correlated if its power spectral density (PSD) asymptotically scales in a power law, S(f )*1=f b for small frequencies and 0vbv2. The power law exponent b measures the strength of the persistence. The signal is uncorrelated for b~0 (white noise), while for b~2 the time series typically originates from integrated white noise processes, such as Brownian motion. For bv1 the Wiener Khinchin Theorem links the PSD to the autocorrelation function of the time series {d n } which also decays in a power law C(t)*t b{1 . For the superpersistent case b~1 the correlation function does not decrease within the scaling regime whereas bw1 indicates instationarity of the time series.
We applied various methods that measure the strength of LRC in short time series, namely detrended fluctuations analysis (DFA) [21,33], zero padding PSD method [34], and maximum likelihood estimation (MLE) (see Methods for details). DFA involves calculation of the fluctuation function F (s) measuring the average variance of a time series segment of length s. For fractal scaling one finds F (s)*s a , where a is the so-called Hurst exponent, which is a frequently used measure to quantify LRC [33]. The DFA exponent a is related to the power spectral exponent b via b~2a{1 [35].

Results
We found LRC in musical rhythms for all tasks and subjects (summarized in Table 1) that were able to follow the rhythm for a sufficiently long time. Fig. 2C shows typical fluctuation functions for several time series of deviations {d n }. The power law scaling indicates 1=f b -noise. The obtained power spectral densities display exponents b in a broad range, 0:2vbv1:3 ( Fig. 2 A-B). This strongly indicates that a (conceivable) unique universal exponent is very unlikely to exist (cf. the confidence intervals for b). For the sets I voice and I tap we recorded musicians and non-musicians with different musical experience ranging from laypersons to professionals. Our findings suggest that it is not possible to assign a correlation exponent (a or b) to a certain musician, as a single person may perform differently, which is quantified by both power law exponents a and b (see, e.g. tasks no. 20 and 21 in Fig. 2B, which corresponds to beating on a drum at different metronome tempi). A systematic study of a possible dependence of the correlation exponents on the nature of the task would be interesting in itself but is beyond our focus here.
Although a musician does not necessarily intend to optimize synchrony with the metronome, the lowest values of the standard deviation of the time series were found for experienced musicians (s&12 ms, average standard deviation s s&21 ms). We observed the trend that small values of s correlated with small values of b and therefore conjecture that on average, the more a person Figure 1. Demonstration of the presence of temporal deviations and LRC in a simple drum recording. A professional drummer (inset) was recorded tapping with one hand on a drum trying to synchronize with a metronome at 180 beats per minute (A). An excerpt of the recorded audio signal is shown over the beat index n at sampling rate 44:1 kHz. The beats detected at times S n (green lines, see Methods) are compared with the metronome beats (red dashed lines). (B) The deviations d n~Sn {M n fluctuate around a mean of {16:4 ms, i.e. on average the subject slightly anticipates the ensuing metronome clicks. Inset: The probability density function of the time series is well approximated by a Gaussian distribution (standard deviation 15:6 ms). Our main focus is on more complex rhythmic tasks, however (see Table 1). A detrended fluctuation analysis of {d n } is shown in Fig. 2C  synchronizes with the metronome clicks, the lower the exponent b. Hence, the degree of correlations in the interbeat time series would shrink with increasing external influence. In finger tapping experiments [8,25] LRC were found even without the usage of a metronome [4], i.e., for weak external influence. On the other hand, in the extreme case where the test subject is triggered completely externally (as in reaction time experiments), LRC were entirely absent [4].
In order to probe nonlinear properties in the time series, we applied the magnitude and sign decomposition method [36,37]. The analysis showed strong evidence for an absence of nonlinear correlations in the data (see Methods and Fig. S2). Thus the applied methods of analysis designed to characterize linear LRC are appropriate for the time series of deviations that we have studied here.

Humanizing music
Do listeners prefer music the more the rhythm is accurate? In order to modify precise computer-generated music to make it sound more natural, professional audio software applications are equipped with a so-called 'humanizing' feature. It is also frequently used in post-processing conventional recordings. A humanized sequence is obtained by the following steps (see Methods for details): First, a musical sequence with beat-like characteristics is decomposed into beats [38], which are then shifted individually according to a random time series of deviations {d n }. The decomposed beats are finally merged again using advanced overlap algorithms.
We found that the humanizing tools of widely used professional software applications generally apply white noise fluctuations to musical sequences. Based on the above results we have studied the question whether musical perception can be influenced by humanizing using different noise characteristics. Therefore, a song was created and humanized with 1=f b -noise with b~1 ('version 1=f '), and white noise ('version WN'), where the time series of deviations have zero mean and standard deviation s 1=f~sWN~1 5 ms. Vocals remained unchanged, as well as all other properties of the versions such as pitch and loudness fluctuations, which have been shown to affect the auditory impression [1,2]. Listeners were able to discriminate between the two versions, see Fig. 3. We observed a clear preference for the 1=f humanizing over the white noise humanized version, see the audio example in the Supporting Information where we compare the two versions.

Rating of humanized music
Next, we describe our empirical analysis on the preference of listeners for 1=f and WN (white noise) humanized music, cf. Fig. 3. Two segments of the song under investigation, which were about 30 seconds long, were used. The two versions, 1=f and WN of a segment were played three times to participants. Thus they listened to a total of six pairs (three times the pairs for the first segment, followed by pairs of the second segment). The order of the exact and the 1=f version was randomized across the six pairs, with each version being played the first in three out of six times. For each pair of segments participants were asked one of two questions: (1) ''Which sample sounds more precise?'' or (2) ''Which sample do you prefer?'' Half of the participants first made their preference judgments for all six pairs before they reheard the same six pairs and were asked which of the two versions sounded more precise. The second half of the participants received the same tasks in reversed order. No feedback was provided regarding which of the two versions was more precise.
Participants were asked to assess their musical expertise on a scale from 1 (amateur) to 6 (professional). Participants rated their musical expertise on average as 3:8 (s~0:9). For both tasks (preference and precision) the relative frequency of choosing the 1=f version was computed for each participant. The mean relative frequency with respect to preference for the 1=f version was 0:641 (SEM~0:040). A t-test revealed that these choices deviated significantly from chance (t(38)~3:49, pv0:002), which indicates that participants clearly preferred the 1=f version over the white noise version. In addition they considered the 1=f version to be more precise (t(38)~7:10, pv0:001) with frequency 79:1% (SEM~0:041).

Discussion
This study provides strong evidence for LRC in a broad variety of rhythmic tasks such as hand, feet, but also vocal performances. Therefore these fluctuations are unlikely to be evoked merely by a limb movement. Another observation rather points to mechanisms of rhythmic timing ('internal clocks') that involve memory processes: We found that LRC were entirely absent in individuals who frequently lose rhythm and try to reenter following the metronome. The absence of LRC plays an important role as well in other physiological systems, such as heartbeat fluctuations during deep sleep [22]. While LRC in heartbeat intervals are reminiscent to the wake phase, they were found only in the REM phase of sleep, pointing to a different regulatory mechanism of the heartbeat during non-REM sleep.
Here, the loss of LRC may originate from a resetting of memory in the neurophysical mechanisms controlling rhythmic timing (e.g. neuronal 'population clocks', see [29,30,39] for an overview on neurophysical modeling of timing in the millisecond regime). In the other cases, the existence of strong LRC shows that these clocks have a long persistence even in the presence of a metronome. In cat auditory nerve fibers, LRC were found in neural activity over multiple time scales [12]. Also human EEG data [13] as well as interspike-interval sequences of human single neuron firing activity [14] showed LRC exhibiting power-law scaling behavior. We hypothesize that such processes might be neuronal correlates of the LRC observed here in rhythmic tasks.
In conclusion, we analyzed the statistical nature of temporal fluctuations in complex human musical rhythms. We found LRC in interbeat intervals as a generic feature, that is, a small rhythmic fluctuation at some point in time does not only influence fluctuations shortly thereafter, but even after tens of seconds. Listeners as test subjects significantly preferred music with long-range correlated temporal deviations to uncorrelated humanized music. Therefore, these results may not only impact applications such as audio editing and post production, e.g. in form of a novel humanizing technique [40], but also provide new insights for the neurophysical modeling of timing. We established that the favorable fluctuation type for humanizing interbeat intervals coincide with the one generically inherent in human musical performances. Further work must be undertaken to reveal the reasons for this coincidence.

Beat detection
Let the recording be given by the audio signal amplitude A(t). We define the occurrence of a sound (or beat) at time S n in the audio signal by  clicks and n~1,2 . . ., and given the sounds at times S n , then the deviation d n %T is obtained by the difference The time series of interbeat intervals fI n g is given by I n :~S n {S n{1~T zd n {d n{1 . The beat detection according to Eq. 1 is suitable in particular for simple drum recordings due to the compact shape of a drum sound A(t) (see Fig. 1): The envelope of A(t) rises to a maximum value (''attack phase'') and then decays quickly (''decay phase'') [38]. Thus, if the drum sounds are well separated, a unique extremum (S n ,A(S n )) can be found. In contemporary audio editing software, typically the onset of a beat is detected [38], which is particularly useful when beats overlap. We used onset detection to find the temporal occurrences of sounds for humanizing musical sequences, as explained in the section ''preparation of humanized music''. We generalized definition Eq. 2 in order to consider deviations of a sequence from a complex rhythmic pattern (instead of from a metronome) as shown in Fig. S1.

Audio example of humanized music
We created audio examples to investigate experimentally, whether there is a preference of listeners for LRC in music. Two segments of the pop song ''Everyday, everynight'', which are about 30 seconds long, were used. The song was created and humanized for our study in collaboration with Cubeaudio recording studio (Göttingen, Germany). An audio example consists of two different versions of the same segment of the song, played one after the other. You can listen to an audio example in the Supporting Information section (Audio S1). First, sample A, then sample B is played, separated by a 5 sec pause. The two samples differ only in the rhythmic structure, all other properties such as pitch and timbre are identical.
The first part (sample A) of the audio example (Audio S1) in the Supporting Information was humanized by introducing LRC (using Gaussian 1=f -noise), while for sample B the conventional humanizing technique using Gaussian white noise was applied. While the samples used in the experiments on music perception contained also vocals, in this audio example vocal tracks were excluded for clarity.

Preparation of humanized music
Here we provide details on how the music examples were prepared. For our study, a song was recorded and humanized in collaboration with Cubeaudio recording studio using the professional audio editing software 'Pro Tools' (Version HD 7.4). The song of 4 : 05 min. length has a steady beat in the eighth notes at 250 BPM, leaving a number of N&1000 eighth notes in the whole song. First, the individual instruments were recorded separately, a standard procedure in professional recordings in music studios. This leads to a number of audio tracks while some instruments, such as the drum kit, are represented by several tracks. Then the beats which are supposed to be located at the eighth notes (but are displaced from their ideal positions) were detected for each individual track using onset detection [38] implemented in Pro Tools. The tracks are cut at the detected onsets of beats, resulting in N&1000 audio snippets for each track. Next, the resulting snippets were shifted onto their exact positions, which is commonly called '100% quantization' in audio engineering. This procedure leads to an exact version of the song without any temporal deviations. At this stage humanizing comes into play. In order to humanize the song, we shifted the individual beats according to a time series of deviations fd n g, where n~1 . . . N. For example, if d 3~{ 20 ms, then beat snippet no. 3 is shifted from its exact position by 20 ms ahead (for all tracks). We generated two time series fd n g, one is Gaussian white noise used for ''version WN'', while the other consists of Gaussian 1=f noise to generate ''version 1=f ''. The snippets of the shifted beats are then merged in Pro Tools. The whole procedure was done for each audio track (hence, for each instrument). Finally, the set of all audio tracks is written to a single audio file.

Probing correlations
A method tailored to probe correlation properties of short time series is detrended fluctuation analysis (DFA) which operates in the time domain [21,33]. The integrated time series is divided into boxes of equal length s. DFA involves a detrending of the data in the boxes using a polynomial of degree k (we used k~2 throughout this study). Thereafter the variance F (s) of the fluctuations over the trend is calculated. A linear relationship in a double logarithmic plot indicates the presence of power law (fractal) scaling F (s)*s a . We considered scales s in the range 6vsvN=4, where N is the length of the time series. Once it is statistically established by means of PSD and DFA that the spectral density S(f ) is well-approximated by a power law, we use the Whittle Maximum Likelihood Estimation (MLE) to estimate the exponent b and determine confidence intervals. The MLE is applied to S(f ) in the same frequency range as for the zero padding PSD method [34], which is f Ã =Nvf vf Ã =2, where f Ã~1 =2 is the Nyquist frequency.
Moreover, we analyzed the data set with the magnitude and sign decomposition method that reveals possible nonlinear correlations in short time series [36,37]. In a first step the increments of the original time series are calculated. Second, the and WN (white noise) were compared by 39 listeners. Two samples were played in random order to test subjects, all singers from Gö ttingen choirs, who were asked either one of two questions: (1) ''Which sample sounds more precise?'' (red bar) or (2) ''Which sample do you prefer?'' (blue bar) The answers to the first question provide clear evidence that listeners were able to perceive a difference between the two versions (ttest, pv0:001). Furthermore, the 1/f humanized version was significantly preferred to the WN version (t-test, p~0:0012). doi:10.1371/journal.pone.0026457.g003 increments are decomposed into sign and absolute value (referred to as magnitude) and its average is subtracted. Third, both time series are integrated. Finally, a DFA analysis is applied to the integrated sign and magnitude time series, and the scaling of the fluctuation function is measured. The magnitude series accounts for nonlinear correlations in the original error time series. Fig. S2 shows four typical fluctuation functions (DFA) for their corresponding magnitude interbeat time series. The curves' slopes are close to a~0:5 (bold black line). More precisely, 89% (24 out of 27) of the data set's magnitude exponents a m lie in the interval ½0:46,0:58; three values for a m have values slightly above 0:6 (each data set has less than 1200 data points). Hence, the magnitude and sign decomposition method indicates a lack of nonlinear correlations in the original interbeat time series. Fig. S2 also displays the sign decompositions of the fluctuation functions. For small scales the curves have slopes below a~0:5, whereas those for large scales have slopes around a~0:5. This behavior is representative for the whole data set. As an expected consequence, the sign decomposition shows small scale anticorrelations typically found in gait intervals [7,24], together with a rather uncorrelated behavior on larger scales.

Ethics statement
The study was reviewed and approved by the Ethics Committee of the Psychology Department of the University of Göttingen. The Committee did not require that informed consent was given for the surveys: return of the anonymous questionnaire was accepted as implied consent. Figure S1 Generalization of Eq. 2 in order to consider deviations of a complex rhythm from a complex pattern (instead of from a metronome), shown is a simple example. Beats at times S n (red vertical lines) and an ideal beat pattern with beats at times