Correlated microtiming deviations in jazz and rock music

Musical rhythms performed by humans typically show temporal fluctuations. While they have been characterized in simple rhythmic tasks, it is an open question what is the nature of temporal fluctuations, when several musicians perform music jointly in all its natural complexity. To study such fluctuations in over 100 original jazz and rock/pop recordings played with and without metronome we developed a semi-automated workflow allowing the extraction of cymbal beat onsets with millisecond precision. Analyzing the inter-beat interval (IBI) time series revealed evidence for two long-range correlated processes characterized by power laws in the IBI power spectral densities. One process dominates on short timescales (t < 8 beats) and reflects microtiming variability in the generation of single beats. The other dominates on longer timescales and reflects slow tempo variations. Whereas the latter did not show differences between musical genres (jazz vs. rock/pop), the process on short timescales showed higher variability for jazz recordings, indicating that jazz makes stronger use of microtiming fluctuations within a measure than rock/pop. Our results elucidate principles of rhythmic performance and can inspire algorithms for artificial music generation. By studying microtiming fluctuations in original music recordings, we bridge the gap between minimalistic tapping paradigms and expressive rhythmic performances.


Introduction
The art of creating music involves a balance of surprise and predictability. This balance needs to be achieved on many scales, and for many musical components like melody, dynamics, and rhythm. Such a balance is believed to be essential for making music interesting and appealing [1][2][3][4][5]. While musicians achieve this balance intuitively, the principles generating it remain unknown. A core hypothesis conjectures that this balance manifests itself in long-range correlations (LRCs) and self-similar structure of melody, dynamics, and rhythm. In fact, a first evidence for this hypothesis was provided by Voss and Clarke [6], who identified LRCs in pitch and loudness fluctuations. More recently, LRCs were found in the rhythmic structure of Western classical music compositions [2], i.e. in written notations, where the rhythm is represented in a metrically organized precise fashion. Such compositions may be played back in this precise a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 fashion, e.g., by computers, but are often perceived to sound mechanical and unnatural [7]. In performed music, in contrast, musicians introduce subtle deviations from the metrically precise temporal location, which make the performance sound human.
Such microtiming deviations on the one hand are inevitable in human performances as human abilities to produce precisely timed temporal intervals are limited [8,9]. On the other hand, they can be introduced on purpose and contribute to a musician's individual expression. It is thus worthwhile elucidating the nature of temporal fluctuations and factors contributing to them in various musical contexts. Inferring such microtiming deviations from ready-made musical recordings is a challenge, however, because beat onsets must be determined with millisecond precision. In past studies this precision was achieved using fairly reduced settings, e.g. simple finger-tapping tasks [7,[10][11][12][13][14][15][16][17][18][19][20]. For those performed with metronome, LRCs were identified for microtiming deviations (e i ) from metronome clicks [7,[10][11][12]. Here, LRCs manifest themselves as power-laws P(f) / f −β , with 0.5 ≲ β ≲ 1.5 in the power spectral density (PSD) of the e i . In contrast, if the deviations e i were independent, one would expect β = 0. For unpaced tapping, i.e. tasks performed without a metronome, LRCs were recovered for tempo fluctuations, i.e. the PSD of the inter-beat interval (IBI) time series showed power-laws P(f) with 0.5 ≲ β ≲ 1.5 [7,11,[15][16][17][18][19][20]. Hennig and colleagues extended this framework to more complex rhythms (with metronome), but still in a laboratory setting, They provided evidence for LRCs of microtiming deviations, consistently with those of simple finger tapping [7]. More recently, LRCs were identified for drumming in a single pop song [5]. Together, these results may suggest that both, microtiming deviations from beats, as well as tempo fluctuations show LRCs. Detailed analyses are required to investigate this hypothesis, in particular with respect to the precise scaling properties (i.e. β), and their dependence on genres. Differences in scaling may occur, as the cognitive involvement clearly differs between simple tapping tasks versus the flow experienced when making music together.
In our present study, we carry the analysis of human beat performance from the laboratory to real-world conditions of musical performances with all its complexity. To this end, we compiled beat onset time series from over 100 music recordings. To estimate the beat onset for each recording with millisecond precision, we devised a semi-automated beat extraction workflow. The resulting IBI time series allowed us to investigate both, unpaced and paced recordings, and to compare their scaling properties to those from finger tapping. Making use of our large dataset, we extended our analysis to investigate different genres, jazz and rock/pop, to elucidate how genre-dependence manifests itself in the beat structure.
Based on the millisecond precise beat time series, we could identify signatures of two processes, a clock and a motor process. Both processes influence the beat microtiming and showed similar long-range correlations. However, the motor process revealed stronger timing fluctuations within a measure for jazz compared to rock/pop. On the one hand our results point to general dynamics of microtiming fluctuations across musical genres on long time scales, reflecting the temporal organization of musical pieces. On the other hand the stronger fluctuations on fast time scales in jazz music might be attributed to the higher degree of freedom as compared to rock/pop.

Millisecond-precise beat extraction
Human rhythmic performance can be precise down to the scale of several milliseconds [8,9]. Therefore, our analyses required a millisecond-precise, consistent estimation of beat onsets. As this precision is not reached by any of the currently available methods, we developed a specialized semi-automated workflow.
A conceptual challenge in beat detection of original performances is that the beat is not uniquely defined. We approximated the beat by cymbal onsets, because drummers provide a rhythmic foundation, because cymbal onsets can be well separated from other instrument onsets, and because the short attack times allow for millisecond-precise onset detection. This precise onset detection is crucial for the subsequent systematic and reproducible analyses of large datasets.
In the following, we sketch the semi-automated workflow for beat-extraction (see Fig 1). More details are given in the methods section. (1) The percussion-dominated channel is selected. (2) The frequency range in which the cymbal dominated is isolated. (3,4) Using differentiation, putative cymbal events are identified. (5) Of those, the cymbal onsets that built a regular beat sequence are combined to a beat-onset time series. This step excludes cymbal onsets that were not on the regular beat. (6) To improve the temporal precision of the extracted beat onsets, the precise onset time is estimated on the rising slope of the cymbal beat. This workflow allowed us to acquire beat time series from more than 100 recordings, comprising each about 600 beat onsets. All songs we analyzed are listed in S1-S3 Tables.

Human beat performance in music
We analyzed recordings played with or without metronome. For those played with a metronome (paced recordings), we consistently found a power law for the power spectral density (PSD) of the inter-beat intervals (IBIs) (Fig 2C, sketch in Fig 3). The exponents β M of the motor or microtiming deviations varied across recordings, but consistently indicated longrange correlations (LRCs). Its median was " b M ¼ À 0:87ð0:43Þ (where the standard deviation is   given in parentheses). " b M is negative, because the IBI time series, compared to the deviations from the metronome, represents a differentiated signal (see below). " b M significantly differed from an independence assumption (b ind M ¼ À 2, p < 10 −30 , where significance was obtained by analytical calculation of the bootstrap distribution; Fig 4A). Qualitatively, these results are consistent with those for simple finger tapping tasks, indicating that a similar process underlies beat generation in simple tapping tasks as well as in music.
IBI time series from unpaced music recordings showed characteristic V-shapes for the PSD (Fig 2A and 2B). Such V-Shapes can be generated by the superposition of two stochastic processes, each of them contributing to the PSD. In analogy to finger-tapping experiments, we interpret the two processes as a "clock process" C governing temporal interval estimation, and a "motor process" M governing the motor execution of a planned interval ( Fig 3A) [13,15,16]. In this general framework, an IBI interval I i is generated by a clock estimate C i , and motor deviations M i , which represent the microtiming deviations from the intended clock interval C i [13,15,16,18]: The PSD of the intervals I is thus generated by the PSDs of the two stochastic processes, C and M. The clock process contributes with a power law Pðf For an uncorrelated process one expects b ind C ¼ 0 (Fig 3A and 3B). The motor process M enters I as a difference, and hence contributes to the PSD with −1.5 β M −0.5 for long-range correlations, whereas b ind M ¼ À 2 would reflect an uncorrelated process (Fig 3A and 3B). As the clock and motor processes contribute to I with exponents of opposite sign, β C > 0 and β M < 0, respectively, C dominates the PSD at low frequencies, whereas M dominates at high frequencies. This generates the characteristic V-shape and allows to estimate both scaling exponents from the PSD of the IBIs (Figs 2A, 2B and 3A). When the rhythm is performed with a metronome (paced), the dynamics of the clock process is strongly confined, and the motor process alone dominates the PSD on the entire frequency range, i.e. one observes a single power law regime in the power spectral density (PSD) of the IBI time series, as reported above ( Fig 2C) [7,15]. A. Neither the clock nor the motor process is random but clearly long-range persistent, i.e. β C > 0, −β M < 2. ÃÃÃ denotes p ( 10 −3 (significance obtained by bootstrapping). B. Genre-dependence of the scaling exponents. The motor process showed significantly stronger long-range persistence in rock/pop (R) than in jazz (J). The box plot depicts the median in red, boxes at the first and third quartile, whiskers at 1.5 Á IQR (interquartile range), and circles represent outliers. ÃÃ denotes p = 0.001, and n.s. denotes not significant. https://doi.org/10.1371/journal.pone.0186361.g004 We systematically quantified the scaling exponents β C and β M for unpaced recordings (Figs 4 and 5). The clock process showed LRCs with " b C ¼ 0:54ð0:38Þ. It significantly differed from an independent process, which would be characterized by b ind C ¼ 0 (p = 10 −29 , bootstrap). Our results indicate that tempo fluctuations across the entire recording do not occur independently, but ultimately are related to fluctuations at any other time. The motor process contributed with β M % −1 ( " b M ¼ À 1:09ð0:55Þ), and significantly differed from an independent process as well (b ind M ¼ À 2, p = 10 −29 , bootstrap). These results can be interpreted as follows: As the local tempo, governed by the clock process, needs to be maintained, any deviation from the local clock or metronome shortens one interval and at the same time lengthens the other, resulting in anti-correlations on I, and negative values of β M . Last, the turnover between the clock-and motor-dominated regimes was generally at about log 2 f V % −3 ( log 2 " f V ¼ À 2:98ð0:98Þ). That is, only for about 2 3 = 8 beats or a few measures the motor process dominated the PSD, while for time scales spanning more than about 8 beats the clock process dominated. Interestingly, unpaced and paced recordings only differed slightly in their IBI distributions p(IBI) (Fig 2D-2F). Despite the absence of a metronome, unpaced recordings showed only slightly broader p(IBI) (σ = 13.1(3.8) ms and σ = 11.5(4.3) ms, respectively, p = 0.067, d = 0.393). Moreover, for both conditions, the microtiming deviations showed similar scaling properties, i.e. the β M did not differ between the conditions (p > 0.05). In contrast, the characteristic V-shape was clearly present for the PSD of I when recordings were played without a metronome, whereas those played with metronome showed a single power law, because the metronome presumably suppressed or replaced the clock process (Fig 2). This result supports the hypothesis of two independent processes, one being suppressed when beats are performed under pacing by a metronome. As many recordings are fairly short (about 3 minutes, median of 580 beats), the effect of spectral averaging was small and thus the PSDs were noisy. To obtain PSDs from longer time series, we recorded seven unpaced, genuine drum performances from a professional musician in a studio setup, lasting 20 to 30 minutes each and comprising 3189(612) beats. For these long time series, the PSDs were very clear due to better spectral averaging (Fig 2A). The parameters obtained from these PSDs were consistent with those of the short musical recordings analyzed above: " b C ¼ 0:77ð0:15Þ, " b M ¼ À 1:11ð0:19Þ, and log 2 " f V ¼ À 3:77ð0:26Þ. As the short and long recordings did not differ significantly in any of the parameters, we merged both for the analysis of genre-dependence in the following sections. Note, that we obtained the same results when we considered the short recordings alone, and the same as a trend for the seven long recordings, which alone, however, would not be numerous enough to reach significance.

Genre-dependence
Do the scaling properties of the clock and motor process depend on the musical genre or are they a general feature of music? With the highly precise IBI time series from jazz and rock/pop music we were able to test for genre-dependence.
Most interestingly, we found that jazz recordings showed smaller β M for the unpaced songs than rock/pop recordings (Fig 4B, p = 0.001, d = 0.509, restricted permutation test (RPT), for details on the statistical tests see methods). More precisely, for jazz recordings we found " b M ¼ À 1:23ð0:47Þ, and for rock/pop " b M ¼ À 0:96ð0:56Þ. The same trend was observed for the paced songs. In contrast, jazz and rock did not differ in the clock exponent β C (Fig 4B).
These results indicate that in jazz, musicians make more use of microtiming deviations on very short time scales, i.e. they introduce stronger deviations from the local tempo. In rock/ pop, musicians play with a more regular beat on these short time scales. On longer time scales, where the clock process dominates, the tempo variations do not differ between jazz and rock/ pop, indicating that the overall musical structure from short motives to long blocks does not differ between these genres, and we hypothesize that other genres might show similar LRCs for the clock process as well.
Basic beat variability. In addition to the very prominent genre-dependence in the motor process, we found that on average the beat in jazz was slightly slower than in rock. In detail, the median IBI in jazz (rock/pop) was 330 ms (288 ms) for the unpaced recordings, and 400 ms (282 ms) for the paced recordings (p = 0.013, d = 0.574, RPT). More interestingly, the variability (i.e. the standard deviation of I) was higher for jazz than for rock; it was 14.3 ms (11.9 ms) for the unpaced performances, and 14.6 ms (9.2 ms) for the paced performances (p = 0.001, d = 0.745, RPT). It is to be expected, that faster performances show less variability (smaller SD). To account for this, we compared the tempo-normalized variability of the IBI time series, i.e. the Fano factor F ¼ s 2 = " I , where σ denotes the standard deviation and " I the median I. Consistently with the differences above, the Fano factor was higher in jazz than in rock across the paced and unpaced recordings (p = 0.016, d = 0.294, RPT, Bonferroni corrected for multiple comparisons), although the effect size was smaller. When analyzing the paced and unpaced recordings separately, both showed the same trend (p = 0.049, d = 1.211 for paced, p = 0.132, d = 0.182 for unpaced, Bonferroni corrected for multiple comparisons), however, the effect size was more pronounced for the paced songs. Together, these result suggests that jazz makes more use of temporal variability, especially when recordings are played with a metronome (paced), i.e. when only the motor, but not the clock process can be used as an expressive component of music.

Discussion
Interestingly, our results for rhythm generation in music revealed evidence for at the same two underlying processes as inferred for simple finger tapping tasks: a clock and a motor process, both of them long-range correlated. In both settings, music and finger tapping, the clock process disappeared when a metronome was used, and the remaining motor process showed a single power law with slope around unity. However, although the LRCs in both settings have similar characteristics, their origin may be very different: When a piece of music is performed, it has structure on all scales, from motifs, phrases and themes to verses and movements. This structure is reflected in the tempo and is likely to underlie the observed long-range correlations. Such structure is absent in finger tapping tasks. Those tasks are somewhat dull for the subjects, and hence it may well be that their mind is wandering during tapping. As a consequence, concentration may wax and wane, certainly on many different scales as well, thereby generating LRCs. Studies relating neural activity to motor precision and perception hint in that direction [12,21]. Smit et al, for example, showed clear correlation between β C and the scaling exponent derived from neural alpha oscillations [12]. Hence, although the signatures of the beat time series are similar for music and finger tapping, their origin may differ vastly: One lying in the multi-scale structure of a musical composition, which aims at keeping us captivated, the other making our mind wander owing to the dullness of tapping a simple beat for minutes in a row. Such waxing and waning of concentration and performance can be tested in future studies by simultaneously measuring markers of attention in brain activity or pupil diameter and relating this to the microtiming deviations in finger-tapping tasks; in music songs, the microtiming deviations might be related to the structure of each song.
For beat generation in music as well as in simple tapping tasks, the generative models still remain unknown. In past studies, LRCs in finger tapping were attributed either generically to models for 1/f noise, e.g. long-range correlated (critical) brain dynamics or the superposition of processes on different time scales [12,[22][23][24][25][26][27][28]. Alternatively, they were explained by more mechanistic models, such as the linear phase correction model [29], the shifting strategy model [30], or the hopping model [31], as summarized by Torre et al. [15].
We found both, the clock and the motor process to show LRCs characterized by β C % 0.6 and β M % −1, respectively. Are these results for beat in music consistent with those found for finger tapping? Early studies on finger tapping assumed that both the motor and the clock process showed uncorrelated Gaussian noise (β C % 0, β M % −2), but never tested that explicitly by evaluating e.g. the PSD [13,18]. First analyses of the PSD showed 0.9 < β C < 1.2 for the clock process [16,19]. The exponent β M was not fitted but assumed to reflect an uncorrelated process (β M % −2), although the spectra were clearly flatter, hinting at LRCs in the motor process as well. For the clock process, β C % 1 was found consistently in various simple finger tapping tasks [12,[16][17][18][19]. When two subjects tapped in synchrony, β C was a bit smaller (β C % 0.85), and in an exemplary pop song, β C was found to be even smaller (β C % 0.56), which is very similar to our results on the over 100 music recordings. Overall, our study, together with the past ones, suggests that finger-tapping tasks have a larger β C than beat generation in music. This indicates stronger persistence of the tempo drifts in tapping compared to music pieces. The origin for this difference, though, remains unknown. It is conceivable that professional musicians are better trained at keeping a constant tempo, whereas the subjects in the tapping tasks typically did not have any training.
Regarding the motor process, results are very scarce for unpaced finger tapping. For paced finger tapping, the β M were typically estimated in a different manner. Instead of the IBI time series, the deviations from the metronome, i.e. the error time series was used. For those, the b 0 M is expected to differ by 2, b 0 We found β M % −1, both for paced and unpaced music recordings. For tapping, earlier studies reported β M % −1.5 or β M % −1.3 [7,10,11,32]. Hence, fluctuations on short time scales are stronger for tapping than for music beat generation. Whether these differences are attributed to the different cognitive involvements, or whether they reflect differences between lay people's tapping performance, versus professional musician's beat generation, remains an open question.

Datasets
All datasets are available in the supplementary material S1 Dataset. Dataset 1: Real-world musical performances. We analyzed in total 100 recordings (47 jazz and 53 rock/pop), listed in Tables S1-S3 Tables. Of these recordings, 9 jazz and 13 rock/ pop recordings were played with metronome ("paced recordings"). The recordings are denoted by J ÃÃ and R ÃÃ with ÃÃ denoting a consecutive (arbitrary) number of the jazz or rock/ pop song, respectively. All recordings satisfied the following criteria: 1. The cymbals were clearly audible even when other high-pitched sounds were interfering.
2. The cymbals' main rhythmic function was for pace-keeping, i.e. we discarded recordings where the cymbals only occurred occasionally or were used in a mainly expressive way, the cymbal patterns frequently changed or where drum-play was virtuoso in general.
3. The audio quality for MP3-encoded recordings was at least 320 kBit/s.

Dataset 2: Experimental performances.
To obtain a complementary dataset, we asked a professional drummer to play jazz and rock/pop music as genuinely as possible on his own and in absence of a metronome. The drummer gave informed, verbal consent that we use the recording for timing analysis.
The drummer was free in all musical decisions like tempo, rhythms and musical structure but was asked to avoid the crash cymbal and not to interrupt his performance. Additionally he was aware of the fact that we focused on the cymbals in our analysis and thus paid attention to use them consistently.
The drummer was a professional musician with a conservatory degree in drumming, and long standing experience with live and studio jazz performances. We obtained 7 drum performances in total (3 jazz and 4 rock/pop) with a length of 20-30 minutes each.
A setup with six drum microphones (Shure PGA Drumkit 6) was used to record the ride cymbal, hi-hat, the toms, the snare and the bass drum separately. In order to reduce cross-talk, we aligned each of the two overhead microphones (Shure PGA 81) to point towards the hi-hat and ride cymbal surface within a close distance (%5 cm) and away from the other drums.
For all these recordings and performances, we estimated the beat time series as described below.

Time series extraction
In the following we detail our semi-automated workflow for millisecond-precise, reproducible beat extraction. It consists of six transformation and refinement steps in order to obtain highprecision cymbal beat time series from the initial audio signal.
(1) From the stereo audio signal, recorded at a sampling frequency of 44.1 kHz, the percussion-dominant channel was isolated (Fig 1(1)). It is denoted by ξ(t 0 ), where t 0 is the discrete time sampled at about 0.02 ms.
(2) To detect the onsets of the cymbal, we used a time-frequency representation of ξ(t 0 ). More specifically, we calculated the short-term Fourier transform (STFT) of ξ(t 0 ), using in the time domain a window size of 128 samples (%3 ms), a step size of 8 samples (%0.2 ms), smoothed with a Hann function.
In the frequency domain this results in 64 bands of f Nyquist /64 % 345 Hz. Hence for each time step t (corresponding to 8 samples of t 0 ) and frequency window k, we obtained the spectrogram S(k, t). The cymbal was most prominent in the band from %15 kHz to %19 kHz. For higher frequencies, MP3 compression artifacts distorted the signal, and for lower frequencies, other instruments interfered. The precise values of the frequency band depended on the specific piece and were adjusted if necessary.
(3) For every frequency band k, a rise in power, potentially indicating a cymbal onset, was determined by subtracting the average power of the past 9.3 ms (i.e. 51 time steps) from the current sample: Sðk; t À 1Þ The onset detection function y(t) is the average of S 0 (k, t) over the cymbal-dominated frequency bands k from %15 kHz to %19 kHz.
(4) To extract putative cymbal events t ev i , we applied a simple peak-picking algorithm by first applying a threshold y thresh = 0.07 Á max{y(t)} and then discarded all but the maximal within any time window for size T block = 70 ms. This resulted in a minimum interval of 2 Á T block between local maxima. Occasionally the threshold had to be manually lowered.
(5) To exclude all cymbal events that are not part of the beat, we first estimated the beat period T. To this end, we calculated the intervals δt between each putative cymbal event t ev i and the m = 2 following cymbal events. The beat period T then manifested as a strong peak in a histogram of the δt within a range [0 ms; 1000 ms].
The local rhythmic structure of the song was obtained by plotting the δt versus the corresponding t ev i . In this representation, the regions in the song with, e.g., fainter or missing cymbals resulted in sparsely populated regions. For such pieces, the procedure described above was repeated with a lower threshold. If this led to many false-detections, then we increased the default value of m to 3 or 4.
Having estimated the beat period T, we grouped the cymbal events to labeled sequences that were locally in agreement with T: Two cymbal events t ev i and t ev j were assigned the same label if their time difference was within T ± τ. τ was set to 35 ms and adjusted if necessary. These labeled sequences were manually assembled to full beat time seriest i . Only sequences of length 256 or longer were used for further analysis, apart from 3 slightly shorter time series (see below).
The steps described up to this point needed about one minute quality checking per recording. They where optimized to quickly validate whether a sufficiently long sequence of beats could be extracted reliably. In the following, we describe how for these 107 recordings the millisecond precise onsets were extracted.
(6) First, allt i of the putative beat time series were checked for validity and corrected manually if necessary. Then we determined millisecond-precisely the physical onset time t i as the time were the onset detection function y(t) (see step 3) first rose above base line (the blue line in panel (6) indicates the estimated physical onset time t i ). t i is expected to be at most 50 ms before the correspondingt i . Hence starting att i À 50 ms, we scanned that entire window to find the last t for which y(t) exceeded its own preceding 5 ms baseline, i.e. y(t) > max{y(t − 5 ms), . . ., y(t − 1 sample)} is fulfilled. In a few pieces, e.g. with prominent rim-shots, which result in multiple closely spaced local maxima, the most reliable type of maximum was used by defining a target interval in which y(t i ) was expected to lie. Typically, the correct onset times were unambiguously visible in the spectrograms. These automatically detected onset times t i were all checked audio-visually and adjusted if necessary.

Time series analysis
We calculated the power spectral density (PSD) of the inter-beat interval (IBI) time series, i.e. the temporal difference between two successive beat onsets d i = t i+1 − t i . Here, any missing t i was handled as NaN. The IBIs were detrended with a polynomial of degree 3. Afterwards, the NaNs were discarded, because this procedure leads to a better estimate of the PSD. Time series with less than 256 data points were centered and zero-padded-this applied to three out of the 107 time series (R16, R32, R48), where with N = 194 R48 was the shortest.
To estimate the exponents, we applied the standard Welch PSD method introduced in [33] with window size N win = 256. To suppress spectral leakage, each segment was multiplied with a Hann window wðnÞ ¼ sin 2 pn N win À 1 , where n denotes the index n = 0, 1, . . ., N win − 1. The overlap was set to N overlap % N win /2, i.e. first the number of windows fitting in the time series of length N was calculated and the overlap was adjusted to cover the whole time series instead of the next-smaller multiple of N win /2. For unpaced performances, we fitted a V-shaped PSD (see Fig 3) using a superposition of two power laws with opposite-signed scaling parameters β C and β M : where β C putatively quantifies how the clock process is correlated over time, while β M quantifies how the "microtiming deviation" from the beat or "motor process" is (anti)correlated over time. P C and P M describe the power of the clock and motor components, respectively. As the clock and the motor components have opposite sign, each of them dominates one side of the spectrum, and the turnover frequency, i.e. the resulting minimum of P(f), is denoted by f V (Fig  3A). f V is a function of the exponents and power of the motor and clock components: When fitting all the free parameters θ = (f V , β C , β M ) of the spectrum, we aimed at weighting the motor and clock contributions equally. To this end, we assumed a turn-over frequency f Ã V and weighted the fit residuals of both sides of f Ã V equally. This results in a weighting function where N − (N + ) denotes the number of frequency bins f that are smaller (larger) than f Ã V . The residual between model and data was minimized using the Broyden-Fletcher-Goldfarb-Shanno algorithm [34]. To initialize the minimization, for each trial each side of the log-transformed spectrum was approximated by a linear relationship, using the Theil-Sen method [35]. As f Ã V is not known a priori, we scanned log 2 f Ã V equidistantly on log 2 f Ã V 2 ½À 6; À 2. For each time series we thus obtained ten parameter sets θ. Note that f Ã V was only used to define the weighting function w(f), while f V proper is a free parameter. We still used different f Ã V , because it allowed for an estimate of the variability of the parameters θ.
In the case of the paced (metronome-guided) recordings we expected the clock component to be missing. As a consequence, the PSD approximates a power law with Pðf Þ $ f À b M . Thus when fitting with the procedure above, the parameters β M and β C are expected to be both negative. This condition was used as a test for paced versus unpaced pieces. The β M of the unpaced pieces can then be estimated either by fitting a single power law, or by fitting the V-shape as above.

Permutation test and effect size
To test for the presence of different effects for jazz (J) versus rock/pop (R), we applied a median-based two-sided permutation test on the estimated parameters β C , β M and log 2 f V obtained by the V-shaped fit. Therefore, the median values " x ¼ f " b C ; " b M ; log 2 " f V g for jazz and rock/pop were compared by Dx ¼ " x J À " x R for each parameter. In addition to the p-values from the permutation, we reported the effect size for the differences between jazz and rock/pop recordings. We used a modified Cohen's d where " x denotes the median of the respective values " b C , " b M or log 2 _ " f V . The pooled standard deviation σ was computed from the population sizes n J , n R and standard deviations σ J , σ R of the respective populations: Effect sizes are considered being small for 0.2 < |d| < 0.5, medium for 0.5 < |d| < 0.8 and large for |d| > 0.8.