Convergent evolution in a large cross-cultural database of musical scales

Scales, sets of discrete pitches that form the basis of melodies, are thought to be one of the most universal hallmarks of music. But we know relatively little about cross-cultural diversity of scales or how they evolved. To remedy this, we assemble a cross-cultural database (Database of Musical Scales: DaMuSc) of scale data, collected over the past century by various ethnomusicologists. Statistical analyses of the data highlight that certain intervals (e.g., the octave, fifth, second) are used frequently across cultures. Despite some diversity among scales, it is the similarities across societies which are most striking: step intervals are restricted to 100-400 cents; most scales are found close to equidistant 5- and 7-note scales. We discuss potential mechanisms of variation and selection in the evolution of scales, and how the assembled data may be used to examine the root causes of convergent evolution.


INTRODUCTION
Music, like language, can be described as a generative grammar consisting of basic building blocks, and rules on how to combine them. 1,2In melodies, these basic units are usually specified by two quantities: frequency and duration.We generally refer to this basic pitch unit as a note, and a set of notes as a scale.Thus, as far as pitch is concerned, a scale is to a melody what an alphabet is to writing.Despite their centrality to music and apparent ubiquity, we know surprisingly little about scales.Most studies focus on scales from a limited number of musical traditions, 3,4 and the main statistical findings are that scales are non-equidistant, and have 7 or fewer notes. 5,6There are many anecdotal reports that certain notes are commonly used, 3,4,[7][8][9][10] but this has only been quantitatively examined in one small cross-cultural sample. 11Thus, we lack concrete understanding of the fundamental questions of why humans use scales, how diverse they are, or how they came to be that way.We suspect that this is simply due to a lack of suitable resources.Here, we address this problem by presenting and analyzing a data set of musical scales from many societies, extant and extinct, built upon a century of ethnomusicological enterprise.
Let us begin by clarifying a few key terms and ideas.We define a scale simply as a sequence of unique notes (Figure 1A).Notes are pitch categories described by a single pitch, but as pitch is continuous, notes are more realistically described as regions of semi-stable pitch centered around a representative (e.g., mean, median) frequency. 12,13However, humans process relative frequency much better than absolute frequency, 14 so scales are practically specified by the intervals (frequency ratios) between notes.Thus, a scale is usually defined by the ratios made between all notes and the first note, called the tonic; we hereafter refer to these as scale notes.Intervals are determined by their frequency ratios, but since we perceive pitch logarithmically, 15,16 they are naturally measured in units of cents -obtained by a base-2 logarithm of the ratio of fre-quencies f 1 and f 2 , cents = 1200 × log 2 f 1 /f 2 .In what follows, steps will describe the intervals between adjacent notes, and intervals can refer to the interval between any pair of notes.
We explicitly define an important sub-class of scales, octave scales, which span an octave (frequency ratio 2:1; 1,200 cents) and have circular (or helical) structure such that notes repeat at every octave; 17 e.g., Fig. 1A shows that the first and last notes (an octave apart) of such a scale have the same name -a property is called octave equivalence).][9][10] However, this has been disputed, 13,16,[18][19][20] and statistical study of octave prevalence is still lacking.Octaves are indeed salient in the harmonic spectra. 21,22Yet, experiments have indicated that octave equivalence is only a weak perceptual phenomenonon, [23][24][25][26] that depends heavily on musical training, [27][28][29] and culture. 16Nonetheless, several measurable phenomena related to harmonic tones may lead to preferential use of the octave: tonal fusion, 26,30,31 hearing in noise, 32 , and memorability of complex tones. 33For example, increased tonal fusion between octave intervals (likewise for fourths, fifths, which also undergo tonal fusion) may give rise to greater perception of synchrony when singing in parallel. 34 address longstanding questions on scale diversity and evolution, we first document the creation of a database of scales.We then analyse the empirical scales to see which intervals are significantly frequent or infrequent, providing robust statistical evidence that the octave is prevalent among many societies.We examine the cross-cultural diversity of octave scales, finding that within-society variation can in some cases be as great as the total variation in scales -scales are surprisingly not that diverse.In particular, across all geographical regions step intervals are limited to 100 -400 cents, and scales are clustered around 5-and 7-note equidistant scales (scales where step sizes are similar in size).As a result of these restric-

Scale Note Interval
Step FIG. 1. Basic notions.A: Illustration of relevant terms: Notes in a scale can be represented symbolically (e.g., letters for notes), or quantitatively.As examples, we show: the Western major scale in the key of C on a piano (top); and the corresponding intervals in cents and as frequency ratios (bottom; in 5-limit just intonation tuning).We show three types of intervals: a scale note, a step, and an interval that is neither a note or a step.B: Venn diagram indicating how scale data can be classified.
tions, humans tend to use only ∼1 % of possible octave scales, which strongly suggests some degree of convergent evolution due to shared biases (in addition to convergence via cultural diffusion).Finally, we discuss the potential mechanisms of change and selection of scales, discuss the challenges in understanding scales evolution, and propose credible future directions for studying this evolution.

SCALES DATABASE
Database curation.A total of 60 books, journals, and other ethnomusicological sources were found to have relevant data on scales (SI Table 1). 13, We nte one previous attempt to create a database of scales from the ethnomusicological literature, 94 however there are no details on its construction, and it does not link scales directly to original sources.Our database improves on these issues through a stringent methodology (described below), sources for all scales and is presented as a digitized, open-access, resource which will continue to grow as new data is made available.We can define scales (Fig. 1B) either prescriptively ("these are the notes you can use in a melody") or descriptively ("these are the notes that were used in the melody").Theory scales consist of intervals with idealized, exact frequency ratios -they lack the random fluctuations that exist in the real world.These are mainly found in a limited set of cultures that exist along the old Silk Road route, and they are not necessarily played according to these theoretical ideals. 93,95][98][99][100] Theory scales are by definition prescriptive scales, although descriptive scales can be found which closely match these.Measured scales are obtained where measurements have been made of the notes on an instrument, or a recording of a song has been analysed with computational tools to extract a scale.Instrument tunings are by default prescriptive, but can be descriptive if all notes are used in a melody.There is some error in these measurementstaken with tools that include tuning forks, the Stroboconn, and modern computational approaches -but it is always within 10 cents.Measured scales taken from song recordings are exclusively descriptive, and they make up the smallest part of the database.This is because it is still quite a challenge to reliably infer scales from a recording of a performance using algorithms, 76 and thus it requires extensive manual labor.[107] Another statistical regularity in musical pitch is tonality -i.e., the distribution of notes, how they are played (duration, ornamentation), and their transition probabilities. 109,110][114][115] Tonality (or modality) is in some cases an integral part of a scale that determines how the notes should be played, 116 to the extent that musicians can reliably tell apart two ragas from the duration of a single note. 117e emphasise that our definition of scale does not imply any specific tonal hierarchy, while acknowledging that similar definitions (e.g., maqam, raga) are often used with the expectation that they include tonality. 100n the database, we account for tonality in a limited way by noting the order of notes in the scale, starting from the first note (tonic).However, this information is often unavailable, and we lack the finer details of note distribution or transition probabilities.Future cross-cultural analyses of recordings ought to document these important details. 116,118o enable a broad range of analyses we collected information pertaining to, where applicable, (i) the society (country, language or ethnic group, musical tradition), (ii) geography (country, geographic region) (Fig. 2), (iii) instrument type, (iv) tonic note, and (v) whether the scale is measured using harmonic or melodic intervals.We additionally linked societies, where appropriate, to identifiers from other ethnographic and linguistic databases, such as D-PLACE and Glottolog. 119,120D-PLACE contains information on social structure, economic, and environmental information, while Glottolog can be used to determine the linguistic distance between two groups -a common proxy for cultural similarity. 121Some musical traditions that span multiple countries (such as Western classical music) are also taken as a unit of society.Identification of the society or geographic origins of a scale was required for inclusion of scale data.Other details were found at varying frequency; e.g., tonic was only identified in 121 out of 434 measured scales.
Inferring octave scales.The database exists in two forms: the raw data (434 theory scales, and 462 measured scales), and a set of octave scales that is generated procedurally from the raw data according to five choices.For a complete workflow from source to database, including examples, and the choices we made to generate octave scales, see SI section "Inferring octave scales, SI Fig. 1-3, and SI Table 1 In total, we infer 896 octave scales (434 theory octave-scales, 384 octave-scales from instrument tunings, and 78 octave-scales from song recordings), from 73 societies.The theory scales span 6 regions, while the measured scales span 8 regions, and 46 countries (Fig. 2).When inferring octave scales from measured scales, a principle assumption is that they add up to an octave.Since the validity of the assumption is not clear, we first study the statistics of the measured scales before studying octave scales.

RESULTS
One of our goals is to estimate whether certain notes and intervals appear more or less frequently than expected by chance.This is hard to quantify, since we lack a universally-correct method of calculating the probability of observing an interval.Instead, we propose three independent statistical models that reflect different ways of constructing scales: Model: Lognorm.This model assumes that scale notes S are chosen independently from a lognormal distribution, P (S) = lnN(µ, σ 2 ).The rationale behind this choice is that small intervals should be uncommon (P (S) → 0 as S → 0) due to limits in pitch perception, and large intervals should be uncommon (P (S) → 0 as S → ∞) due to physical constraints (human anatomy; instrument size).We expect that this model is inappropriate for within-scale analyses, since notes within a scale are unlikely to be chosen independently of each other, but rather according to specific choices made by a musician.However, if the choices made by many independent musicians are sufficiently diverse, then the lognormal distribution may be an appropriate description of the between-scales note distribution.Model: Shuffle.This model assumes that specific step sizes are important, but their arrangement is not.Normally, both step size and order determine scale notes.But now consider a musician who cares only about the step sizes, not their order in which they are arranged.This can be approximated by sampling from the original data and reshuffling the order of step sizes.Model: Resample.This model assumes that step sizes are chosen independently from a distribution, and arranged randomly into scales.This is equivalent to a musician who is indifferent to both the step size, and the order.A reasonable choice of a distribution in this Between-scales distributions of step sizes and scale notes.A: Distribution of notes in measured scales (blue; histograms are shown as lines; bin size = 30 cents).Maximum likelihood lognormal distribution fitted to notes distribution (black).Distribution of notes obtained via alternative sampling: shuffling the step sizes within scales (orange); resampling step sizes from the full distribution of step sizes (green).The x-axis is truncated at 3,000 cents for clarity.B: Distribution of step sizes (blue).Distribution of step sizes in a set of scales with notes generated independently from a lognormal distribution.C-E: Probability that the counts observed in the original data were generated by the lognormal distribution (C), the shuffled scales distribution (D), or the resampled scales distribution (E).Empty circles indicate that the interval is found more than chance, while filled circles indicate the opposite.Dotted line indicates a p value of 0.05 after applying a Bonferroni correction.Dashed lines indicate the values of 200 and 700 cents.Shaded region shows bootstrapped 95% confidence intervals.
case is the posterior distribution of all step sizes.
In any alternatively-sampled set of scales, the number of scales and the number of notes in these scales matches the original.In the following sections, we will use these three models to, first, compare intervals between scales, and second compare intervals within scales, searching for evidence of statistically significant counts of intervals.
Statistically significant intervals between scales.To search for statistically significant (due to an abundance or dearth of) intervals, we first plot the distribution of scale notes (Fig. 3A, Original) and step sizes (Fig. 3B, Original) for a sub-sample including only measured scales, without imposing octave equivalence, and controlling for society (214 scales sampled from a total of 434 scales, with no more than 5 per society, resampled 1,000 times to achieve convergence; SI Fig. 4).We compare this empirical distribution to the maximum-likelihood lognormal distribution (Fig. 3A, Lognorm), and the corresponding step size distribution (Fig. 3B, Lognorm).As expected, we find that scales drawn from the lognormal distribution have step sizes that do not resemble real scalessince notes within a scale are not independent.Instead, notes within scales are more evenly distributed, such that step sizes are peaked at ∼200 cents, and rarely smaller than 100 cents (Fig. 3B); this is found in every geographical region that we investigated (SI Fig. 5).However, scale notes from different scales are independent of each other, resulting in a distribution that is approximately lognormal when many scales are sampled (Fig. 3A).
The utility of fitting a lognormal distribution to the data is that it can serve as a null-hypothesis baseline for estimating the probability that a scale note appears more or less than chance.To this end, we integrate the lognormal distribution over the range of each histogram bin i in Fig. 3A to get the probability of observing scale notes, p i .We then calculate the binomial probability, where k i is the number of observations in bin i, and n is the total number of observations.We report either: the probability that k i or higher is observed if k i /n > p i , n j=i q i ; or else, the probability that k i or less is observed, i j=0 q i , if k i /n ≤ p i .Low probability implies significant deviation from the null lognormal hypothesis.We see that only a few intervals deviate significantly from the lognormal distribution (Fig. 3C 1,200 , 700 , 200 and 2,400 cents are found more frequently, while 600 cents is found less frequently than chance.This is strong evidence that the octave (and the fifth) are important intervals in many societies.
To corroborate these findings, we repeat the significance test with different assumptions on how scales are generated.We first repeatedly shuffle the step sizes in each scale to generate new scales, to examine whether the statistically significant intervals would arise if step sizes were ordered randomly (Fig. 3D).Similarly, by resampling with new step sizes, we can test whether the significant intervals could plausibly have been produced by arranging randomly selected step sizes from the distribution in Fig. 3B (Fig. 3E).These analyses demonstrate that the peak at 200 cents (Fig. 3C) is due to it being the most common step size.The values of 600 and 2,400 cents are found to be less significant than in Fig. 3C, but are still much more significant than most intervals.The fact that the peak at 700 cents is significant in Fig. 3D but not in Fig. 3E is likely due to the presence of equidistant scales (SI Fig. 6), where the order of the intervals is mostly irrelevant.By this logic, one might expect the peak at 1,200 cents to disappear in Fig. 3D, but it survives because a large fraction of equidistant scales do not extend beyond the octave (SI Fig. 6).Remarkably, the octave is extremely significant according to all three tests.
Statistically significant intervals within scales.The previous method was used to assess whether inter-vals are found more than expected by chance across a collection of scales.Since grouping together scales from different societies might be problematic, we require a test that can discriminate unusually frequent / rare intervals within individual scales.To this end, we compare the original scales with many alternative scales that are created by shuffling the original scales' step sizes.We use the full range of intervals that can be generated with an instrument, not just the scale notes.For an instrument with N notes, this results in N × (N − 1)/2 intervals.By sampling many more intervals, this test is powerful enough to detect, in some cases, significant signals within individual scales.
To test for statistical significance of an interval I, we find all intervals in a scale that fall within w = 100 cents of I, and calculate their distance from I. We repeat this process for 50 shuffled versions of the scale, collecting all intervals within w of I in a second group.We then use a Mann-Whitney U test to examine whether either set of intervals, original or shuffled, is significantly closer (i.e., p < 0.05) to the octave than the other (repeating 100 times to get a converged average).We demonstrate this procedure for I = 1,200 cents (Fig. 4A): in most scales the original intervals are closer to an octave than the shuffled scales, although only a fraction of results are significant.Reasons for non-significant results include small sample sizes and the tendency of equidistant scales to produce similar intervals when shuffled.
We extend this analysis to all intervals over the range 200 ≤ I ≤ 2600 cents, showing the fraction of significant results indicating an interval is found more or less frequently than chance (Fig. 4B).Without a doubt, the most significant interval is the octave (35 % significantly close), and the intervals that are most significantly avoided in scales are the regions flanking the octave.In further agreement with the between-scales analysis, the next most significant regions are those around 500 , 600 cents, and 700 cents.
To put these results in perspective, we repeat the test on sets of scales generated by resampling step sizes from Fig. 3B (Fig. 4B, Null).The null distribution converges to 5 % as expected, which clearly demonstrates that the high fractions of significant results in the original scales are not artefacts due to testing multiple hypotheses.Additionally, we show that the results do not depend on our choice of w (SI Fig. 7).The consistency of these test results reinforces the conclusion that these significant intervals are not chosen randomly.
Effect of tuning variability on the search for universal intervals.Observation of significant intervals likely reflects the intentions of the musician to tune to these intervals.However, non-significant results do not necessarily indicate lack of such intentions, as imprecision in tuning (and tuning measurements) may result in false negatives.To understand the impact of imprecision, we generate a test set of scales (by sampling step sizes from Fig. 3B) and fix all scale notes that are ≥1,200 cents to be exactly an octave higher than one of the notes that are ≤1,200 cents.We then add to the test sets normally-distributed noise, N(µ = 0, σ 2 ).Even without noise (σ = 0), this test can find significant results only 90 % of the time (Figure 4C), which demonstrates the difficulty in inferring intentionality simply due to low sample sizes.
To estimate reasonable bounds on the noise σ, one may first consider that the measurement error varies from about 1 cents for computational methods (e.g., Stroboconn), 122 to 5 cents for tuning forks.9][130][131] Thus, we can expect a reasonable upper bound to the proportion of significant results that can be detected of about 30 to 70 %.Thus, when we find the octave to occur significantly more than chance about 35% of the time (Fig. 4B), this is a rate that is within the range of the hypothetical maximum.
Ultimately, it is hard to say how exactly scales are chosen.Were some important intervals fixed first (e.g., the octave), and then the rest chosen to fill the gaps?Or were step sizes of a certain size (e.g., big and small size categories) chosen, and then arranged in some preferred order? 132What we have conclusively shown is that independently of how scales are created, they show a significant, consistent bias towards including some notes, and avoiding others.
Qualitative evidence for preferential use of octaves.A more direct route to understanding the intentions of musicians is through detailed ethnography.We therefore examined each source for qualitative evidence indicating preferential use (or absence) of the octave (SI Dataset 1).We found qualitative evidence in support of octave use in 26 out of 60 sources: use of the octave to tune instruments (8 sources); performing melodies in parallel octaves (11 sources); using the same name for notes an octave apart (10 sources), which is also strong evidence for octave equivalence.We also find quantitative evidence (statistically significant octaves for at least one scale; Fig. 4A) in 26 sources; in total, 40 sources contain either quantitative or qualitative evidence.For 6 of the remaining 20 sources, there is evidence from secondary sources for preferential use of the octave in those cultures; another 3 sources report on archaeological findings, which are difficult to investigate further. 48,57,63wo sources (Georgian polyphonic singing) provided qualitative evidence that fifths are perceptually important, instead of octaves; 13,89 however, we found statistically significant octaves in one of these sources, so the evidence here is mixed. 13A third source (Colombian marimba) provided quantitative evidence that octaves were found significantly less than chance. 85Overall, we find some evidence (either primary or secondary) in support of octaves being significant in 46 out of 60 sources, and evidence to the contrary in 3 sources.These findings strongly support the view that octaves are widespread, but not absolutely universal. 19,20atistics of Octave Scales.After verifying that preferential use of octaves is indeed widespread, we proceed to studying octave scales; we infer octave scales by making some assumptions (e.g., octave equivalence; see section "Inferring octave scales").While in the previous section we exclusively examined measured scales, here we look at a mix of theory and measured scales.Data is shown for four samples of scales: (i) all theory scales, (ii) all measured scales, (iii) subsample controlling for society ('SocID', no more than 5 scales per society), and (iv) a sub-sample controlling for geographical region ('Region', no more than 10 scales per region).
Most scales are found to have 7 or fewer notes (Fig. 5A), in agreement with previous work. 66note scales are remarkably rare across all sampling schemes.The predominance of step sizes of ∼200 cents (Fig. 3B) is also seen in octave scales (Fig. 5B).Consistent with existing literature, 2 most steps are between 100 -400 cents, and this applies in all geographic regions studied (SI Fig. 5).The most common notes (Fig. 5C), in all sub-samples, are exact matches for the significant intervals in Fig. 4B (500 , 700 cents).The fact that the statistics of octave scales is consistent with the statistics of measured scales (and with previous work), to an extent, validates our methodology for extracting octave scales (SI section "Inferring octave scales").
Theory scales show sharp peaks at intervals close to 12-TET intervals.In contrast, the distributions of steps and notes in measured scales are much more diffuse.This difference is expected, since theory scales consist of mathematically-exact, ideal intervals, while measured scales include natural sources prone to error and variation.There is, however, correspondence between theory and measured scales in the rarely used notes (270 , 450 , 550 , 650 , 930 cents), which is even clearer when controlling for the number of notes (SI Fig. 8).Regardless of the number of notes in a scale, we find salient peaks at 200 , 500 and 700 cents (SI Fig. 8).Overall, despite some differences between theory and measured scales, both sets are in agreement over the most distinct features: (1) step sizes are usually ∼200 cents, and between 100 -400 cents, and (2) the most salient notes are at 500 and 700 cents, while the regions around these notes are avoided in scales.
Variation across societies is comparable to variation within some societies.To quantify variation in scales across societies, we use t-distribution stochastic neighbour embedding (tSNE) to map the scales onto a reduced two-dimensional representation. 133This method can compare only scales with the same number of notes, so we show results separately for 7-note scales (Fig. 6A), and 5-note scales (SI Fig. 9).We group scales into clusters (using the DBSCAN algorithm (eps = 2, minimum samples = 5), 134 and report note distributions and region distributions for the largest four clusters.The tSNE embedding is useful for visualizing diversity among high-dimensional objects, but note that tSNE dimensions are arbitrary.The main information in this plot is the distances between points: distances in the 2d-embedding are non-linearly related to the real distances, such that 2d distance from one scale to all other scales will have a high rank-correlation with the real distances.
The simplest and most salient feature is that scales tend to be almost equidistant (Fig. 6D).For example, in the largest cluster (Fig. 6A-B; blue), notes are all within 30 cents of the corresponding notes in the equiheptitonic scale.Although most examples (29 / 40) of this cluster (Fig. 6D) come from three countries (Thailand, Guinea, Malawi), in total they are found in 13 countries (e.g., Colombia, Georgia, Zimbabwe, Indonesia).Surrounding this cluster (Fig. 6A), we can find examples of theory scales: Western diatonic modes, and North Indian thaat (e.g., green cluster).In general, we see that most equidistant scales are measured scales, but there is also a lot of overlap between theory and measured scales (Fig. 6E).Furthermore, when clustering societies by scale similarity, we find two main clusters -one dominated by societies with theory scales, and one dominated by societies with equidistant scales (SI Fig. 10-11).Examples of scales that are furthest from equiheptatonic include the Gamelan pelog scale (Fig. 6A-B; yellow cluster), and the Carnatic mela salagam (Fig. 6F), which both have many small step sizes in a row.
Surprisingly, there seems to be little variation overall, such that multiple societies that use theory scales exhibit levels of within-society variability comparable to the total variability (Fig. 6F).In general, we find that geographical regions tend to contain overlapping subsets of scales (SI Fig. 12A), and distances between scales within-regions are of similar magnitudes to distances between scales between-regions (SI Fig. 12B).Even Thai scales, which are often cited as exclusively being in equiheptatonic tuning 47,62,69,70 although this is disputed 68 -exhibit substantial withinsociety variability (Fig. 6F).Overall, there appears to be less variation in 7-note scales than might have been expected, and a similar analysis of 5-note scales reveals similar results (SI Fig. 8), which suggests that they share the same organizing principles.
Statistical analysis shows that scales tend to be equidistant.To put the diversity of scales in a broader context, one may consider a hypothetical universe of possible scales by enumerating them on a grid (grid scales).To this end, we take 20 cents as a basic grid resolution, and enumerate all possible unique scales with step sizes within 60 -320 cents; this limitation on step size already reduces the number of possible scales (at 20 cents resolution) by 98 %.To conveniently compare grid and real scales, we examine their 2dimensional embedding.Strikingly, those grid scales that correspond to real scales (i.e., within ±10 cents of a real scale) are clustered around the equiheptitonic scale (Fig. 7A, cyan).For comparison, Thai scales are found to be as far as 43 cents from equiheptatonic.We find that 70 % of real scales are within this boundary, compared to only 10 % of grid scales (Fig. 7B).Alternative definitions of equidistance based on step sizes rather than scale notes also support the finding that scales tend to be close to equidistant (SI Fig. 11).
To gain further insight into the striking (near-) equidistance of scales, we look at the note distributions for notes 2 -7 (Fig. 7D) for real scales and grid scales.The grid scales show the broadest note distributions for the two notes (4 and 5) furthest from the fixed ends (tonic, octave), resulting in higher entropy (Fig. 7E).This is expected, since summing two random distributions should result in a higher variance; e.g., note 4 is the sum of three steps drawn from Fig. 7C.In contrast, distributions of notes 4 and 5 in real scales are most predictable (lowest entropy), once again illustrating the significance of intervals of size 500 and 700 cents.
To examine the possibility that the difference between real scales and grid scales is mainly due to differences in their step size distributions (Fig. 7C), we analyse two other sets of note distributions.We first look at the note distributions that ought to be farthest from equidistant scales: we rearrange the steps in each real scale so that they are ordered low-to-high, and consider scales starting from every position.This results in note distributions with entropy similar to grid scales (Fig. 7E, Sorted).We then look at the note distributions of alternative versions of real scales obtained by shuffling the step sizes in scales.Again, we find that the entropy is highest at notes 4 and 5 (Fig. 7E, Shuffled),in contrast to real scales.The main difference between the real scales and grid scales appears to be that step sizes in real scales are well-mixed -small steps are found adjacent to large steps rather than small steps (and vice versa).

DISCUSSION
How diverse are scales?Despite some differences, scales across cultures are remarkably similar.For example, the set of Carnatic scales alone is almost as varied as the total set of scales (Fig. 6F).In particular, most scales observed in cultures -when compared to the universe of possible scales -are close to 5-and 7-note equidistant scales (Fig. 7, SI Fig. 12).6][137] Yet, we find that they are more prevalent than expected by chance.This discrepancy may be due to the lack of a robust definition for equidistant scales -for example, the only previous statistical study of prevalence of equidistance does not explicitly account for natural variation in intonation. 6It is not clear how much deviation from perfect equidistance renders a scale perceptually non-equidistant, or how variability in pitch affects perception of equidistance.As an illustration of the difficulties, consider the following example from Georgian singing: the step sizes in this scale are close to equidistant (163 -202 cents) when viewing the melodic pitch histogram.But the intervals between notes in the melody are much less exact (90 -240 cents); 13 despite being statistically equidistant, it is not clear whether this scale would be perceived as equidistant.Perception of equidistance may also vary with culture and training, in which case it may be better to avoid binary measures of equidistance, and instead to construct a perceptually-relevant, continuous measure.
The most salient difference between societies is whether their data consisted of theory scales or measured scales (SI Fig. 9-10), considering that theory scales appear to be less equidistant than measured scales.One hypothetical explanation for the predominance of non-equidistant scales in societies using theory scales: the process of creating new scales by combining simple-integer-ratio intervals will inevitably result in more non-equidistant scales because only one combination of step sizes can result in an equidistant scale.We wonder whether this societal difference would persist if we were to only compare measured scales from societies, since theory scales will certainly exhibit intonation variability when performed.In the literature, Western scholars often discuss variation in intonation in societies that lack mathematical musical theory as being an intentional form of expression, 64,68,[138][139][140][141] whereas studies of classical musics typically investigate to which theoretical tuning system the musicians conform. 13,123,126,127Ultimately, it is difficult to intuit the differences between societies, since there are only a few cross-cultural studies of interval discrimination. 142,143Here, we performed a preliminary study on intonation variability (SI Fig. 13-14), finding comparable levels of variation (with the possible exception of the pelog scale) in Gamelan orchestras, Thai xylophones, Turkish ney, 91 Georgian singing, 89 and a Belgian carillon. 144But ultimately, to understand the differences between societies that use theory scales and those that do not, we will need crosscultural perceptual experiments, and direct measurements of scales from recordings.
How did far-away societies come to use such similar scales?-Using a cultural-evolutionary framework, 145 we can describe this process as a combination of cultural diffusion, and convergent evolution due to common factors biasing the use of scales across cultures.Diffusion can certainly account for some similarities, as some societies have documented shared history: e.g., societies with theory scales in the Middle East, or East Asia; court music in Thailand, Laos and Cambodia. 100Some have suggested that cultural diffusion can be inferred based on two cultures using similar equiheptatonic scales, 80,85 but evidence seems circumstantial given how widespread equiheptatonic scales are.7][148][149] While there is undoubtedly some transmission of information across cultures, we reiterate that: within-region scale diversity is comparable to between-region diversity (SI Fig. 12B), step intervals consistently show approximate limits of 100 -400 cents across regions (SI Fig. 5), and scales are (against chance odds) overwhelmingly close to equidistance in all regions (SI Fig. 12A).Taken together, these facts point towards some non-negligible degree of convergent evolution due to shared biases.How do scales evolve over time?Scales can change, or persist, in a variety of ways.On a short time-scale, vocal (or other non-fixed-pitch instrument) scales are inherently stochastic, due to a lack of precision in motor control. 150On a longer time-scale, vocal scales reside in memory, which introduces another mechanism of change.Unlike vocal scales, instrument tunings physically persist through time, but are still affected by multiple factors: environment (temperature, humidity), material (wood, metal, animal organs), and physical force. 49To illustrate this point, we checked examples of repeated tunings of the same instruments across a specific time frame: Gamelan orchestra (metal idiophone), with standard deviation of notes, σ = 8 cents (slendro) and σ = 13 cents (pelog) cents over about 25 years; 46 Angolan likembe (plucked metal idiophone), where σ = 18 cents over a few weeks; 49 Gambian kora (chordophone), with σ = 27 cents over one week. 52Technology offers more robust ways of keeping scales stable over time: monochords (ancient Greece) and pitch pipes (ancient China) enable reliable tuning by fourths and fifths. 36Likewise, the perceptual phenomenon of tonal fusion may have enabled stable scales by providing knowledge of perceptual anchors (octaves and fifths) that can be reliably passed through generations.Recently, it seems to us that stability in scales has been reinforced though global tuning standards and inventions such as fixed pitch instruments and electric tuners.Thus, we can see that scales change in many ways, but it is possible for the rate of change to decrease through the use of technology and musical theory.
We can imagine some possible mechanisms of scale evolution, drawing analogies with biological evolution: We can think of a single scale as a gene, and a set of scales used by a population of people as a genome.Changes to the scale notes is akin to mutating a single nucleotide.Adding or removing notes from a scale is like insertions/deletions of nucleotides.Scales can be copied through interactions between populations; genes can be shared through horizontal transmission.The Western diatonic modes can be constructed using the same step intervals, but changing the tonic position; the same process occurs in protein sequences (circular permutants), likely through gene duplication and homologous recombination.
The analogy starts to break when you consider that scales can be invented, which is more akin to designing genomes.There are recent examples of invention of microtonal tunings in Western music, 36,64,151 and many theory scales bear the hallmarks of design: Greek modes are all circular permutants, based on simple integer ratios; 152 the Carnatic melakarta result from combinatorial enumeration of a set of intervals, constrained by a set of rules. 1536][157][158][159] Typically, though, the theory scales that still survive are similar to the meaured scales, so it is likely that designed scales are constrained by similar selection criteria that applies to non-designed scales.
We propose that a detailed mechanistic model of scale evolution is possible: The mechanisms of spontaneous change need to be studied at the resolution of a single mutation step (through, e.g., transmission chain experiments). 160The relative importance of the different mechanisms could be investigated with sufficient long-term data (e.g., same individual/group/culture, over a period of months/years).Such a model could then be used for agent-based modelling to study the role of horizontal transmission, and cultural-evolutionary biases.pend on how pitch is perceived by humans; and (iii) production biases depend on what is easy or difficult to sing, and physical constraints on instruments.
We can think of three examples of culturalevolutionary biases that may apply to music: conformity, novelty and presitge biases.A bias towards conformity results in the most populous scale type being increasingly successful.In general, humans tend to synchronize 34,162 when playing music (same meter, tonality), and over a longer time period, there is evidence of shifts towards the use of 12-TET. 126,163A conformity bias could arise due to other reasons, but we can probably rule out the difficulty of learning scales -studies on novel scale systems have shown that people rapidly learn their statistics. 164,165Novelty biases could be can describe choosing novel microtonal tunings as a means of expression. 140,151An example of a prestige bias is where tunings are copied from players/instruments that are acknowledged as good. 46,87][176][177][178][179] Production constraints can apply to both singing and instruments. 180When singing, large intervals cost more energy to produce than smaller ones, while small intervals are difficult to produce reliably due to limits on motor control. 181Instruments are less constrained in this way, but still have physical limits to the number of notes, and interval range.On the other hand, instruments are constrained in how reliably they can be re-tuned to the same scale.There is a long history 152,182 behind the practice of tuning using harmonic intervals, 50,52,59,67,88 and reports exist of tuning according to the step sizes, 38,49,51 tuning instruments visually, 87,88,96,183 and copying a reference instrument. 46,87w can we study the evolution of scales?Tracing the evolutionary history of scales is challenging.Scales change at rates that depend on the instruments and technology, and new scales can be invented from scratch.The evolution is potentially driven by numerous selection pressures that vary in strength across societies.Nonetheless, we propose some approaches that seem feasible.
We can try to track evolutionary trajectories of scales using historical data.By restricting the scope to a single society, one can make simplifying assumptions by using ethnographic accounts to inform models.For example, about Gamelan orchestras we know that they are typically tuned in reference to another orchestra, 140 and instrument intonation changes at a steady rate (due to similar materials and environment).This can be described as an evolving network of Game-lan orchestras with edges between orchestras that influence each other.Gamelan tunings are also extensively documented in the literature, and additional tunings can be inferred from recordings.
Another approach to infer the evolutionary trajectories or selection pressures is by analyzing a population of scales.This approach requires appropriate mathematical models where multiple selection pressures can be considered in tandem.Ultimately, many models may have convergent predictions, which means that additional experiments are needed to distinguish the relative importance of different selection pressures.One limitation of this approach is the dependence on the sample of scales studied, and it is hard to construct a priori a representative sample of scales.In this study we control for society, but other criteria are possible.Does it matter if societies have different population sizes?How does one deal with differences in within-society variation in scales?Should we take into account frequency of scale use within a society? 100,184These questions will surely unravel as more data becomes available for hypothesis testing.
Possible bias in the scale database.Relying on data collected by a limited number of ethnomusicologists, the database at this stage has sparse geographic coverage (Fig. 2).Some have suggested that ethnomusicologists have a bias towards reporting findings that are considered 'interesting', thus exaggerating diversity. 20ertain musical traditions, such as Gamelan and Thai, were very popular research topics, so they are overrepresented.In contrast, there are very few quantitative measurements of scales or instruments with fewer than 5 notes, despite these being reported in many sources.Additionaly, vocal scales (taken from recordings of singing) are scarce in the database, yet the voice is probably the most important instrument from an evolutionary point of view. 12nfortunately, in some rare instances there seem to be statistical irregularities in the reporting of tunings: Surjodiningrat et al. (46) note that Jaap Kunst 185 reported gamelan tunings (not included in this database) where all the higher notes were exactly an octave above the lower ones, and that this is extremely unlikely; in a study of prehistoric bone flutes where the tunings are given to an accuracy of 1 cent, one flute is recorded as having a series of equal tempered intervals: [200, 200, 200, 300]. 63Ultimately, these biases do not void the conclusions, but the results shown here will certainly need to be updated as more data becomes available.Limitations to studying scale evolution.We may be witnessing a decline in diversity of scales due to the converging forces of globalization and technological change. 186There is already evidence of homogenization, via the adoption of 12-TET. 126,163Thus, to understand how scales evolve, we must look to the past, but therein lies a different problem: the older the in-strument, the less certain we are about how they were played.For prehistoric artefacts, we cannot be sure whether they (or their reconstructions) faithfully resemble the instrument in its original condition.8][189] Thus, we believe that the best source of scales is in ethnographic recordings spanning the past century. 103It is therfore imperative that methods be developed that can faithfully infer scales from large samples of songs.Algorithms must be developed to handle low-quality recordings, 190 background noise, instrument / singing segmentation, 191 polyphonic stream segmentation, 192 note segmentation, 193 and tonal drift. 194

CONCLUSION
Scales are a cornerstone of music across the world, upon which endless combinations of melodies can be generated.Surprisingly, despite a wealth of ethnomusicological research on the subject, we lacked a comprehensive, diverse synthesis of scales of the world.Here we remedy this issue, with a focus on quantitative data that will enable detailed statistical analyses about how scales evolve.Our own preliminary analyses have lent quantitative and qualitative support for the widespread (but not necessarily universal) use of the octave in some special capacity.Despite the rich diversity of scales, when put in context of how many scales are possible, what stands out in our analysis is how remarkably similar they are across the globe.Altogether, this work presents a treatise on the evolution of scales, and proposes promising avenues for future research.

Octave Scales
Figure 1: Top: Schematic for construction of scale database.Appropriate sources are rst identi ed, and out of these we identify scales that have unambiguous quantitative representations.is results in a set of 'theory scales', and a set of 'measured scales'.From these two sets, we create a set of 'octave scales', according to a set of ve (i -v) choices.Bo om: Examples of how choices (i) and (ii) are implemented.(i) Most theory scales are represented symbolically, and we require some kind of code to convert the symbolic representation into a quantitative representation.(ii) For purposes of analysis, if one wants to ignore the concept of tonality (here referring to the idea that positions in a scale are not equivalent, and that scales start at the tonic), one can include all possible variants that start on di erent positions.…      .Distribution of notes obtained via alternative sampling: shu ing the step sizes within scales (orange), for those scales that are far from equidistant (di erence between minimum and maximum step sizes is greater than 100 cents).e x-axis is truncated at 3,000 cents for clarity.B: Probability that the counts observed in the original data were generated by the (far-from-equidistant) shu ed scales distribution.e peak at 700 cents is lower than in the main text Fig. 3D. is indicates that for scales that are far from equidistant, the step sizes are arranged so that hs appear more frequently than chance.C: Distribution of the largest interval size for all scales.D: Distribution of the second largest interval size for all scales.When we create new samples by shu ing without replacement, we do not count the nal note since this will always be the same.We considered that if there were many equidistant scales, they would result in octave intervals appearing by chance.However, this does not happen as the second largest interval size falls short of an octave in many of these equidistant scales.

Belgian Carillon
Figure 16: Analysis of interval consistency for tuning information from three sources: Georgian polyphonic singing (A), from recordings of Zar (a type of funeral dirge) taken in di erent villages; [2] a set of Turkish 'ney' utes (B); [3] a single Belgian carillon.[4]  A: We align (in some villages certain notes are omi ed) and compare all scale notes that were measured either from melodic pitch class histograms (le ) or harmonic pitch class histograms (right).B: Since the ute tunings are already aligned (same number of notes; similar scale notes), we can calculate the deviation from mean for each note position, for scale notes (le ) and step sizes (right).C: Unfortunately there is only one set of notes that spans 2,700 cents, so we performed two analyses of intervallic consistency within the instrument.We grouped all possible intervals by their distance to the nearest equal tempered (12-tet) interval, and calculated the deviation from the mean (le ).It is not known whether the intended tuning is 12tet (the authors speculate that mean-tempered tuning is used), so we also used a restricted set of intervals, where only intervals between the same pairs of notes were grouped together (right; e.g., C3-D3 is only grouped with C4-D4). is results in most groups only having two sets of measurements, which artefactually results in a lower standard deviation.It can be shown that using a sample size of 2 to estimate the standard deviation will typically result in a value that is approximately half of the actual standard deviation; thus, we believe that the rst measurement (C, le ) is a good approximation of the interval consistency of this instrument.

FIG. 5 .
FIG. 5. Statistics of octave scales.A: Distributions of the number of notes in a scale for different samples of the scale database: sample balanced by region, sample balanced by culture, theory scales, and measured scales.B-C: Histograms of step sizes (B) and notes in scales (C) for different samples of the scale database: samples balanced by region, samples balanced by society.Whiskers (A) and shading (B,C) indicate bootstrapped 95% confidence intervals.Histogram bins are 20 (B) and 30 (C) cents.

FIG. 6 .
FIG.6.Cross-cultural diversity of scales.A: 2-dimensional embedding of 556 heptatonic scales, labelled with 7 Western diatonic modes (black), 10 North Indian thaat (red), and the four largest clusters (shaded areas).B: Note distributions for each cluster.Black lines are shown for means of each note, and grey shading indicates equidistant scale notes ±30 cents.C: Geographic distribution of each cluster.D-G: Embeddings are labelled with: equidistant scales (D; where notes are on average within 30 cents of the equidistant values), theory vs measured scales (E), Carnatic scales (F), Thai scales (G).

FIG. 7 .
FIG.7.Comparison of real scales with all possible 7-note scales enumerated on a grid (20 cents resolution; with step sizes limited to 60 -320 cents).A: Two-dimensional embedding of grid scales.Grid scales that correspond to real scales (notes are on average within 10 cents of a real scale) are highlighted cyan; a black circle shows the equidistant scale.B: Mean note distance from 7-note scales and the equiheptatonic scale, for Thai scales, real scales, and all grid scales.C: Step size histograms (bin size = 20 cents) in real scales and grid scales.D: Scale note histograms (bin size = 20 cents) for notes 2 -7 (no tonic and octave) for real scales and grid scles.E: Entropy of note distributions for: real scales; real scales, but rearranged with their steps arranged in order of size (Sorted); real scales, but rearranged with their steps in random order (Shuffled); grid scales.

Figure 2 :
Figure 2: continuing from Fig. 1…(iii) sources that contain 'measured scales' rarely explicitly include detailed information about tonality, let alone which note may be considered a starting note ('tonic').us, one can choose to: (a) Only include scales if there is evidence for which note(s) the performer considers a tonic.(b) Include all possible variants where the nal note sums to 1200 ± O cents.(c) Identify one potential 'tonic', and only include scales that start on that note, or on notes that are related to the 'tonic' by an octave relation.(iv) e value of O needs to be speci ed; we choose O = 50 cents.(v) Some sources indicate that a full octave is used, and for reasons of concision they do not report the nal interval leading to the octave.In this case, one can choose to include the nal interval (shown in red) or not.

Figure 3 :Figure 4 :
Figure 3: Distribution of deviations of nal notes from the octave for all octave scales (O = 50).

Figure 5 :
Figure 5: Step interval distributions for Measured scales (A) and eory Scales (B), from di erent geographical regions.In each panel, the distribution for one region is highlighted, and the distributions for the other regions are shown in grey.ere are no eory scales in the database from Oceania or Latin America (B).

Figure 6 :
Figure6: A: Distribution of notes in measured scales (blue; histogram bin size = 30 cents).Distribution of notes obtained via alternative sampling: shu ing the step sizes within scales (orange), for those scales that are far from equidistant (di erence between minimum and maximum step sizes is greater than 100 cents).e x-axis is truncated at 3,000 cents for clarity.B: Probability that the counts observed in the original data were generated by the (far-from-equidistant) shu ed scales distribution.e peak at 700 cents is lower than in the main text Fig.3D. is indicates that for scales that are far from equidistant, the step sizes are arranged so that hs appear more frequently than chance.

Figure 7 :
Figure7: E ect of changing  on the fraction of signi cant results found for each interval size (see main text for details of statistical test).Blue line indicates that the interval is found signi cantly more than chance, while the orange line indicates the opposite.Shaded region shows bootstrapped 95% con dence intervals.

Figure 9 :
Figure 9: Cross-cultural diversity of scales.A: 2-dimensional embedding of 232 pentatonic scales, with the six largest clusters indicated (shaded areas).B: Note distributions for each cluster.Black lines are shown for means of each note, and grey shading indicates equidistant scale notes ±30 cents.C: Geographic distribution of each cluster.D-G: Embeddings are labelled with: equidistant scales (D; where notes are on average within 30 cents of the equidistant values), theory vs measured scales (E), Japanese scales (F), Burmese scales (G).

Figure 12 :
Figure12: A: tSNE embedding of 7-note scales, with di erent regions highlighted in each plot.Oceania was not included since there are too few examples in this case.B: Distributions of distances between all possible pairs of 7-note scales between one region and another (between-region distance) is shown in grey; distributions of distances between all pairs of scales within a region (within-region distance) is highlighted in each plot.
108akdown of scales in the database according to: geographical area; theoretical scales or measured scales.The map shows the geographic origin of the measured scales; theory scales are not included since they are not always associated a single country; marker size indicates sample size.Made with Natural Earth.108

Table 1 :
e number of scales of each type from each source, in order: raw theory; raw measured from instrument; raw measured from recording; octave theory; octave instrument; octave recording; total octave scales.