Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Genetic analysis of the endangered Cleveland Bay horse: A century of breeding characterised by pedigree and microsatellite data

  • Andrew Dell ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft (AD); (PBW)

    Affiliations Department of Biological Sciences, University of Lincoln, Brayford Way, Brayford Pool, Lincoln, United Kingdom, Rare Breeds Survival Trust, Stoneleigh Park, Stoneleigh, Warwickshire, United Kingdom

  • Mark Curry,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration

    Affiliation Department of Biological Sciences, University of Lincoln, Brayford Way, Brayford Pool, Lincoln, United Kingdom

  • Kelly Yarnell,

    Roles Formal analysis, Supervision, Visualization, Writing – review & editing

    Affiliation School of Animal, Rural and Environmental Sciences, Nottingham Trent University, Brackenhurst Campus, Southwell, Nottinghamshire, United Kingdom

  • Gareth Starbuck,

    Roles Project administration, Resources, Software, Visualization, Writing – review & editing

    Affiliation School of Animal, Rural and Environmental Sciences, Nottingham Trent University, Brackenhurst Campus, Southwell, Nottinghamshire, United Kingdom

  • Philippe B. Wilson

    Roles Data curation, Formal analysis, Project administration, Resources, Validation, Writing – review & editing (AD); (PBW)

    Affiliations Rare Breeds Survival Trust, Stoneleigh Park, Stoneleigh, Warwickshire, United Kingdom, School of Animal, Rural and Environmental Sciences, Nottingham Trent University, Brackenhurst Campus, Southwell, Nottinghamshire, United Kingdom


The Cleveland Bay horse is one of the oldest equines in the United Kingdom, with pedigree data going back almost 300 years. The studbook is essentially closed and because of this, there are concerns about loss of genetic variation across generations. The breed is one of five equine breeds listed as “critical” (<300 registered adult breeding females) by the UK Rare Breeds Survival Trust in their annual Watchlist. Due to their critically endangered status, the current breadth of their genetic diversity is of concern, and assessment of this can lead to improved breed management strategies. Herein, both genealogical and molecular methods are combined in order to assess founder representation, lineage, and allelic diversity. Data from 15 microsatellite loci from a reference population of 402 individuals determined a loss of 91% and 48% of stallion and dam lines, respectively. Only 3 ancestors determine 50% of the genome in the living population, with 70% of maternal lineage being derived from 3 founder females, and all paternal lineages traced back to a single founder stallion. Methods and theory are described in detail in order to demonstrate the scope of this analysis for wider conservation strategies. We quantitatively demonstrate the critical nature of the genetic resources within the breed and offer a perspective on implementing this data in considered breed management strategies.


In recent years there has been substantial interest in quantifying the genetic diversity of equine breeds using pedigree [1], molecular data [2] or a combination of both sources [3] in order to implement effective breed management strategies. The effectiveness of the use of both data types in the understanding and management of rare and native equine breeds have been investigated using both theoretical modelling, and studies of closed studbooks.

The Cleveland Bay horse is a heritage British breed which has its origins in the Cleveland Hills of Northern England [4]. The first studbook was published in 1885, and this contains retrospective pedigrees of animals dating back to 1732 providing a closed non-Thoroughbred studbook dating back almost 300 years and for more than 38 generations. In addition, the breed Society now has a mandatory policy of microsatellite-based parentage testing at the time of registration. Unrestricted access to the microsatellite test data, as well as the stud book records provides a rare opportunity to evaluate both methods of assessing genetic diversity within the breed and, in turn, provides comprehensive guidance to breeders in terms of conservation practice for this endangered breed [5], whilst providing an important and potentially wide-ranging tool for wider conservation practices both in situ and ex situ in vivo.

The Cleveland Bay is a warm-blooded equine; a product of a cross of hot-blooded Oriental / Barb /Turkish or Mediterranean stock on the cold-blooded Northern European heavy draught horse [6]. It is reputed to have evolved in the matriline from the now extinct Chapman horse, which early records show were being bred on the monastic estates of the region well before the dissolution of the monasteries in the mid 16th century [4].

Although stated as being “free of blood” in the first three volumes of the studbook [7], early research into the founders of the breed recognised the contribution on the male side by some notable Thoroughbred stallions that were standing at stud or travelling in the region in the late 18th and early 19th Centuries [8].

Over the years the breed has been used extensively as both a work horse and a riding horse, and has been crossed with other breeds to produce carriage horses [9]. Indeed, at one time there was a separate breed society with its own studbook–The Yorkshire Coach Horse Society–for such animals [10]. Such has been the desirability of the pure Cleveland Bay for contributing weight carrying capacity when crossed with other equine breeds, that they have been exported globally [9]. In addition to North America, the breed has been exported to Australasia, Pakistan and Japan; a Cleveland Bay stallion stands at the Imperial stud [8].

The fashion for such effective cross-bred horses is one factor that brought the pure-bred Cleveland Bay horse to the edge of extinction. The substantial decrease in population size of the breed following the First World War when large numbers of Cleveland Bay horses were used to haul artillery on the battlefields of Northern Europe led to sustainability concerns regarding the remaining genetic resources of the breed [9]. The popularity of the breed continued to decline in the 1920s and 30s as the increasing use of motorised transport reduced the need for carriage horses. Moreover, following the technological developments of the Second World War, further mechanisation was implemented in farming practice and the purpose of the Cleveland Bay was further diminished [11].

In an attempt to improve the diversity of the home-based breeding population, the stallion Farnley Exchange was brought back from the United States of America (USA) in 1945 to stand at stud [9]. By the early 1960s there were only four stallions of breeding age left in existence and the breed is known to have gone through a genetic bottleneck at this time [8].

In the 1960s HM the Queen purchased the stallion Mulgrave Supreme, thus preventing his export, and stood him at public stud, both to promote, and help conserve the genetic diversity of the breed in the United Kingdom. Since that time the breed has seen a moderate recovery in numbers, partly because of patronage of the breed society by HM the Queen and the use of Cleveland Bay horses at the Royal Mews.

By the late 1990s, between 35 and 50 pure bred animals were being registered annually by the Cleveland Bay Horse Society (CBHS), whose studbook now includes animals being bred both in the United Kingdom, Europe, North America and Australasia [12].

The breed is one of only five equines listed as “Critical” by the UK Rare Breeds Survival Trust, indicating that the population has less than 300 breeding females. Earlier investigation of the CBHS Studbook records [7] indicated there were eight female ancestry lines existing within the breed.

A more recent study [13] restricted to animals entered in the CBHS studbook between 1934 and 1995, highlighted the limited genetic diversity in the breed and the increasing levels of inbreeding. It was recognised that further in-depth analysis of the status of the breed would be needed in order to aid in the development of breed management plans.

The aim of this study was to develop a comparative analysis of the genetic diversity in the Cleveland Bay Horse population using both genealogical and molecular methods and provide recommendations in order to support a global breed conservation strategy for the Cleveland Bay Horse, whilst sequentially detailing the theory and practice inherent in our approach leading to its applicability in the conservation of endangered breeds and species in vivo.

Materials and methods

Pedigree data

Summary data from the CBHS stud books volumes one to thirty eight was published in the Society’s Centenary studbook [7]. Names and studbook numbers of all registered horses together with date of birth, sire and dam were listed and this information was digitised in Filemaker™ (Filemaker Inc.), to construct an electronic pedigree database for the breed, stored in Filemaker format. Registrations post-1985 have been added to the database on an annual basis up to and including for this study, Volume 38 of the studbook.

The Cleveland Bay Horse Society provided access to a total of 535 microsatellite parentage testing reports. These had been obtained by commercial analysis of hair follicle samples taken from individual animals for registration verification. Samples were tested for a panel of 16 microsatellite markers approved by the International Society for Animal Genetics (ISAG) equine genetics group, by the Animal Health Trust (Newmarket, UK.). Close examination of stud book records, recent Breed Society census records and the microsatellite dataset enabled the identification of a reference population of 402 animals, registered in the 10-year period 1997 to 2006 for which both microsatellite and pedigree data was available.

Pedigree completeness

Data correction routines within the programmes Genes [14] and Eva [15] were used to identify pedigree errors and correct infinite loops. Calculation of Pedigree Completeness was made using PopRep [16]. Using Eqs 1 and 2 to compute pedigree completeness index [17] (Id): Eq 1 Eq 2

Where k represents the paternal (pat) or maternal (mat) line of an individual, and ai is the proportion of known ancestors in generation i; d is the number of generations measured when calculating the pedigree completeness. Values for pedigree completeness will range from 0 to 1. Where all of the ancestors of an individual are known to some specified generation (d) then Id = 1. However, where one of the parent animals is unknown, Id = 0 [16].

Generation interval

Generation Interval is defined as the average age of the parent animals at the birth of selected offspring with offspring subsequently producing at least one progeny [18]. The generation interval was calculated for each of the four possible lines of descent: sire to son; sire to daughter; dam to son and dam to daughter. The results were averaged for each year group using PopRep [16].

Founder and ancestor representation

Stallion and dam lines, defined respectively as: unbroken descent through male or female animals only from an ancestor to a descendant [3] were identified and detailed founder and ancestor analysis was performed using Endog 4.6 [19] to initially determine Number of Founders.

We make the assumption that all animals with two unknown parents are regarded as founders in this analysis [20]. In addition, if an animal has one known and one unknown parent, the unknown parent is regarded as a founder. The total number of founders contains limited information on the genetic basis for the population. Firstly, founders are assumed to be unrelated, as their parentage is unknown. However, this is most likely not the case in practice. Secondly, some founders have been used more intensely and therefore contribute more, in terms of genetic resource, to the current population than other founders.

The effective number of founders, ƒe, has been designed to correct for this second shortcoming [21] and is defined as the number of equally contributing founders that would be expected to produce the same genetic diversity as in the population under study. This is computed as: Eq 3

Where qk is the probability of gene origin of the kth founder and Nf the real number of founders. In a scenario where every founder makes an equal contribution, the effective number of founders will equal the actual number of founders.

It is more common for founders to contribute unequally, leading to fe < Nf. The genetic contributions will converge following 5 to 7 generations [22]. Once this convergence occurs, employing fe as a measure of genetic contribution, will have limited usefulness as will remain constant irrespective of later changes in the population. Pedigrees of more than 7 generations can be characterized with a high effective number of founders even after a severe, recent bottleneck [23]. Whilst the effective number of founders is not an absolute measure of genetic diversity, it forms a basis for comparison of the effective population size (Ne) and the effective number of ancestors (fa). In a population with minimum inbreeding, fe would be expected to be approximately equal to ½Ne [22]. Where fe diverges from this, there is compelling evidence that the breeding structure has been changed since the founder generation [24].

The Effective Number of Founder Genomes (ƒg) was proposed by Lacy (1989) to account for unequal founder contributions, random loss of alleles caused by genetic drift and for bottleneck events. It is computed by the equation: Eq 4

Where pi is the expected proportional genetic contribution of a founder i; ri is the expected proportion of alleles from founder i which remain in the current population, and c is the total number of contributing founders [21]. This gives an indication of the number of equally contributing founders with no loss of founder alleles, that would produce the same degree of diversity as found in a reference population [25]. The fg will be smaller than both fe and the effective number of ancestors (fa), even under minimum inbreeding pressure, and approximately equal to ½Ne. The scale of these differences is indicative of the degree of random loss of alleles. Alleles will be lost with every generation of a pedigree and thus fg will decrease as the depth of pedigree increases [24].

The Effective Number of Ancestors (ƒa) supplements fe and is calculated from the genetic contributions of ancestors with the largest marginal genetic contributions themselves [20]. Whilst genetic contributions of founders are independent and sum to unity, this is not the case for genetic contributions of ancestors. Indeed, the dam of a highly used sire has >50% contribution of her son, as the same genes are represented in both generations. Boichard et al. (1997) therefore introduced the marginal contribution to the pedigree genetic resource. The ancestors contributing most to the reference population are considered individually in a recursive process. For each round of the recursion, the ancestor with the highest contribution is chosen, and the contributions of all others are calculated conditionally on the contribution of the chosen ancestor. The marginal contribution is the genetic contribution from an individual after correcting for contributions of other ancestors already considered in the recursive process. The sum of marginal contributions of all ancestors will be equal to unity. Ancestors with a large marginal contribution to the reference population will correlate with individuals having genes passed through many descendants [24].

Assessment of the fa helps to account for the losses of genetic variability produced by the unbalanced use of individuals in terms of reproduction within breeding programmes. This is conventional in domestic equines, whilst also accounting for bottlenecks in the pedigree.

The parameter fa is computed as Eq 5 where qj is the marginal contribution of an ancestor j.

Inbreeding analysis

Inbreeding coefficients for each individual animal were calculated using ENDOG [19].

The Increase in Inbreeding (ΔF), is calculated for each generation using ENDOG 4.6 [19], by means of Eq 6. Eq 6 where Ft and Ft-1 are the average inbreeding of offspring and their parents, respectively [18].

The Average Relatedness Coefficient (AR) [26] describes the probability that a randomly chosen allele from the whole population in the pedigree belongs to the animal under study. This parameter was calculated using ENDOG 4.6 [19]. The Additive Relationship Coefficient (Ryz), is estimated for two animals through calculating the hypothetical coefficient of inbreeding of an animal produced by mating the two individuals, irrespective of the sex of these assumed parents. The additive relationship between the two animals is then calculated as twice the coefficient of inbreeding of the hypothetical offspring. Ryz = 2 Fx, where Fx is the coefficient of inbreeding of the hypothetical offspring of individual Y and individual Z. This additive relationship has a minimum value of zero and a maximum value of two. The Additive Relationship is twice the value of the coefficient of kinship. The kinship of any two individuals is identical to the inbreeding coefficient of their progeny if they were mated. It is the probability that alleles drawn randomly from gametes of each of the two individuals are identical by descent.

Effective population size

The Effective Population Size from the rate of inbreeding is computed using the classic equation Eq 7

Where the rate of inbreeding per generation is calculated using Eq 6.

The Effective Population Size from the number of parents is computed as Eq 8

Where Nm and Nf are the number of male and female parents, respectively [18]. This method assumes that the ratio of breeding males to breeding females is 1:1, and that all individuals have an equal opportunity to contribute their genetic material to the next generation. This is seldom the case in managed livestock populations and there is a tendency for this method to overestimate Ne [16].


Total DNA was isolated at the Animal Health Trust’s laboratories, from hair follicle samples following standard commercial procedures and as previously described [27]. A set of 16 microsatellites (ASB17 VHL20 HTG10 HTG4 AHT5 AHT4 HMS3 HMS6 HMS7 ASB23 LEX3 LEX33 ASB2 HTG6 HTG7 HMS2) were analysed in all the sampled individuals. The GENETIX program was used to carry out factorial correspondence analyses and associated calculations on 15 of these markers [28]. Although microsatellite LEX3 appears in the panel of markers recommended for equine parentage verification by the International Society for Animal Genetics it was excluded from the analysis in this study because it is located on the X chromosome and as such is not appropriate for this type of analysis.

The Average Number of Alleles per Locus (A), corrected in order to account for sample size using Hurlbert's rarefaction method (1971) can be shown as: Eq 9 where g is the specified sampled size for a collection containing N individuals, numbering Ni in the ith species.

Nei's minimum distance (Dm) and Nei's standard distance (Ds [29]) are computed according to Eqs 10 and 11, respectively. Eq 10 Eq 11 where fkk and fmm are the average coancestry between individuals belonging to population k or m, and fkm is the average coancestry between individuals belonging to populations k and m.

Population structure

F (fixation) statistics extend the study of inbreeding coefficients in the case of sub-divided populations [30]. The FIT refers to the inbreeding of individuals in the total population. Conversely, FIS describes the inbreeding of individuals within sub-populations. FST is not strictly a fixation index as it represents the correlation between two gametes taken at random in two sub-populations from the total population. It measures the degree of genetic differentiation of the sub-populations. The three indices are computed as in Eqs 12, 13 and 14, respectively Eq 12 Eq 13 and Eq 14 where f and F are, respectively, the mean coancestry and the inbreeding coefficient for the entire metapopulation, and, the average coancestry for the subpopulation, so that (1 –FIT) = (1 –FIS) (1 –FST) [31].

ENDOG [19] was used to calculate F statistics and Nei’s minimum distance [29]), D, the genetic distance between subpopulations i and j which is given by Eq 15 Eq 15

The programme TREX [32, 33] was used to construct phylogenetic trees to illustrate the structure from the distance matrix data.

Bayesian model-based clustering was conducted using the programme STRUCTURE v2.1 [34], to assign individuals to homogeneous clusters or populations K, from a user defined range. An admixture model was adopted, with a burn in of 104 and 104 iterations of each value of K from 2 to 25.


Pedigree completeness

The pedigree file included a total of 5422 animals, of which 2661 were male and 2761 were female. The reference population of 402 individual animals consisted of 193 male and 209 females for which microsatellite data as well as pedigree data was available.

The pedigree file was analysed to assess the number of fully traced generations for each individual, the maximum number of generations traced and the equivalent complete generations for each animal. The maximum number of traced generations was 36. Percentage average population completeness for each year of birth considering 1 through 6 generations are shown in Fig 1 with percentage population completeness for the reference population up to 6 generations being high (Table 1).

Fig 1. Percentage pedigree completeness over 6 generations.

Average percentage completeness (%) is shown as a factor of individual birth year.

Table 1. Pedigree completeness over 6 generations estimated from breed society records and pedigree recording data.

Average generation interval

Generation intervals for each of the four pathways (Table 2) ranged from 9.2 years to 10.0 years (sire-son and sire-daughter, respectively). The average generation interval for each breeding year (Fig 2) was found to range between 5.5 and 13 years, being at a minimum in the immediate post WW2 period 1946 to 1950, which coincides with the genetic bottleneck previously identified by Walling (1994).

Fig 2. Average generation interval for whole population calculated as the average age of parents at the birth of offspring which in turn produce the next generation of breeding individuals.

Founder and ancestor representation

A total of 11 stallion lines were identified in the pedigree. A single paternal ancestry line is present in the reference (living) population.

Analysis of the female members of the studbook identified a total of 17 dam lines. Nine of these maternal ancestry lines are present in direct descent in the living population. Three of these lines (2,4 & 9) are only represented, in direct female descent, by either a single individual or two individual animals (Table 3). The three most common maternal lines constitute 70% of the present female population. However, analysis of the relative contributions of the most influential maternal ancestry lines to the genome of the reference population reveals that some of the lines least well represented in direct descent in fact continue to make a substantial genetic contribution as shown in Table 3.

Table 3. Relative contributions of maternal ancestry lines to the evolution of the whole and reference (1997–2006) populations.

Analysis identified 194 founders in total of which 28 were represented in the reference population. The mean retention was 0.035. The number of founder genomes surviving was 6.285. Calculations on the same population show the founder genome equivalent to be 2.366 with the effective number of non- founders only 2.379. The proportion of ancestry known was 0.330 reflecting the fact that in early volumes of the studbook only a record of the sire of an individual animal was made. The Number of Ancestors contributing to the population was 424 and the number of ancestors describing 50% of the genome was 7 animals.

The number of Ancestors contributing to the Reference Population was calculated as 31 animals. The Effective Number of Founders/Ancestors [20] for the Reference Population were 40 and 9, respectively. The number of ancestors describing 50% of the genome of the living population was 3. Ancestors were selected following Boichard et al. (1997), while founders were selected by their individual Average Relatedness coefficient (AR).

Inbreeding analysis and effective population size

Across the whole analysed dataset, F = 7.8% with an associated mean average relatedness of 8.3%. Fig 3 shows Inbreeding and additive relationship coefficients by birth year between 1900 to 2006.

Fig 3. Inbreeding coefficient and additive genetic relationship 1900 to 2006 as a function of birth year of individuals.

The average rate of change of the additive genetic relationships between 1901 and 2009 for the Cleveland Bay Horse breed was 0.00202 per year based on the slope regression. This results in a Δf per generation of 0.02629 (Table 4). The rate of change of the average inbreeding coefficients based on slope regression between 1901 and 2009 was 0.00214, which represents a ΔF per generation of 0.02709. The effective population sizes for the Cleveland Bay Horse breed, based on Δf and ΔF were 19 and 18, respectively (Fig 4). The pattern of inbreeding during which the reference population was foaled and Effective population size, calculated based on both the rate of inbreeding and the number of parents are tabulated in Table 5 for the period 1997 to 2006 with data calculated using POPREP [16].

Fig 4. Effective population size from rate of change of inbreeding (grey series), and number of parents (black series) calculated with POPREP [16].

Table 4. Change in inbreeding coefficient and average relatedness for 7 fully traced generations.

Table 5. Inbreeding coefficients, F, and effective population size (Ne) of animals by birth year 1997–2006.

Microsatellite variation

The total number of alleles found for 15 microsatellite loci within the reference population was 93. The mean number of alleles per locus was 2 ranging from 4 to 10. The mean Observed Heterozygosity (Ho) ranged between 0.052(HTG7) and 0.716 (VHL20) the mean being 0.4486 whilst the mean Expected Heterozygosity (He) was 0.5341. The highest values for He were found for microsatellite LEX33 whilst the lowest were found for microsatellite HTG6 (Table 6).

Table 6. Summary statistics for the 15 microsatellite loci analysed.

Na represents the number of alleles; N, the sample size; Ho the Observed Heterozygosity; He the Expected Heterozygosity; HW the departure from Hardy-Weinberg equilibrium; F the Fixation Index; and Nm* the Gene flow estimated from FST = 0.25(1—FST)/FST.

Across the reference population there is complete heterozygosity. However, at subpopulation level 3 (Table 7), groups show homozygosity at multiple loci. Female Line 2 is 62.5% polymorphic with fixation at HMS3 and LEX3. Female Line 4 is 62.5% polymorphic with fixation of alleles at HMS3, ASB23, HTG4, HTG10 and LEX3. Female Line 8 is 93.75% polymorphic with fixation at LEX3.

Table 7. Summary of the microsatellite analysis results on a subpopulation by matriline basis and for the full dataset, where MNA represents the mean number of alleles per locus.

Allele frequencies are more restricted in populations 2, 4 and 9 (Fig 5), as is the expected heterozygosity. This will be influenced by the smaller membership and corresponding sample size for these subpopulations.

Fig 5. Summary statistics grouped by subpopulations, where Na represents the number of different alleles; Na (Freq > = 5%) the number of different alleles with a frequency ≥ 5%, Ne the number of effective alleles, which is equal to 1 / (Σpi2); I, the shannon and weaver information index, calculated as Σ(pi ln (pi)); No. private alleles, the number of alleles unique to a single population; No. LComm alleles (< = 25%), the number of locally common alleles (Freq. > = 5%) found in 25% or fewer populations; No. LComm alleles (< = 50%), the number of locally common alleles (Freq. > = 5%) found in 50% or fewer populations; He the Expected Heterozygosity; and UHe the Unbiased Expected Heterozygosity, estimated as He(2N / (2N-1)).

The analysis of allele frequencies identifies a significant number of gaps in the distribution of allele length or number of repeats. It has been reported that populations that have experienced genetic bottlenecks tend to exhibit such less cohesive distributions than stable populations [35].

Bottleneck analysis

The microsatellite allele frequency data was tested for departure from mutation-drift equilibrium with the software BOTTLENECK 1.2 [23]. The results of the three tests of heterozygosity excess (Infinite Allele Model, IAM; Stepwise mutation Model, SMM; and Two-Phase Mutation Model, TPM) are shown in Table 8 and the results of the test for null hypothesis under Sign Test, Standard Difference Test and Wilcoxon Test in Table 9.

Table 8. Bottleneck heterozygosity excess test results based on 16 identified loci, where n represents the sample size; and ko, the observed number of alleles under the assumption of mutation-drift equilibrium.

Table 9. Tests for null hypothesis under three microsatellite evolution models.

Under the Sign Test, the expected number of loci with heterozygosity excess were 8.93 (p = 0.00120) under IAM, 9.40 (p = .0.29262) under TPM, and 9.43 (p = 0.06923) under SMM. This suggests that the null hypothesis is rejected under IAM, but with p> 0.05 would appear to be met under the other two tests. Therefore, only under the IAM is there clear evidence of a recent bottleneck event.

The standard difference test gives T2 probability statistics of 3.186 (p = 0.00072) under IAM; 0.902 (p = 0.18357) under TPM and -4.294 (p = 0.00001) under SMM. Probability values of less than 0.05 for both IAM and SMM under these two models suggest a recent bottleneck event.

Under the Wilcoxon rank test the probability values were 0.00042 (IAM); 0.11560 (TPM) and 0.97116 (SMM), thus rejecting the null hypothesis under IAM.

Mode shift indicator

The Bottleneck software [23] provides an alternative method for detecting potential genetic bottleneck events in the Mode Shift Indicator. Populations that have not experienced a bottleneck will be at or near mutation drift equilibrium and will be expected to have a large proportion of alleles with low frequency [36]. This pattern will show as a normal, L shaped distribution when displayed graphically. Fig 6 shows that the Cleveland Bay data displays a normal L-shaped distribution at low allele size class, but deviates from it in the latter quartiles. This would suggest a population not completely at mutation drift equilibrium, and showing evidence of having experienced a genetic bottleneck in the recent past.

Fig 6. Allele distribution by size class.

Trendline describes a natural logarithmic relationship according to y = -0.188 ln(x) + 0.3836.

As both the data plot and the trend show that at the higher size classes there is some departure from the normal L-shaped distribution; the absolute assumption of accepting the null hypothesis should be treated with caution. Indeed, on initial examination, the results of the analysis with Bottleneck [23] appear far from conclusive. Initial assessment suggests that under the IAM all of the tests provide evidence of a recent bottleneck event. However, under TPM and SMM, the evidence is somewhat contradictory indicates some reservation to assessment of the suggested recent bottleneck. The mutation drift model deviation from normal L-shaped distribution supports the above assumption, however, this conflicting evidence suggests the reduction in population size in the 1950s was perhaps not as significant a bottleneck event as previously reported [9]. When the theory behind the various models is re-examined [36] it becomes evident that gene diversity excess has only been demonstrated for loci evolving under the Infinite Allele Model. Given that there is very strong evidence to support a recent bottleneck event under this model, which is supported by testing of microsatellite allele frequency data herein, it is likely that the Cleveland Bay horse has indeed experienced a recent genetic bottleneck.

Population structure

Wright F Parameters [37] reflecting departure from Hardy–Weinberg equilibrium were calculated from the pedigree analysis for the reference population in terms of FIS (-0.006677), FST (0.040230) and FIT (0.033821). Multilocus estimations of Wright’s F statistics [38] from the microsatellite data showed an across population distribution of the following: FIS (0.011362), FIT (0.029308), and FST (0.018153).

Distance matrices [39] were constructed from both pedigree and molecular analysis, and phlogenetic trees were constructed using TRex [33] showing the relative positions of each female ancestry line (Figs 7 and 8).

Fig 7. Neighbour joining tree showing relative genetic distance between subgroups from analysis of pedigree data assigned by female ancestry line.

Fig 8. Neighbour joining tree from microsatellite analysis showing distances between subpopulations by maternal ancestry line.

Both the pedigree distance analysis (Fig 7) and the molecular analysis (Fig 8) are suggestive of a population structure rooted on three sub-divisions, or clades. However, neither analysis provides conclusive evidence of the causes or nature of this division. In addition to the pairwise distance matrices constructed assuming 9 subgroups within the population, GENALEX 6.4 [40] was also used to construct the much larger matrix of Nei distance between individuals [39]. This matrix in Phylip format was imported into the cluster drawing programme SplitsTree4 [41] to construct a Neighbour-Net diagram. The Neighbour Net Diagram indicating a 2 clade subdivision of the population is shown at Fig 9 whilst a 3 clade subdivision is shown at Fig 10.

Fig 9. Neighbour-Net diagram of Nei genetic distance between individuals showing two clade model of structure.

Fig 10. Neighbour-Net diagram of Nei genetic distance between individuals showing three clade model of structure.

Examination of this net immediately suggests that the structure of the reference population could be explained by two broad groups or clades as shown in Fig 9. However, an alternative model with three clades, shown in Fig 10, is also possible.

Principal co-ordinate analysis via covariance matrix was conducted using Genalex 6.5 [40], with sub-populations assigned by both modern female and modern male ancestry lines, in order to examine alternative possible structuring of the reference population. Fig 11 presents the PCoA with subpopulations assigned by female ancestry and Fig 12 by male ancestry.

Fig 11. Principal coordinate analysis (PCoA) with subpopulations assigned by female ancestry across the two principal components (PCoA1, PCoA2).

Fig 12. PCoA with subpopulations assigned by male ancestry.

The PCoA analysis shows both male and female sub-populations distributed widely across principal axes, with little suggestion of structuring by sex group being the driving process of population sub-division in the microsatellite data. Variational Bayesian analysis of the microsatellite dataset, using the programme STRUCTURE [34] was carried out, in order to further investigate breed structure. 104 runs of the analysis were carried out for potential populations, K, numbering 2 to 25. The best fit of K appears at K = 3. Fig 13 provides a visual representation of this analysis for K = 2 to K = 4. There is a substantial increase in background noise in the display at K = 4, indicative that the number of clusters or sub-populations is below this level.

Fig 13. STRUCTURE analysis of population numbers K = 2 to K = 4.

Each colour is a representation of a population, with individuals shown as vertical lines, which are split into coloured segments; the lengths of these describe the admixture proportions from K populations.

Further analysis of the population structure was conducted using the programme BAPS [42]. 17 clusters within the microsatellite dataset were identified, with a highly significant probability of 0.99998.


The results presented herein highlight the significant losses of founder representation that have occurred in the Cleveland Bay Horse population across the past century. Approximately 91% of the stallion and 48% of the dam lines are lost in the reference population. The unbalanced representation of the founders is illustrated by the effective number of founder animals (fe) and the effective number of ancestors (fa). The parameter fe constitutes over a third of the equivalent number of founder animals for the reference population, whilst the ratio fa/fe is 22.5%. This ratio is substantially lower than that reported in other horse breeds such as 41.7% in the Andalusian [43] or 54.4% in the Lipizzan [44]. Additionally, this is lower than the figure of 38.2% reported for the endangered Catalonian donkey [26].

The values of the generation interval presented herein (Table 2) are common in equines and identical to those observed in the literature [22, 43]. Suggesting some lack of regularised control measures or quantitative breeding strategies on the part of breeders and a decrease in genetic gain, which is directly linked to the generation interval. Breeders should start programmes at a younger age and decrease breeding extent over time.

The average inbreeding computed for the Cleveland Bay Horse at 20.64% in the reference population is substantially higher than most of the values reported in the literature [43], with typical values ranging from 6.5% to 12.5%. Although most of these inbreeding values have been computed in breeds with deep pedigrees such as Andalusian, Lipizzan or Thoroughbred there are significant differences in population sizes, and the accumulation of inbreeding in populations of restricted size will occur at a greater rate.

The smaller the number of individuals in a randomly mating breed the greater will be the accumulation of inbreeding due to the restricted choice of mates. Furthermore, we see a smaller Ne with increasing ΔF. The Cleveland Bay horse is therefore predisposed to inbreeding and associated loss of genetic variation. In the reference population of 402 individuals the Effective Population Size (Ne) computed via individual increase in inbreeding was 27.84. Ne computed via regression on equivalent generations was 26.29. Inbreeding and genetic loss under random mating will occur at ½ Ne per generation. In the reference population, where Mean Ne is 32.32 under random mating, inbreeding can be expected to accumulate at 1.5% per generation.

This is reflected by the genealogical FIS values. This parameter characterises the mating policy derived from the departure from random mating as a deviation from Hardy–Weinberg equilibrium. Positive FIS values indicate that the average F value within a population exceeds the between-individuals coancestry, thus suggesting that matings between relatives have taken place [26]. Moreover, the average AR values computed for nine complete generations, (Table 3) are roughly equivalent the value of F. In an ideal scenario with random matings and no population subdivision, AR would be approximately twice the F value of the next generation [26].

Molecular information obtained in this study using microsatellite analysis suggests that genetic diversity within the breed is more restricted than has been reported in many other horse breeds and is based on an assessment of the tendency of genetic characteristics to vary accordingly (Table 10).

Table 10. Genetic variability from microsatellite DNA loci for Cleveland Bay and other domestic horse breeds.

Populations that have experienced a recent reduction in their Ne exhibit a correlative reduction of the allele numbers (k) and gene diversity (He) at polymorphic loci. However, the allele numbers reduce faster than the genetic diversity. Thus, in a recently bottlenecked population, the observed gene diversity is higher than the expected equilibrium gene diversity (He) which is computed from the observed number of alleles, k, under the assumption of a constant-size or equilibrium population [36]. The existence of a population bottleneck in the mid twentieth century, when the number of breeding age Cleveland Bay stallions was reduced to four, has previously been reported [12]. There is clear genetic evidence of this event shown in the excess of observed heterozygosity across subpopulations, with the exception of ancestry line nine. The latter is of more recent origin having evolved from a grading up scheme in the latter half of the twentieth century. In all other subgroups, the excess is positive ranging from 2.12% in Line 5 to 19.6% in Line 4. However, this investigation has revealed that lines two, four, and eight are in fact not polymorphic. The observed heterozygosity excess amongst the five polymorphic lines peaks in line one at 6.1%.

Microsatellite multilocus estimations of Wright’s F statistics [38] showed an across population FIS; FIT and FST of 0.01758, 0.02490, and 0.00745, respectively. This departure from random mating will have been influenced by a number of factors common to restricted populations of domesticated equines. These include: selection by breeders for particular lines of descent; natural differences in fertility between individuals; a restricted number of male animals leaving significantly more offspring than females (disproportionate male founding) and geographic distribution of animals and breeders leading to logistical difficulties in some matings. The reduced number of alleles and fixation at certain loci in female ancestry lines is evidence of loss of founder representation from these lines. This lower heterozygosity is also indicative of the typical practice of the larger studs, where breeding tends to be carried out in pasture by free live cover, with the use of only one stallion per year, per herd and where the same stallion may be retained for several breeding years. This strategy is compounded by breeders with only a small number of breeding females sending their animals to these groups or to be covered in hand by the same stallion.

This strategy has different implications for the genetic diversity of the Cleveland Bay Horse compared that of mares travelling to stud to be covered in hand by a greater range of stallions that do not have their own herds of mares [55]. as well as through trade or exchange, which will change geographic location albeit on an irregular basis. Although this latter practice has clear benefits in conservation programmes, there is the danger of inappropriate matings supplanting the more common and less frequent alleles. Whilst such matings increase the frequency of the rarer alleles, they simultaneously increase the frequency of those more common [56], highlighting the need for in-depth understanding of the genetic diversity of any rare breed, and for an effective management plan for conservation maintenance.

There has been considerable debate about the most effective methods of conserving and managing endangered populations [55]. Before the advent of mitochondrial and microsatellite DNA analysis, the accepted strategy involved minimizing inbreeding, whilst managing mean Kinship/average relatedness [57]. Moreover, the use of molecular methods has been proposed [58, 59]. Where pedigree data is robust and complete over a significant number of generations, it appears that genealogical data remains the preferred method by which to manage founder contributions, inbreeding and kinship/relatedness. Indeed Lacy has highlighted the problems caused in conservation programmes based on private or rare alleles [56].

Variational Bayesian analysis of within-population structure using microsatellite data shows significant evidence for three main clades. Although this study has been based on the use of pedigree and microsatellite marker data for the Cleveland Bay horse there is now firm evidence of the value of mitochondrial DNA for such investigations and an increasing number of investigations consider the origins and relatedness of modern equines (Table 10).

The Cleveland Bay horse has been reported to belong to haplotype C [48] which is common amongst older northern European breeds such as the Exmoor, Icelandic, Fjord, Connemara and Scottish Highland. This correlates with the assertion that in the matriline the Cleveland Bay has evolved from the Chapman; an ancient Northern European breed. The comparative studies have been based on five Cleveland Bay mtDNA sequences deposited in GeneBank by Cothran and Frankham within which there are three haplotypes. There is scope for further sampling of all of the existing matrilines to determine the number of haplotypes present in the reference population the level of correlation with the three Clades identified herein.


We have reported an in-depth genetic analysis of the Cleveland Bay Horse, using both pedigree and microsatellite data. It reveals substantial loss of genetic diversity and high levels or relatedness and inbreeding. The results of this study highlight the importance of the Cleveland Bay Horse community implementing an effective and sustainable breed management plan, such as management of Mean Kinship and Inbreeding Coefficients.


The authors thank the Breed Committee of the Cleveland Bay Horse Society for access to its microsatellite parentage testing records.


  1. 1. Petersen JL, Mickelson JR, Cleary KD, McCue ME. The american quarter horse: Population structure and relationship to the thoroughbred. Journal of Heredity. 2014. pmid:24293614
  2. 2. Cunningham EP. Molecular methods and equine genetic diversity. Conservation Genetics of Endangered Horse Breeds. 2005.
  3. 3. Cunningham EP, Dooley JJ, Splan RK, Bradley DG. Microsatellite diversity, pedigree relatedness and the contributions of founder lineages to thoroughbred horses. Anim Genet. 2001. pmid:11736806
  4. 4. Boyd MM. A plea for a more extended use of the system of live-stock registration. J Hered. 1907.
  5. 5. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000. pmid:10835412
  6. 6. Khanshour AM, Hempsey EK, Juras R, Cothran EG. Genetic characterization of Cleveland Bay Horse Breed. Diversity. 2019.
  7. 7. EMMERSON S. Cleveland Bay Horse Society Centenary Studbook. Clevel Bay Horse Soc. 1984.
  8. 8. Dent AA. Cleveland Bay Horses. JA Allen & Company, Limited; 1978.
  9. 9. Fairfax-Blakeborough J. Cleveland bay horse, its history, evolution and importance today. 1950.
  10. 10. Reese HH. Breeds of light horses. US Department of Agriculture; 1918.
  11. 11. Johnson D. Horse breeds. Voyageur Press; 2008.
  12. 12. WALLING G. Cleveland Bay Horse Society Studbook Vol XXXIII. Clevel Bay Horse Soc. 1994.
  13. 13. Khanshour AM, Juras R, Cothran EG. Microsatellite analysis of genetic variability in Waler horses from Australia. Aust J Zool. 2014;61: 357–365.
  14. 14. Lacy RC. Management of limited animal populations. Bottlenose dolphin reproduction workshop. 2000.
  15. 15. Ansari-Mahyari S, Berg P. Power of QTL mapping using both phenotype and genotype information in selective genotyping.
  16. 16. Groeneveld E, Westhuizen BD, Maiwashe A, Voordewind F, Ferraz JB. POPREP: a generic report for population management. Genet Mol Res. 2009. pmid:19866435
  17. 17. Maccluer JW, Boyce AJ, Dyke B, Weitkamp LR, Pfenning DW, Parsons CJ. Inbreeding and pedigree structure in standardbred horses. J Hered. 1983.
  18. 18. J-M G., Falconer DS. Introduction to Quantitative Genetics. Popul (French Ed. 1962.
  19. 19. Gutiérrez JP, Goyache F. A note on ENDOG: A computer program for analysing pedigree information. J Anim Breed Genet. 2005. pmid:16130468
  20. 20. Boichard D, Maignel L, Verrier É. The value of using probabilities of gene origin to measure genetic variability in a population. Genet Sel Evol. 1997.
  21. 21. Lacy RC. Analysis of founder representation in pedigrees: Founder equivalents and founder genome equivalents. Zoo Biol. 1989.
  22. 22. Bijma P, Woolliams JA. Prediction of genetic contributions and generation intervals in populations with overlapping generations under selection. Genetics. 1999. pmid:10049935
  23. 23. Cornuet JM, Luikart G. Description and power analysis of two tests for detecting recent population bottlenecks from allele frequency data. Genetics. 1996. pmid:8978083
  24. 24. Sørensen AC, Sørensen MK, Berg P. Inbreeding in danish dairy cattle breeds. J Dairy Sci. 2005. (05)72861–7 pmid:15829680
  25. 25. Lacy RC. Clarification of genetic terms and their use in the management of captive populations. Zoo Biol. 1995.
  26. 26. Gutiérrez JP, Marmi J, Goyache F, Jordana J. Pedigree information reveals moderate to high levels of inbreeding and a weak population structure in the endangered Catalonian donkey breed. J Anim Breed Genet. 2005. pmid:16274421
  27. 27. Khanshour A, Conant E, Juras R, Cothran EG. Microsatellite analysis of genetic diversity and population structure of Arabian horse populations. J Hered. 2013;104: 386–398. pmid:23450090
  28. 28. Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme F. GENETIX 4.05, Windows TM software for population genetics. Lab génome, Popul Interact CNRS Umr. 2004;5000.
  29. 29. Hill WG. Molecular Evolutionary Genetics. By Masatoshi Nei. New York: Columbia University Press. 1987. 512 pages. U.S. $50.00. ISBN 0 231 06320 2. Genet Res. 1988.
  30. 30. Wright S. Variability within and among natural populations. Evolution and the genetics of populations. 1978.
  31. 31. Caballero A, Toro MA. Analysis of genetic diversity for the management of conserved subdivided populations. Conserv Genet. 2002.
  32. 32. Makarenkov V. T-REX: Reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics. 2001. pmid:11448889
  33. 33. Alix B, Boubacar DA, Vladimir M. T-REX: A web server for inferring, validating and visualizing phylogenetic trees and networks. Nucleic Acids Res. 2012. pmid:22675075
  34. 34. Pritchard JK. Documentation for structure software: Version 2. 2. Statistics (Ber). 2007.
  35. 35. Chikhi L, Bruford M. Mammalian population genetics and genomics. Mammalian Genomics. 2004.
  36. 36. Luikart G, Cornuet JM. Empirical evaluation of a test for identifying recently bottlenecked populations from allele frequency data. Conserv Biol. 1998.
  37. 37. Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution (N Y). 1984. pmid:28563791
  38. 38. Weir BS, Hill WG. Estimating F-Statistics. Annu Rev Genet. 2002. pmid:12359738
  39. 39. Nei M. Molecular evolutionary genetics. Columbia university press; 1987.
  40. 40. Peakall R, Smouse PE. GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research-an update. Bioinformatics. 2012;28: 2537–2539. pmid:22820204
  41. 41. Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution. 2006. pmid:16221896
  42. 42. Corander J, Waldmann P, Marttinen P, Sillanpää MJ. BAPS 2: Enhanced possibilities for the analysis of genetic population structure. Bioinformatics. 2004. pmid:15073024
  43. 43. Valera M, Molina A, Gutiérrez JP, Gómez J, Goyache F. Pedigree analysis in the Andalusian horse: Population structure, genetic variability and influence of the Carthusian strain. Livest Prod Sci. 2005.
  44. 44. Zechner P, Sölkner J, Bodo I, Druml T, Baumung R, Achmann R, et al. Analysis of diversity and population structure in the Lipizzan horse breed based on pedigree information. Livest Prod Sci. 2002.
  45. 45. Aberle K, Wrede J, Distl O. Analyse der Populationsstruktur des Süddeutschen Kaltbluts in Bayern. Berl Munch Tierarztl Wochenschr. 2004.
  46. 46. Fox-Clipsham LY, Brown EE, Carter SD, Swinburne JE. Population screening of endangered horse breeds for the foal immunodeficiency syndrome mutation. Vet Rec. 2011. pmid:22016514
  47. 47. Prystupa JM, Hind P, Cothran EG, Plante Y. Maternal lineages in native canadian equine populations and their relationship to the nordic and mountain and moorland pony breeds. J Hered. 2012. pmid:22504109
  48. 48. McGahern AM, Edwards CJ, Bower MA, Heffernan A, Park SDE, Brophy PO, et al. Mitochondrial DNA sequence diversity in extant Irish horse populations and in ancient horses. Anim Genet. 2006. pmid:16978181
  49. 49. Brinkmann L, Gerken M, Riek A. Adaptation strategies to seasonal changes in environmental conditions of a domesticated horse breed, the Shetland pony (Equus ferus caballus). J Exp Biol. 2012. pmid:22399650
  50. 50. Gu J, Orr N, Park SD, Katz LM, Sulimova G, MacHugh DE, et al. A genome scan for positive selection in thoroughbred horses. PLoS One. 2009. pmid:19503617
  51. 51. Achmann R, Curik I, Dovc P, Kavar T, Bodo I, Habe F, et al. Microsatellite diversity, population subdivision and gene flow in the Lipizzan horse. Anim Genet. 2004. pmid:15265067
  52. 52. Luís C, Juras R, Oom MM, Cothran EG. Genetic diversity and relationships of Portuguese and other horse breeds based on protein and microsatellite loci variation. Anim Genet. 2007. pmid:17257184
  53. 53. Cañon J, Checa ML, Carleos C, Vega-Pla JL, Vallejo M, Dunner S. The genetic structure of Spanish Celtic horse breeds inferred from microsatellite data. Anim Genet. 2000. pmid:10690360
  54. 54. Juras R, Cothran EG, Klimas R. Genetic Analysis of Three Lithuanian Native Horse Breeds. Acta Agric Scand—Sect A Anim Sci. 2003.
  55. 55. Luís C, Cothran EG, Oom MDM. Inbreeding and genetic structure in the endangered Sorraia horse breed: Implications for its conservation and management. J Hered. 2007. pmid:17404326
  56. 56. Lacy RC. Should we select genetic alleles in our conservation breeding programs? Zoo Biology. 2000.
  57. 57. Mills LS, Ballou JD, Gilpin M, Foose TJ. Population Management for Survival and Recovery: Analytical Methods and Strategies in Small Population Conservation. J Wildl Manage. 1997.
  58. 58. Pearl MC. Research Techniques in Animal Ecology Methods and Cases in Conservation Science. J Wildl Manage. 2000.
  59. 59. Fraser DJ, Bernatchez L. Adaptive evolutionary conservation: Towards a unified concept for defining conservation units. Molecular Ecology. 2001. pmid:11903888