Tracing Cattle Breeds with Principal Components Analysis Ancestry Informative SNPs

The recent release of the Bovine HapMap dataset represents the most detailed survey of bovine genetic diversity to date, providing an important resource for the design and development of livestock production. We studied this dataset, comprising more than 30,000 Single Nucleotide Polymorphisms (SNPs) for 19 breeds (13 taurine, three zebu, and three hybrid breeds), seeking to identify small panels of genetic markers that can be used to trace the breed of unknown cattle samples. Taking advantage of the power of Principal Components Analysis and algorithms that we have recently described for the selection of Ancestry Informative Markers from genomewide datasets, we present a decision-tree which can be used to accurately infer the origin of individual cattle. In doing so, we present a thorough examination of population genetic structure in modern bovine breeds. Performing extensive cross-validation experiments, we demonstrate that 250-500 carefully selected SNPs suffice in order to achieve close to 100% prediction accuracy of individual ancestry, when this particular set of 19 breeds is considered. Our methods, coupled with the dense genotypic data that is becoming increasingly available, have the potential to become a valuable tool and have considerable impact in worldwide livestock production. They can be used to inform the design of studies of the genetic basis of economically important traits in cattle, as well as breeding programs and efforts to conserve biodiversity. Furthermore, the SNPs that we have identified can provide a reliable solution for the traceability of breed-specific branded products.


The Singular Value Decomposition and Principal Components Analysis
We briefly describe the Singular Value Decomposition (SVD) of matrices and the related Principal Components Analysis (PCA). Given m subjects and n SNPs (m ≤ n), let the m × n matrix A denote the subject-SNP matrix encoded as described above. After mean-centering the columns (SNP genotypes) of A, the SVD of the matrix returns m pairwise orthonormal vectors u i , n pairwise orthonormal vectors v i , and m non-negative singular values σ i such that σ 1 ≥ σ 2 ≥ . . . ≥ σ m ≥ 0. The matrix A may be written as a sum of outer products as Each triplet (σ i , u i , v i ) may be used to form a principal component of A. Formally, the i-th most significant principal component of a matrix A is the rank-one matrix that is equal to σ i u i v i T . In our setting, the left singular vectors (the u i 's) are linear combinations of the columns (SNPs) of A and will be called eigenSNPs [2]. Notice that a principal component is a matrix, whereas an eigenSNP is just a column vector. PCA is a well-known dimensionality reduction technique that, in this case, represents all subjects with respect to a small number of eigenSNPs, corresponding to the top few principal components.

Determining the number of significant principal components
The PCAIM selection procedure that was used in our work necessitates as input the number of significant principal components. Statistical tests of significance such as the Tracy-Widom test [1] or other tests [5,4] do not seem suitable for our goal, since, for example, when analyzing the data corresponding to the top node in Figure 1 (main text) we are only interested in assigning a sample to one of three broad groups. Towards that end, the top two principal components contain sufficient information and lower order principal components, while statistically significant, tend to differentiate cattle breeds at a finer granularity than necessary at this level. Thus, we chose to determine the significance of a principal component by measuring its contribution in separating the cattle groups that correspond to the relevant node of the decision tree. In order to achieve this, we performed the following experiment: given the data at each node of our decision tree (say m samples genotyped on n SNPs) we ran a complete leave-one-out validation experiment using our 5-NN classification algorithm and k principal components with k varying from one up to ten. Thus, we systematically leave out one of the m individuals (test set) and then we seek to predict their "ground truth" breed using the remaining m − 1 individuals (training set). To do that we project all m individuals on the top k eigenSNPs and identify the five Nearest Neighbors of the individual in the test set. By repeating this experiment m times (once for each of the m individuals) we compute two statistics: the classification accuracy and the average number of correctly predicted nearest neighbors (see online material for the numerical results for each value of k at each node of the decision tree). We chose to set k (the number of significant principal components) to the value that minimizes the number of misclassifications and maximizes the average number of correct nearest neighbors. In the online supporting material we have provided detailed tables with the results of the above min-max objective for each value of k from one to ten at each node of the decision tree of Figure 1 (main text). Interestingly, in most cases that were studied here, there existed a unique value of k that achieved this min-max objective for every node in Figure 1 (main text). Table 1 in the main text (column 1) indicates the number of components that were deemed significant using the above min-max objective. An immediate observation is that the number of principal components that are significant at each node never exceeds four. In the online supporting material we provide plots of all eigenSNPs (principal components) that were deemed significant via our analysis. In these plots, it is easy to observe that in most cases the significant eigenSNPs clearly separate the targeted subsets of breeds under study at each level.

Accuracy metrics
We report two statistics. First, in a cross-validation experiment using the 5-NN classifier, we will consider a sample's breed of origin to be accurately predicted if and only if at least three out of the five nearest neighbors belong to the correct breed. The following definition is now immediate.

Definition 1 [Classification Accuracy (C ACC )]
The classification accuracy is defined as the percentage of the number of individuals in the test set whose breed of origin was accurately predicted.
We should note that if an individual with breed of origin X had two nearest neighbors in X and the remaining three in -say -three distinct breeds Y 1 , Y 2 , and Y 3 , we would consider the individual incorrectly predicted. Thus, our accuracy metric is quite stringent, since it necessitates a strict majority. Our second metric rectifies the situation by measuring the average (over all individuals in the test set) number of nearest neighbors that belong to the correct breed of origin.

Definition 2 [Average number of correct neighbors (N N AV G )]
The average number of correct neighbors is defined as the average number of neighbors that belong to the correct breed of origin over all individuals in the test set.