The authors have declared that no competing interests exist.
Conceived and designed the experiments: PD RD. Performed the experiments: PD. Analyzed the data: PD SM. Wrote the paper: PD RD SM.
¶ A full list of participants and their affiliations is available in
Genomic screening for chromosomal abnormalities is an important part of quality control when establishing and maintaining stem cell lines. We present a new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data. In contrast to other methods designed for identifying copy number variations in a single sample or in a sample composed of a mixture of normal and tumor cells, this new method is tailored for determining differences between cell lines and the starting material from which they were derived, which allows us to distinguish between normal and novel copy number variation. We implemented the method in the freely available BCFtools package and present results based on induced pluripotent stem cell lines obtained in the HipSci project.
Induced pluripotent stem (IPS) cells can be generated from adult tissues and differentiated into specific cell types. The reprogramming process involves clonal selection and cell line passaging, and is known to accumulate genomic aberrations, such as point mutations (estimated six protein-coding mutations per cell line on average), de novo copy number variations (CNVs) and aneupoloidy [
Genotype array data provides measurements at each of hundreds of thousands of genetically variable sites. The raw measurements consist of two intensity signals, one for each allele, which are subsequently transformed into the log-scaled ratio of the observed and the expected intensity (LRR), and the B Allele Frequency (BAF) which captures the relative contribution from one allele (B) to the fluorescent signal. An example is given in
Each dot in the graphs represents a single marker, the gap in the middle corresponds to a centromere, which is not targeted by the chip. The top graph shows the LRR values, which are centered around 0 in diploid regions (CN2), elevated in duplicated regions (CN3), and lowered in deletions (CN1). The bottom graph shows the corresponding BAF values which cluster into three bands in diploid regions, into four in CN3 regions, and into two in CN1 regions, as explained further in the text.
A number of free and commercial programs have been developed for detecting CNVs which employ t-tests and standard deviations of the LRR values [
Although determining the absolute copy number state can be challenging, the problem of cell line screening is easier. Here, we are interested in detecting differences between a cell line and an independent control sample from the same individual from which the cell line was made, for example to screen out lines with major genomic aberrations. This problem is similar to that of cancer, where data from a tumour sample is compared to a control from non-cancerous tissue. A number of sophisticated algorithms have been developed for this type of problem [
In this paper, we present a method for screening cultured cell lines for genomic abnormalities. The method consists of two programs implemented in BCFtools [
Large aberrations which affect whole chromosomes, such as aneuploidy or contamination, can be discerned directly from the overall distribution of BAF values (
Unscaled (A) and scaled (B-D) distributions of BAF values typical for the copy number states 2-4. In (C) we infer that 33% of the cells are aneuploid copy number 3, and in (D) we infer a sample with 20% contamination. The black line is the complete BAF distribution over 0 to 1 of which only part is modelled (shown in red); the green line is the best fit to the red part of the distribution. The model does not include the RR peak and including the AA peak is optional.
In order to determine the copy number state, the method employs the Levenberg-Marquardt algorithm [
The fitting is performed individually for each copy number state and then the most likely state is selected according to the goodness of fit. Because multiple peaks are always an equal or a better fit to the experimental data than a single peak, an aberrant copy number state is accepted only if the absolute deviation of the fit to the data is smaller than a given percentage of the single peak deviation (the default is 30%). When the data deviate strongly from the expected distributions or none of the alternatives is markedly better than the others, the program reports a failure to assign the copy number state.
Finally, an additional constraint is put on the minimum fraction of aberrant cells to assure clean separation of the side peaks, which helps to reduce the number of false calls with noisy data. As shown in Results, a practical threshold is 20% but this can be set higher when higher specificity is required.
For smaller localised copy number variations we employ HMMs similarly to the work of others [
The Viterbi algorithm is used to find the most likely copy number state in a region and the average posterior probability calculated by the forward-backward algorithm is used to assign a quality score to each. We also build a two-sample HMM with state space the product of the basic HMM for both samples (16 states) and favoring joint transitions to keep the states the same except where there are mutations. This has the desired effect of ignoring population variation shared by both samples.
Following the assumption that the BAF values are normally distributed, we set the probability of observing the BAF value
The central plot shows LRR correlation across the whole genome. The right hand plot shows LRR correlation across chromosome 17 which contains a large 40Mb duplication. The plots are 2-D histograms with hexagonal bins, logarithmic greyscale was used to indicate the number of markers in a bin.
Sites at which no call was made by the genotyping software may derive from the locus being missing entirely in the sample (
In the case of two-sample calling, the emission probability is
The HMM transition matrix
In order to account for the fact that the markers are unevenly distributed on chromosomes, the transition matrix
The HMM states are initially set to prefer the normal copy number state CN2:
The calculation of emission probabilities relies on several parameters. Some are general statistical descriptors that can be readily calculated from the data, such as the standard deviation
The performance of the methods was evaluated on simulated data and compared to the popular PennCNV caller.
The programs were then applied to genotyping data from cell lines generated in the HipSci project [
Unless explicitly stated otherwise, both programs were run with default parameters. PennCNV was tested with both its
As shown in
In cases where the main source of the LRR variation is random statistical noise, the data can be cleaned by applying a moving average [
Based on these observations we can expect the CNV calling method to be less robust than the polysomy method, because it takes as input both BAF and LRR values. Also rather than working with whole distributions of values, it takes into consideration signals from each marker individually, making it even more sensitive to input data noise. Therefore, less significance was given to the relative contribution of LRR when calling from real data. The default values (
The simulations focused on the CNV calling method with the aim to determine the specificity and the sensitivity of the caller under different conditions.
The size limit for detecting CNVs is dictated by spacing of the markers on the genotyping array (the median of which is 2kb in the case of the 0.5M CoreExome array) and the percentage of segregating heterozygous genotypes (which was 15% in our data). The simulated genotypes were assigned to the markers randomly according to RR, RA and AA frequencies observed in real data. Signal intensities were then sampled from a normal distribution using variation observed in real data, which ranged from 0.03 to 0.08 for BAF and from 0.13 to 0.26 for LRR. Ten thousand simulated experiments were performed, producing 23,033 simulated aberrations which ranged from 18kb to 4Mb (1st and 99th percentile) with median 1.7Mb. The false positive and false negative counts were calculated based on overlaps of arbitrary length. In addition, the missed length and the length of falsely called regions was analysed.
Of the 23,033 simulated regions, BCFtools and PennCNV failed to recognise 1.7% and 1.8%, respectively. After the raw PennCNV calls were cleaned using its recommended post-processing script, the number of missed calls decreased to 1.6%. The false discovery rate was 0.3% for BCFtools, 1.2% for PennCNV and 1.1% for the cleaned PennCNV calls. In general, duplications were more difficult to call and formed a larger fraction of both false negatives (BCFtools 55%, PennCNV 73%) and false positives (BCFtools 100%, PennCNV 95%). BCFtools required at least four markers to call a deletion, capturing 99.9% deletions which spanned ten markers or more and 98.8% duplications which spanned four heterozygous markers or more (
BCFtools was more robust to input data noise and the error rate remained relatively constant throughout the whole range of LRR and BAF values (
The set of 48 samples included 12 trisomies which were all called also from the 2.5M Omni array data. The estimates of trisomy frequencies ranged from 18% to 91% of affected cells and were highly consistent (
Fraction of contaminating cells in the sample (red circles) and the fraction of cells with trisomies (green triangles) as estimated from the default (0.5M sites) and higher density chip (2.5M sites). All outliers in this figure are unconfirmed contaminations and chromosomes failing goodness of fit criteria as explained in the text.
Of the 8 samples which were tested for signs of contamination, 4 were confirmed and 4 were not, possibly because the contamination occurred at a different stage of sample processing (see
The estimates of contamination frequencies in the confirmed samples ranged from 10% to 43% of foreign cells, and were also highly consistent (
Although we cannot estimate true false positive and false negative rates from our data, we can report the fraction of calls confirmed by the higher density chip. The median distance of the markers on the 2.5M array is 600bp which allows us to identify false calls caused by short runs of homozygous markers and check the consistency of the calls across two platforms. In the comparisons that follow, we exclude calls which overlap centromeres as they are not targeted by the arrays.
As expected, the CNV calling method was much less robust than the polysomy method when applied to real data. BCFtools made 315 calls of which 78.4% intersected with a call made from the 2.5M Omni array data. From the same data PennCNV made 7041 calls of which 61.9% were reproduced. We then performed the same test with QuantiSNP [
Many of the false calls can be effectively prevented by running BCFtools in pairwise mode, where a prior is set on the control and the query samples sharing the same copy number state. The effect of the prior is shown in
BCFtools/polysomy and CNV have been used as part of the quality control pipeline for the HipSci project, which is making hundreds of human induced pluripotent cell lines (
We have presented a method for checking genomic integrity in cultured cell lines using whole-genome SNP genotyping data. The method consists of two algorithms implemented in BCFtools/polysomy and BCFtools/cnv commands, the first for detection of whole chromosome aberrations and contamination, and the second for detecting sub-chromosomal aberrations.
The program was evaluated extensively and compared to the popular PennCNV caller. Both programs performed similarly well when run in single-sample mode on simulated data, with BCFtools having slightly lower error rates and being more robust against noise. However, while the performance was excellent on simulated data with correctly recognising 98.5% of CNVs larger than 10kb and 99.9% of CNVs larger than 200kb, both programs performed less consistently when applied to real data. The reproducibility of calls made from the two microarray platforms was 78.4% for BCFtools and 61.9% for PennCNV. Moreover, running BCFtools in pairwise mode increased reproducibility from 80.1% to 90.0%.
We found that the input data to these methods, BAF and LRR quantities, are affected by systematic and random noise to a different extent, LRR being much less reliable across different runs. Consequently, results from LRR-based methods tend to be less reproducible, unless precautions are taken. Our approach to this problem is to make the LRR contribution to the HMM emission probabilities less significant than the contribution from BAF. Also, in contrast to most other methods, the BCFtools/cnv command is tailored for determining differences between two cell lines, which increases robustness and, in addition, allows one to distinguish between normal and novel copy number variation.
We acknowledge that, as stated in the introduction, SNP genotyping arrays cannot detect all copy number variants, but they are cost effective and convenient for detection of large aberrations compared to custom CNV arrays or sequencing. We also note that, because of our focus on screening for genomic integrity, we did not attempt to resolve higher order copy number states. To our knowledge BCFtools/cnv is the first SNP-array based CNV caller for paired samples that explicitly models background copy number variation in the control sample. The resulting program has proved valuable for screening hundreds of cell lines in the HipSci project.
(PDF)
(ODS)
(PDF)
(PDF)
A duplication (CN3) present in the majority of cells manifests in a wider split of heterozygous genotypes (
(TIFF)
The correlation between LRR values obtained from the default chip (0.5M sites) and the high density chip (2.5M sites) plotted as a function of the moving average window.
(TIFF)
(TIFF)
(EPS)
The top row shows the number of false positives and negatives as a function of increasing LRR and BAF noise. The bottom row shows the error in prediction of region boundaries: the FP curves show incorrectly added length to correctly detected regions and the FN curves show missed length from correctly detected regions.
(EPS)
(EPS)
The plots in the left column show BAF values from each marker individually and the plots on the right show the overall distribution of BAF values across whole chromosome. The top row (in red) of each panel shows the 0.5M array data, the bottom row shows the 2.5M Omni array data (black). The top sample (
(TIFF)
(EPS)
The number and total length of novel CNVs observed across 905 cell lines from the HipSci project. No differences are allowed when
(PDF)
The authors would like to thank Anja Kolb-Kokocinski for HipSci project coordination and Yasin Memari for data processing. This work was supported by the Wellcome Trust (WT098051) and a grant co-funded by the Wellcome Trust and Medical Research Council (WT098503).