^{*}

Conceived and designed the experiments: JQ RR. Performed the experiments: JQ. Analyzed the data: SD. Wrote the paper: SD JQ RR. Worked out the mathematical analysis and algorithms and conducted simulation experiments: SD.

The authors are employees of Fludigim Corporation and for this research project they used a technology platform which is being commercialized by Fluidigm.

Copy Number Variations (CNVs) of regions of the human genome have been associated with multiple diseases. We present an algorithm which is mathematically sound and computationally efficient to accurately analyze CNV in a DNA sample utilizing a nanofluidic device, known as the digital array. This numerical algorithm is utilized to compute copy number variation and the associated statistical confidence interval and is based on results from probability theory and statistics. We also provide formulas which can be used as close approximations.

Digital PCR conventionally utilizes sequential limiting dilutions of target DNA, followed by amplification using the polymerase chain reaction (PCR)

PCR mixes are loaded into each panel and single DNA molecules are randomly partitioned into the chambers. The digital array can be thermocycled, imaged on a BioMark instrument, and the data analyzed using the Digital PCR Analysis software.

The two bottom panels are NTC (no template control). Digital PCR Analysis software can count the number of positive chambers in each panel. When two assays with two fluorescent dyes are used in a multiplex digital PCR reaction, two genes can be independently quantitated. This is the basis of the CNV study using the digital array.

Copy number variations (CNVs) are the gains or losses of genomic regions which range from 500 bases on upwards in size. Whole genome studies have revealed the presence of large numbers of CNV regions in human and a broad range of genetic diversity among the general population

Current whole-genome scanning technologies use array-based platforms (array-CGH and high-density SNP microarrays) to study CNVs. They are high throughput but lack resolution and sensitivity. Real-time PCR is a sequence-specific technique which is easy to perform, but is limited in its discriminating power beyond a 2-fold difference

CNV determination on the digital array is based upon its ability to partition DNA sequences. Given the number of molecules per panel and the dilution factor, the concentration of the target sequence in a DNA sample can be accurately calculated. In a multiplex PCR reaction with 2 or more assays, multiple genes can be quantitated simultaneously and independently, effectively eliminating any pipetting errors if separate reactions have to be set up for different genes. When a single copy reference gene (RNase P in this study,

In this paper we will show that the digital array provides a robust and easy-to-use platform to study CNVs. We have derived a mathematical framework to calculate the true concentration of molecules from the observed positive reactions in a panel. We also show how to perform statistical analysis to find the 95% confidence intervals of the true concentrations and the ratio of two concentrations in a CNV experiment using the digital array with multiplex PCR.

The copy number variation problem can be stated as follows. _{1} _{2} _{1}_{2} _{Low}_{High}

Our approach is built on well-known tools and techniques from statistics. It decomposes the problem into two parts.

Given a count _{Low}_{High}

Given estimated true concentrations _{1} and _{2} of the reference gene and the target gene, respectively, in the DNA sample and their respective confidence intervals, how can one estimate the ratio _{1}/_{2} and a confidence interval [_{Low}_{High}

It turns out that the first question can be answered by applying sampling and estimation theories from statistics and probability, and the second question can be answered by a numerical algorithm based on generalization of a mathematical theorem.

For related work on answering the first question, using Bayesian approach, see unpublished preprint by Warren et al. titled “The Digital Array Response Curve” dated March 2007 at

This paper differs from this prior work by Warren et al. in two different ways. First, we consider the parameter

We will prove mathematical correctness of our results in this paper and present simulation results to help the reader build useful insight. Finally, we present actual CNV experiments on the digital array with known ratios and show the results using the techniques developed in this paper.

DNA quantitation in the digital array is based on the partitioning of a PCR reaction into an array of several hundreds or even few thousands of chambers or wells. One panel of the digital array consists of 765 chambers and one can use up to 12 panels at a time. If the concentration of the target molecules is low in the DNA sample, most of the chambers capture either one or no molecules and the number of positive chambers at the end point of the PCR yields close approximation to the true concentration of the target. However, if the number of molecules is large, then there is greater probability of several molecules being in the same chamber, and therefore the number of positive chambers would be significantly lower compared with the number of molecules in the chambers.

We are interested in estimating the true concentration of the molecules in the DNA sample from which we extracted 6 nl×765 = 4.59 µl of sample for each panel.

Consider the universe of infinite number of the digital array chambers filled with an infinite amount of the DNA sample where the true concentration of the target molecules is

One can model ^{−λ}. Therefore,

A chamber getting a hit or no hit is a binomial process, same as toss of a coin, with success probability

A digital array panel is a finite sampling of this universe. The goal is to determine

See

If _{c}

Define the estimator of

Let a random variable _{X}

Now we derive an approximation for

It is informative and useful to run a simulation experiment on the computer to see how the real-world matches with the theory developed above. For this purpose, one can use a random number generator and a computer program to simulate the universe of the digital array chambers.

If a panel has

Extract ^{−λ} and standard deviation should be

For our simulation experiments we chose

The green curve is the sampling distribution predicted by the theory.

Theoretical Predictions | Simulation Results | |

Mean | 311.5 | 311.48 |

Standard Deviation | 13.59 | 13.58 |

Percent of times |
95% | 94.44% |

In previous section, we established a method for estimating the true concentration of the target molecules in the DNA sample from the count of positive chambers as well as the 95% confidence interval for this estimation. We also showed how the sampling distribution

In CNV, the goal is to determine ratio of true concentrations of two genes, one being reference gene and the other being test gene, and associated confidence interval, which we now accomplish in next subsections.

Let the sampling distributions of the test gene and the reference gene be

However, as mentioned in previous section, one can not make this assumption in general. It is useful to go through the geometric interpretation of Fieller's theorem so that one can solve the problem for arbitrary sampling distributions. See

Assume

In this paper we have presented data in a controlled experimental system, where a synthetic DNA construct was spiked into human cell line DNA at different concentrations. In this case, the synthetic construct, which was to the RPP30 gene, was used as the target, and the RNase P gene which was endogenous to the human cell line, was used as the reference gene. The two genes were identified using two separate PCR reactions, using separate PCR primers and probes. Since there is no reason to assume that the amplification and detection of the target and reference genes are linked,

It is easy to see from the proof of Fieller's theorem and its geometric interpretation that one can compute sampling distribution

Build histograms of sampling distributions

Build a histogram of sampling distribution _{1}, _{2}] and by adding all the joint probabilities of different values of concentrations which give a ratio _{1}, _{2}].

Compute the mean and the 95% confidence interval from the ratio histogram.

See

Most of the contribution would come from the confidence ellipse region.

One can still use direct formulas, as an approximation, to compute confidence interval as follows.

The means of _{1} and _{2} respectively. Let the standard deviations be _{x}_{y}_{c}

Let the asymmetric confidence intervals for specified _{c}

If _{R}_{L}_{c}σ_{x}_{T}_{B}_{c}σ_{y}

One detail has to be mentioned. Special care has to be taken if the confidence region gets too close to _{High}

See

_{1,Low} = _{1}−1.96_{1}, _{1,High} = _{1}+1.96_{1} |

_{2,Low} = _{2}−1.96_{2}, _{2,High} = _{2}+1.96_{2} |

See the details in the paper for assumptions made so that these equations are close approximations to actual values.

We conducted simulation studies, using a random number generator and a computer program as in previous section, by choosing a ratio of 2 of concentrations of two genes, which are independent of each other, and building a distribution of estimated ratios over 50 thousand panels. In 94.9% of the panels, the true chosen ratio did lie in the computed confidence intervals thereby showing the correctness of our mathematical analysis.

The copy number variation results for known ratios of 1, 1.5, 2, 2.5, 3 and 3.5 are shown in

In total, 6 different known ratios were estimated by running the experiments for varying number of panels. The graphs for different numbers of copies are slightly staggered to allow visual comparison of overlap of the 95% confidence intervals.

In summary, Fluidigm's digital array is capable of accurately quantitating DNA samples and is a valuable platform for studying copy number variation. It is a robust technology that is sequence-specific, easy-to-use, and extremely flexible. We have presented mathematical and algorithmic solutions to analyze CNV on a digital array. The solution is an elegant application of statistical sampling and estimation theories to such an important real-world problem. We have shown how one can compute the true concentration of a target sequence in a DNA sample and the associated confidence interval on this estimation, and how one can compute the ratio of true concentrations of multiple sequences and the associated confidence interval on the estimation of this ratio.

A 10-µl reaction mix is normally prepared for each panel. It contains 1× TaqMan Universal master mix (Applied Biosystems, Foster City, CA), 1× RNase P-VIC TaqMan assay, 1× TaqMan assay for the target gene (900 nM primers and 200 nM FAM-labeled probe), 1× sample loading reagent (Fluidigm, South San Francisco, CA) and DNA with about 1,100–1,300 copies of the RNase P gene. 4.59 µl of the 10-µl reaction mix was uniformly partitioned into the 765 reaction chambers of each panel and the digital array was thermocycled on the BioMark system. Thermocycling conditions included a 95°C, 10 minute hot start followed by 40 cycles of two-step PCR: 15 seconds at 95°C for denaturing and 1 minute at 60°C for annealing and extension. Molecules of the two genes were independently amplified. FAM and VIC signals of all chambers were recorded at the end of each PCR cycle. After the reaction was completed, Digital PCR Analysis software (Fluidigm, South San Francisco, CA) was used to process the data and count the numbers of both FAM-positive chambers (target gene) and VIC-positive chambers (RNase P) in each panel.

A spike-in experiment was performed using a synthetic construct to explore the digital array's feasibility as a robust platform for the CNV study. A 65-base oligonucleotide was ordered from Integrated DNA Technologies (Coralville, IA) that is identical to a fragment of the human RPP30 gene. The sequences of the primers and FAM-BHQ probe used to amplify this construct are from Emery et al

Both RPP30 synthetic construct and human genomic DNA NA10860 (Coriell Cell Repositories Camden, NJ) were quantitated using the RPP30 assay on a digital array. Different amounts of RPP30 synthetic construct was then added into the genomic DNA so that mixtures with ratios of RPP30 to RNase P of 1∶1 (no spike-in), 1∶1.5, 1∶2, 1∶2.5, 1∶3, and 1∶3.5 were made simulating DNA samples containing 2 to 7 copies of the RPP30 gene per diploid cell.

These DNA mixtures were analyzed on the digital arrays as described above. Five panels were used for each mixture and 400–500 RNase P molecules were present in each panel. The ratios of RPP30/RNase P of all samples were calculated using the techniques developed in this paper. For each ratio, we did pooled analysis by adding the numbers of positive chambers in the first

The authors would like to thank the members of the Research and Development division of Fluidigm Corporation, in particular the software and molecular biology teams for facilitating this work. The first author would like to thank Gang Sun for bringing the problem to his attention, and to Gang Sun and Hou-Pu Chou for discussions leading to clarification of the problem statement and for initial feedback on the solution. The authors would also like thank an anonymous referee for constructive feedback on the paper which led to its significant improvement.