^{ 1 }

^{*}

^{ 1 }

^{ 2 }

^{ 1 }

^{2}

^{ 1 }

^{2}

^{3}

^{4}

XL, SGS, PCG, TGP, and BJR conceived and designed the experiments, and performed the experiments. XL analyzed the data. XL contributed reagents/materials/analysis tools. XL, SGS, PCG, TGP, and BJR wrote the paper.

The authors have declared that no competing interests exist.

Single nucleotide polymorphisms (SNPs) have been increasingly utilized to investigate somatic genetic abnormalities in premalignancy and cancer. LOH is a common alteration observed during cancer development, and SNP assays have been used to identify LOH at specific chromosomal regions. The design of such studies requires consideration of the resolution for detecting LOH throughout the genome and identification of the number and location of SNPs required to detect genetic alterations in specific genomic regions. Our study evaluated SNP distribution patterns and used probability models, Monte Carlo simulation, and real human subject genotype data to investigate the relationships between the number of SNPs, SNP HET rates, and the sensitivity (resolution) for detecting LOH. We report that variances of SNP heterozygosity rate in dbSNP are high for a large proportion of SNPs. Two statistical methods proposed for directly inferring SNP heterozygosity rates require much smaller sample sizes (intermediate sizes) and are feasible for practical use in SNP selection or verification. Using HapMap data, we showed that a region of LOH greater than 200 kb can be reliably detected, with losses smaller than 50 kb having a substantially lower detection probability when using all SNPs currently in the HapMap database. Higher densities of SNPs may exist in certain local chromosomal regions that provide some opportunities for reliably detecting LOH of segment sizes smaller than 50 kb. These results suggest that the interpretation of the results from genome-wide scans for LOH using commercial arrays need to consider the relationships among inter-SNP distance, detection probability, and sample size for a specific study. New experimental designs for LOH studies would also benefit from considering the power of detection and sample sizes required to accomplish the proposed aims.

Single nucleotide polymorphisms (SNPs) are common DNA sequence variations and have been widely investigated for their roles in disease causation [

Detection of LOH requires SNPs to be heterozygous (i.e., informative). In the largest public SNP database, dbSNP (

The frequency distribution of the average SNP HET rates for each SNP reported in the dbSNP database is shown in

Blue bars are the distribution of SNP HET rates in dbSNP; red line is fitted line. Chi-square goodness of fit test (with 20 bins) for fitting a beta distribution was not rejected at

The

These results indicate that a significant number of the SNPs in dbSNP have large estimated variances, which would not provide enough precise information for designing studies requiring the accurate estimation of SNP HET rates (i.e., those using SNPs for LOH detection for molecular diagnoses). Traditionally, for diallelic alleles with _{1} and _{2} allele frequencies, the HET rate could be estimated as _{r}_{1}_{2}, although this formula is appropriate only for alleles in HWE. Another approach, which is robust to HWE assumptions, is to estimate the HET rate (and its variance) directly by population allele frequencies [

Using hypothetical parameters for true HET rates and sample sizes, we first show the relationships among true HET rates, estimated HET rates, their estimated variances, and sample sizes using the score method with continuity correction (exact binomial method may result in larger CI) (_{r}_{r}_{r}

Relationship between Sample Size and Estimation of SNP Heterozygous Rates

We introduce two different approaches to deal with the unrealistically large sample size requirement. In using SNPs to evaluate LOH in a specific chromosomal region, it is desirable that the HET rates of selected SNPs used in the region be higher than a specific value to increase the probability that at least one SNP will be informative for each patient. Therefore, the question is to test the statistical hypothesis for the HET rate of a specific SNP _{rs}_{r}_{0} (i.e., _{0}: _{r}_{r}_{0} versus _{1}_{r}_{r}_{0}). With a given power and sample size

To get sample size, we have: _{β}_{α}_{β}

Using _{0} is reasonably small in most cases; e.g., when _{r}_{0} = 0.2, and _{rs}_{rs}_{r}_{0}, the required sample size becomes much larger.

Sample Sizes* for Testing SNP Heterozygous Rate at Different Thresholds

_{0}: h = h_{0} versus _{1}_{1}_{1} _{0}). The likelihood ratio is

For type I error (false positive) level _{1}, …, x_{n}, h_{0}_{1})) < _{0} will be accepted when _{0} will be rejected and _{1} accepted when _{0,} _{1}, _{0} and _{1}. _{0} = 0.3, and _{1} = 0.2 against various true (sample) SNP HET rates _{0}, and, under the most optimistic situations, only four subjects are necessary to determine the HET rate regarding hypothesis _{0}. Depending on the goals of a study, the SPRT method could be used to significantly reduce the testing sample size required for SNP HET rate inference (e.g., compare to values in

Average Sample Number of Sequential Probability Ratio Test Method for SNP HET Rate Test

We also examined the number of SNPs needed for reliable detection of LOH for random chromosomal regions of a specific length assuming the SNP HET rate distribution shown in _{t}_{t}_{t}^{k} ≥ threshold (i.e., threshold = 0.95 or 0.99) to guarantee at least one or more heterozygous SNP will be in the lost segment. However, _{t}

At _{i}_{i}_{i}_{i}

Given the non-random distribution pattern of SNP HET rates in the genome, the next obvious question is how long (in base pairs) must a random chromosomal segment be to contain one or more heterozygous SNPs so that LOH is detected with a high probability (e.g., 0.95 or 0.99). Based on HapMap data, we used three approaches to ascertain this relationship, including simulation using the fitted dbSNP HET rate distribution pattern in

Blue and black lines are the simulated results using HET rate distribution pattern in dbSNP (

Many publications [_{het}_{d} = ph_{het}_{⌊ ⌋} representing the largest integer equal to or smaller than _{d}_{het}^{k}. The relationships are shown in

(Red lines: inter-SNP distance = 12kb; green lines: inter-SNP distance = 120kb; blue lines: inter-SNP distance = 200 kb). For each color, the three lines from bottom to top correspond to SNP HET rates of 0.2 (bottom), 0.3 (middle), and 0.4 (top).

(A) Shows the results when the chromosomal region being lost is smaller than the inter-SNP distance. For example, with a 100 kb region being lost and a 200 kb inter-SNP distance, the LOH detection probabilities are 8%, 15%, and 20% for 0.2, 0.3, and 0.4 SNP HET rates, respectively, (blue lines). The maximum detection probability is about 40% or less, depending on SNP HET rate.

(B) Shows the results when the region of loss size is larger than the inter-SNP distance. For a 300 kb region of loss size and a 120 kb inter-SNP distance, the detection probability is about 40%, 60%, and 70% for SNP HET rates of 0.2, 0.3, and 0.4, respectively, in the calculation (green lines). As the region of loss increases, approaching 900 kb, the LOH detection probability will approach 0.9 or higher when the SNPs have a HET rate of 0.3 or higher. Similarly, with a 200 kb inter-SNP distance and a region of loss of 300 kb, the probabilities of detection of LOH are about 28%, 40%, and 52% for SNP HET rates of 0.2, 0.3, and 0.4, respectively (blue lines). If the inter-SNP distance is 12 kb, the detection probability of LOH is fairly high (more than 85%) when loss size is about 100 kb or longer (red lines). The results were based on the assumption that the SNPs selected and arrayed on the chips are evenly distributed on the chromosome, which gives the most optimistic detection probability for genome-wide screening. If the selected SNPs on a chip are not evenly distributed, the detection probability will be reduced. If all the current available SNPs are used (arrayed on a chip), the detection probabilities become the pattern as shown in

Finally, we used a bootstrap method to randomly sample the heterozygous SNPs on Chromosomes 1, 3, 9, and 17 genotype data within a 500 kb window in two human subjects from the HapMap database.

The HapMap genotype data are from two randomly selected individuals from the CEU group.

Using SNPs for LOH detection is of great value for chromosomal instability studies and cancer risk prediction, but a better understanding of the resolution of the technique and how to select an informative panel of SNPs for a given application is needed. The variances of SNP HET rates are large for a large number of SNPs. In most cases, this is likely to be due to the small sample sizes used for estimation of allele frequencies in most cases. Differences in ethnic groups might also contribute to the variance of averaged HET rates. Relatively large sample sizes are needed to accurately estimate SNP HET rates using traditional methods. In order to reduce sample size for practical use, we presented two statistical methods that could be used to determine the number of individuals in the population that would need to be examined to determine if a SNP HET rate was above or below a specified threshold. The Monte Carlo simulation was performed on SNPs in dbSNP with HET rate estimation values ≤0.5 as well as all SNPs, with essentially no change in the conclusion of the study (

Based on specific study goals or technologies, more study specific methods such as truncated SPRT schemes [^{2}), the estimation of ^{2} and variance of ^{2} themselves are subject to the effects of sample size and evolutionary history of specific SNPs [^{2} should be considered when ^{2} are used for inferring SNP HET rates if a study has stringent requirements (i.e., development of clinical diagnostic markers).

We did not distinguish coding and non-coding regions of the genome in this simulation since Cargill et al. [

Our study showed that a region of LOH greater than 200 kb could be detected with high probability (>90%), with losses smaller than 50 kb having a substantially lower detection probability when using all SNPs currently in the HapMap database (

Using dbSNP and HapMap data, this study evaluated the distribution of SNP HET rates and resolution of LOH genome wide. The results of this study have two important implications that might improve design and interpretation of future genome wide LOH screens of cancers and premalignant tissues. First, retrospective review of previous genome-wide LOH screens indicate that technology limitations (i.e., SNP density of arrays) used in the experiments could have missed significant numbers of LOH events that were below the resolution of the SNP array [

LOH has been frequently proposed as a candidate biomarker for cancer risk prediction. The ability to detect an LOH event will depend on informativity, SNP density, and the size of the LOH event. Our results could improve sample size calculations for design of future LOH studies. If one would like to detect the effect of an LOH event on the risk of progression to cancer, then the sample size depends on the LOH detection probability. For example, in a study with a 1:5 ratio of cases and controls, a minimum detectable relative risk of the LOH of 5, a statistical detection power 0.9, and an LOH prevalence rate of 30% among informative subjects, at least 23 cases and 117 controls will be needed if the LOH detection probability is 100% (large region loss or high density of informative SNPs). However, if the LOH detection probability is 0.7 or 0.3, for example, (e.g., a smaller loss event, or fewer informative SNPs), then at least 44 cases and 190 controls or 116 cases and 468 controls will be needed, respectively.

All the results obtained in this analysis are based on the assumption that heterozygous SNPs are required for detection of LOH. New technologies are emerging that could be used to detect chromosome copy number changes (including deletion) using homozygous SNPs with a reasonably high accuracy [

The data for SNPs HET rates were downloaded from dbSNP (build 126) (

Data from dbSNP were used to summarize the HET rate distribution pattern of SNPs (

To examine the spatial pattern of LOH detection probability along a chromosome (

average sample number

heterozygosity

loss of heterozygosity

single nucleotide polymorphism

sequential probability ratio test