^{1}

^{2}

^{1}

^{*}

^{3}

^{1}

^{3}

^{3}

^{1}

Conceived and designed the experiments: XZ. Performed the experiments: WTH. Analyzed the data: XY LW. Contributed reagents/materials/analysis tools: FAM CCC. Wrote the paper: XY. Guideline of the work: XZ. Guideline and important suggestions for the manuscript: STCW.

The authors have declared that no competing interests exist.

Copy Number Aberration (CNA) in myelodysplastic syndromes (MDS) study using single nucleotide polymorphism (SNP) arrays have been received increasingly attentions in the recent years. In the current study, a new Constraint Moving Average (CMA) algorithm is adopted to determine the regions of CNA regions first. In addition to large regions of CNA, using the proposed CMA algorithm, small regions of CNA can also be detected. Real-time Polymerase Chain Reaction (qPCR) results prove that the CMA algorithm presents an insightful discovery of both large and subtle regions. Based on the results of CMA, two independent applications are studied. The first one is power analysis for sample estimation. An accurate estimation of sample size needed for the desired purpose of an experiment will be important for effort-efficiency and cost-effectiveness. The power analysis is performed to determine the minimum sample size required for ensuring at least

Myelodysplastic syndromes (MDS) are a heterogeneous group of clonal hematopoietic disorders characterized by peripheral cytopenia, morphologic dysplasia and susceptibility to leukemic transformation

This article is concerned with our latest MDS study using 250 K Affymetrix SNP arrays. In contrast to other research groups, who used unsorted bone marrow samples

One goal of SNP array studies is to detect the regions of Copy Number Aberration (CNA) in the whole genome. Traditional methods to infer the copy number from a SNP array can be referred to segmentation, modeling and regression approaches. Olshen

Two independent applications of our CMA algorithm are studied. The first one is the power analysis. An important aspect of experimental design is to determine the number of the samples required in order for the results to be statistically interpretable. It usually refers to power analysis. To perform power analysis, we establish a hypothesis first, and then statistical testing is implemented to decide whether the null hypothesis is accepted or rejected. The power of a test is the probability of getting a statistically significant result, given that the null hypothesis is false (the flowchart is given in Result part). Power is proportional to the sample size, significance level and the effect size, and is also inversely proportional to the variance in the population. Statistical and biological significance can be linked through the use of power analysis. And once given the significance level, the effect size and the desired power, the sample size can be directly estimated for target power.

To estimate the number of the required samples for the purpose of genotype array studies, there already exist some standard methods of power analysis. Like in gene microarray studies, people usually identify the differentially expressed genes across disease subtypes by employing some algorithms, such as Principle Component Analysis (PCA), Significance Analysis of Microarrays (SAM)

However, the methods mentioned above are due to heterogeneity of the disease invalid in MDS studies. In our experiments, copy number variants in the same regions can hardly be found in SNP arrays from different patients or even in the different hematopoietic fractions (erythroid, myeloid or blastic fraction sorted by flow cytometry) of the same patient. To the best of our knowledge, there is no existing work that attempts to quantify the statistical power in for MDS studies. The major obstacle of such kind of work is that the heterogeneity makes it difficult to design statistical tests and to give an accurate estimated sample size. This motivated us to consider other approaches to deal with this issue.

Based on the CNA regions selected by the CMA algorithm, power analysis can be performed to determine which sample size can ensure that the detected regions are statistically different from the normal references (details are shown in the

The second application of our pattern-selection based CMA algorithm is to identify the different MDS grades of patients. As we know, the well separated stages of MDS patients (high and low grades MDS patients) can guide the prognosis and survival analysis. The existing methods for discrimination of the grade of MDS patients can refer to both cell morphology and International Prognostic Scoring System (IPSS) score, which belong to methods at the phenotype level. As already mentioned, due to the heterogeneity reflected in the SNP arrays to study this complex group of diseases and consequently the lack of common CNA regions, the traditional classification approaches generally used for analysis at genotype level are no longer available Therefore, we need a new approach to overcome this obstacle. Based on the CMA algorithm, the Risk Likelihood Function and General Variant Level (GVL) score are proposed for each array. The GVL score integrates the information of CNA, such as the number of abnormal chromosomes, the total number of altered SNPs, and return a unified measurement to make the different arrays comparable. Afterwards related analyses according to GVL are considered to discriminate between high and low grade MDS patients. (It is worth to mention that we pay attention to individual patient instead of single arrays here. If we have more than one array for one patient, we need to calculate the GVL for all arrays of this patient, and the average GVL score will be the final GVL score for this patient.) Two group

Our novel contributions are: (i) we develop a new pattern-selection based method to detect the regions of CNA for a heterogeneity disease such as MDS by using sorted bone marrow SNP arrays. Real time PCR results prove that besides large CNA region, the CMA algorithm also presents an insightful discovery of subtle regions; (ii) based on the results of the CMA algorithm, two independent applications are studied. (a) Sample size estimation of the experiment based on selected patterns can be easily done by using statistical. (b) According to the results of the CMA algorithm, the high and low grade MDS patients can be well separated by using the proposed GVL score, which gives a unified measurement to make it comparable among the different arrays. (iii) For comparative analysis, we demonstrate that the number of the abnormal chromosomes detected by CMA is significantly different between patients suffering from high grade and those affected by low grade MDS. Such difference cannot be observed using

Altogether, 35 SNP arrays are generated from 12 MDS patients, 21 and 14 of are treated as test samples and references, respectively. Genomic DNA from each fraction was extracted with Qiagen Allprep RNA/DNA Mini Kit (Qiagen Valencia, CA) and stored at −80°C. Constitutional/control DNA consisted of buccal mucosa and lymphoid fractions of the patients and one marrow sample without evidence of MDS sorted into blastic, erythroid, and myeloid fractions (see

Due to the high variability of the mean intensities across different SNP arrays, normalization is necessary to make different SNP arrays comparable

Consider the pair of tumor and normal samples from the same patient (but different tissue), the SNP is genotype call conflicting in these pair samples, if the genotype of one SNP is homozygous in the normal sample, but heterozygous in the tumor sample, or both are homozygous but have different alleles. In order to reduce the risk of false positives or false negatives in the final results, SNPs with conflicting genotype call between samples and references are filtered right at the beginning.

171057 | 38094 | 26262 | 10919 | 7069 | 3770 | |

0.362959 | 0.142516 | 0.036324 | 0.006756 | 0.000977 | ||

2384 | 1242 | 734 | 386 | 216 | 181 | |

0.000114 | 1.12E-05 | 9.22E-07 | 6.56E-08 | 4.07E-09 | 2.33E-10 |

E.g. there are 3770 genotyping conflicting SNPs, each of which appears in 5 arrays, and if such conflicts are just due to random error (i.e. can not be regarded as wrong genotyping calls), the probability is 0.000977.

By virtue of the small probabilities that indicate the conflicting SNPs appearing at least in 3 of the arrays, only the remaing 235413 (171057+38094+26262) SNPs will be analyzed in the next stage.

In the next step, we use

On account of CNA in part of the arrays, we need to detect those regions first. Among the methods for copying number estimation,

(2) The standard deviation (

The threshold of log-2 ratio here is ±0.35, the same as in

MDS-2 Lymphoid is the reference, and MDS-2 Erythroid is the test sample. In the test sample, the average log-2 intensities of every five consecutive SNPs in the circled region are higher than 0.35, with the small SDs (< = 0.15). Selected overlapping regions are merged into a large region.

The CMA algorithm reduces the complexity of the model, and fewer parameters are employed, which makes it robust and easy to be performed and computable for high resolution SNP arrays. In addition, the mean of the log-2 ratio gives an intuitional hint of the real copy number variations and together with the restriction on the standard deviation. It avoids false positives caused by strong noises. Compared with the results of ^{th} SNP and ending at 85978^{th} SNP. Blast and Erythroid fractions are used as test samples, and Lymphoid is the corresponding reference. These three fractions are from the same patient. The region covers the gene

Sample | Fraction | Mean | SD | PCR |

MDS-3 | Lymphoid (control) | 0.0951 | 0.1220 | 1 |

Blast (test sample) | 0.1869 | 0.1724 | 0.91 | |

Erythroid (test sample) | 0.3563 | 0.1229 | 0.65 |

Three fractions from the same patient are displayed. The Lymphoid is the normal reference. Blast and Erythroid serve as test samples. The log 2 ratio behaviors them are different. As normal one, the log 2 ratio of Lymphoid is closed to 0. There is a significant loss in Erythroid, but for Blast, the log 2 ratio is not low enough. The real time PCR of Lymphoid is normalized as 1. Comparing with the reference, Erythroid is concluded as copy number aberration. However, such abnormality can not be observed in Blast.

Our CMA algorithm has the capability to detect both large and subtle regions. The comparison of the results with other algorithms and the choice of parameters will be discussed in the

Using the pattern-selection based CMA algorithm, the CNA regions can be detected for each array.

Compared with the references, the regions with circles indicate the CNA regions. For different samples, the CNA regions may occur in the different locations. Some appear repeatedly (the ones with green circles), and some others rarely occur (the ones with red one).

To estimate the sample size, usually one refers to power analysis. There are four quantities in the power analysis, sample size, effect size, significance level

Another principal challenge posed in the field of power analysis is how to define the effect size. The effect size is a measure of biological significance. It gives the difference between the results predicted by the null hypothesis and the actual state of the population being tested. In a clinical study, when the interest is to target the power for different effect sizes, sample size can be estimated to ensure that the endpoint with the smallest effect size is sufficiently powered with a fixed significance level

Due to the limited amounts of SNP arrays used in our MDS study, it is difficult to define an appropriate effect size empirically. Therefore, we prefer to use the standardized effect size from

When

Since the repeatability number of each detected CNA region is different, the effect size varies. It results in altered sample sizes for the detection of the different specific regions. For some regions, the effect sizes are so small that we can hardly see any CNA region in the test samples. Sometimes, only a fraction of the CNA regions are receiving attentions according to the purposes of experiments. Especially for those frequently appearing regions, only a fewer samples will be required for statistical interpretation. While for rarely emerging regions, we need a huge sample size to ensure the statistical significance of tests. Therefore, the sample size depends on the desirability of the study. It does not necessarily require identification of all abnormal regions at the same time. Accurate sample size estimation will be important to an efficient and economical study design. To implement it, we first collect the abnormal regions derived by CMA algorithm. For each detected region

According to our CMA algorithm, in total 1117 of the detected regions are non-overlapping. Power analysis is executed according to the flowchart in

The significance level is set as 0.05.

P = 0.8 | P = 0.9 | P = 0.8 | P = 0.9 | P = 0.8 | P = 0.9 | ||

0.4 | 0.344 | 54 | 74 | 69 | 91 | 102 | 129 |

0.5 | 0.283 | 79 | 109 | 101 | 134 | 150 | 190 |

0.6 | 0.230 | 118 | 163 | 150 | 200 | 224 | 284 |

0.7 | 0.171 | 215 | 296 | 272 | 364 | 406 | 516 |

0.8 | 0.108 | 528 | 731 | 671 | 897 | 998 | 1271 |

(P: Power).

The discrimination of high and low grade of MDS patient is an important issue in the prognosis and survival analysis of MDS studies. Biologists use cell morphology and the IPSS score to determine the assessment of patient's MDS severity. Those kinds of classification are important to clinical survival analysis in the future. However, at the genotype level, to the best of our knowledge, there is no relative research focusing on this issue. As an application of the proposed CMA algorithm, we first define the Risk Likelihood Function and the General Level (GVL) score as in (4) and (5); then the GVL score will be used for the discrimination between the high and low grade MDS. Some statistical tests show that, high and low grade MDS can be well separated by the definition of GVL. The difference between the two groups is significant, which implies that it can give a quantitative criterion for the classification when using SNP arrays.

The Risk Likelihood Function defined as follows, takes account of two aspects, one is the chromosome abnormalities, and other one is the number of altered SNPs.

The General Variant Level presenting the log-2 ratio variant with the Risk Likelihood Function can be defined as

Though there are only a few common regions among the sample arrays, using the proposed CMA algorithm and the GVL score, we make the arrays comparable. Thereby, high and low grade MDS patients can be well separated.

Sample | Fraction | GVL | Average | High/Low by morphology | IPSS | ||

MDS-1 | Blast | 8 | 365 | 0.3871 | 0.3835 | H | Int-1 |

Erythroid | 6 | 53 | 0.3799 | ||||

MDS-2 | Myeloid | 12 | 74 | 0.3581 | 0.4011 | H | Int-2 |

Erythroid | 15 | 174 | 0.4441 | ||||

MDS-6 | Blast | 4 | 16 | 0.3336 | 0.2725 | H | Int-2 |

Erythroid | 3 | 15 | 0.3119 | ||||

Myeloid | 1 | 6 | 0.1719 | ||||

MDS-8 | Blast | 5 | 1610 | 0.3635 | 0.3561 | H | Int-1 |

Erythroid | 3 | 875 | 0.3286 | ||||

MDS-10 | Myeloid | 6 | 4586 | 0.3742 | 0.3742 | H | H |

MDS-3 | Blast | 0 | 0 | 0 | 0.1123 | L | Int-1 |

Myeloid | 0 | 0 | 0 | ||||

Erythroid | 4 | 30 | 0.3369 | ||||

MDS-4 | Myeloid | 0 | 0 | 0 | 0.0895 | L | L |

Erythroid | 1 | 5 | 0.1790 | ||||

MDS-5 | Erythroid | 1 | 5 | 0.1808 | 0.1808 | L | L |

MDS-9 | Erythroid | 1 | 5 | 0.1939 | 0.1939 | L | L |

MDS-11 | Myeloid | 0 | 0 | 0 | 0 | L | Int-1 |

MDS-12 | Myeloid | 0 | 0 | 0 | 0 | L | Int-1 |

MDS-7 | Blast | 15 | 198 | 0.4264 | 0.3842 | Int-1 | |

Myeloid | 3 | 38 | 0.3421 |

A GVL of zero implies that there is no selected abnormal region in the corresponding arrays.

From

Based on the classification results by cell morphology and IPSS score, we perform the two-group

Morphology | IPSS | |||||

Without MDS-7 | df | df | ||||

9.3989 | 9 | 0.0001 | 4.2432 | 9 | 0.0022 | |

With MDS-7 | df | df | ||||

5.6028 | 10 | 0.0002 | 3.6182 | 10 | 0.0047 |

The cutoff value of copy number one and three in

With the proposed CMA algorithm, we detect the CNA regions using the mean and SD of every five consecutive SNPs as criteria. Actually, computing the mean in a region can be regarded as a constant regression to predict the real log-2 ratios. We have also tried different methods than constant regression, such as local linear regression, quadratic regression to select the CNA regions by the threshold of mean and SD. Most of the selected CNA regions of CMA algorithm can be included by performing the local linear regression, because the local linear regression will not change the mean of a region. However, due to the correction of SD, it almost abolishes the restriction of SD, which leads to overestimation, especially in the case of heavy noise data. Since the quadratic regression will essentially change both the mean and SD of a specific regions and cannot give an intuitionist view of the log-2 ratio, hence it is not robust enough.

Statistical approaches for analyzing copy number data are aimed at detecting the regions of genomic alteration. One alternative method is to model the data explicitly as a series of segments, with unknown boundaries and heights, and then one can set up some performances or optimize an objective function, like proposed in

Arrays | Regions | CMA | CBS | |

MDS-8 B | Chr 7 monosomy |
Y | Y | Y |

MDS-8 E | Chr 7 monosomy |
Y | Y | Y |

MDS-1 B | 7q34–7q36.1 | Y | Y | Y |

MDS-3 E | 7q34 |
Y | N | N |

MDS-1 B | 7p21.3 | Y | N | N |

MDS-2 E | 7p14.2 | Y | N | N |

MDS-1 E | 7q14.1 (mean = 0.56; SD = 0.47) | N | Y | N |

MDS-1 E | 7q34 (mean = 0.25; SD = 0.20) | N | Y | N |

MDS-2 E | 7p31.1 (mean = 0.24; SD = 0.34) | N | Y | N |

MDS-2 M | 7p31.3 (single SNP) | N | Y | N |

MDS-2 M | 7q34 (mean = 0.29; SD = 0.28) | N | Y | N |

B: Blast; E: Erythroid.

Next we want to compare the CNA regions discovered with the CMA algorithm and

CMA | H | L | df | ||||

mean | SD | mean | SD | ||||

Morphology | 5.93 | 2.83 | 0.64 | 0.56 | 7.07 | 9 | 0.0001 |

IPSS | 6.22 | 3.67 | 1.85 | 2.44 | 4.72 | 0.0011 |

H | L | df | |||||

mean | SD | mean | SD | ||||

Morphology | 14.30 | 2.96 | 9.17 | 5.03 | 2.01 | 9 | 0.0753 |

IPSS | 12.38 | 1.67 | 11.18 | 5.68 | 0.35 | 0.7344 |

Two-group

At last we want to discuss the length of CMA algorithm, as it is a critical parameter for the success of our study. Notice that the overlapping regions selected by the CMA algorithm will be merged to large and non-overlapping regions; therefore, the length of final copy number aberration regions may not be fixed at five. In this study, we can regard the length five as an initial length. Our choice is based on the real-time PCR results, indicating that the copy number aberration region will not be selected, if we change the length to six, due to the dissatisfaction of both mean and SD for the log-2 ratios of six consecutive SNPs'. The mean and the SD for a length of five are −0.356269 and 0.1229, respectively. However, they change to −0.296754 and 0.1826 for a length of six. Furthermore, we think that if the initial length is too short, we may find much more false positive regions. By trying different lengths, we conclude that the proposed one of five is the most suitable and it can be the minimum length for the selection of CNA regions. However, the user may change the initial length appropriate to the data. If the user has prior knowledge about the data, we recommend that the initial length should be chosen according to the prior information.

Details of the used SNP arrays. The references are marked in shade.

(0.05 MB DOC)

Copy number aberrations comparison of CMA algorithm and CNAG (MDS-7 is excluded). The cutoff value of copy number one and three in CNAG is −0.35 and 0.35, and the window size of moving average is 5 (chromosomes with only single altered SNP excluded). Two-group t-test are performed under the null hypothesis that the means of two groups are no significant different.

(0.03 MB DOC)

Copy number aberrations comparison of CMA algorithm and CNAG (MDS-7 is excluded). The cutoff value of copy number one and three in CNAG is −0.49 and 0.30 (default setting), and the window size of moving average is 5. Two-group t-test are performed under the null hypothesis that the means of two groups are no significant different.

(0.03 MB DOC)

Copy number aberrations comparison of CMA algorithm and CNAG (MDS-7 is excluded). The cutoff value of copy number one and three in CNAG is −0.49 and 0.30 (default setting), and the window size of moving average is 5 (chromosomes with only single altered SNP excluded). Two-group t-test are performed under the null hypothesis that the means of two groups are no significant different.

(0.03 MB DOC)

The authors would like to thank the referees for their good suggestions in the previous version of the manuscript, as well as for many valuable comments that have improved in this work. The author would also like to thank their colleagues of Bioinformatics Core, at The Methodist Hospital Research Institute, for the discussions and all the valuable suggestions regarding the research.