Non-Parametric Change-Point Method for Differential Gene Expression Detection

Background We proposed a non-parametric method, named Non-Parametric Change Point Statistic (NPCPS for short), by using a single equation for detecting differential gene expression (DGE) in microarray data. NPCPS is based on the change point theory to provide effective DGE detecting ability. Methodology NPCPS used the data distribution of the normal samples as input, and detects DGE in the cancer samples by locating the change point of gene expression profile. An estimate of the change point position generated by NPCPS enables the identification of the samples containing DGE. Monte Carlo simulation and ROC study were applied to examine the detecting accuracy of NPCPS, and the experiment on real microarray data of breast cancer was carried out to compare NPCPS with other methods. Conclusions Simulation study indicated that NPCPS was more effective for detecting DGE in cancer subset compared with five parametric methods and one non-parametric method. When there were more than 8 cancer samples containing DGE, the type I error of NPCPS was below 0.01. Experiment results showed both good accuracy and reliability of NPCPS. Out of the 30 top genes ranked by using NPCPS, 16 genes were reported as relevant to cancer. Correlations between the detecting result of NPCPS and the compared methods were less than 0.05, while between the other methods the values were from 0.20 to 0.84. This indicates that NPCPS is working on different features and thus provides DGE identification from a distinct perspective comparing with the other mean or median based methods.


Introduction
When normal gene expression is exposed to radiation, virus infection, etc., it would cause gene mutation or gene abnormal activation, which probably leads to cancer arising [1]. There are observable differences between cancer and normal tissues in their expression values on single-gene level, which enables recognition of cancer-related gene from a statistical perspective.
Based on microarray gene expression profiling [2], many methods were reported aiming to detect such difference in gene expression, or normally called differential gene expression (DGE) [3,4]. Among these methods, T-statistics is a classical and widelyused DGE detecting methods, which works on the hypothesis that all the cancer samples are over-expressed compared with the normal samples [5]. Other work has also presented meaningful results, such as empirical Bayes approach [6] (Efron 2001), mixture model approach [7] (Pan, 2003), and SAM [8] (Storey 2003). However, considering the heterogeneity of gene activation, it is reasonable to assume that DGE could only take place in a subset of cancer samples. Many methods were proposed to solve DGE detection under this assumption, such as PPST (permutation percentile separability test) [9] (Lyons-Weiler, 2004), COPA (cancer outlier profile analysis) [10,11], OS (outlier sum) [12]  , ORT (outlier robust t-statistics) [13] (Wu, 2007), and MOST (maximum ordered subset t-statistics) [14] (Lian, 2008).
Most of the aforementioned methods attempt to identify the abnormal data points based on the overall percentile of the gene expression profile. However, it is reasonable to assume that the DGE detection could be achieved by searching for the change point of the gene expression profile If we consider the single-gene expression profile as a data sequence, for non-DGE sequence, there is no significant change between the data distributions of normal and cancer samples; for DGE sequence, since the gene expression is over regulated in cancer group, the data distribution of cancer and normal samples become distinctly different, which would result in a significant change point in the sequence of gene expression profiles.
Change point problem [15] was widely studied in many fields, such as atmospheric and financial analysis. There are also applications of change-point theory to the microarray analysis, for example, a change point detection model for genomic sequences of continuous measurements [16], ARTIVA formalism for topology inference of regulatory network [17], a Bayesian model for DGE patterns of the DosR regulon of Mycobacterium tuberculosis in the timing of gene induction [18]. With respect to DGE analysis, there are BRIDGE (Bayesian robust inference for differential gene expression) for DGE detection in microarrays with small sample sizes [19], and DGE detecting method LRS (likelihood ratio test) [20] (Hu, 2008).
Since a few of the currently available change-point methods deal explicitly with estimation of the number and location of change points, and moreover these methods may be somewhat vulnerable to deviations of model assumptions usually employed [16], we propose a non-parametric statistical method for DGE detection, named as NPCPS (Non-Parametric Change Point Statistics). NPCPS is based on modified Kolmogorov statistic to detect the single-change point in a data sequence [21]. This method compares the data distribution of normal and cancer group to detect the existence of possible change-point in the cancer group, and to estimate the position of change-points. Besides, as a non-parametric inferential method, NPCPS does not make assumptions about the probability distributions of the variables being assessed, and accordingly, it is not necessary to normalize the microarray data before calculating the test statistic like other parametric methods usually do. As comparison, we tested several percentile-based methods and LRS. BRIDGE was not included as it is originally designed for two-sample problem and application to larger sample size is computationally heavy. NPCPS works comfortably with large-scale dataset, and both simulation and experiment results show that NPCPS is effective for DGE detection.

Methods
Suppose X~x 1 ,:::,x r ,:::,x n are independent random variables with cumulative distribution function F , and r is the change point of X . Then, for distribution function F r of X r~x1 ,:::,x r and F n{r of X n{r~xrz1 ,:::,x n , there exists a value x 0 that satisfies F r (x 0 )=F n{r (x 0 ),x 0 [ ½x 1 ,x n . Otherwise, if r is not the change point, we have F (x 0 )~F r (x 0 )~F n{r (x 0 ),x 0 [ ½x 1 ,x n . The change point is also noted as Gene expression profile of a single gene could also be considered as a sequence of independent variables as below: X 1~x1 ,:::,x n 1 , X 2~x1 ,:::,x n 2 , X~X 1 zX 2~x1 ,:::,x n 1 ,:::,x n ,n~n 1 zn 2 : Here, X 1 contains expression values of normal samples in known distribution function F 1 (x), and X 2 contains expression values of disease samples. Over or under expression values in X 2 would result in a change point in X .
To detect the change point, in the hypothesis test we used a modified Kolmogorov statistic (K-statistic), which evaluates the distance between two distribution functions: ,k~0,1,:::,n{1, ð3Þ is the empirical distribution function of X n{k ,:::,X n , defined as: where I is an indicator function. F {1 1 (y) is the inverse function of F 1 (x) defined as where y is a variable increasing with a fixed step that is subject to user's selection (we selected 100 in the simulation study and experiment). Therefore, the testing procedure is defined as where ½ : means round toward negative infinity. Null hypothesis H 0 is true when sup To give an estimate of change point, we define and Lett t 0 (n) be the estimate of t 0 , which is defined as: Since the test statistic measures the difference between two distribution functions, larger D n (t,y) j j indicates more significant DGE, while the positive D n (t,y) corresponds to under-expression and the negative D n (t,y) corresponds to over-expression, respectively.

Methods compared with NPCPS
Gene expression profile obtained from microarray data is often considered as a g6n matrix, which contains g rows of genes with their expression levels in n samples, in which normal group has n 1 samples, and disease group has n 2 samples. Let x ij be the expression intensity of the ith gene in the jth sample of the normal group, while i = 1, 2, …, g, j = 1, 2, …, n 1 ; let y ij be the expression intensity of the ith gene in the jth sample of the disease group, while i = 1, 2, …, g, j = 1, 2, …, n 2 . The median of the ith gene is defined as Define med ix , the normal-group median of the ith gene, and med iy the cancer-group median as med iy~m edian 1ƒjƒn 2 (y ij ): Parametric methods for DGE in cancer subset. There have been many parametric methods proposed based on the mean, median and median absolute deviation (MAD) of the gene expression profile, and following are some typical methods.
COPA [10,11]: COPA first normalizes the expression data using the group mean and MAD to prevent impact to the data distribution hypothesis by outliers, then sorts the expression value and detects cancer genes though the rth percentile of the cancer group. If the MAD of the ith gene is approximated as The COPA statistic is defined as where q r (i) is the rth percentile of the ith gene expression values, which is subject to the user's selection.
OS [12]: OS introduced heuristic rule as an additional function, and also applied the percentile knowledge to detect DGE. OS normalizes every gene to ensure the same data scale, which is convenient for gene comparison. In OS, gene expression values greater than Q 3 (i)+IQR(i) and smaller than Q 1 (i)2IQR(i) are statistically considered as DGE. The OS statistic is defined as below.
The group with over expression is defined as Similarly, the group with under expression DGE is defined as ORT [13]: Compared with OS, ORT is similar but uses the median of the normal group instead of the median of all data, and estimates the absolute error using the median of several groups instead of the square error as in COPA [11]. The purpose of these changes is to acquire more robust and consistent estimation. Accordingly, the estimate MAD in ORT is The ORT statistic is where C i is the cancer group of the ith gene. For over expression,  Similarly, for under expression, MOST [14]: Gene with expression value greater than MOST ik is considered as a differential gene expression. The testing statistic MOST ik is defined as: When k is unknown, the data are normalized firstly by m k and s 2 k , and the MOST ik is defined as Non-parametric methods for DGE in cancer subset. PPST [9]: As a non-parametric method, PPST compares the expression levels of thousands of genes in two sample groups, i.e. the control group (A) and the case group (B). The detection focuses on genes in group A of which the expression levels are higher than a certain percentile of group B's expression values (A.B), which is a type I error in statistics, and vice versa (B.A). There are two marks for each gene, s1 and s2. S1 is the number of samples in group A that are higher than the 95% of group B added by the number of samples in group B that are lower than the 95% of group A. S2 is defined as the opposite to s1. Considering a given gene, if the expression value in group A is higher than the 95th percentile of group B, it is considered as over expression; if the expression value is lower than that in group B, it is considered as under expression. Define the PPST statistic of each gene with over expression as: ppst i~X m i~1 y ij IfZ i ,y ij wq 95 (x i1 ,:::,x in )g ð 24Þ PPST statistic of gene with under expression can be obtained similarly.
LRS [15]: in LRS, cancer outlier samples are viewed as coming from a distribution with higher mean expression intensity than all the normal and other cancer samples. The purpose of LRS is to test such unequal mean. For up-regulation, LRS first organizes all the samples so the non-cancer samples are arranged before the cancer samples, and the cancer samples are sorted by their expression intensities in the ascending order. S n is the summation of the expression intensities of all the samples and the LRS statistic is as follows, Traditional methods for DGE in entire cancer group. Tstatistic [5]: this traditional method assumes that the cancer sample group is generally over or under-expressed compared with the normal samples. The t-statistic is defined as: where x i is the sample mean of normal group expression values and y i is the sample mean of cancer group expression values, S i is the estimate of combined standard deviation. Differentially expressed genes are recognized when the testing statistic exceeds a certain threshold.

Monte Carlo simulation
Monte Carlo simulation can be used to evaluate the performance of a hypothesis test in terms of the ratio of Type I error, i.e. false positive rate (FPR). For each Monte Carlo simulation, NPCPS was applied to an artificial 7000-gene dataset (normal random numbers with mean = 0, standard deviation sd = 1) composed of n 1 normal samples and n 2 cancer samples, of which k (0,k,n 2 ) cancer samples contained DGE simulated by adding a      (Table 2) were computed and the results of simulation with a = 0.01 were illustrated in Fig. 1. For data set n 1 = n 2 = 25 (Fig. 1A), the FPR was larger when k was smaller; FPR decreased with k increasing; when k was equal to or larger than 9, the detecting accuracy of NPCPS was sufficient to satisfy the significance level. For data set n 1 = n 2 = 50 (Fig. 1B), k should be not less than 9 to satisfy the significance level. The estimate of change point enhanced greatly when k increased; the estimated position became very close to the actual position at the same time as the FPR dropped below the significance level. This indicates that NPCPS is highly sensitive to left boundary and less sensitive to the right boundary, when the F 2 information is not sufficient [21].

ROC analysis on simulated data
First, we test NPCPS (a = 0.01) and seven other methods, namely COPA, ORT, OS, MOST, T, LRS, and PPST, on normally distributed datasets (mean = 0, sd = 1) with different m, n and k. When k was getting greater, all methods produced better ROC ( Fig. 2 and Fig. 3). For m = 2, when n = 50 ( Fig. 2A-2C), NPCPS was slightly weaker than LRS, and better than the other methods; when n = 100 (Fig. 2D-2E), NPCPS was very similar to LRS, and better than the other methods. For m = 1, NPCPS gave the best performance for both n = 50 and n = 100 datasets and different values of k (Fig. 3A-3F). This indicated that NPCPS had better sensitivity for less significant DGE compared with the other seven methods. Among the non-parametric method, PPST was not significantly better than the parametric methods, while LRS and NPCPS were continuously better than the other methods. This indicated that methods based on change-point were more effective and robust than methods based on percentile and MAD.
Second, we tested NPCPS (a = 0.01) and other seven methods on datasets generated from skew-normal (SN) distribution (Fig. 4). For different n and k, NPCPS had significantly larger area under the ROC curves compared with the other methods. By comparing Fig. 2 and Fig. 4 we can see that NPCPS was both effective for normal and skew-normal data distribution, and when k = 9 gave similarly good ROC ( Fig. 2C compared with Fig. 4C, Fig. 2F compared with Fig. 4E). The other seven methods, including non-parametric methods, LRS and PPST, had inferior results when working with skew-normal data. The AUC of ROC is summarized in Table 3 and 4.

DGE detection in breast-cancer microarray data
The microarray data used in the experiment are provided by West [22]. In their experiment, primary breast tumors (between 1.5 and 5 cm in maximal dimension) from the Duke Breast Cancer SPORE frozen tissue bank were selected and diagnosed as invasive ductal carcinoma. In each case, a diagnostic axillary lymph node dissection was performed. The final dataset includes 49 samples, 25 samples of which have negative lymph nodes and 24 samples with positive lymph nodes, used here as normal sample and cancer sample, respectively. Gene expression profile of 7129 genes was obtained through annotation package hu6800 [23]. The original gene expression values ranged from 34 to 43053 and were initialized to the range from 3.5 to 10.7. Seven detection method (t-statistic, COPA, OS, ORT, MOST, PPST and LRS) were applied to the initialized gene expression profile, while NPCPS was applied to the original data. The calculated test statistics of these 7129 genes by these methods were sorted in descending order.
For NPCPS, C(0.05) = 1.628 was selected, which yields a detecting result of 1978 DGE genes. Fig. 5 shows the distribution of the estimated position of change-points in the expression value of these genes. We selected the first 30 genes ranked by NPCPS, and searched PubMed and other databases to confirm that whether these genes were relevant to breast cancer or other known cancers. Out of the first 30 genes identified by NPCPS, 17 have been reported as relevant to breast cancer or other cancers (as shown in Table 5 and Table 6 separately according to D n value). The gene expression values and the change-point (CP) positions of the cancer-relevant genes are illustrated in Fig. 6 and Fig. 7. From Fig. 6 and 7, it could be seen that the estimated change-point positions could successfully locate the change in the trend of the gene expression value.   Moreover, by comparison among the ranking results of all eight methods based on the test statistic, it was noticed that genes topranked by NPCPS were ranked considerably lower by other methods, most of which were mean and median based parametric methods. When inspecting all the 7129 genes, the overall trend in ranking difference between NPCPS and other methods became more obvious. Table 7 shows the pair-wise linear correlation of gene ranking among the six methods. For NPCPS, the positive correlation is below 0.007 with OS, and the negative correlation below 0.047 with other methods. This indicated that NPCPS had much less correlation with the other five methods, among which the correlations were all positive and valued around 0.5 with each other.  This will be further discussed in the following section. If NPCPS is combined with other methods, it would help to identify genes which are considered as less DGE significant by the other seven methods.

Discussions on the biological significance of NPCPS
Non-parametric statistics. As a non-parametric statistics, NPCPS does not rely on assumptions that the data are drawn from a given probability distribution. It is applicable to input data derived from various types distributions and doesn't require data pre-processing. As such it is the opposite of parametric statistics, which would have inferior performance when the input data are not in the assumed distribution, as in the ROC simulation on normal and skew-normal datasets (Fig. 4).
No restriction on both over expression and under expression. The gene expression profile generated from microarray data usually contains samples of thousands of genes. Genes in the cancer samples might be over or under expressed. Majority of the DGE detecting methods have different formulas for under expressed and over expressed genes, respectively. For example, OS and ORT use different percentile values for over-expression and under-expression, respectively, and apply both formulas to the same microarray data. If over expression formula is applied to under expressed data, the DGE can not be correctly recognized. However, the detected results might contains false alarms, since both over-expression and under-expression formulas are applied to the same gene, and might be detected as DGE significant for twice. Unlike the other methods, NPCPS works for both types of DGE by using the same calculating formula, which would reduce the FDR, and do not require further analysis and computation aiming to clean the false alarms. When over expression formula was applied to under-expressed gene data (Fig. 8A), and vice versa (Fig. 8B), NPCPS presented stable performance in both situations, while other compared methods gave inferior ROC curves. According to the characteristic of ROC, T and MOST could have good ROC if the prediction result was inversed. The ROC curves of LRS were in the zone of random guess, which was close to the line-of-no-discrimination. Using LRS for under-expresson, user could turn under-expression into overexpression by inversing the dataset. This indicated that when over-expression formula of LRS was applied to under-expression, the random detecting result would be given.

Estimated change point position:t t 0
The biological meaning oft t 0 lies in that once the position of change point is estimated or located, we can identify which sample contains DGE. Then, rather than identifying DGE existence in n = n 1 +n 2 samples on the single gene level, we can learn that, for one sample containing thousands of genes, how many genes were over expressed or under expressed. This statistical information can be used to analyze features of each sample, and the results of which could be applied to the estimation of the differentiation degree of cancer in different development stages. Distance between two distribution function: D n NPCPS results showed that, among the 7219 genes, 3608 had negative D n , while the rest 3521 had positive D n . NPCPS use D n to evaluate the change in distribution between normal and cancer samples, and directly measure the DGE type as either over expressed or under expressed. This feature is valid based on the expression value in Fig. 6 and 7, where Fig. 6 (positive D n ) shows typical under expression and Fig. 7 (negative D n ) shows typical over expression. Fig. 9 and Fig. 10 can illustrate the relationship between D n and DGE in a more intuitive manner where cumulative data distributions of several typically ranked genes are given.
Genes in Fig. 9 were ranked on the top by NPCPS, where Fig. 9A and 9B are corresponding to positive D n (under-expression), 9C and 9D are corresponding to negative D n (over-expression), respectively. By comparing the empirical distribution of cancer and normal samples, in Fig. 9A and 9B, cancer group was significantly left to the normal group, which demonstrated under expression; in Fig. 9C and 9D, the cancer group was significantly right to the normal group, which demonstrated over-expression. The distribution graph was consistent with the D n value.
Genes in Fig. 10 were ranked lower by NPCPS. We can find that the cumulative distance between the data distributions of normal and cancer group is generally smaller compared with those genes top-ranked by NPCPS. From the empirical data distribution, difference between cancer and normal groups were very small.
As comparison, Fig. 11A-11J shows the data distributions of those top-ranked genes by the parametric methods, and Fig. 12A-12D by LRS and PPST. The data distributions were more similar to genes that were bottom-ranked by NPCPS in that small percent of the samples bring significant increase to data range. These few samples would greatly impact the cancer-group mean or median, which consequently result in a high test statistic of parametric methods. For example, in Fig. 11B and 12B, 96% of the two curves were close to each other while 4% data points in the normal group valued much greater, which equals to one outlier sample out of the 25 normal samples. Considering that the outliers were in the normal group, it was reasonable to assume that these outliers might be caused by microarray noise. For the rest of Fig. 11 and 12, except for T-statistic, the cancer group had one outlier. Fig. 11 and 12 indicate that the comparing methods are sensitive to significant change in mean and median, even when the change is introduced by a single sample which might be outliers. NPCPS is less prone to report a DGE as such few outliers are not sufficient to produce a large D n .
In summary, NPCPS is less sensitive to right boundaries and tends to find genes that have greater cumulative distance between the data distribution of normal and cancer groups. For such genes, the samples in normal and cancer group may have the same data range but should have very different distributions. Therefore, the detecting result of NPCPS would be different from other compared methods, which are more sensitive to outliers that influence the data range, rather than the cumulative distance between distributions. In other words, NPCPS values continuous change in data distribution over the whole data range, while the other methods look for a significant change of mean or median. This would explain the low correlation between NPCPS and other methods.

Conclusion
A non-parametric statistical method, NPCPS, was proposed for DGE detection based on change-point theory. NPCPS uses the data distribution of normal and cancer samples as the only input to detect a change point that indicates DGE, in order to identify potential cancer genes. Distribution-based NPCPS does not require data pre-initialization and is computationally efficient compared with other median-based parametric methods. Contrast to the compared methods, NPCPS deals with both over-expression and under-expression by the same equation. Another unique feature is that the proposed NPCPS could estimate both the number and the location of cancer samples with DGE could be estimated. Simulation study and experiments showed that, the proposed NPCPS method had better reliability and accuracy; NPCPS was more effective than the compared parametric methods; similar ROC curves was given compared with LRS when sample size was larger; when the simulated DGE value was smaller, i.e. DGE was less significant, NPCPS had better sensitivity compared with the other seven methods. Simulations also indicated that, for cancer subgroup with size greater than 8, the NPCPS had FPR less than 0.01. Besides, the detection results of NPCPS had very low correlation with the compared methods, both parametric and non-parametric, which indicates that NPCPS provides meaningful detection results different from other methods. Since cancer samples could be categorized according to different stages in the cancer development, DGE detection can also be considered also a multi-class problem. Further effort could be focused on the multi-change-point in the distribution of microarray gene expression profile.