Weighted Change-Point Method for Detecting Differential Gene Expression in Breast Cancer Microarray Data

In previous work, we proposed a method for detecting differential gene expression based on change-point of expression profile. This non-parametric change-point method gave promising result in both simulation study and public dataset experiment. However, the performance is still limited by the less sensitiveness to the right bound and the statistical significance of the statistics has not been fully explored. To overcome the insensitiveness to the right bound we modified the original method by adding a weight function to the Dn statistic. Simulation study showed that the weighted change-point statistics method is significantly better than the original NPCPS in terms of ROC, false positive rate, as well as change-point estimate. The mean absolute error of the estimated change-point by weighted change-point method was 0.03, reduced by more than 50% comparing with the original 0.06, and the mean FPR was reduced by more than 55%. Experiment on microarray Dataset I resulted in 3974 differentially expressed genes out of total 5293 genes; experiment on microarray Dataset II resulted in 9983 differentially expressed genes among total 12576 genes. In summary, the method proposed here is an effective modification to the previous method especially when only a small subset of cancer samples has DGE.


Introduction
Selecting differentially expressed genes [1,2] is one of the most important tasks in microarray applications. Many methods were proposed to compare patterns of gene expression between cells or tissues of different kinds and under different conditions, for example, between normal and cancer cells. The goal of these methods has been to enable faster, simpler, more sensitive and systematic analyses [3]. Among these methods, t-statistics is a classical and widely-used DGE detecting methods, which works on the hypothesis that all the cancer samples are over-expressed compared with the normal samples [4]. Several other methods are also based on this hypothesis, such as empirical Bayes approach [5], mixture model approach [6], and SAM [7]. However, considering the heterogeneity of gene activation, many genes show increased expressions in disease samples, but only for a small number of those samples [8]. The study of Tomlins et al. [9,10] shows that t-statistics has low power in this case, and they introduced cancer outlier profile analysis (COPA) method which performs better than the traditional t-statistics for cancer microarray data sets. More recently, several progresses have been made in this direction with the aim to design better statistics to account for the heterogeneous activation pattern of the cancer genes, such as non-parametric method PPST (permutation percentile separability test) [11] (Lyons-Weiler, 2004) and LRS (likelihood ratio test) [12] (Hu, 2008); percentile based methods OS (outlier sum) [13] (Tibshirani, 2007), ORT (outlier robust tstatistics) [14] (Wu, 2007) and TriORT [15] as an improvement to ORT; MOST (maximum ordered subset t-statistics) [16] (Lian, 2008) and TriMOST [17], which is an improvement to MOST.
Previously, we proposed a non-parametric change point statistics (NPCPS) method [18] based on modified Kolmogorov statistic to detect the single change-point (CP) in a data sequence [19]. This method compares directly the data distribution of normal and cancer group to detect conveniently the existence of possible change-point in the cancer group, giving an estimate of the change-point as well. Besides, as a non-parametric inferential method, NPCPS does not make assumptions about the probability distributions of the variables being assessed, and accordingly, it is not necessary to normalize the microarray data before calculating the test statistic like other parametric methods usually do. By simulation and experiment, NPCPS is effective for DGE detection and outperforms the compared methods with better ROC results in many circumstances [18]. However, the performance of this change-point based method is still limited by the less sensitiveness to the right bound and the statistical significance of the static has not been fully explored. Therefore, here we present an improved method, Weighted Change-Point Statistics (WCPS) aiming to break the limitations.

Monte Carlo simulation and ROC analysis
Monte Carlo simulation was applied to evaluate the hypothesis test used in the proposed method. For each Monte Carlo simulation, the proposed method was applied to an artificial 7000-gene dataset in normal distribution (mean = 0, standard deviation = 1) and multiple simulations were carried out with positive m = 2, and different sample size n (normal group size n 1 and cancer group size n 2 equal to n/2) and DGE sample size k (0,k,n 2 ). The false positive rate (FPR, i.e. genes with DGE were recognized as no DGE existence) and average estimate of change point (Table 1 and Table 2) were computed. Generally, for both methods, the estimate of change point and FPR enhanced together when k increased; after FPR dropped below the significance level (0.01 in this case), the estimated position converges to the actual position. However, for a given k, the proposed method outperforms the original NPCPS with closer CP estimate and lower FPR; with k increasing, the proposed method converged faster to the true change point and reached zero FPR before the original NPCPS method. For normally distributed data, between WCPS and NPCPS, the FPR is 0.09 versus 0.17; for skewnormally distributed data, the FPR was 0.08 versus 0.12. Besides, the mean absolute error (MAE) of estimated CP by WCPS was 0.03, while MAE by NPCPS was more than 0.06.
Results of more simulations with different m and k are in Table 3.
The proposed method and other seven methods as comparison were then applied to two types of dataset, one in normal distribution and the other in skew-normal distribution, and each type contained several datasets with different m, n and k. The other seven methods are NPCPS, LRS, TriMOST, TriORT, COPA, OS and T-statistics. The AUC of ROC analysis on both types of dataset is summarized in Table 4 and Table 5, and the ROC in Fig. 1 and Fig. 2, respectively.
Results show that the proposed method had larger AUC than the other methods, more significantly when k was smaller. Generally, change-point based methods, namely WCPS, NPCPS and LRS were better than the percentile-based methods in terms of ROC in the simulation study, while WCPS had the best performance; among the percentile based methods, T-statistic was    very effective, while TriORT and TriMOST were better than the other two methods in terms of ROC.
The simulation result proved that by adding a weight to the original function, the proposed method becomes more sensitive to smaller k.

DGE detection in microarray data of breast-cancer
Result on Dataset I. Dataset I contains microarray data of 49 samples from breast cancer tissues as described in the Material and Methods section. Based on the previous experiment result, among the 5293 valid and unique genes of the dataset, NPCPS (C(0.05) = 1.628) yielded a detecting result of 1598 DGE genes and 17 out of 36 top ranked genes were reported as relevant to breast cancer or other known cancers. By applying the proposed method to the same dataset, for C(0.05) = 1.628, there were 2279 DGE genes being detected (1258 over expressed genes and 1021 under expressed genes, respectively); for C(0.05) = 1.358, there were 3974 DGE genes being detected (2230 over expressed genes and 1744 under expressed genes, respectively). All the top 50 ranked genes were reported as cancer-relevant.
Among the recognized differentially expressed genes, most of them have been reported as involved directly with cancer in published papers, such as AGER, MAPK14, etc. Some genes themselves have not yet been reported, but their related genes, proteins, or behaviors have been reported as cancer-relevant, such as DGKD (EGFR and DAG related, ranked 481) [20]. Some of the genes with higher D n statistic are suspected as participants of cancer cell lines. For example, gene CCDC130 (ranked 384) is potentially cancer relevant and currently under research in order to reveal the characterization of CCDC130 in cancer cell signaling [21]. Gene ranked in the first 500, such as AHDC1 (ranked 159), LIG3 (ranked 409), DMD (ranked 75), have not yet been reported formally as cancer-relevant. However, given the significant difference between cancer and normal group, it is reasonable to assume there is high possibility that these genes might participate in cancer development.
Some of the top 50 genes are listed in Table 6 with the cancerrelevant description . The data distributions of two typically ranked genes are in Fig. 3 and 4. It is clear that the estimated change point could locate the actual changing point in the gene expression data. Particularly, the cancer samples that are 'more overly expressed' than the sample on the change point could be recognized as located in the area specified by the red dashed lines of CP.
The number of DGE samples of each gene is calculated and the corresponding histogram of detected DGE genes is displayed in Fig. 5    Result on Dataset II. As described in the Section of Material and Methods, Dataset II contains microarray data of 42 samples of 12576 genes, 18 samples of histologically normal (HN) epithelium from breast cancer patients, 6 samples of highrisk prophylactic mastectomy (PM) patients, and 18 samples of reduction mammoplasty patients. After applying WCPS to the dataset, when threshold is 1.358, there are 9793 over-high expressed gene and 190 over-low expressed genes, respectively; when the threshold is 1.628, the over expressed genes reduced to 867 over-high and 10 over-low, respectively. Apparently, this dataset contains majorly over-high expressed genes. Among the 50 top-ranked genes, 43 genes have been clearly reported as relevant to human cancer. Among the rest 6 genes, third-ranked gene AP000944.1 is a lincRNA and long non-coding RNA has drawn the research attention of its functional role in human cancer [54]; CENPM gene itself are not yet reported as cancerrelevant, but inappropriate expression of the centromere proteins CENP-A and CENP-H could be a major cause of chromosomal instability that has been recognized as a hallmark of human cancer [55]; 50-ranked gene HPN cooperates with MYC in the  progression of adenocarcinoma in a prostate cancer mouse model [56].
NPCPS was also applied to this dataset and yielded 2564 and 337 differentially expressed genes with threshold 1.358 and 1.628, respectively.
WCPS detected much more differentially expressed genes compared with NPCPS. Moreover, the rankings between these two methods are only about 50 percent relevant. WCPS successfully recognized genes that are lower ranked or ignored by NPCPS. Fig. 7 and Fig. 8 show expression data of several such genes. Fig. 9 illustrates the total number of DGE genes in each HN sample. HN sample 11 and 18, two ER+ breast cancer patient samples, have more than 6000 differentially expressed genes. HN sample 1, 2, 9, three ER2 breast cancer patient sample and 13, an ER+ patient sample have more than 2000 differentially expressed genes. Fig. 10 is the top ranked gene by WCPS.
The 6 PM samples are from high-risk women and, as in the work by Graham et al., gene expression in histologically normal epithelium from breast cancer patients and from cancer-free PM patients shares a similar profile [57]. Therefore, we also tested the dataset consisted of 6 PM samples as the case group and 18 RM Table 6. Cancer-related description of top-ranked genes.

Rank
Gene Description

AGER Strong expression is seen in cells at the invasive edge of tumors and correlates with invasion and lymph node metastasis [22]
2 GP1BB Different histological types of lung cancer may be distinguished from normal tissue based on differential DNA methylation of GP1bbeta [23] 3 PDE4B The phosphodiesterase PDE4B limits cAMP-associated PI3K/AKT-dependent apoptosis in diffuse large B-cell lymphoma [24] 4 MAPK14 The expression of p-p38 and uPA was negatively correlated to prognosis of breast cancer [25] 5 SMARCA2 Encodes BRM in the SWI/SNF chromatin-remodeling complex. SWI/SNF related and loss of SWI/SNF-mediated transcriptional activation increases DNA methylation in cancer cells [26] 6 TCF3 Protein TCF3 no longer binds DNA when modified by a phosphate, making Phosphorylated TCF3 a new diagnostic marker for cancer [27] 7 NCSTN NCSTN coded protein is a subunit of c-Secretase compound, which is related to Notch signaling, a pathway found dysregulated in many cancers [28] 8 C9 Upregulation of plasma C9 protein in gastric cancer patients [29] 9 SCARB2 SCARB2 and CSNK1 double negative mRNA expression seems to be predictive of the presence of non-compromised lymph nodes in oral squamous cell carcinoma [30] 10 BMP1 BMP molecules have further been shown to have an impact on the biological behaviour of breast cancer cells [31] 11

MEF2A
Mediates synergistic transcriptional responses to the CaMK and MAPK signaling pathways by signal-dependent dissociation from histone deacetylases [32], which regulate the expression and activity of numerous proteins involved in both cancer initiation and cancer progression [33] 12 MYOG Terminal myogenesis switches off cell proliferation and migration, hence, the promotion of rhabdomyosarcoma differentiation should antagonize tumor growth and metastasis [34] 13 RPL36A Over-expression of RPL36A is associated with cellular proliferation in hepatocellular carcinoma [35] 14 SLC5A5 NIS expression is prevalent in breast cancer brain metastases and could have a therapeutic role via the delivery of radioactive iodide and selective ablation of tumor cells [36] 15 JAG1 Associated with a basal phenotype and recurrence in lymph node-negative breast cancer [37] 16 MMP11 Expression reflects the stages of tumor differentiation and LNM of breast cancer [38] 17 NEFL Neurofilament proteins are markers for neuroendocrine tumors [38] 18 SLC4A2 (AE2) AE2 might be associated with gastric carcinogenesis and the achlorhydria experienced by gastric cancer patients [40] 27 MYL1 Myosin VI is critical in maintaining the malignant properties of the majority of human prostate cancers diagnosed today [41] 28 IGHD Immunoglobulin D enhances the release of tumor necrosis factor-alpha [42] 29 ZNF131 Repressor of ERalpha signaling [43] 30 RBBP6 Involvement of RbBP6 gene and apoptosis in the pathogenesis of lung cancer [44] 31 IQGAP1 IQGAP1 plays a critical role in colon cancer cell invasion, and therefore diffuse and high expression of IQGAP1 predicts poor prognosis in patients with colorectal carcinoma [45] 35 UNC119 UNC119 is required for G protein trafficking in sensory neurons [46], while G protein signaling is involved in tumor growth and angiogenesis [47] 38 PTPRR The protein tyrosine phosphatase receptor type R gene is an early and frequent target of silencing in human colorectal tumorigenesis [48] 39 UBB Essential mediator of trichostatin A-induced tumor-selective killing in human cancer cells [49] 40 MGST2 Microsomal glutathione Stransferase II. Glutathione plays a critical role in cellular mechanisms that result in cell death [50] 44 ACAP1 ACAP1 is a GTPase activating protein specific for Arf6 [51], which is required in breast cancer invasive activities [52] 47 NAT6 (FUS2) Function of NAT6 plays an important role in cancer as the gene maps to the chromosomal region 3p21.3, which includes at least one tumor suppressor gene [53] doi:10.1371/journal.pone.0029860.t006 samples as the control group. As a result, when threshold C(0.05) = 1.358, there are 7344 over-expressed genes and 79 under-expressed genes, respectively. Fig. 10 shows one of the topranked genes, in which the gene expression of PM samples are not only over-expressed compared with the RM samples in case group, but also generally higher than the 18 HN samples from breast cancer patients. Fig. 11 summarizes number of DGE genes in each PM samples. PM sample 1, 2, 4, and 6 have significantly more DGE genes compared with PM sample 3 and 5. This result corresponds to the average expression of the total 12576 genes from the 6 samples.

Change-point in gene expression
The method we proposed here inherited the definition of change-point as described in NPCPS [18]. Consider gene expression value as a sequence of independent variables as below: X 1~x1 ,:::,x n 1 , X 2~x1 ,:::,x n 2 , X~X 1 zX 2~x1 ,:::,x n 1 ,:::,x n ,n~n 1 zn 2 : Here, X 1 contains expression values of normal samples in known distribution function F 1 (x), and X 2 contains expression values of cancer samples. Over or under expression values in X 2 would result in a change point in X. The existence of change point is evaluated by a modified Kolmogorov statistic (K-statistic), which indicates the distance between two distribution functions. Suppose F 1 21 ( y) is the inverse function of F 1 (x), which is defined as where y is a variable increasing with a fixed step that is subject to user's selection. Then, the testing procedure is defined as

Weighted Change-point Statistic
The aim of NPCPS is to find the largest D n and check if the value exceeds the threshold, while the position of the largest D n value indicates the most significant changes in the expression profile of a single gene. According to the ROC curves obtained from simulation study [18], NPCPS was more than 99% correct when for a single gene there are more than 9 samples that contain DGE. However, NPCPS is not very sensitive to the right bound as shown in Fig. 12. When there is only a small subset of cancer group, especially when k,5, NPCPS would have inadequate D n values and consequently would not always report the existence of change point. Fig. 13 illustrates the descending trend of D n value.
When there is no simulated DGE added to the normally distributed data, D n function shows a descending curve.
Therefore, in order to enhance the right-bound sensitiveness, it is reasonable to assume that by adding a proper weight function to the original function, the D n statistic could be adequately compensated even if the change occurs in the last few data points. Apparently, the goal of the weight function is to moderately compensate the right end of the D n statistic to avoid a rigid positive result, while keeps the D n value on the left end as well as in the middle as much as possible, which would resemble a function similar to 1/x. Besides, as D n is a step function, the weight function should also have the same step as D n statistic.    The weight function as in Fig. 14 is as follows: and the weighted D n is defined as The weighted D n function demonstrated better response to small subset that has DGE as shown in Fig. 12. Both estimated change point and type II error of WCPS show better results compared with NPCPS. Besides, from Fig. 13 we can see that adding a weight function does not give an unreasonable rise to the right bound when there is no DGE in any samples of the simulated data.

Experiment on Breast cancer microarray dataset
Two datasets were tested in the experiment. One microarray dataset (referred to as dataset I) of breast cancer [58], the same dataset used in [18] includes 49 samples all from cancer tissues, with different status of lymph node (LN) and estrogen receptor (ER), i.e. LN+ER2/LN+ER+/LN2ER+/LN2ER2. As the negative-lymph-node breast cancer is categorized as early stage breast cancer, these 49 samples could be categorized into two types: 25 samples with negative lymph node as the normal samples and 24 samples with positive lymph node as the cancer samples, respectively. Besides, gene expression profile of 7129 genes in the samples was obtained through annotation package hu6800 [59]. Probes of genes obsolete in NCBI gene bank were deleted; for multiple probes mapping to the same gene, only the probe that corresponded to the largest D n was kept. These two steps resulted in a total 5293 genes. This dataset was tested by all methods mentioned in simulation study. Before applied to LRS, COPA, TriMOST, TriORT, OS, and T-statistics, the gene expression values were first normalized. Before applied to WCPS, the expression values in cancer group were sorted in ascending order for each gene.    The other one (referred to as dataset II) is a 42-sample dataset obtained on platform Affymetrix Human Genome U133A Array. The samples contains 3 subsets: 18 samples of normal breast epithelia from reduction mammoplasty patients (RM sample); 18 samples of histological normal breast epithelia from 9 ER+ and 9 ER2 breast cancer patients (HN samples); and 6 samples of histologically normal breast epithelium from prophylactic mastectomy patients (PM samples) [57]. 18 RM samples and 6 PM samples were considered as the control group, while the 18 HN samples were the case group in the original article. This dataset was tested by WCPS.
For method NPCPS, LRS, TriMOST, TriORT, COPA, OS and T-statistic, the genes were ranked according to the different statistic in descending order. Genes ranked in the top indicated higher degree of DGE.
For WCPS, change-point was determined by weighted D n statistic. Genes with weighted D n larger than C(a) were recognized as having DGE. Specially, for detecting result under C(a) = 1.358 and based on the type of DGE (over high or over low), sample values that exceed the expression value at the change-point could be identified on single gene level. This would result in an array containing binary values of 0 or 1, where 0 indicates non-DGE sample and 1 indicates significant DGE sample. Therefore, for all genes in a dataset, these arrays could be combined to construct a matrix. Based on the matrix, the DGE genes contained in each cancer sample, or the size of DGE cancer sample subset could be calculated.