Figures
Abstract
With an adaptive partition procedure, we can partition a “time
course” into consecutive non-overlapped intervals such that the population
means/proportions of the observations in two adjacent intervals are
significantly different at a given level . However, the
widely used recursive combination or partition procedures do not guarantee a
global optimization. We propose a modified dynamic programming algorithm to
achieve a global optimization. Our method can provide consistent estimation
results. In a comprehensive simulation study, our method shows an improved
performance when it is compared to the recursive combination/partition
procedures. In practice,
can be determined
based on a cross-validation procedure. As an application, we consider the
well-known Pima Indian Diabetes data. We explore the relationship among the
diabetes risk and several important variables including the plasma glucose
concentration, body mass index and age.
Citation: Lai Y (2011) On the Adaptive Partition Approach to the Detection of Multiple Change-Points. PLoS ONE 6(5): e19754. https://doi.org/10.1371/journal.pone.0019754
Editor: Mike B. Gravenor, University of Swansea, United Kingdom
Received: December 10, 2010; Accepted: April 15, 2011; Published: May 24, 2011
Copyright: © 2011 Yinglei Lai. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by the National Institutes of Health grants R21DK075004, R01GM092963 and Samuel W. Greenhouse Biostatistics Research Enhancement Award (GW Biostatistics Center). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Time-course data analysis can be common in biomedical studies. A “time course” is not necessarily only a certain period of time in the study. More generally, it can be patients' age records or biomarkers' chromosomal locations. A general “time” variable can be a predictor with continuous or ordinal values. When we analyze a response variable (binary, continuous, etc.), it is usually necessary to incorporate the information from this predictor. Many well-developed regression models can be used for the analysis of this type of data. In this study, we focus on a nonparametric type of analysis of time-course data: the whole time course is partitioned into consecutive non-overlapped intervals such that the response observations are similar in the same block but different in adjacent blocks. Then, the partition of time course is actually the detection of change-points. The detection of a single change-point has been well studied in statistical literature [1]. However, for the detection of multiple change-points, since there are many unknown parameters like the number of change-points, the locations of change-points and the population means/proportions in each block, it still remains a difficult problem [2].
A motivating example for this study is described as follows. The body mass index
(BMI) is calculated by dividing the mass (in kilograms) by the square of the height
(in meters). The recent WHO classification of BMI gives six categories: underweight
(18.5), normal weight (18.5-24.9), overweight
(25.0–29.9), class I obesity (30.0–34.9), class II obesity
(35.0–39.9) and class III obesity (
40.0). For the
continuous variable BMI, its values are classified into six categories based on five
cut-off points. Normal weight is considered as low risk, while the risk of
underweight category is elevated, and the risks of overweight, class I, II and III
obesity are gradually increased. Therefore, as BMI increases, the risk trend is not
simply increasing nor decreasing, but is a “U” shape. A question
motivated from this classification is that, given the data of BMI and health status,
can we partition the variable BMI into consecutive non-overlapped intervals
(categories) such that the risks are similar within an interval but significantly
different between two adjacent intervals?
In this study, our purpose is to partition a “time course” into
consecutive non-overlapped intervals such that the population means/proportions of
the observations in two adjacent intervals are significantly different at a given
level . This type of analysis can provide informative results in
practice. For example, medical experts may provide an appropriate consultation based
on a patient's blood pressure level.
The isotonic/monotonic regression (or the order restricted hypothesis testing) is a traditional nonparametric trend analysis of time-course data [3]. Since the maximum likelihood estimation results are increasing/decreasing piecewise constants over the time course, this analysis can also be considered as a special case of change-point problem. Based on the traditional isotonic/monotonic regression, the reduced isotonic/monotonic regression has been proposed so that the estimation results can be further simplified [4]. Its additional requirement is that the estimated population means in two adjacent blocks must be significantly different at a given level. However, the existing method is based on a backward elimination procedure and does not guarantee the maximum likelihood estimation results.
Without the constraint of trend shape, the detection of multiple change-points for our study purpose can be achieved through a recursive algorithm based method like recursive combination or recursive partition. The circular binary segmentation algorithm is a typical example of recursive partition [5]. In the middle of a large block, the method recursively tries to detect a sub-block with a significantly different population mean. The analysis results are piecewise constants. This method has been frequently used for analyzing array-CGH data [6]. The reduced isotonic regression (see above) is an example of recursive combination. These recursive algorithms provide approximated solutions as alternatives to the globally optimized solutions since an exhaustive search is usually not feasible. Therefore, global optimizations are not always guaranteed.
The dynamic programming algorithm is a frequently used method for optimizing an objective function with ordered observations [7]. Therefore, it is intuitive to consider this algorithm in the analysis of time-course data. This algorithm has actually been frequently used to implement many statistical and computational methods [8],[9],[10],[11],[12],[13]. For a feasible implementation of this algorithm, an optimal sub-structure is necessary for the objective function. This is usually the case for the likelihood based estimation in an unrestricted parameter space. However, when there are restrictions for the parameters, certain modifications are necessary for the implementation of the dynamic programming algorithm.
In the following sections, we first present a modified dynamic programming algorithm
so that a global optimization can be achieved for our analysis. The algorithm has
been originally developed for the normal response variables. But the extension of
our method to the binary response variable is straightforward and is also discussed
later. We prove that this method can provide consistent estimation results. Then, we
suggest a permutation procedure for the -value calculation and
a bootstrap procedure for the construction of time-point-wise confidence intervals.
We use simulated data to compare our method to the recursive combination/partition
procedures. The well-known Pima Indian Diabetes data set is considered as an
application of our method. We explore the relationship among the diabetes risk and
several important variables including the plasma glucose concentration (in an oral
glucose tolerance test, or OGTT), body mass index (BMI) and age.
Methods
A modified dynamic programming
At the beginning, we introduce some necessary mathematical notations. Consider a
simple data set with two variables: represents
distinct ordered indices (referred to as “time
points” thereafter), and
represents the
observations with
being the
-th observation at the
-th time point. Let
be the corresponding population mean of
at the
-th time point. We
assume that
, where
are independently
and identically distributed (i.i.d.) with the normal distribution
. Furthermore, we assume that the set
has a structure
with
.
,
and the set
are all unknown
(including
) and to be estimated.
The traditional change-point problem assumes that . When
, it is a multiple change-point problem. If there is no
strong evidence of change points, we may consider the null hypothesis of no
change-point (
)
:
are the same. For the traditional analysis of variance
(ANOVA), we consider an alternative hypothesis
:
can be different. Then, even when the null hypothesis
can be significantly rejected, there may be many
adjacent
estimated with similar values. Therefore, we intend to
group similar and adjacent
into a block. If
this is achievable, then we can have a detection of multiple change-points.
Therefore, we specify the following restricted parameter space for the
alternative hypothesis:
:
can be different;
if we claim any
, then they are
significantly different at level
by a two-tailed
(or upper-tailed/lower-tailed) test with the two-sample data partitioned to
include
and
in each
sample.
Remark 1.
The comparison based on adjacent time points has no effect and
is reduced to
when
. Clearly,
is reduced to
when
. Furthermore,
when an upper-tailed/lower-tailed test is specified, the analysis is the
reduced isotonic/monotonic regression [4]. Particularly, the
analysis is the traditional isotonic/monotonic regression [3] when
(when a one-sided
-test is used
for comparing two sample groups).
The goal of this study is to partition into
consecutive non-overlapped intervals such that the population means of the
observations in two adjacent intervals are significantly different at a
given level
. This type of
analysis cannot be achieved by the computational methods for the order
restricted hypothesis testing (or the isotonic regression) due to the
existence of significance parameter. One may consider the well-known dynamic
programming (DP) algorithm [7] since the observations are collected at
consecutive time points. This is again not feasible: an optimized partition
for a subset of time points may be excluded for a larger set of time points
due to the significance requirement in
. However, we
realize that the traditional DP algorithm can be modified to achieve our
goal by adding an additional screening step at each time point.
Algorithm.
Due to its satisfactory statistical properties, the likelihood ratio based
test (LRT) has been widely used. To conduct a LRT, we need to estimate the
parameters under different hypotheses. When normal population distributions
(with a common variance) are assumed for , the maximum
likelihood estimation is equivalent to the estimation by minimizing the sum
of squared errors (SSE). When a block of time points
are given, the associated SSE is simply
, where
is the sample
mean of the observations in the time block
. Notice that
the SSE of several blocks is simply the sum of SSEs of individual blocks.
Then, under the alternative hypothesis, we propose the following algorithm
that is modified based on the well-known DP algorithm. For simplicity, we
refer to the term “triplet” as a vector containing (link, index,
score) that are described in the algorithm. (For each triplet,
“link” is defined as the time point right before the block under
current consideration; “index” is defined as the index in the
triplet set linked from the time point under current consideration;
“score” is defined as the objective function value under current
consideration.)
link ; index
; score
Create as a vector set with only one element with the
above triplet
for
to
do
link ; index
; score
Create as a vector set with only one element with the above
triplet
for
to
do
Go through as ordered until a feasible index can be found
link
index current position in
score (current score in
) +
Include the above triplet as a new element in
Sort
according to the increasing order of SSE
scores
Remark 2.
It is important that the maximum likelihood estimation is equivalent to the
estimation by minimizing the sum of squared errors (SSE) when normal
population distributions (with a common variance) are assumed for
. Notice that a normal distribution is assumed for
each time point. The population means can be different at different time
points but the population variances are common for all the time points. Due
to the optimal substructure requirement for the dynamic programming
algorithm, we can only estimate the parameters specific to the existing
partitioned blocks. As shown in the above algorithm, the estimation of
variance can be achieved after the estimation of population means. (The
algorithm will not work if the common population variance has to be
estimated within the algorithm.)
Remark 3.
The definition for a feasible index in is a time
point
such that two population means in the blocks
and
are
significantly different at level
(as specified
in the restricted parameter space
). A flow chat
is given in Figure 1 to
illustrate this algorithm. The set
contains an
optimized link index among the feasible ones for each of its previous time
point
,
(if there is
no feasible link index for a previous time point
, then
will be
excluded from
). Since the
required condition for the restricted population means are imposed on
adjacent blocks, the set
also contains
all the necessary link point for the future time points when the time point
is screened as a link point. (Then, it is not
necessary to check other time points
not included
in
.) This can be confirmed as follows: if any future
time point uses the time point
as a link
point, then any sub-partitions stopped at the time point
must meet the required conditions for the restricted
population means; furthermore, an optimized one will be chosen from the
feasible ones for each time point before the time point
; therefore, these sub-partitions belong to
.
Theorem 1.
With the mathematical assumptions described at the beginning, the proposed modified dynamic programming algorithm solves the maximum likelihood estimation of restricted population means.
Proof: With the above discussion, it is not difficult to give a
proof. It is obvious to prove this claim at the time points 1 and 2 since
the algorithm just enumerates all possible partitions and selects the
optimized one. Assume that the claim holds at the time point
. For the time point
, the algorithm
screens all the previous time points and selects the optimized and feasible
solution (when it exists) for each of them. Finally, all these locally
optimized solutions are sorted so that a global optimized solution can be
found. Therefore, the claim also holds at the time point
. This concludes the proof that the claim holds for
all the time points.
Remark 4.
To test whether (or
or
) is
significant at level
, we consider
the well-known two-sample Student's
-test. The
statistical significance (
-value) of a
test value can be evaluated based on the theoretical
-distribution or through the permutation procedure
[14].
Considering the likely small sample size on each time point and the
computing burden involved in this analysis, the theoretical
-value may be a more preferred choice when the
observations are approximately normally distributed. (The required sample
size for calculating a
-value is no
less than two for each sample; otherwise, one will be considered as the
reported
-value.)
Estimators.
When we finish screening the last time points, the overall optimized
partition can be obtained in a backward manner:Create the set
with an element
while
do
Include
as a new element in the set
The estimated means in the
restricted parameter space
are simply the
sample means for the partitioned blocks. With
calculated, we can estimate the variance by
Compared to the traditional dynamic programming algorithm, which requires
computing time, our modified algorithm requires at
least
but at most
computing
time. The additional computing time is necessary so that the optimization
can be achieved in the restricted parameter space
.
Consistency of Estimation
The following theorem shows that our proposed algorithm can provide consistent
estimates for and
. The mathematical
proof is given in File S1.
Theorem 2.
Let . Assume that
and
. Then, for any time point
, we have
Furthermore,
we also have
Here, we briefly provide an outline of proof for the readers who wish to skip
the mathematical derivation. When the sample size at each time point becomes
larger and larger, eventually the true structure
,
,
,
will be a
feasible partition of time points (since the power of the two-sample tests
for these adjacent partitions will go to 100%). Its corresponding
estimates are actually the sample means and they are consistent estimators.
Furthermore, the estimated variance, which is closely related to the SSE,
will be eventually optimal when the sample size becomes larger and larger.
However, our algorithm guarantees a minimized SSE. Then, the estimated
population means provided by our algorithm will be closer and closer to the
underlying sample means. Therefore, our algorithm can provide consistent
estimates for the underlying population means. Then, it is straightforward
to prove the convergence of the variance estimator.
Remark 5.
The estimation bias and variance for isotonic regression are difficult
problems [15],[16],[17]. These two
issues are even more difficult for our adaptive partition approach since a
two-sample test is involved in the detection of multiple change-points.
(However, the building-in two-sample test can be an appealing feature for
practitioners.) Therefore, we use the well-known permutation and bootstrap
procedures to obtain the -value of test
and the confidence limits of estimates. They are briefly described as
below.
-type test and its
-value
We use to denote the score in the first element of
. This is the optimized SSE associated with the
restricted parameter space
. The SSE
associated with the null hypothesis is simply
. Then, we can
define a
-type test:
It is straightforward
to show that
is actually a likelihood ratio test. However, it is
difficult to derive the null distribution of
due to the
complexity of our algorithm. Therefore, we propose the following permutation
procedure for generating an empirical null distribution.
- Generate
as a random sample (without replacement) from
;
- Run the modified DP algorithm with
and
as input and calculate the associated
-type test
;
- Repeat steps 1&2
times to obtain a collection of permuted
-scores
, which can be considered as an empirical null distribution.
The procedure essentially breaks the association between
and
. It is also
equivalent to permute the expanded time point set
(see below). In this way, the null hypothesis can be
simulated with the observed data and the null distribution of
can be approximated after many permutations [18]. Then, the
-value of an observed
-score can be
computed as: (number of
)/
. For a
conservative strategy, we can include the observed
into the set of permuted
-scores (since the
original order is also a permutation). We can use this strategy to avoid zero
-values.
Time-point-wise confidence intervals
Our algorithm provides an estimate of population mean at each time point. It is
difficult to derive the theoretical formula for constructing a confidence
interval. Instead, we can consider a bootstrap procedure. At the beginning, we
need to expand the variable to
, where
for
. Then, we denote
.
- Generate
, where
is re-sampled based on
with replacement;
- Run the modified DP algorithm with
and
and estimate the time-point-wise population means; if any time points in
is not re-sampled in
, then assign missing values as the estimates for those time points;
- Repeat steps 1&2
times to obtain a collection of resample estimates of time-point-wise population means
.
The procedure applies the “plug-in principle” so that a resample distribution can be generated for each time point [19]. A pair of empirical percentiles (e.g. 2.5% and 97.5%) can be used to constructed a confidence interval for each time point after excluding the missing values.
Extension to binary response variables
It is straightforward to extend our method for the binary response variables.
This can be simply achieved by changing the objective function (SSE in our
current algorithm) to the corresponding (negative) log-likelihood function for
the binary response variable, and also changing the two-sample
-test in our current algorithm to the corresponding
two-sample comparison test for the binary response variable. Due to the
computing burden, we would suggest to use the two-sample
-test for proportions when there is a satisfactory sample
size or the Fisher's exact test when the sample size is small (e.g. less
than six in each cell of the 2×2 contingency table).
Choice of 
An appropriate choice of is important. A
small number of partitioned blocks will be obtained if a small value is set for
, and vice versa. For examples, no partition will be
obtained if
and each time point will be a partition if
. Therefore, like the smoother span parameter for the
local regression [20],
can also be
considered as a smoothing parameter. In practice, we suggest to use the
cross-validation procedure [21] to select
. Among a given
finite set of values like
, we can choose the
one that minimizes the prediction error.
Approximation algorithms: recursive combination and recursive partition
To illustrate the advantage of global optimization in the likelihood estimation
of restricted population means, we also consider and implement two widely used
approximation algorithms: recursive combination and recursive partition. These
algorithms provide approximated (sometimes exact) solutions to the optimal
solution in the restricted parameter space defined based on the given
. Based on the following description of these two
algorithm, it is clear their required computing time is at most
.
For the recursive combination algorithm, it begins with no partition and each
time point is a block. In each loop, it conducts a two-sample test for each pair
of adjacent blocks and find these pairs with -value higher than
; among the combinations based on these selected pairs of
adjacent blocks, the one that results the largest overall likelihood (based on
all the data) is chosen and the next loop is started when it is still possible
to combine the existing blocks; otherwise, the algorithm stops and returns the
partitioned blocks. [Notice that this algorithm is slightly different from
the one proposed by [4], in which the pair of blocks is chosen completely
based on the two-sample test. In our simulation studies, we have observed that
the likelihood based criterion can result in a better performance (results not
shown).]
For the recursive partition algorithm, it begins with one block with all the time
points. In each loop, within each existing partitioned block, it conducts a
two-sample test for each possible partition (that generates two smaller blocks)
and find these triplets (when the partitions are from the blocks in the middle
of time course) or pairs (when the partitions are from the blocks on the
boundaries of time course) with test -values lower than
; among the partitions based on these selected
triplets/pairs, the one that results the largest overall likelihood (based on
all the data) is chosen and the next loop is started when it is still possible
to create new partitions; otherwise, the algorithm stops and returns the
partitioned blocks.
Performance evaluation
In a cross-validation (CV) procedure (e.g. leave-one-out or 10-fold CV), the
estimated population mean for each
observation
can be obtained based on the training data without
. (If the time point
is not included in
the training set, then
can still be
obtained based on the linear interpolation between two nearest time points to
.) Then, the CV (prediction) error is calculated as
Remark 6.
In a simulation study, instead of a CV error, we can use the overall mean
squared error since we know the parameter values. This is a strategy to save
a significant amount of computing time. For each round of simulation, the
overall mean squared error is calculated as . After
rounds of simulations and estimations (including the
selection of
), it is also
statistically interesting to understand the estimation mean squared error
(MSE), bias and variance at each time point. The time-point-wise mean
squared error, bias and variance (for the
-th time point,
) are calculated as:
;
;
. Notice that
the denominator for
is
instead of
such that
.
Results
Simulation studies
We consider four simple scenarios to simulate time-course data: (1)
,
,
,
; (2)
,
,
,
; (3)
,
,
,
; (4)
,
,
,
. For each
scenario, the number of observations is
, 10, or 100 at
every time point. For each simulated data set, we consider twelve different
values of
. Two-tailed tests are used so that no monotonic changes
are assumed. Since we know the true parameters for simulations, we choose the
value that minimizes the overall mean squared error
(MSE). This makes our simulation study computationally feasible since it is
difficult to run the cross-validation procedure for many simulation repetitions.
Then, for each round of simulation, we obtain the “optimized”
overall MSE and the corresponding
for each of the
three algorithms: the global optimization algorithm (our dynamic programming
algorithm) and two approximation algorithms (the recursive combination algorithm
and the recursive partition algorithm). After 1000 repetitions, we compare the
boxplots of these two results. A lower overall MSE is obviously preferred. But a
lower
value can also be preferred since the detected changes
will be statistically more significant. (
can be considered
as a smoothing parameter if we do not have a pre-specified value for it.
However, it also defines the significant level for the test between any two
adjacent blocks. Then, a smaller
indicates a more
significant testing results and it is more preferred.) Therefore, a lower
boxplot means a better performance for both results.
For all the above four scenarios, Figures 2 and 4
shows similar patterns. When is as small as one
for each time point, the approximation algorithms give a better performance in
term of overall MSE, but the global optimization algorithm still gives a quite
comparable performance (Figure
2); on the choice of
, the global
optimization algorithm gives a better performance and the approximation
algorithms can give a comparable performance (Figure 4). When
becomes larger to 10 and then to 100, we observe a clear
performance improvement from the global optimization algorithm: we can achieve a
clearly smaller overall MSE and also much more significant
(Figures
2 and 4).
All y-axes represent the overall mean squared error. DP represents our
dynamic programming algorithm; RC and RP represent the recursive
combination and recursive partition algorithms, respectively. The
boxplots in each row (1–4) are generated from the analysis results
based on the corresponding simulation scenario (1–4). The boxplots
in each column (1–3) are generated from the analysis results based
on different (1, 10 and
100).
To further compare the performance of three different methods, we calculate the
relative ratio between two overall MSEs (or the selected
's) given by each of the two approximation methods
(RC or RP) vs. our proposed method (DP). Based on
simulation repetitions, we can understand the empirical
distributions of these ratios. If any ratio distribution is always no less than
one, then DP is absolutely a better choice. Furthermore, for a ratio
distribution, If the proportion of (ratio
1) is clearly
larger than the proportion of (ratio
1), then DP is
still a preferred choice in practice. Corresponding to Figure 2, Figure 3 further demonstrates the advantage
of DP when the sample size is not relatively small. Even when the sample size is
as small as one at each time point, DP still shows a quite comparable
performance. For the selected
, Figure 5 corresponds to Figure 4 and it also further
confirms the advantage of DP. [In each plot, the proportion of (ratio
1) is cumulated from the right end although the
proportion of (ratio
1) is cumulated
from the left end.]
All y-axes represent the quantile of relative ratio of overall MSEs given
by each of the two approximation methods vs. our proposed method. All
x-axes represent the values used to calculate the empirical quantiles.
The black curves represent RC vs. DP and the gray curves represent RP
vs. DP. (DP represents our dynamic programming algorithm; RC and RP
represent the recursive combination and recursive partition algorithms,
respectively.) The plots in each row (1–4) are generated from the
analysis results based on the corresponding simulation scenario
(1–4). The plots in each column (1–3) are generated from the
analysis results based on different (1, 10 and
100).
All y-axes represent the selected . DP
represents our dynamic programming algorithm; RC and RP represent the
recursive combination and recursive partition algorithms, respectively.
The boxplots in each row (1–4) are generated from the analysis
results based on the corresponding simulation scenario (1–4). The
boxplots in each column (1–3) are generated from the analysis
results based on different
(1, 10 and
100).
All y-axes represent the quantile of relative ratio of
's
from each of the two approximation methods vs. our proposed method. All
x-axes represent the values used to calculate the empirical quantiles.
The black curves represent RC vs. DP and the gray curves represent RP
vs. DP. (DP represents our dynamic programming algorithm; RC and RP
represent the recursive combination and recursive partition algorithms,
respectively.) The plots in each row (1–4) are generated from the
analysis results based on the corresponding simulation scenario
(1–4). The plots in each column (1–3) are generated from the
analysis results based on different
(1, 10 and
100).
In addition to the overall performance based on the overall MSE and the selected
, it is also statistically interesting to understand the
estimation mean squared error, bias and variance at each time point. The
time-point-wise mean squared error (MSE), bias and variance (for the
-th time point,
) are shown in
Figures 6, 7, 8, 9 for three different methods. For the
time-point-wise MSE, even when the sample size is relatively small (one
observation at each time point), our proposed method (DP) still shows an overall
comparable performance when it is compared to the two approximation methods (RC
or RP). As the sample size is increased, its time-point-wise MSEs become overall
comparably lower and lower. For the time-point-wise bias, when sample size is
relatively small (one at each time point), DP shows an overall worse performance
in the simulation scenarios 1 and 3 but it still shows an overall comparable
performance in the simulation scenarios 2 and 4. As the sample size is
increased, its biases become overall comparably lower and lower (i.e. closer to
the zero y-axis value). For the time-point-wise variance, DP always shows an
overall comparable performance. (When the sample size is as small as one at each
time point, the estimated time-point-wise means are almost all constants from
all three different methods in the simulation scenarios 2; then the
corresponding time-point-wise variance patterns are relatively flat. For the
same sample size, the estimated time-point-wise means are actually all constants
from all three different methods in the simulation scenarios 4; then the
corresponding time-point-wise variances are actually constant across the whole
time period. This also explains the relatively regular patterns of the
corresponding time-point-wise MSE and bias.)
The y-axes represent the time-point-wise MSE (upper row), bias (middle
row) or variance (lower row). The x-axes represent the time point. The
plots are generated from the analysis results based on the simulation
scenario 1. The plots in each column (1–3) are generated from the
analysis results based on different (1, 10 and
100).
The y-axes represent the time-point-wise MSE (upper row), bias (middle
row) or variance (lower row). The x-axes represent the time point. The
plots are generated from the analysis results based on the simulation
scenario 2. The plots in each column (1–3) are generated from the
analysis results based on different (1, 10 and
100).
The y-axes represent the time-point-wise MSE (upper row), bias (middle
row) or variance (lower row). The x-axes represent the time point. The
plots are generated from the analysis results based on the simulation
scenario 3. The plots in each column (1–3) are generated from the
analysis results based on different (1, 10 and
100).
The y-axes represent the time-point-wise MSE (upper row), bias (middle
row) or variance (lower row). The x-axes represent the time point. The
plots are generated from the analysis results based on the simulation
scenario 4. The plots in each column (1–3) are generated from the
analysis results based on different (1, 10 and
100).
Applications
For applications, we consider different univariate analysis scenarios for the
well-known Pima Indian diabetes data [22]. The data set contains a
binary variable for the indication of diabetes and three continuous variables
for the plasma glucose concentration at 2 hours in an oral glucose tolerance
test (OGTT), BMI and age (other variables in this data set are not considered in
our study). Our proposed method allows us to detect the multiple change-points
for diabetes indication vs. OGTT, BMI or age (analysis for a binary response),
and also OGTT vs. BMI or age (analysis for a continuous response). To reduce the
computation burden, the observed OGTT values are rounded to the nearest 5 units
and the observed BMI values are rounded to the nearest 1 unit when any of these
two variables is considered as a “time” variable.
is chosen from the finite set of values
to minimize the leave-one-out cross-validation (LOO-CV)
prediction error. Two-tailed tests are used so that no monotonic changes are
assumed. Again, we compare three algorithms: the global optimization algorithm
(our dynamic programming algorithm) and two approximation algorithms (the
recursive combination algorithm and the recursive partition algorithm).
For each analysis scenario and each algorithm, Table 1 gives the “optimized”
LOO-CV error and the corresponding . The global
optimization algorithm always chooses a highly significant
while the approximation algorithms sometimes choose a
relatively large value of
. In term of
prediction error, the global optimization algorithm achieves the best
performance in four out of five scenarios. Although the approximation algorithms
give the best prediction error for the analysis of OGTT vs. BMI, the global
optimization algorithm only gives a slightly worse prediction error.
Figure 10 shows the
identified change-points for all five analysis scenarios and also all three
algorithms. The global optimization algorithm always gives stable change
patterns while the approximation algorithms sometimes give abrupt drops or
jumps. The change patterns fitted by the global optimization algorithm are all
increasing and this is practically meaningful. For example, the analysis result
for diabetes indication vs. OGTT suggests significant increasing risks of
diabetes when OGTT values are increased to 100,
130 and
160; the analysis
result for diabetes indication vs. BMI suggest significant increasing risks of
diabetes when BMI values are increased to
23 and
32; and the analysis results for diabetes indication vs.
age suggest significant increasing risks of diabetes when age values are
25 and
32. Since OGTT is
an important predictor for diabetes, it is also interesting to understand its
changes over different BMI or age intervals. Figure 10 shows increasing patterns for OGTT
vs. BMI and OGTT vs. age. The change-points identified by the global
optimization algorithm are
25 and
40 for OGTT vs. BMI, and
27 and
48 for OGTT vs. age.
The plots in each row (1–5) are generated from the results based on different analysis scenarios (as shown in the axis labels). The plots in each column (1–3) are generated from the results based on different algorithms (DP, RC and RP). In each plot, the black solid curve represents the estimated proportions/means and the black dotted curves represent the estimated 95% confidence intervals. The gray solid curve represents the estimates only based the observations at each time point. The gray dots represent the observed data.
Discussion
The advantage of our proposed method is that the maximum likelihood estimation can be
achieved during the partition of a time course. Furthermore, the method is simple
and the interpretation of estimation results is clear. Based on our knowledge, the
modified dynamic programming algorithm proposed in this study is novel. Although the
algorithm requires more computing time than does the traditional dynamic programming
algorithm, it is still practically feasible with the current computing power for
general scientists. We have demonstrated the use of our algorithm for normal and
binary response variables. It is also feasible to modify the algorithm for other
types of response variables. Furthermore, it is interesting to explore whether there
are better approaches to the choice of . These research topics
will be pursued in the near future.
One disadvantage of our method, which is actually a common disadvantage for general nonparametric methods, is that a relatively large sample size is required in order to achieve a satisfactory detection power. (This is consistent with the results in Figure 2.) For our method, we would require a relatively long time course, or a relatively large number of observations at each time point. In our simulation and application studies, we choose to analyze the data sets with relatively long time courses since this well illustrates the advantage of our method.
Our method may also be useful for the current wealthy collection of genomics data. For example, we can apply the method to array-based comparative genomics hybridization (aCGH) data and identify chromosomal aberration/alteration regions [6]; we can also apply the method to certain time-course microarray gene expression data and cluster different genes based on their changing pattern across the time course. However, a powerful computer workstation/cluster may be necessary due to the relatively high computing burden from our method.
Acknowledgments
I am grateful to Drs. Joseph Gastwirth, John Lachin, Paul Albert, Hosam Mahmoud, Tapan Nayak, and Qing Pan for their helpful comments and suggestions. The R-functions implementing our method and the illustrative examples are available at: http://home.gwu.edu/~ylai/research/FlexStepReg.
Author Contributions
Conceived and designed the experiments: YL. Performed the experiments: YL. Analyzed the data: YL. Contributed reagents/materials/analysis tools: YL. Wrote the paper: YL.
References
- 1. Siegmund D, Venkatraman ES (1995) Using the generalized likelihood ratio statistic for sequential detection of a change-point. The Annals of Statistics 23: 255–271.
- 2. Braun JV, Muller HG (1998) Statistical methods for DNA sequence segmentation. Statistical Science 13: 142–162.
- 3.
Silvapulle MJ, Sen PK (2004) Constrained statistical inference. Hoboken, , New Jersey, USA: John Wiley & Sons, Inc.
- 4. Schell MJ, Singh B (1997) The reduced monotonic regression model. Journal of the American Statistical Association 92: 128–135.
- 5. Olshen AB, Venkatraman ES, Lucito R, Wigler M (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557–572.
- 6. Venkatraman ES, Olshen AB (2007) A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23: 657–663.
- 7.
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, 2nd edition. MIT Press & McGraw-Hill.
- 8. Autio R, Hautaniemi S, Kauraniemi P, Yli-Harja O, Astola J, et al. (2003) CGH-plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics 19: 1714–1715.
- 9. Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ (2005) A statistical approach for array CGH data analysis. BMC Bioinformatics 6: 27.
- 10. Price TS, Regan R, Mott R, Hedman A, Honey B, et al. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Research 33: 3455–3464.
- 11. Liu J, Ranka S, Kahveci T (2007) Markers improve clustering of CGH data. Bioinformatics 23: 450–457.
- 12. Picard F, Robin S, Lebarbier E, Daudin JJ (2007) A segmentation/clustering model for the analysis of array CGH data. Biometrics 63: 758–766.
- 13. Autio R, Saarela M, Jarvinen AK, Hautaniemi S, Astola J (2009) Advanced analysis and visualization of gene copy number and expression data. BMC Bioinformatics 10: Suppl. 1S70.
- 14. Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 18: 71–103.
- 15.
Robertson T, Wright FT, Dykstra RL (1988) Order Restricted Statistical Inference. New York: John Wiley & Sons, Inc.
- 16. Groeneboom P, Jongbloed G (1995) Isotonic estimation and rates of convergence in Wicksell's problem. The Annals of Statistics 23: 1518–1542.
- 17. Wu WB, Woodroofe M, Mentz G (2001) Isotonic regression: Another look at the changepoint problem. Biometrika 88: 793–804.
- 18.
Good P (2005) Permutation, parametric, bootstrap tests of
hypotheses. Springer.
- 19.
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Chapman & Hall/CRC.
- 20. Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836.
- 21.
Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer.
- 22. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care. IEEE Computer Society Press 261–265.