Quantile estimation of semiparametric model with time-varying coefficients for panel count data

Panel count data frequently occurs in follow-up studies, such as medical research, social sciences, reliability studies, and tumorigenicity experiences. This type data has been extensively studied by various statistical models with time-invariant regression coefficients. However, the assumption of invariant coefficients may be violated in some reality, and the temporal covariate effects would be of great interest in research studies. This motivates us to consider a more flexible time-varying coefficient model. For statistical inference of the unknown functions, the quantile regression approach based on the B-spline approximation is developed. Asymptotic results on the convergence of the estimators are provided. Some simulation studies are presented to assess the finite-sample performance of the estimators. Finally, two applications of bladder cancer data and US flight delay data are analyzed by the proposed method.


Introduction
In longitudinal follow-up studies, panel count data is frequently encountered in many fields such as medical research, social sciences, reliability studies, and tumorigenicity experiences, which has been widely analyzed by many authors. This type data is usually collected from the discrete observations in recurrent event process, as the continuous observations might be too expensive to be carried out. Thus, we can only obtain the cumulative occurrence numbers of the events of interest at these discrete observation times.
For the analysis of panel count data, [1,2] developed the regression analysis approaches to the panel count data model. [3] studied the clustered mixed nonhomogeneous Poisson models of panel count data. [4] considered the spline-based likelihood estimation of the proportional mean model. To describe the potential correlations of the recurrent event process, [5][6][7] developed some joint models of panel count data by employing some frailty parameters to discuss these correlations. Recently, semiparametric transformation models with informative observation times were studied by many authors, such as [8][9][10]. More comprehensive introductions about this type data can be referred to the book of [11]. In general, the existing approaches in modeling panel count data are based on the timeinvariant coefficients assumption, but which may be violated in practice. In some applications, coefficients may be time-varying, and sometimes it is more vital to detect the temporal impacts on the recurrent event process. For example, in medical studies, we are interested in detecting the temporal impacts of one new drug. Recently, [12,13] proposed the varying coefficient models for recurrent events. However, the analysis of panel count data with varying coefficients is very limited. Most recently, [14] proposed a partially varying coefficient model of panel count data to account for the nonlinear interactions between covariates. [15] proposed a nonparametric proportional mean model of the panel count data with time-varying coefficients.
Quantile regression is widely used in the analysis of longitudinal data. It can provide more information about the distribution shape of the response and can be used to measure the effect of variables under different percentiles of the distribution. However, quantile regression methodologies for the panel count data are lagging. As the discreteness of the panel count data, quantile regression cannot be directly used. For the first, a smoothing technique ("jitter") is used for this type data, then the quantile regression can be applied to the smooth data.
In this paper, a semiparametric time-varying coefficient model is formulated. For the inference of the unknown functions, quantile regression method is used for the panel count data, with the unknown functions approximated by the B-spline basis functions. Furthermore, the asymptotic results on the convergence of the estimators are established as well. The main contribution of the paper is that we propose a new spline-based quantile estimation procedure for the time-varying coefficient panel count data model, which has not been discussed in the literature.

Model specification
Suppose that n independent subjects are observed over time. N i (t) denotes the cumulative total number of recurrent event occurring at or before time t for subject i.H i ðtÞ is a counting process with jumps at the discrete observation times, t i,1 < t i,2 < � � �. We assume that t is in a fix interval < of finite length. Besides, two follow-up times are existed: the potential censoring time C � i and the observation endpoint T i . Thus, only i is assumed to be independent with N i (t) andH i ðtÞ. Let H i ðtÞ ¼H i fminðt; C i Þg denote the real observation process of subject i, and m i ¼H i ðC i Þ, i = 1, � � �, n. Then, N i (t) can be only acquired at the time points where H i (t) jumps. The total number of the observations is defined as m ¼ P n i¼1 m i . Let Z i be a p × 1 vector of covariates. So we can have the independent and identically distributed dataset To describe the possible time-varying effects of covariates on N i (t), the time-varying coefficient model is proposed as follows.
• (1) Given Z i , the conditional mean function of N i (t) is where λ 0 (u) is an unspecified smooth baseline intensity function, and β(u) is an unknown p × 1 vector of time-varying regression coefficients.
• (2) Conditional on Z i , fC i ; N i ðtÞ;H i ðtÞg are mutually independent.
For the model defined above, [15] developed the likelihood and pseudo-likelihood methods to get the estimation of the baseline intensity function λ 0 (u) and the varying coefficient functions β(u) based on the Poisson distribution assumption on N i (t). However, no distribution assumption is specified in this paper and the existed methods cannot be used. In the next section, the spline-based quantile regression is proposed to acquire the estimation of the unknown functions. In the first step, the unknown baseline intensity function and the coefficients are approximated by B-splines. And then, the discrete panel count data become continuous by a smoothing technique. Quantile regression is developed for the inference in the last step.

Estimation procedure
For the inference of Eq (1), the model can be rewritten as,

Approximations of baseline and varying coefficients
Similar as [16], we use the basis expansion method to get the estimation of the unknown functions in this paper. Suppose η k (u), k = 1, 2, � � �, p + 1, can be approximated by a basis expansion, that is where B k ðuÞ ¼ fB k1 ðuÞ; . . . ; B kL k ðuÞg > are basis functions, g k ¼ ðg k1 ; . . . ; g kL k Þ > and L k is the number of basis functions. Various basis functions can be used in the expansion such as Fourier basis functions, polynomial basis functions and B-spline functions. In this paper, the Bspline basis is selected in the estimation procedure for calculation simplicity. The tuning parameter L k is selected by L k = n k + q k + 1, where n k is the number of interior knots and q k is the degree of the B-spline functions. The interior knots of the splines are equally spaced or placed on the sample quantiles of the data in all simulations and applications. The tuning parameter L k may be different for different k. In this paper, we assume that L k = L and q k = q for all η k (u). Thus, we define B k (u) = B(u) for simplicity presentation.

Quantile regression
As quantile regression is a good alternative to the conditional mean models, the quantile regression is considered for the panel count data model. However, quantile regression cannot be directly used as the discreteness of the data N i (t). According to the method developed in [17], the "jitter" method is applied to construct continuous random variables. By adding U ij , which is generated from a [0, 1) uniform distribution, we can have where the noise U ij is independent of N i (t ij ) and Z i . The uniform distribution is used because it allows computational simplifications. The uniform noise, however, is by no means a necessity to jitter the data. The noise may be generated by any continuous distribution with support on [0, 1). Thus, we can get the continuous data N � i ðt ij Þ and there exists a one-to-one link between the quantiles of N i (t ij ) and N � i ðt ij Þ. The regression model of N � i can be written as where � ij are assumed to be independent of t ij with unknown cumulative distribution function (cdf) G(�) and density function g(�). Besides, the τ-th conditional quantile of � ij is b τ . The quantile regression loss function is defined as ρ τ (u) = u[τ − I(u < 0)], τ 2 (0, 1). Then the quantile regression is applied on the smooth data N � i ðt ij Þ to obtain the estimation of the unknown parameters. Thus, the unknown parameters ϕ = (γ > , b τ ) > can be estimated by minimizing the following objective function C(ϕ), that is For the ease of calculation, Gauss-Legendre formula is used to approximate the integral. Thus, we have where ω s is the Gauss coefficient, S is the number of the Gauss points and Δ s is the Gauss point. The Gauss-Legendre approximation of the objective function C(ϕ) can be defined as and the baseline intensity function of λ 0 (u) can be obtained bŷ Next, we discuss how to select the tuning parameter L and the Gauss point number S. As proposed by [16], we use the leave-one-subject-out cross-validation (CV) to choose L and S.
Letĝ ðÀ iÞ andb ðÀ iÞ t denote the estimators from the data with the i-th subject deleted. So the leave-one-subject-out CV can be written as : Thus, the tuning parameter L and S can be selected as CVðL; SÞ: The number L k of the basis expansion of β k may be different from each other. However, we assume L k = L for all k, for simplicity.

Asymptotic results
The asymptotic results are concluded in this section. Before presenting the results, some regularity conditions are introduced for the first.
where λ min (Γ) and λ max (Γ) denote the smallest and the largest eigenvalues of Γ.
Under these above regularity conditions, the asymptotic results on the convergence of the estimators are displayed in the following theory. For the need of the proofs, a lemma of spline function of [18] is presented. First, define g kl B l ðuÞ; ðg k1 ; � � � ; g kL Þ 2 R L g: Let S kn be the space of splines of degree q consisting of functions η kn satisfying: (i) the function η kn to each subinterval is a polynomial spline of degree q; (ii) for q � 1 and 0 � q 0 � q, η kn is q 0 times continuously differentiable on the support. Besides, η k is assumed to satisfy the following regularity condition. Let l 1 2 [0, q] be a nonnegative integer. The l 1 -th derivative, denoted as Z ðl 1 Þ k , exists and satisfies the Lipschitz condition of order v 2 (0, 1] such that ρ = l 1 + v > 0.5 and jZ ðl 1 Þ k ðsÞ À Z ðl 1 Þ k ðtÞj � djs À tj v , for s, t 2 [0, C], where δ is a positive constant. k logl 0 ðuÞ À log l 0 ðuÞ k 2 ¼ O p fðL=mÞ 1=2 g: Ignoring the approximation error in the B-spline basis approximation of β k (u), k = 1, � � �, p, we can have the 100(1 − α)% pointwise confidence interval of β k (u) under quantile τ, b k ðuÞ � z 2=a ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi covfb k ðuÞg q ; where z 2/α is the 100(1 − α)% percentile of the standard normal distribution and covfb k ðuÞg ¼ BðuÞ > covðĝ k ÞBðuÞ. Similar as the baseline function λ 0 (u).

Simulation studies
Three simulation studies are carried out to evaluate the performance of the method developed in this paper. We generated 200 datasets from the time-varying coefficient model, each of size n = 100 or 200 independent subjects. For each subject i, the endpoint of observation T i is assumed to be 6 and the censoring time C � i follows the uniform distribution of [T i /2, 3T i /2]. The number of observation times m i is generated from a discrete uniform distribution {1, 2, 3, 4, 5}. And the observed event times, ft i1 ; . . . ; t im i g, are the order statistics of a random sample size m i from the uniform distribution over (0, C i ). Given m i and ft i1 ; . . . ; t im i g, the panel count data N i (t ij ) can be obtained by the following formula for j = 1, � � �m i and i = 1, � � �n. N � i ½l N ðt ij Þ� is the random number generated from the Poisson distribution with mean Z t ij 0 l 0 ðuÞexpfbðuÞ > Z i gdu: The following three cases are considered: • Case I: p = 1 and the covariate Z i is generated independently from the [0, 1] uniform distribution. The baseline function is taken as λ 0 (u) = 2u + 1 and the varying coefficient β(u) = sin (−πu/6).

Bladder cancer data
Bladder cancer data was collected by the Veterans Administration Cooperative Urological Research Group. In this study, 85 patients were randomly assigned to two treatment groups: placebo group (47) and thiotepa group (38). For each patient, the observation times and the cumulative numbers of the bladder tumors that occurring at or before the observation times are recorded. The observation endpoint is 53 month. What's more, the initial number of the bladder tumors and the largest initial tumor size for each patient are also recorded. In the literature, the dataset has been discussed by many authors such as [5,7,19]. However, time-varying coefficient panel count data model is not considered for this dataset. In order to describe the temporal impacts of the covariates on the bladder cancer data, the time-varying coefficient model proposed in this paper is applied to these data. For each patient i, N i (t) is denoted as the cumulative bladder tumors number occurring up to time t, and H i (t) is denoted as the cumulative observation number up to time t, i = 1, � � �, 85. Furthermore, let Z i1 = 1 if the patient i is belonged to the thiotepa group and Z i1 = 0 otherwise. Z i2 is denoted as the initial tumor number and Z i3 is the natural logarithm of the largest initial tumour size plus 1 for each patient i. Therefore, we have the model Then quantile regression estimation is applied to this data. 100 samples are drawn from the data every time and 200 times are repeated in the estimation. Similar to the numerical studies, the unknown functions λ 0 (t) and β k (t), k = 1, 2, 3 are approximated by Cubic B-spline functions. The estimation is implemented under quantiles τ 2 {0.25, 0.5, 0.75}.
The estimation curves of log λ 0 (t) and β k (t), k = 1, 2, 3 are displayed in Fig 4. In general, the thiotepa treatment and the tumor recurrence rate are negatively correlated at different quantiles. Patients in the thiotepa group tend to have less tumor recurrence rate than those in the placebo group. The initial tumor number is positively correlated with the recurrence rate and the largest initial tumor size is negatively correlated with the recurrence rate. These above conclusions are consistent with [19]. Furthermore, we can find the covariates impacts are varying during the observation time and the impacts are different at different quantiles. Thus, more information can be obtained from the quantile regression of the time-varying coefficient panel count data model than the other analysis in the existing literature.

US flight delay data
In In order to describe the temporal covariates impacts on the flight delays, the time-varying coefficient model proposed in this paper is used to these data. For each flight i, N i (t) is denoted as the cumulative flight delay number that had occurred up to time t, H i (t) is denoted as the

PLOS ONE
cumulative observation number up to time t, i = 1, � � �, 9794. Furthermore, we define Z i1 as the average time of the departure delay and Z i2 as the average distance of the flight i. Therefore, we have the model Then spline-based quantile estimation is applied to this data. Similarly, the unknown functions λ 0 (t) and β k (t), k = 1, 2 are also approximated by Cubic B-spline functions. The estimation is implemented under quantiles τ 2 {0.25, 0.5, 0.75}.
As the sample size of the dataset is large, it is time-consuming or even not possible to read the entire dataset in practice due to the limited memory. Besides, the direct analysis can be infeasible, mainly due to the computing memory or computing time. In order to deal with the massive data, parallel computing method is developed by [20,21]. In parallel computing method, we split the original dataset into a family of disjoint sub-sample blocks with equal size for the first. More precisely, the data structure can be defined as the following form: where the original dataset S is of size n = K × m which is partitioned to K subsample blocks S k each consist m samples which are randomly picked up from the dataset S.
Thus, the estimation procedure proposed can be implemented for every block S k , k = 1, � � �, K and the estimated values of unknown parameters for each block S k is denoted as fẐðtÞg k . Similar to the method introduced in [21], the final full-sample estimators can be generated bŷ

PLOS ONE
The estimation curves of log λ 0 (t) and β k (t), k = 1, 2, under different quantiles τ 2 {0.25, 0.5, 0.75} are displayed in Fig 5. From Fig 5, we can find that the departure delay time is positively correlated with the cumulative flight delay numbers. Besides, the impact of the departure delay time is varying over the time under different quantiles and the impact is different at different quantiles. However, the effect of the flight distance is not significant on the flight delay numbers.

Concluding remarks
In this paper, we proposed a spline-based quantile regression estimation method in the timevarying coefficient panel count data model. This model discussed in our paper is more general than [15], with no Poisson restriction on the recurrent event process. To get the estimations, B-splines are used to approximate the unknown functions log λ 0 (t) and β(t) for the first, and then a smoothing technique is applied to obtain the continuation of the discrete panel count data. Finally, the spline-based quantile regression approach is developed at different quantiles. Some simulations are presented to evaluate the performance of the proposed approach and two applications are analyzed to demonstrate its effectiveness in this paper.
Recently, the Enron e-mail corpus which was a massive set of the e-mail messages, have been discussed by many authors, such as [22]. If we are interested in the number of interactions of all pairs of individuals in this longitudinal observations, as usual in network analysis, the snapshots are applied to model this longitudinal networks, then, this is a standard panel count dataset with massive observations. Furthermore, in this paper, we only considered the situation with low dimensional covariates, which may be not unpracticable in the applications. As the high-dimensional covariates may be existed, variable selection methods can be considered for the time-varying coefficient model. This will be an important topic for our further studies. Besides, reliability data and traffic data have been studied by many authors, such as [23][24][25][26]. This will be interesting to study the quantile regression estimation of such data.