Forecasting intermittent and sparse time series: A unified probabilistic framework via deep renewal processes

Intermittency are a common and challenging problem in demand forecasting. We introduce a new, unified framework for building probabilistic forecasting models for intermittent demand time series, which incorporates and allows to generalize existing methods in several directions. Our framework is based on extensions of well-established model-based methods to discrete-time renewal processes, which can parsimoniously account for patterns such as aging, clustering and quasi-periodicity in demand arrivals. The connection to discrete-time renewal processes allows not only for a principled extension of Croston-type models, but additionally for a natural inclusion of neural network based models—by replacing exponential smoothing with a recurrent neural network. We also demonstrate that modeling continuous-time demand arrivals, i.e., with a temporal point process, is possible via a trivial extension of our framework. This leads to more flexible modeling in scenarios where data of individual purchase orders are directly available with granular timestamps. Complementing this theoretical advancement, we demonstrate the efficacy of our framework for forecasting practice via an extensive empirical study on standard intermittent demand data sets, in which we report predictive accuracy in a variety of scenarios.


Introduction
Intermittent demand forecasting (IDF) is concerned with demand data where demand appears sporadically in time [1][2][3][4], i.e., long runs of zero demand are observed before periods with nonzero demand. Not only does this sparsity render most standard forecasting techniques impractical; it leads to challenges on measuring forecast accuracy [5], model selection [6], and forecast aggregation [7].
Demand for large shares of inventory catalogues in manufacturing are well known to exhibit intermittency [8,9]. Intermittent demand is most likely to appear with slow-moving, (sometimes) high-value items that are critical to production processes. For example, spare parts in aerospace and defense are well known to exhibit intermittent patterns [10,11]. Therefore, precise estimates of forecast uncertainty, e.g., with access to forecast distributions, are vital for IDF. IDF was recognized as a unique and challenging problem in the early 70s [1,[12][13][14][15]. Upon recognizing that traditional approaches such as simple exponential smoothing (SES) led to poor forecasts in slow-moving inventory, Croston [1] proposed to independently apply exponential smoothing to consecutive positive demand sizes, and to the number of periods between each (i.e., interarrival or interdemand times). Towards developing forecast distributions and uncertainty estimates, statistical models underlying Croston's method were proposed by [16][17][18][19].
In this paper, we make a sequence of observations on IDF methods proposed so far. We start working from Croston's original insight, towards a consistent set of flexible intermittent demand models. We explore extensions of these models in several directions, which are natural consequences of our new framework to systematically tackle IDF. Our paper builds on early work in [20], expanding on theoretical concepts and adding a thorough empirical analysis. We summarize our observations and contributions below.
• We draw previously unrecognized connections between existing IDF models and renewal processes. We note that these two subjects of applied statistics both deal with temporal sparsity and have been developed for planning spare parts inventories.
• We introduce a flexible set of discrete-time renewal process models for stationary intermittent demand. We illustrate that these models are able to capture patterns such as temporal clustering, aging, and quasi-periodicity of demand.
• We cast Croston-like models as instances of discrete-time conditional (self-modulating) renewal processes. Through this observation, we can introduce recurrent neural networks to recover more flexible ways in which renewal processes are modulated.
• We observe that our construction applies in contexts in which individual demand events are observed in continuous time. We make connections to temporal point processes and their neural variants and apply these models in IDF.
• We evaluate our approach on both synthetic and real-world data sets. The experiments on synthetic data illustrate qualitatively and quantitatively how our proposed models can capture complications such as alternating periods better than prior methods. For the experiments on real-world data set, we report results on Car Parts, Auto, RAF, and M5 competition data sets, as well as a newly introduced dataset from the UCI repository. While our results are mixed, we can conclude that using probabilistic neural networks is a promising direction for further exploration, in particular for the data set from the M5 competition.
Our paper consists of four sections. In Section 2, we start with a survey of literature on intermittent demand forecasting, with a focus on machine learning-driven methodologies. In the subsequent Section 3, we introduce preliminary statistical and algorithmic concepts such as renewal processes, temporal point processes, and recurrent neural networks. In Section 4, we introduce our unified framework-which we test on a wide array of synthetic and real data experiments in the Section 5.
Croston [1] observed that more accurate point forecasts of intermittent demand resulted from performing exponential smoothing separately on sequences of demand sizes and inter demand intervals. Apart from providing a forecast heuristic, he introduced a set of models which, he posited, would lead to the forecast equations he provided. Later developed and corrected [21, 22], Croston's method has been the most common IDF method for stockists and forecasters [23], and the de facto standard in forecasting software and libraries. Among early extensions to Croston's method, Schultz [13] suggested applying different smoothing constants to interarrival times and demand. Willemain et al. [15] verified results by [1], who also illustrated that interdemand times exhibit autocorrelation by statistical analysis on IDF data.
Dunsmuir and Snyder [14] provided an explicit model for inventory control in an intermittent demand scenario. Notably, they took stochastic delivery lead times of replenishment to be gamma random variables, modeling replenishment in continuous time. Johnston and Boylan [16] pointed out that Croston's method did not offer an estimate of the variability in demand. They proposed separate continuous-time models, i.e., temporal point processes, for demand sizes and order arrivals. Notably, they assumed the order arrivals to follow a Poisson process, building on Cox's results from renewal theory and developing estimates of demand mean and variance. They also explored different order size distributions under Poissonian order arrivals. Syntetos [24] gave a survey of IDF in his thesis, along with error approximations in IDF models and some extensions for inventory control. Importantly, he pointed out the "inversion bias" in Croston's original forecast estimates and provided a revised estimator [22]. Syntetos later also gave a simpler approximant apart from reviewing questions around forecast accuracy measures in IDF [25]. A similarly modified IDF estimate was studied by Shale et al. [26], where the authors assumed underlying Poissonian order arrivals. A review of these studies was given in [10]. For the question of how to define intermittency, i.e., which time series constitute intermittent demand we refer to [2,3,[27][28][29].
Another method for generating point forecasts was given by Teunter et al. [30]. The authors applied exponential smoothing directly on the probability of demand, bypassing the inversion bias in Croston's method. A review, as well as a comparative study of point forecast methods on real IDF data was presented by Babai et al. [31], with no strong evidence in favor of any of the forecast functions considered.
The problem of associating a stochastic model to Croston's method was explored by Snyder [17]. He proposed several modifications to the method and considered intermittent demand forecasts via the parametric bootstrap on a set of state-space models (SSM). Notably, this work draws connections between the renewal-type model of Croston [1, Appx. B] and single source of error SSMs.
Shenstone and Hyndman [18] investigated the validity of Croston's methods on several fronts, pointing out that the i.i.d. assumption on order sizes and interarrival times were inconsistent with the exponentially weighted moving average (EWMA) forecast estimates. The authors explored a set of "modified" models that would yield Croston's estimates, assuming both sizes and intervals follow an ARIMA(0,1,1) process. To ensure nonnegativity, required of both sizes and intervals, they proposed log-Croston models where an ARIMA(0,1,1) process determined the required quantities in the log domain. Importantly, the authors argue that there exist no practicable models that yield Croston's estimates as optimal forecasts. This is since any local-level model that yields the EWMA as an unbiased one-step-ahead forecast, and is defined on the positive real half-line, suffers from a convergence problem [32, 33], i.e., draws from the model converge to 0 over long terms. Let us note that the question of a stochastic model for IDF was also raised in [34,35]. An extensive discussion on model-based IDF is given in [36, Ch. 16].
Snyder et al. [19] proposed an extended set of models, contrasting several ways in which stochastic Croston-like models were considered in the literature. They compared Poisson, negative binomial and zero-inflated Poisson distributions for demand sizes. For all distributions, they tested "static" (constant) parameters as well as undamped (IMA) and damped (AR) dynamic parameterizations. The proposed class of "dynamic" models are simple and are shown to outperform Croston's method and static parameter models. In this paper, we connect previous model-based approaches in [18,19] to the rich theory behind renewal processes. As such, for the first time, we can derive Croston's methods and other proposed models from a principled probabilistic framework.
SSM approaches were considered and compared by Yelland [37], who also later proposed a hierarchical Bayesian treatment of the problem [38]. Seeger et al. [8] considered Bayesian SSM with multiple sources of error, focusing on applicability on very large inventory catalogues (see also [39] for more details). Their method incorporates exogenous features, and presents an efficient approximate inference algorithm enabling large-scale and distributed learning. Altay et al. [40] explored an adaptation of Croston's method with Holt-like trends. Seasonality in IDF, with a comparison of Holt-Winters and SARIMA, was considered in [41]. Recently, a single source of error SSM with multiplicative errors was explored by Svetunkov and Boylan [42]. A novel method for forecast aggregation via temporal hierarchies, THieF, was introduced by Kourentzes and Athanasopoulos [43].
Using machine learning methods-mainly, neural networks-in forecasting is both a recently growing area of research and the subject of ongoing debate [44,45]. In their extensive comparison of machine learning methods to traditional "statistical" methods, Makridakis et al. [46] find scarce evidence in favor of using neural networks or other ML-based methods in forecasting. The authors argue that most empirical evidence in forecasting favors "modeldriven" methods over "data-driven," in the terms of Januschowski et al. [47]. Moreover, datadriven methods are often harder to train and replicate; and require substantially greater effort for implementation and computation. The findings from this study are controversial to say the least as practical evidence from industrial researcher has consistently pointed to contrary conclusions (see e.g., [48][49][50][51]). Supporting this point of view are the results from highly visible M4 competition [52] which was won by a hybrid method that combined exponential smoothing, recurrent neural networks, and ensembling [53]. Other such hybrid approaches have appeared recently, mainly in the machine learning literature [48,[54][55][56][57], along with software libraries for neural network time series forecasting [58]. A thorough review of the use of neural networks in forecasting was recently given by Benidis et al. [59].
Several studies have considered neural networks in the context of IDF. Gutierrez et al. [60] experimented with a single hidden layer feedforward network, with only three hidden units, trained to predict the demand on the next time interval, given time since last demand and the last observed demand size. They reported favorable results compared to Croston and Syntetos-Boylan methods. These results were debated by Kourentzes [61], who compared several neural network architectures to a suite of naive and Croston-based forecasters and demonstrated low forecast accuracy in terms of mean absolute error (MAE). However, the author also found favorable performance by neural networks when inventory control metrics were considered directly. Mukhopadhyay et al. [62] experimented with several training strategies to find that neural networks outperform traditional methods. Recurrent neural networks were considered in [63], and several extensions were explored in [64]. A deep LSTM, in the context of IDF, appeared recently in [65]. Our paper allows combining neural networks with a model-based approach, a first in the literature for intermittent demand forecasting. As a consequence, the neural forecasting methods we present here are the first dedicated neural network-based models for IDF which yield probabilistic forecasts.
Finally, we remark that IDF requires special attention when evaluating forecasting accuracy as standard forecast accuracy metrics are well known to fail in this context. Hyndman [5] proposed MASE, mean absolute error scaled against a naive-one forecast, for the IDF task. Kim and Kim [66] advanced this notion to a reportedly more robust metric, employing trigonometric functions for comparing forecasts and ground truth. Although we used both of these metrics in our evaluations, we found both approaches to be potentially misleading, as a "zero forecast" often yielded better results in terms of MASE or MAAPE than any other method. As our methods are primarily geared towards probabilistic forecasting, we use P50 and P90 loss metrics as in [48]. Further, we also report root mean squared error (RMSE) and root mean squared scaled error (RMSSE), the main error metric used in the recent M5 competition [67].

Problem setup
We consider univariate, nonnegative, integer-valued time series, corresponding to random variables denoted fY n g n2f1;2;���g ; Y n 2 N. Here, n indexes uniformly spaced intervals in time, each corresponding to a demand review period. Realizations {y n } of {Y n } will typically contain long runs of zeros, i.e., only rarely will y n be greater than zero.
For a clearer exposition of our models, we will mostly use the size-interval notation for intermittent demand. As many Y n are zero, it will be useful to define an auxiliary index i 2 N, such that i indexes the issue points, i.e., periods where nonzero demand is observed. More formally, we define the one to one map sðiÞ ¼ minfn j P n m¼1 ⟦Y m > 0⟧ � ig, where ⟦.⟧ is the indicator function (Iverson bracket). We denote interdemand times-number of periods between issue points-Q i . That is, Q i = σ(i) − σ(i − 1), taking σ(0) = 0. We denote demand sizes for issue points M i = Y σ(i) . Random variables M i , Q i , both defined on positive integers, fully determine Y n and vice versa.
We use corresponding lowercase Latin letters to denote instantiations of random variables denoted in uppercase Latin. E½�� and V½�� denote mathematical expectation and variance respectively. Unless otherwise specified, we keep indices i, n analogous to their function in this introduction, indexing issue periods and review periods respectively. Range indexes will denote consecutive subsequences of a random process or realization, e.g., Y 1:k = {Y 1 , Y 2 , � � �, Y k }. We reserve N ; G; PO; N B; E to denote Gaussian, geometric, Poisson, negative binomial, and exponential distributions respectively. Unless otherwise noted, these distributions are parameterized in terms of their mean. The support of Poisson, geometric, and negative binomial distributions are shifted, and defined only on positive integers. This less common form of the negative binomial distribution is detailed in the S1 Appendix.
Our main aim here is to characterize, or approximate, forecast distributions. Typically, these are conditional distributions of the form PfY nþ1 jY 1:n ¼ y 1:n g; or PfY nþ1:nþk jY 1:n ¼ y 1:n g: We will letŶ n refer to an estimator available at time n. Often, it will be an estimator of the onestep-ahead conditional mean E½Y nþ1 jY 1:n ¼ y 1:n �.

Croston's method
Here, we briefly elaborate on Croston's method and some of its variants, setting up notation for discussions to follow.
In his paper [1], Croston highlighted an important drawback of SES in IDF. SES forecasts, by placing the highest weight on the most recent observation, would lead to the highest forecasts just after a demand observation, and the lowest just before. Heuristically, Croston proposed to separately run SES on interdemand times and positive demand sizes. Concretely, he That is, he proposed to use the EWMA of each series as the forecast estimate. We will denote this recursive computationQ iþ1 ¼ EWMA a ðQ 1:iþ1 Þ. Moreover, Croston also discussed possible models for intermittent demands, setting The paper also considers {M i } � ARIMA(0, 1, 1) with Gaussian innovations.
The discrepancy between the model of (2) and forecasts (1) has been a common issue in the IDF literature that followed. Assuming the model in (2), unbiased forecasts of M i+1 , Q i+1 are possible simply by setting them to the averages of previous values. Indeed, this is "implied" by the i.i.d. assumption. Instead, however, the forecast heuristic (1) has been praised for its ability to capture serial correlation and nonstationary behavior in both sizes and intervals. This is clearly at odds with the model assumption.
More formally, the EWMA only functions as an asymptotically unbiased estimator of the parameter μ. Moreover, as noted by Syntetos and Boylan in a series of papers [22, 24, 25], the EWMA of previous interdemand times only results in a biased estimate of π, due to an oversight of the inversion bias E½1=Q i � 6 ¼ 1=E½Q i �. In [25], the authors corrected for this bias via a Taylor series approximation about the mean of E½1=Q� and gave an approximation to an asymptotically unbiased forecast,Ŷ Several other variants of forecasts (1) have been explored, e.g., using a simple moving average instead of the EWMA [26, 29]. Both models suggested by Croston allow trajectories with negative and non-integer values. To alleviate this misspecification issue, models with non-Gaussian likelihood have been proposed, e.g., by parameterizing a distribution of positive support with a mean process obeying an IMA model. Nevertheless, such models-namely, nonnegative EWMA models-are "illfated" [32] since their trajectories converge to 0 over longer terms. Shenstone and Hyndman [18] use this result to point out that no models exist that yield Croston's forecast estimates as consistent and unbiased estimators while being immune to the convergence problem outlined here.
Exploring possible ways to account for the model-forecast discrepancy in Croston's method, [17-19, 36] studied numerous alternative probabilistic models. In this paper, we will consider four variations of these models as baseline methods, which we summarize in Table 1. Let us note that, to the best of our knowledge, only the first three models were previously proposed [19, 36].

Renewal processes in discrete time
Renewal processes constitute a central theme in the theory of stochastic processes and their study plays a more general role in probability theory [68,69]. Broadly, renewal processes are concerned with recurring events in repeated trials, where after each such event, trials start identically from scratch. That is, interarrival times between events are independent and identically distributed. For example, in independent spins of a roulette wheel, drawing a certain number is a recurrent event as is observing three consecutive reds.
The example of a roulette wheel may appear unusual, as spins constitute a discrete sequence of outcomes. Renewal processes are mostly introduced as continuous-time counting processes (temporal point processes), where inter-arrival times are i.i.d. [68]. Yet, some of the earlier treatments of renewal theory consider the case of recurrent events in discrete time [70,Ch. 13]. We introduce such processes, focusing our attention on the special case where these recurrent events are simple binary outcomes. Definition 1. (Discrete-time renewal process with Bernoulli trials) Let {Z n } n define a sequence of (not necessarily independent) binary random variables. Let i index time steps where Z n = 1, as above, defined through the map sðiÞ ¼ minfnj Our reuse of notation is not coincidental. It should be clear that, Remark 1. The demand arrival process {Y n > 0} of Static G-Po, and Static G-NB models, as well as Croston's original models, are DTRPs.
In fact, these models rely on the most basic renewal process-the Bernoulli process-as a demand arrival regime.
Renewal processes, as the name implies, were developed in the context of modeling failures in large systems-towards determining when parts of such systems would have to be renewed. The underlying intuition is that every time an identical part is renewed, the random process with which it fails starts anew.
Thinking in terms of renewal processes paves the way to introducing a class of interpretable, parsimonious and flexible models. It also gives access to tools from renewal theory, such as characterizing multi-step ahead forecast distributions with convolutions of interdemand times, or model fitting via moment-matching methods. As we will see, they will also enable an intuitive connection to temporal point processes, introduced below.

Recurrent neural networks
Recurrent neural networks (RNN) rely on a recurring hidden state to model sequential data, and have been widely adopted in time series applications. In contrast to feedforward neural networks, which can be seen as general function approximators, RNNs approximate sequences that (partly) depend on a recurrence relation. See [71,Ch. 10] for an introduction to RNNs. Long short term memory (LSTM) [72] networks, which we use in this paper, define a specific RNN architecture designed to capture long-term interactions.  Concretely, given inputs {x i }, LSTMs-as in all RNNs-obey the recurrence where h i 2 R d comprises both the memory (state) and the output of an LSTM cell. Note that in time series applications, inputs often include past time series values. For example, if y i denotes the time series of interest, The exact functional form of LSTM(.) is composed of several nonlinear projections and compositions of h i−1 and x i , that provide mechanisms for representing "long-term memory." The LSTM network is implemented on virtually all deep learning frameworks such as [73]. For further details on the exact computation and optimization of LSTMs, we refer the reader to the tutorial by Olah [74].
Most previous works in forecasting rely on RNNs to approximate a time series in squared or absolute error. More precisely, the training objective is where Θ denotes the set of all parameters of function LSTM, and g(.) is a suitable projection to the domain of y i , such as an affine transformation from R d to R. This is the case with previous work on IDF [63,64]. A more recent approach in neural forecasting considers RNNs not as forecast functions, but to parameterize forecast distributions [48], or approximate transition and emission dynamics in an SSM [56]. In this case of probabilistic RNNs [75], one assumes where y i is drawn from a suitable probability distribution denoted by D with parameters β. The parameters are then computed from the output of the LSTM, via an appropriate projection to their domains, denoted here by g β . Then, instead of the mean squared error objective, one minimizes the negative log likelihood of the model. In addition to naturally reflecting domain assumptions about y i , such as its support or tailedness, this approach naturally leads to forecast distributions. In the sequel, when we use neural networks in the IDF context, we will rely on this approach. Namely, we will let neural networks parameterize conditional interarrival time distributions in self-modulating renewal processes.

Temporal point processes
Temporal point processes (TPP) are probabilistic models for sets of points on the real line. Often, these points represent discrete (instantaneous) events in continuous time. Examples of discrete event sets are plenty, such as order arrivals in a financial market, anomalous log entries in a computer system, or earthquakes. Demand arrivals in a stock control system can be seen as discrete events, and modeled via a TPP. The rationale for this approach is more intuitive for intermittent demands where instances are rare and the main goal is to determine when they will occur. Croston [1] observed in his original article that intermittent demand data arose since "updating [occurred] at fixed unit time intervals, which are much shorter than the times between successive demands for the product." The same approach was later used as a basis in deriving forecast approximations, for example, [16] based their analysis on an underlying Poisson process.
As noted above, DTRPs have better known, continuous time counterparts which constitute a large subclass of TPPs [76,Ch. 4]. As such, discrete-time IDF models we introduced above have natural continuous time counterparts, including those that benefit from recurrent neural networks for modulating interdemand time distributions [77][78][79][80].
Then again, discrete-time processes with binary trajectories, such as DTRPs as given in Definition 1, have also been referred to as "point" processes. Determinantal point processes constitute an important example, which define random processes for finite-dimensional binary vectors [81,82]. The "points" in this case are the vector elements which equal to 1. Although point processes customarily refer to random point collections on a Borel set, this flexible use of nomenclature further justifies the connection between DTRPs and continuous-time models.
Modeling demand directly with a point process is mostly impractical as exact timestamps of demand events are not observed. Moreover, estimation and inference algorithms are computationally intensive: computation times scale asymptotically in the number of events and not the number of demand review periods. However, in increasingly many scenarios such as in e-commerce, the exact demand time is known and can be used directly. In IDF, the number of events is often in the same order as the number of intervals, making computational costs comparable. In this light, we will explore using TPPs on intermittent demand data, and study how directly addressing demand timestamps affects forecast accuracy.

Discrete-time renewal processes
In previous sections, we discussed how existing IDF models could be cast as renewal processes. We will first follow the obvious extension implied by this generalization, and we will introduce a set of models with more flexible interdemand time distributions. Particularly, we will first consider extending the demand arrival process of Static models given above. In this section, we focus on the rationale for doing so.
Previous IDF models have considered the geometric distribution as a model for the times between successive issue points. Doing so is well-justified, it assumes a homogeneous Bernoulli process for demand arrivals, with the probability of observing a demand constant across time and independent of all other intervals. Yet, this is an obvious oversimplification of real-world arrival processes.
A natural generalization of Static models is to alter the demand interarrival time distribution. This relaxes the strict assumption imposed by homogeneous Bernoulli arrivals. To frame our analysis, we define the hazard rate (cf. the hazard function in continuous-time renewal processes), h(k) for a discrete random variable Q with c.d.f. F as Letting Q denote interarrival times, the interpretation of the hazard rate is intuitive. It stands for the probability of a new demand occurring at time k, given that it hasn't occurred for k − 1 intervals since the previous issue point. It is not hard to show that when Q admits a geometric distribution, the hazard rate is constant and consequently independent of the past. This property of the geometric distribution has been referred to as the memoryless property.
If h(k) is monotonically increasing, the function determines a profile for aging. This is often the case in machine parts, where the probability of a part being renewed increases with the time in service. h(k) can also be decreasing, or imply negative aging. In such demand data, issue points are typically clustered in time. A flexible distribution for Q can also be used to capture quasi-periodic demand, e.g., with a distribution for Q that concentrates around a set period μ.
We replace the interarrival time distribution in Static models with the negative binomial distribution. The negative binomial can flexibly capture aging, clustering, and periodic effects discussed above, as illustrated in Fig 1. Indeed, the geometric distribution is a special case of the negative binomial, confirming our intuition that it determines, in a sense, the most basic renewal process possible. We give further details on the negative binomial distribution in the S1 Appendix.
Building on this insight, we introduce the Static NB-Po and Static NB-NB models, merely by replacing the geometric interarrival time assumption with the negative binomial distribution. As in the rest of this work, we will rely on maximum likelihood parameter estimation and the parametric bootstrap (forward sampling) for forecast estimates of these models. We should note, however, that Static NB-Po and Static NB-NB models yield closed form forecast estimates, which we explore in the S1 Appendix.

Self-modulating discrete-time renewal processes
The renewal process construction highlights a potentially unrealistic assumption, that interarrival times are independent, while they have been shown to exhibit autocorrelation [15]. Many prior models have either attempted to craft a well-defined probabilistic model for nonstationary and autocorrelated demand arrival regimes, or to encode these relationships via SES-like heuristics [18,36]. Two examples, EWMA G-Po and EWMA G-NB were given above.
In order to make a similar extension to these models, we first frame them as instances of selfmodulating DTRPs. Such models, where the interarrival time distribution is determined by a conditional rate function have been widely explored in the TPP literature [76]. Their key characteristic is that interarrival times now obey a conditional distribution, the parameters of which are easily computed. More formally, letting Q i denote interarrival times as in Definition 1, Definition 2. A sequence of Q i defines a self-modulating DTRP if the sequence obeys an identically defined conditional distribution PfQ i ¼ kjH i g for all i, k, where H i denotes the history, of events up to interval i.
EWMA models are a good example. For all i, the conditional distribution of Q i is defined as a random variable whose mean is the EWMA of previous realizations. In other words, the history H i completely determines the distribution of Q i . A natural next step is to follow the arguments of the previous section, and alter the conditional interarrival time distributions with the negative binomial distribution. We will explore EWMA NB-Po and EWMA NB-NB models as extensions of their counterparts with geometric interarrival times.
Note that the convergence problem that plagues such models is not mitigated by the change in interarrival time distributions. For our model specification, the problem takes a slightly different form. That is, over longer forecast horizons, EWMA model forecasts converge to a forecast of "all ones," introducing a slightly different bias than the one shown in [32]. We give further details of this issue in S1 Appendix. Finally, note that self-modulating DTRP models can be seen as instances of Markov renewal processes [83].

RNN-based discrete-time renewal processes
DTRPs yield an intuitive and simple way to extend IDF models. Yet, they suffer from two main limitations that hinder a realistic understanding of intermittent demand series. First, the conditional distribution of demand sizes and intervals rely only on the exponentially weighted moving average. However, the recurrence relation that determines the conditional mean of sizes and intervals may take more complicated functional forms. The second limitation arises since size and interval processes are assumed independent. One could reasonably expect that longer interdemand intervals result from higher demands-e.g., when a customer frontloads inventory-an effect not captured by this independence assumption. Finally, other features or covariates that are consequential for accurate forecasts may be available with each issue point, such as dummy variables for discounts or holidays.
To alleviate the first issue, we observe that under mild conditions any recursive form for parameterizingQ i would yield a well-defined self-modulating DTRP. An RNN is no exception.
Recall from the EWMA model definition that An LSTM, admits a similar recursion, where instead ofQ i , we introduce the hidden state h i 2 R H . We define RNN-based models via the following alternative form, where g w (h) = 1 + log(1 + exp(w > h + w 0 )) is a projection from the hidden state to the domain ofQ i , parameterized by w 2 R H ; w 0 2 R. By LSTM Θ (.), as given above, we refer to the long short-term memory recurrent neural network function parameterized by weights collected in Θ. Secondly, LSTM inputs can be extended to include both the previous interdemand time, and the previous demand size, at no significant added cost for estimation or inference. Indeed, this is the approach taken by previous IDF models with neural networks [60,61]. Finally, the LSTM inputs may include any additional covariates that are available, in contrast to other IDF methods that propose no clear way to accommodate them.
Though conceptually similar, EWMA and LSTM recurrences have a fundamental difference. The LSTM network has a potentially high-dimensional parameter vector Θ that has to be fitted. Moreover, neural networks are known to be "data-hungry," requiring a large data set to fit as opposed to the parsimonious representation of temporal continuity embedded in exponential smoothing models. In order to alleviate this issue, when multiple intermittent demand series are available, we share the parameters Θ across different samples. In machine learning terminology, the LSTM representation learned is a global model, shared across different items (SKUs). See, e.g., [47,59,84].
We give a summary of the discrete-time renewal process models introduced so far in Table 2.

Continuous time renewal processes
For our last set of models, we explore a connection between the models of the previous section and temporal point processes.
When granular timestamps of individual demand events are available, temporal point processes can be used directly as a model for intermittent demand. A flexible class of continuoustime renewal processes result from the use of RNNs similar to their functions above. To make the connection, note that continuous-time renewal processes arise as a limit case of their discrete time counterparts. For instance, it is well known that the Poisson process-the simplest continuous-time renewal process-arises as a limit case of the Bernoulli process. See, e.g., [85,Ch. 1]. Similarly, the geometric distribution of interarrival times leads to an exponential distribution in the continuous case.
Let us introduce processes fQ 0 j g; fM 0 j g as continuous-time interarrival time and demand size processes. The index j now runs over individual demand events, e.g., purchase orders, and not only positive demand intervals. We keep the (conditional) distributions of M 0 j identical to the discrete-time model as the support of the random variable and its semantics are identical. To address continuous time, we simply change Q 0 j to a random variable with continuous support. We list these models in Table 3. Note that the Static E-Po and Static E-NB are just homogeneous Poisson processes with positive integer marks.
This approach, combining RNNs with TPP, has been taken in the machine learning literature to construct flexible models [77][78][79][80], with connections to full Markov renewal processes explored recently by [86].

Synthetic data
Our work proposes two main ideas. First, we cast existing IDF models as instances of renewal processes, naturally extending previous point forecast models to probabilistic forecasting. In  contrast to point forecast methods such as Croston's estimate [1] and the Teunter-Syntetos-Babai (TSB) [30] method, renewal processes are able to produce consistent probabilistic forecasts. Second, by modeling interdemand times with more flexible distributions, we are able to better capture common intermittent demand patterns such as periodicity and aging. In order to highlight the advantage brought by using flexible renewal processes, we simulate perfectly periodic hourly intermittent demand data of 10 weeks, fixing the interdemand period at 20 hours, and drawing demand sizes i.i.d. from a Poisson distribution with a mean of 5. We illustrate forecasts from previous point forecast methods, Croston

PLOS ONE
Forecasting intermittent and sparse time series: A unified probabilistic framework via deep renewal processes are able to represent periodicity both in the forecast-i.e., the conditional expectation of the time series-and the forecast quantiles.
Our second contribution is noting that the exponential moving average in Croston-type methods can be replaced with more flexible function families to capture a wider family of patterns. Above, we posit that recurrent neural networks, used as a building block to replace the exponential moving average, can represent many interesting patterns. In order to illustrate this effect, we sample a single time series with "alternating" periods, and a constant demand size of 10. Interdemand times, in turn, alternate between 4 and 16. In Fig 3, we compare point forecasts, Static G-Po, and two new models meant to represent nonstationarities in interdemand times and demand sizes: EWMA NB-NB, and RNN-based NB-NB.
Even in this simple scenario, forecasts of the RNN-based method visibly outperform those of moving average-based methods, as the former is able to easily capture the alternating pattern.
Finally, we validate our findings via experiments on three synthetic data sets. The Periodic and Alternating data sets are obtained by repeating the respective simulations above 100 times. The Random data set is drawn from the Static G-Po model. In a sense, this data set represents purely random demand arrivals, with a Bernoulli process governing the arrival process with Poisson demand sizes.
We expect that our models do not yield any significant benefits in the Random data set. Instead, we expect that renewal processes outperform baseline methods in the Periodic data set, and the RNN-based model outperforms others in alternating periods. We report results of our experiments, where we compare three point forecast baseline methods (Croston, SBA, TSB), with the probabilistic baseline (Static G-Po) and newly introduced models in Table 4. Definitions of accuracy metrics used can be found in the S1 Appendix.
Among our results, P50 and P90 Losses can be regarded as accuracy measures of probabilistic forecasts while others measure the efficacy of point forecasts. We find that our expectation is reflected in the results, and that the RNN model is able to better represent forecast distributions across different data sets.

Real data
We test our framework on five intermittent demand data sets, comparing extended models to well-known baselines. These include three well-known standard data sets for IDF and the recently held M5 competition data set which is mostly composed of intermittent series. We also introduce a new data set from the UCI Machine Learning repository [88]. We give further details of these data sets below.
• Car Parts consists of monthly parts demand data for a US-based automobile company, previously studied in [19,36]. It includes 2674 time series. We conduct experiments on 2503 time series which have complete data for 51 months, with at least one demand point in the first 45 months.
• The Auto data set consists of monthly demand data for 3000 items, for a period of 24 months. It was previously used by [6], and originates from [25]. We report results on 1227 time series that are "intermittent" or "lumpy" based on the Syntetos-Boylan classification (SBC) [25], using the implementation in tsintermittent [87].
• The RAF data set consists of aerospace parts demand data taken from the Royal Air Force, used previously by [3]. The data set includes 84 months of data for 5000 parts.
• The UCI data set includes demand records of retail items of 4070 individual items between 01/12/2010 and 09/12/2011 of a UK-based online retailer [89]. Most notably, the data set is a transaction record of individual purchases, with exact date and timestamps. Among these, we aggregate purchase records of the last three months to daily time series, each corresponding to a single product. We retain time series with demand in the first 30 days, and with less than 90 purchase records in the entire timespan. All 1381 such time series are classified as lumpy demand per SBC.
• The M5 data set is taken from the recently launched M5 competition [67]. The data set includes daily unit sales per product and store in Walmart, over the course of over 5 years. It contains 30490 time series, with a maximum of 1913 days of data in the longest example. 23401 of these series are classified as intermittent while 5953 are lumpy according to SBC. Summary statistics of data sets are given in Table 5, where CV 2 denotes the squared coefficient of variation across all demand instances in the data set.
We take the last 6 demand periods of each time series as a held out sample, and train models on the preceding periods. Results are reported on the forecast accuracy of the last 6 periods, i. e., with a 6-step-ahead forecast. For the M5 competition, both the held-out sample and the forecast horizon are set to 28 days, in accordance with competition rules [67].
Notably, due to the lack of sufficiently large data for cross-validation, we perform no hyperparameter optimization (HPO) for the neural network architecture, regularization, or optimization parameters on the first four data sets. For RNN-based models, we use a single hidden layer of 20 units in the LSTM. For regularization, we rely on weight decay, setting the decay rate to 0.01.
On the M5 data set, where sufficiently many examples for cross-validation are available, we perform hyperparameter optimization with Amazon SageMaker's hyperparameter tuner, which implements Bayesian optimization for searching the hyperparameter space for the minimum value of the loss metric [93]. Here, we heuristically set the loss to be the simple average of the four forecast accuracy metrics reported. We reserve 20% of the in-sample training data as the validation set, and simultaneously tune the learning rate, number of layers, number of hidden units, and number of training epochs. The winning network configuration has 3 hidden layers with 30 hidden units each, which is trained in 750 epochs with a learning rate of 0.01. In our results, we report outcomes for the default hyperparameters given above, and separately for the optimized versions denoting these as "+ HPO." We evaluate performance of both forecast distributions and point forecasts. As previously noted, forecast accuracy metrics are especially elusive in the case of IDF. We report results on a variety of metrics which have been used in both IDF and in forecasting in general. The precise definitions of these metrics are given in the S1 Appendix. We repeat each experiment three times, and report the mean and standard deviation of metrics. For probabilistic models, we evaluate forecasts based on 250 trajectories sampled from the fitted model. For point forecast comparisons, we compare true values with means of sampled trajectories for simplicity of experimental design. Choosing the mean as a statistic of the probabilistic forecasts constitutes a lower bound on the accuracy as choosing different statistics in accordance with the accuracy metric could yield improved results still (see e.g., metric [94] for a discussion). Note that we are interested in evaluating different, distinct methods for addressing the IDF for which we designed our experiment. We remark however that naturally, any of the methods presented in our results can be enhanced with standard techniques such as model selection heuristics (e.g., depending on data set statistics such as intermittency patterns, choose a specific model instantiation) or ensembling techniques which are highly successful in forecasting practice (e.g., [9,52,95]). Discrete-time model results are reported in Tables 6 and 7. All zeros refers to accuracy obtained by a naive baseline-a point forecast of 0s across the forecast horizon. Across all metrics, lower numbers indicate better predictive accuracy.
Our results vary across model families and data sets, however some key themes emerge. We observe that in four of five data sets, RNN-based models lead to forecast distributions-as measured with P50 and P90 losses-that are on par with or more accurate than their non-neural counterparts. The exception is the Auto data set, which has a total time series length of 18 steps in sample, barely enough to include any interesting temporal patterns. While these findings support our hypothesis that RNNs can recover more complex temporal patterns, we believe their true potential is offset by the inability to perform hyperparameter optimization. In a real-world scenario, model architecture, training and regularization hyperparameters would be tuned for each specific data set and target forecast horizon. Here, by keeping hyperparameters constant, we report the "bare minimum" of what RNN-based models can achieve in IDF. Indeed, on the M5 data set where hyperparameters were optimized, we find that RNNs significantly outperform all other model-based methods in both probabilistic and point forecast performance, and are slightly better than point forecast methods as measured by RMSE and RMSSE.
Without the use of RNNs, flexible interdemand times alone improve both probabilistic and point forecasts in the first three data sets. This confirms our intuition that improvements by more flexible renewal-type models are data set dependent. The Car Parts data set, which has the lowest variation in demand sizes but a high variation in interdemand intervals yields some evidence in favor of using DTRPs, matching our intuition for where such models would be useful. However, in the larger UCI and M5 data sets, modeling with flexible interdemand sizes alone leads to a deterioration in forecast performance. Finally, we find scant evidence that flexible negative binomial demand sizes improve forecast accuracy.
Our results so far suggest that carefully tuned RNN-based forecasters, built on the DTRP formalism, yield slightly more accurate forecasts. However, we find no strong evidence that DTRPs alone generally outperform baseline methods. Tables 6 and 7 do not paint a clear picture of which modeling direction is more promising. Note, however, that this is in line with other empirical studies on forecasting methods, e.g., [58], where no overall dominant model is found for forecasting-unlike other areas of machine learning such as natural language processing, where dominant models have emerged (e.g., [96]). In order to quantify the improvement brought by our two main modeling ideas, we calculate the significance of improvements brought by model families across different data sets. For this, we compute the ratio of losses for each model to the loss of the Croston (Static G-Po) model, and report averages of these ratios across data sets. In Table 8, we report these average ratios for model families, and the significance level of a one-sample one-sided t-test under the null hypothesis that the ratio is equal to or greater than one-i.e., the model brings no improvement.
Here, we find that flexible demand size distributions, proposed in [19] do not generally result in improved forecasts. Moreover, renewal processes-or flexible interdemand sizes alone-do not appear to yield improvements across data sets. We also do not find that exponential moving average-based models result in any significant improvement. Comparing Table 8 to dataset results supports our claim that while these three model components may improve individual data set performance, they do not do so in general. However, we can confirm that these approaches improve on the P50 loss by a slight margin of 2 to 5 per cent. On the other hand, RNN-based models result in significantly more accurate point and probabilistic forecasts compared to simple probabilistic models. RNNs yield an improvement of 10 per cent on P50 loss, and over 5 per cent in P90 loss in general, yielding more accurate probabilistic forecasts. Their performance in squared losses also outperforms the probabilistic baseline. We can confirm from Tables 6 and 7, however, that RNN models do not generally yield better point forecasts than classical point forecasting methods. We therefore conclude that while RNNs are promising as probabilistic forecasting methods, further work has to be invested in their use before their performance is on par with carefully tuned and debiased point forecasters such as SBA and TSB. Nevertheless, results from the M5 data set are promising, as the margins between tuned RNN models and point forecasters are small, with the former slightly outperforming the latter. Finally, we test using TPPs, or continuous-time renewal processes, directly for IDF. The UCI data set is the only one where purchase records with exact timestamps are available. In Table 9, we compare Static and RNN-based models with their continuous time counterparts. Specifically, we train continuous-time models on interdemand times of individual purchase orders instead of the aggregated intermittent demand series. We take forward samples from these models, and aggregate the predicted purchase events in the last 6 time periods. We compare these with the forecast accuracy of samples taken from discrete-time models. Our results yield no evidence in favor of continuous-time models, which appear to be exhibit a similar performance with their discrete-time analogues.

Discussion and conclusion
IDF is a uniquely challenging problem; the definition of good forecasts is as elusive as the techniques used to produce them. Most previous works in IDF, starting from Croston's method [1], have focused on point forecast methods with strict assumptions. While probabilistic versions of Croston's method were introduced, these did not propose a general theoretical framework for similarly constructed forecasting methods. In this work, we proposed a general probabilistic framework for building IDF models, by making connections to well-established literatures in renewal theory and probabilistic neural forecasting. We argued that previously proposed model-based and point forecast methods can all be cast as special cases of our framework, and that it allows for significant generalizations of these methods. For example, by extending simple, widely used models with flexible interdemand time distributions, we are able to recover various common demand arrival patterns such as aging, periodicity, and clustering. RNNs, used as a subcomponent of our framework, are able to capture non-trivial temporal patterns, e.g., alternating periods which we illustrated on synthetic data.
We reported mixed results from a large set of numerical experiments on a variety of wellknown intermittent demand data sets. We found that while renewal processes alone can only lead to improved probabilistic forecasts for individual data sets, they do not generally lead to substantial improvements in performance. However, we found some evidence that RNNs, used as a subcomponent in our framework, lead to improvements in probabilistic forecast accuracy. These improvements are promising for the larger data sets, such as the M5 competition which lend themselves particularly well to neural network based techniques and are examples for an important class of practical forecasting problems [97].
Our study also highlights the intimate connection between temporal point processes and intermittent demand models. Neural TPPs (see e.g., [98]), a recent research direction in machine learning, can be seen as continuous-time instances of our framework. These models are especially well-suited to scenarios in which purchase order data are directly available, removing the need to temporally aggregate demand instances, and directly taking advantage of full temporal granularity. While our empirical results on limited data available do not yield substantive evidence for taking this approach, we believe this connection can be pursued further for demand forecasting research.
In the context of probabilistic forecasting, our theoretical framework does not only yield useful forecasting tools but also opens up exciting avenues for further research for which especially in the recent data set from the M5 competition will provide inspiration. To provide two concrete examples: (i) we believe that a better understanding of the intermittency patterns in the M5 data set will allow for an improved, more fine-tuned model selection of the different instantiations of our framework (or additionally, more flexible extensions thereof); and (ii) combination of the probabilistic models provided here with hierarchical forecasting techniques for probabilistic methods (such as [99,100]) will offer ample opportunity for further work. In more general terms, although we do not find that the individual models categorically outperform classical point forecasting baselines, we believe experimenting with other instances of our framework, e.g., with other statistical distributions, neural network architectures and training strategies, can lead to closing this gap. Finally, we hope that casting intermittent demand problems in the language of renewal theory can lead to new results and simple tools, e.g., for multi-step ahead forecast estimates or generalized point forecast methods.