Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Modified Kumaraswamy seasonal autoregressive moving average models with exogenous regressors for double-bounded hydro-environmental data

  • Aline Armanini Stefanan ,

    Contributed equally to this work with: Aline Armanini Stefanan, Murilo Sagrillo, Bruna G. Palm, Fábio M. Bayer

    Roles Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    aline.armanini@acad.ufsm.br

    Affiliation Postgraduate Program in Industrial Engineering, Universidade Federal de Santa Maria, Santa Maria, Rio Grande do Sul, Brazil

  • Murilo Sagrillo ,

    Contributed equally to this work with: Aline Armanini Stefanan, Murilo Sagrillo, Bruna G. Palm, Fábio M. Bayer

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – review & editing

    Affiliation Postgraduate Program in Industrial Engineering, Universidade Federal de Santa Maria, Santa Maria, Rio Grande do Sul, Brazil

  • Bruna G. Palm ,

    Contributed equally to this work with: Aline Armanini Stefanan, Murilo Sagrillo, Bruna G. Palm, Fábio M. Bayer

    Roles Formal analysis, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Mathematics and Natural Sciences, Blekinge Institute of Technology, Karlskrona, Blekinge, Sweden

  • Fábio M. Bayer

    Contributed equally to this work with: Aline Armanini Stefanan, Murilo Sagrillo, Bruna G. Palm, Fábio M. Bayer

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    Senior author.

    Affiliations Department of Mathematics and Natural Sciences, Blekinge Institute of Technology, Karlskrona, Blekinge, Sweden, Department of Statistics, Universidade Federal de Santa Maria, Santa Maria, Rio Grande do Sul, Brazil

Abstract

This paper proposes the MKSARMAX model for modeling and forecasting time series that can only take on values within a specified range, such as in the interval (0,1). The model is especially good for modeling double-bounded hydro-environmental time series since it accommodates bounded support and asymmetric distribution, making it advantageous compared to the traditional Gaussian-based time series model. The MKSARMAX models the conditional median of a modified Kumaraswamy distributed variable observed over time, by a dynamic structure considering stochastic seasonality and including autoregressive and moving average terms, exogenous regressors, and a link function. The conditional maximum likelihood method is employed to estimate the model parameters. Hypothesis tests and confidence intervals for the parameters of the proposed model are derived using the asymptotic theory of the conditional maximum likelihood estimators. Quantile residuals are defined for diagnostic analysis, and goodness-of-fit tests are subsequently implemented. Synthetic hydro-environmental time series are generated in a Monte Carlo simulation study to assess the finite sample performance of the inferences. Moreover, MKSARMAX outperforms βSARMA, SARMAX, Holt-Winters, and KARMA models in most accuracy measures analyzed when applied to useful water volume datasets, presenting for the first-step forecast at least lower MAE, RMSE, and MAPE values than competitors in the Caconde UV dataset, and lower MAE, RMSE, and MAPE values than competitors in the Guarapiranga UV dataset. These findings suggest that the MKSARMAX model holds strong potential for water resource management. Its flexibility and accuracy in the early forecasting steps make it particularly valuable for predicting flood and drought periods.

1 Introduction

Improving modeling capability and forecast accuracy for hydro-environmental variables remains an ongoing objective in scientific research. Hydro-environmental time series, such as relative humidity, water level, rainfall depth or volume, wave height, streamflow, and groundwater level values recorded at regular periods of time, are generally modeled from a stochastic view by the traditional Gaussian-based autoregressive moving average (ARMA) model [1], which considers the time series is real-valued and follows a mesokurtic and symmetric distribution. However, several hydro-environmental variables, such as useful water volume, are asymmetrically distributed and present bounded support [2]. In such situations, the assumption of Gaussianity can lead to inaccurate inferences, wrong predictions, and erroneous interpretations, such as predicting values out of bounds [3,4]. One approach to extending the Gaussian ARMA for non-Gaussian time series data was the generalized autoregressive moving average (GARMA) model, proposed by [5], which is an extension to the work of [6] - the autoregressive and Markov chain models for time series, and [7] - that considered the moving average component. The GARMA model has a dynamic generalized linear model (GLM) framework [8] and proposes to model data following the conditional canonical exponential family distribution, such as Poisson, binomial, and gamma distributions.

In a stochastic approach, some probabilistic distributions that are usually applied to hydrological processes are the Kumaraswamy[9], inflated Kumaraswamy [10], beta [1113], inflated beta [14], beta prime [15], kappa [16], Rayleigh [17], inflated Rayleigh [18], Weibull [19], and Gumbel [20], for example. In particular, when the variable of interest exhibits double-boundedness and an asymmetric distribution, researchers have considered the beta and Kumaraswamy laws [9,13,14,21,22]. Dynamic models based on these distributions, the beta autoregressive moving average (ARMA) and the Kumaraswamy autoregressive moving average (KARMA) models, were proposed in [3] and [9], respectively. Additionally, the beta seasonal autoregressive moving average (SARMA) model, proposed by [21], extends the class of ARMA models by incorporating seasonal dynamics. However, as discussed in [2], the beta and Kumaraswamy-based models may have limitations in their flexibility for modeling certain types of hydrological data. Thus, [2] proposed a more flexible two-parameter probability model called the modified Kumaraswamy (MK) distribution. This new probability model incorporates density shapes unsuitable for the beta and Kumaraswamy distributions, such as increasing-decreasing-increasing shapes, as can be seen in the Fig 1. Also, it can be useful to model left- or right-skewed data and heavy- or non-heavy tails.

Considering an empirical application in the percentage of useful water volume of several Brazilian water reservoirs, the authors showed that the MK distribution excels over competing models (beta and Kumaraswamy), being a suitable and prominent alternative for modeling double-bounded hydro-environmental data.

To fit the MK model, [2] assumes constant parameters and independence among the data points, which are strong assumptions in time series data. To the best of our knowledge, a dynamic model based on the MK distribution has not been addressed in the literature, and this paper aims to provide the first treatment on this topic.

1.1 Research motivation and objective

It is common knowledge that natural resources are finite on our planet, and since our life depends on them, it is imperative that we monitor and utilize these resources efficiently. Modeling and forecasting tools for hydro-environmental variables play a crucial role in water resource management by facilitating the development of strategies to address future conditions, such as floods and droughts, ensuring that the population is adequately water and energy supplied. However, existing stochastic time series models in the literature fail to incorporate asymmetric distributions with double-bounded support and also do not adequately address stochastic seasonality or exogenous regressors within the systematic component. Including these characteristics in a model would significantly expand the potential for modeling and forecasting a wide range of hydro-environmental variables.

Our goal is to propose a new dynamic time series model based on the MK distribution. This model, named the modified Kumaraswamy seasonal autoregressive moving average model with exogenous regressors (MKSARMAX), is designed to model the conditional median of seasonal double-bounded time series. The model consists of autoregressive and moving average terms, a set of regressors, a seasonal component, and a link function. Link functions are used in non-Gaussian distributed models to relate the model scale to the original data scale, allowing the use of linear models to make predictions without producing inappropriate results [8].

Being a new approach to modeling time series, the MKSARMAX model considers the time dependence in the data, specifically, that the value of today is related to past values, past error predictions, or both. The time dependence can be measured by autocorrelation, also known as serial correlation. When time dependence is not considered in the model, which is the case of linear regression models, the regression assumption of no autocorrelation of the errors is not achieved. It follows that the hypothesis significance tests and confidence intervals would be incorrect, with an increased risk of type I errors, more likely indicating that the parameters are significant when they are not [23]. Therefore, it is essential to model time series using an approach that accounts for the inherent time dependence present in this type of data.

In addition to considering a more flexible distribution in the random component compared to competitors, such as the KARMA and SARMA models, the MKSARMAX model is the only one that accommodates both stochastic seasonality and exogenous regressors in the systematic component. For the proposed model, parameter estimation, conditional observed information matrix, validation tests, prediction, and forecast are introduced. Synthetic and observed double-bounded hydro-environmental time series were considered to numerically assess the performance of the proposed model.

The paper is organized as follows. In Sect 2, we introduce the mathematical formalism of the proposed model. The conditional maximum likelihood estimation and hypothesis testing inference are shown in Sect 3. Model validation, including diagnostic analysis, model selection, prediction, and forecasting, are discussed in Sect 4. Monte Carlo simulations and two applications to observed datasets can be found in Sect 5 to evaluate the performance of the proposed model. The conclusion is presented in Sect 6. The derivation of the observed information matrix is presented in S1 Appendix, and supplementary simulation results are included in S2 Appendix.

2 The proposed model

This section reviews the MK distribution, introduces the median-based reparameterization of this distribution, and derives a new dynamic model suitable for modeling the conditional median of MK-distributed time series.

2.1 The modified Kumaraswamy (MK) distribution

The MK distribution proposed in [2] is based on a transformation, Y = [1 − , of a Kumaraswamy-distributed variable Z. The random variable Y assumes values y in the unit interval and has the probability density function and the cumulative distribution function given, respectively, by

(1)

being and the shape parameters. The quantile function, useful to generate pseudo-random variables, computes the quantile and is given by

(2)

Note that when , we have the median () of Y, i.e. .

2.2 The dynamic MKSARMAX model

Parametric statistical models typically aim to model a central tendency measure, such as mean or median. However, the original parameters of the MK distribution lack direct physical or statistical interpretations. Therefore, a median-based reparameterization for the MK distribution is considered when proposing the MK-based dynamic model. From Eq 2, the following relations can be derived:

Substituting the quantity described above in the Eqs 1 and (2), we obtain the median-based MK distribution denoted as , as detailed below.

Let be a stochastic process which each Yt assumes values yt in the interval (0,1) with probability 1, and let denote the -field generated by the previous observations of yt. Assume that each . With the median-based parametrization of Yt, the conditional density, cumulative distribution, and quantile functions are given, respectively, by

where is the conditional median and is a shape parameter.

The dynamic structure of the proposed MKSARMAX is defined according to

(3)

where B is the backshift operator such that for a nonnegative integer d, is a strictly monotone and twice differentiable link function such that , is the k–dimensional vector containing the exogenous regressors at time t, is the k–dimensional vector of unknown parameters associated to exogenous regressors , and is the intercept. The autoregressive and moving average terms are defined as in [1]: (i) is the seasonal autoregressive operator, considered as a polynomial in of degree P, where P is the seasonal autoregressive order of the model and S is the seasonal frequency; (ii) is the autoregressive operator, where p is the autoregressive order; (iii) is the seasonal moving average operator, where Q is the seasonal moving average order; and (iv) is the moving average operator, where q is the non-seasonal moving average order. The error term is considered in the predictor scale, , as considered in [9]. The model name is adapted according to the terms included, e.g., MKSARMA when it does not present exogenous regressors, and MKARMAX(p,q) when it does not present stochastic seasonality.

The general MKSARMAX model can also be written from (3) in the following GLM-like structure

(4)

The dynamic model structure presented above is similar to SARMA [21]. However, our approach diverges by employing the MK distribution (which has been shown to be a better tool for hydro-environmental data modeling) and exogenous regressors. Furthermore, it is observed that the exogenous regressors are modeled separately from the intercept, which differs from the dynamic structure considered in [24]. The inclusion of exogenous regressors in the model can be extremely powerful since external variables can impact the time series behavior and improve predictions, as demonstrated in [25] and [26]. By considering exogenous regressors, the model can accommodate level changes, deterministic trends, or any deterministic action over the time series, thereby enabling the handling of human interventions in the hydrological cycle and adequately modeling level changes in double-bounded time series data, for example.

Conditions for stationarity, causality, and invertibility for dynamic time series models that assume conditional probability structures under non-Gaussian distributions remain an open and challenging topic in the literature. This limitation primarily arises because, for link functions other than the identity, the moving average error terms do not form a martingale difference sequence. As a result, the first two moments of the marginal distribution become analytically intractable, as discussed in [5], where the authors briefly addressed this point through a simulation-based approach. Nevertheless, this theoretical characteristic does not compromise the practical applicability of such models, which are widely employed in various fields (see, e.g., [2730]). An important point we can assert regarding stationarity in double-bounded time series is that, due to the bounded support, both mean and variance are finite by construction.

Remark 1. The analysis in this paper focuses on the median of a time series with values in the interval (0,1). Therefore, appropriate link functions satisfying include the logit, probit, and cloglog functions.

Remark 2. It is pointed out that a quantile approach can be considered for the MKSARMAX when in (2). While this option is available for the model, the analysis in this paper focuses on the median due to its interpretability as a measure of central tendency.

3 Conditional likelihood inference

In this section, we present the theory of conditional likelihood inference for the parameters, covering point estimation, hypothesis testing inference, and confidence intervals.

3.1 Parameters estimation

Let be a sample of a MKSARMAX (P,Q)S stochastic process and be the -dimensional parameter vector, where . The conditional maximum likelihood estimators (CMLE) for model parameters are obtained by maximizing the logarithm of the conditional likelihood function. To achieve this, the score vector, which consists of the derivatives of the conditional log-likelihood function with respect to the parameters, is expressed as

(5)

where is the logarithm of the conditional likelihood function of the parameter vector conditional on the initial observations, and it is given by

where

being and .

Considering the chain rule to compute the derivatives of Eq 5, for and , we have

where

and

being the derivative of the inverse of the link function . The derivatives of with respect to the parameters , and , considering the general form of the model, are computed as

Finally, the derivative of with respect to the parameter is given by

where

The CMLE of , , is obtained by solving

(6)

where is the -dimensional vector of zeros, whose solution has no closed-form. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [31] with analytic first derivatives was chosen as the nonlinear optimization algorithm to solve the system in Eq 6. The initial values for the iterative method were set as follows: (i) and were set as zero as in [21]; (ii) was obtained using the generalized simulated annealing function [32] as in [2]; (iii) , , , and were derived from the ordinary least squares estimate associated with the linear regression where the response vector is given by , and the matrix of the independent variables is given by

similar to [21] and [33].

3.2 Hypothesis testing inference and confidence intervals

Based on the Central Limit Theorem, the CMLE follows normal distribution asymptotically, with the respective parameters as the mean and the inverse of the observed information matrix ( K) as the covariance matrix. The asymptotic variances of the estimators , , , , , , and —useful for hypothesis testing inference—are the diagonal elements of the variance-covariance matrix of CMLE, which is obtained by the inverse of K, . The closed-form expression for K, whose elements are the negative second derivatives of the conditional log-likelihood function, is derived in detail in S1 Appendix.

The test for

where , for , represents mth component of , can be performed based on the signed square root of Wald’s statistic [34,35], which is given by

where is the estimated asymptotic variance of , which is given from the diagonal of evaluated at the  +  k  +  -dimensional CMLE vector . When |Zm| is greater than the , the null hypothesis is rejected considering a significance level of , being the th quantile of the standard normal distribution.

Confidence intervals for , with confidence approximately , can be derived based on the asymptotic normal distribution of the CMLE according to

4 Validation, model selection, prediction and forecasting

This section presents the residual diagnostic analysis, model selection, prediction, and forecasting for the MKSARMAX model. For that, it is necessary to compute the fitted values ,  +  . The estimated is obtained by replacing the model parameters with their estimators , , , , , , and in Eq 4, and applying the inverse link function to as defined in Sect 4.3.

4.1 Residuals and goodness-of-fit tests

After the parameters have been estimated, diagnostic checks and goodness-of-fit tests of the fitted model should be considered. For that, the quantile residual can be performed. This residual is widely employed in the literature since it is approximately Gaussian distributed with a zero mean and unit variance when the model is correctly fitted [36], and is given by

where denotes the standard normal quantile function.

After fitting the MKSARMAX model, the residuals are expected to be approximately normally distributed, uncorrelated, exhibit a mean close to zero, and display constant variance. The presence of these properties suggests that the model has adequately captured the underlying structure of the time series and that no significant patterns remain unexplained.

To verify the goodness-of-fit of the adjusted model, the Ljung-Box test [37], Jarque–Bera test [38], and autoregressive conditional heteroskedasticity (ARCH) test [39] can be considered over the residual series to assess non-autocorrelation, Gaussianity, and non-heteroscedasticity, respectively.

4.2 Model selection criteria

To select the order of the MKSARMAX model for a given time series, the modified Bayesian information criterion (MBIC) proposed by [21] can be applied. The MBIC is given by

where is the maximized conditional log-likelihood function of the fitted model. The MBIC is advantageous over the Bayesian information criterion (BIC) [40,41] in conditional likelihood inference, when the initial observations are not fitted, because it avoids penalizing models erroneously with numerous parameters. The BIC is chosen over AIC due to its more parsimonious overfit penalizing. The selected model has the lowest MBIC value among a set of competing fitted models.

4.3 Prediction and forecasting

In-sample predictions for are obtained from the fitted values as seen previously. Out-of-sample predictions, i.e., forecasting values H steps ahead for , are obtained by setting rt = 0, for , besides replacing the model parameters with their estimates in Eq 4, and applying the inverse of the link function to for the rolling window forecast and by also setting g(yt) as for the traditional forecast. So the general form of prediction and forecasting is given by

where

5 Numerical evaluation and discussion

In this section, Monte Carlo simulations are presented and discussed to assess the introduced model using synthetic data. Different sample sizes and model structures are employed for this purpose. Additionally, applications to useful water volume datasets are performed by comparing the proposed and some competing models. The considered link function is the logit, i.e., , which is the most common choice to other models for double-bounded data in the interval (0,1), such as KARMA and ARMA models [3,9,11,13,27]. The significance level for hypothesis testing is set at for Wald test, Ljung-Box test, Jarque-Bera test, and ARCH test, and is set at for Diebold-Mariano test. Implementations in the R language [42] for MKSARMAX model fitting are available [43].

5.1 Monte Carlo simulations

Monte Carlo simulations were performed to evaluate the finite sample performance of the CMLE on synthetic hydro-environmental data. The parameters used for generating synthetic hydro-environmental time series were obtained by fitting real datasets with the MKSARMAX model, enabling their evaluation in the simulation study. We simulated 5000 replications of synthetic hydro-environmental time series considering four different sample sizes, . The mean, average bias, relative bias, standard error (SE), and mean square error (MSE) were adopted as figures of merit to numerically evaluate the point estimators of the model parameters. The coverage rates (CR) of the confidence intervals were computed to evaluate interval estimation. In each Monte Carlo replication, the inversion method was employed to simulate MK distributed hydro-environmental time series with based on Eq 4.

The simulation results of MKSARMA (1,1)12 scenario are presented in Table 1; the parameter values are presented inside the parentheses. For this scenario, one model was discarded in the simulations for , and two models were discarded in the simulations for , due to failure to converge using the BFGS optimization method or a non–positive-definite observed information matrix. As expected, bias and MSE figures reduce as n grows. This behavior evidences the consistency of the CMLE. The coverage rate is close to the nominal value, , especially for the sample sizes equal to 500 and 700. Note that: (i) , , and excel in terms of relative bias; (ii) relative bias converges faster to zero than the other estimators; (iii) the seasonal parameter estimators, and , display the highest relative bias values—such fact was also discussed in [24] regarding the CMP-ARMA model, where it was found, by simulation studies, that inferences about seasonal parameters estimators are more biased; and (iv) shows the highest MSE values.

thumbnail
Table 1. Mean, bias, relative bias, standard error, and MSE of MKSARMA parameter estimators, and coverage rate for the confidence interval. Parameter values are presented inside the parentheses.

https://doi.org/10.1371/journal.pone.0324721.t001

Simulation results of MKARMAX (1,1) scenario are presented in Table 2. For this scenario, one exogenous regressor x is changing the level of the time series in the middle of the sample size, being and . As expected, bias and MSE reduce as n increases, and the coverage rate is close to the nominal value, , for the sample sizes . Note that: (i) the moving average parameter estimators, , display the highest average relative bias values; (ii) shows the highest MSE values; and (iii) and excel in terms of relative bias. Complementary simulation results for MKSARMA (1,0)12 and MKSARMA (0,1)12 are presented in S2 Appendix. All the models successfully converged and the observed information matrices were positive definite in the MKARMAX (1,1), MKSARMA (1,0)12, and MKSARMA (0,1)12 scenarios, ensuring the reliability of the estimated parameters.

thumbnail
Table 2. Mean, bias, relative bias, standard error, and MSE of MKARMAX parameter estimators, and coverage rate for the confidence interval. Parameter values are presented inside the parentheses.

https://doi.org/10.1371/journal.pone.0324721.t002

In general, the Monte Carlo simulation results show that the performance of the conditional likelihood inference in the MKSARMAX is good for finite samples. The CMLE presents minor errors and biases as the sample size of synthetic hydro-environmental time series increases, and the coverage rate reaches close to the confidence level in all analyzed scenarios. To complement the numerical evaluation of the model, the MKSARMAX fit and forecast for two real datasets of monthly useful water volume (UV) [44] are presented in the following.

5.2 Application to Caconde useful water volume dataset

This section evaluates the effectiveness of the proposed model in the monthly useful water volume of the Caconde Reservoir, from a hydroelectric power plant situated in Caconde, SP, Brazil. This volume is defined as the percentage of a reservoir’s volume between its maximum and minimum operational levels [45]. In the case of reservoirs from hydroelectric power plant, such as the Caconde Reservoir, the useful volume represents the volume of water (in ) in the reservoir that can be effectively used for power generation and the useful volume is obtained from the ratio between the current level of the reservoir and the maximum and minimum operational levels difference [13]. The Caconde UV dataset employed in this section is from January 2015 to February 2024. The last 8 observations were used to assess the traditional and rolling window forecasting (Sect 4.3) performance of the proposed model; thus, n = 102 and H = 8. The time series is presented in Fig 2 through its time plot, seasonal plot, autocorrelation function (ACF) plot, and partial autocorrelation function (PACF) plot. The fit dataset varies from 0.1500 to 0.9930, with an unconditional median of 0.5285, a sample average of 0.5804, and a standard deviation of 0.2666. The nonparametric Friedman test () [46] indicates a significant seasonality in the time series, which can also be seen in the Fig 2. Typically, at the end of the rainfall period, from December to April, the reservoirs reach higher useful water volume levels than in the other periods of the year.

thumbnail
Fig 2. Time series, seasonal plot, ACF, and PACF for the Caconde UV dataset from to 102.

https://doi.org/10.1371/journal.pone.0324721.g002

In order to select the best MKSARMAX fitted model, only models in which (i) the BFGS method converges and (ii) the observed information matrix is positive semidefinite were considered. Some other critical points were verified, such as (iii) significant parameter coefficients (Wald test Sect 3.2); (iv) autoregressive and moving-average coefficient roots being outside the unit circle to approximately ensure the stability of g(y); and (v) quantile residuals behaving as a white noise process (Sect 4.1). If one parameter coefficient of the fitted model does not reject the null hypothesis of the Wald test, , the adjusted model is inadequate and the fit must be discarded [47]. Thus, different lags up to the defined order are considered to find the best model that fits the dataset.

It is important to check the historical facts that affect the time series to incorporate them into the model as exogenous regressors. If cycles outside of the seasonal period are identified, they can be incorporated into the model as exogenous regressors. Also, droughts or floods are critical periods in UV reservoirs, so the time series behaves differently in those periods, presenting too low UV values in droughts and too high values in floods. Between January 2015 and February 2024, we had one big water crisis in Brazil during 2021 [48]. This crisis was caused mainly by a drought period that led to low useful water volume in the reservoirs and compromised the power generation of the Caconde Reservoir, since the hydroeletric power plant had to lower the minimum flow level [49]. Consequently, energy production was reduced to extend the retention of water within the reservoirs for a prolonged period.

Restricting the maximum model orders p,q,P, and Q to 4 for computational simplification and considering the model selection criteria described in Sect 4.2, we successfully modeled the Caconde UV dataset using a MKSARMAX , with and X21 being an exogenous regressor changing the level of the time series in the Brazil water crisis [48] from January 2021 to December 2021. The fitted MKSARMAX and its diagnostic test results are presented in Table 3. According to the employed validation tests, the model residuals are independent, homoscedastic, and normally distributed. The Ljung-Box test analyzed 24 lags, and the ARCH test analyzed 10 lags. Fig 3 presents the ACF and PACF of the residuals. Those are in accordance with the Ljung-Box test evaluation of non-autocorrelation, with most autocorrelation and partial autocorrelation within the confidence interval.

thumbnail
Fig 3. ACF and PACF of MKSARMAX quantile residuals for the Caconde UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g003

thumbnail
Table 3. Fitted MKSARMAX for the Caconde UV dataset. CMLE coefficients, confidence intervals, statistic and p-value of Wald test, and MBIC are presented, as well as the statistics and p-values of Ljung-Box, Jarque-Bera, and ARCH tests.

https://doi.org/10.1371/journal.pone.0324721.t003

The autoregressive parameters in the model suggest that the current value is a function of past values. So, significant and inform that the UV in the second and third previous months significantly help explain the UV in the current month, and significant and inform that the UV in the same month three and four years ago help explain the UV in the current month. The moving average terms in the model suggest that the current value is a function of past prediction errors. So, significant informs that the UV error prediction in the second previous month significantly helps explain the UV in the current month, and significant informs that the UV error prediction of the same month two years ago helps explain the UV in the current month. Also, the significant parameter and negative value of its estimator corroborate the evidence of lower useful water volume levels in the Caconde Reservoir in the year 2021. Fig 4 presents the observed and predicted values obtained considering the MKSARMAX model, displaying close predicted and observed values, corroborating the goodness-of-fit test results.

thumbnail
Fig 4. Fitted MKSARMAX and 8 steps traditional and rolling window forecasts for the Caconde UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g004

The prediction and forecast performance of the proposed MKSARMAX model is compared with the additive Holt-Winters method from the HoltWinters function of the stats base package [42], SARMA, SARMAX, and KARMA models. We fitted the SARMAX and the SARMA models with the same autoregressive and moving-average terms as MKSARMAX, and for the KARMA model, the exogenous regressors and were added to model the seasonality deterministically, once the KARMA model does not consider stochastic seasonality terms. The Diebold-Mariano test from the dm.test function of the forecast package [50] considering the absolute value loss function indicates that the MKSARMAX model has a better traditional forecast than the Holt-Winters (p-value = 0.0068) and KARMA models (p-value = 0.0009). The predicted in- and out-of-sample accuracy of the proposed model and its competitors are displayed in Table 4 and Table 5, respectively. Prediction accuracy measures, including mean average error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean directional accuracy (MDA) were considered to evaluate the models. Lower values for MAE, RMSE, and MAPE are desirable, indicating better performance. Additionally, the higher the MDA value, the better the prediction follows the time series directional movement (upward or downward).

thumbnail
Table 4. In-sample prediction accuracy measures MAE, RMSE, MAPE, and MDA for MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models fit of observations from to 102 of the Caconde UV dataset. The best value for each measure is shaded light gray.

https://doi.org/10.1371/journal.pone.0324721.t004

thumbnail
Table 5. Out-of-sample traditional and rolling window forecasts accuracy measures MAE, RMSE, MAPE, and MDA for MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models forecast of the Caconde UV dataset. The best value for each measure is shaded light gray.

https://doi.org/10.1371/journal.pone.0324721.t005

The MKSARMAX model excels over the competing models in MAE and MAPE accuracy measures, except in the 8-steps forecasts, and follows of the time series directional movement in-sample and in all out-of-sample steps. Specifically, the MAE and MAPE values of the fitted MKSARMAX model are approximately and , respectively, lower when compared to the ones from the KARMA model (in-sample). Although all models got the correct direction for the first-step forecast, MKSARMAX presented MAE, RMSE, and MAPE around lower than the competing models. The MAE, RMSE, and MAPE values of the fitted MKSARMAX model when compared to the ones from SARMA are about , , and , respectively, lower for the cumulative 8-steps traditional forecast, and about , , and lower for the rolling window forecast.

For a qualitative analysis, Fig 5 displays the observed and forecasted values for 8 steps out-of-sample. It is noted that the traditional and rolling window forecasts of MKSARMAX, SARMA, and SARMAX are very close for this dataset. Besides the Holt-Winters and KARMA models having improved in the rolling window forecast compared to the traditional forecast, the MKSARMAX still presented the best values in most accuracy measures (Table 5). It can be highlighted that besides SARMAX and additive Holt-Winters presenting the second and third-best measures, respectively, in most of the forecast accuracy measures, these models predicted values above 1, exceeding the variable range, emphasizing the importance of a judicious model selection for reliable seasonal double-bounded time series modeling. Similar to the SARMA and KARMA models, the MKSARMAX model accounts for the bounded nature of the response variable. Therefore, its superior performance relative to these models can be attributed to the greater flexibility of the MK distribution, or the inclusion of stochastic seasonality components – offering an advantage over the KARMA model, or the incorporation of exogenous regressors, which are not present in the SARMA model, or a combination of these factors.

thumbnail
Fig 5. Eight steps out-of-sample traditional and rolling window forecasts comparison of MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models for the Caconde UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g005

5.3 Application to Guarapiranga useful water volume dataset

The second dataset utilized to evaluate the effectiveness of the proposed model is the monthly useful water volume of the Guarapiranga Reservoir, situated on the border between Itapecerica da Serra and Embu-Guaçu, SP, Brazil. The Guarapiranga Reservoir is used as a source of public water supply. The dataset comprises Guarapiranga UV data collected monthly from January 2012 to June 2024, with the last 8 observations employed to assess the performance of the model forecasting for n = 138 and H = 8. The dataset is shown in Fig 6. The Guarapiranga UV data range from 0.3980 to 0.9400, with an unconditional median of 0.7760, an average of 0.7484, and a standard deviation of 0.1199. Fig 6 suggests the presence of seasonality with an annual frequency, a conclusion supported by the Friedman test (), which indicates evidence of a seasonal time series.

thumbnail
Fig 6. Time series, seasonal plot, ACF, and PACF for the Guarapiranga UV dataset from to 138.

https://doi.org/10.1371/journal.pone.0324721.g006

The best model up to order 4, as determined by the methodology outlined in the previous application was found to be the MKSARMAX , with the first and second lags of the seasonal moving-average term not considered and X14 and X21 being exogenous regressors changing the level of the time series in the Brazil water crises [48,51] from January 2014 to December 2014 and from January 2021 to December 2021, respectively, caused by drought periods. The drought periods led to low useful water volume in the reservoirs and compromised the population’s water supply. The fitted MKSARMAX model and the results of diagnostic tests are presented in Table 6. The tests outlined in Table 6 indicate an appropriate fit of the MKSARMAX model to the Guarapiranga UV data. The ACF and PACF of the residuals, presented in Fig 7, show most autocorrelation and partial autocorrelation within the confidence interval, not rejecting the null hypothesis of non-autocorrelation. The significant informs that UV in the previous month helps significantly explain UV in the current month, and the significant informs that UV in the same month of the previous year helps to explain UV in the current month. The significant and inform that the UV error predictions in the two previous months significantly help to explain UV in the current month, and significant informs that the UV error prediction of the same month three years ago helps to explain the UV in the current month. Also, the lower useful water volume levels in the Guarapiranga Reservoir in the years 2014 and 2021 are significantly different from those in other periods, as the negative values of the and estimates are significantly different from zero. Fig 8 illustrates the observed, fitted, and forecast values of the MKSARMAX model. It is evident from Fig 8 that the MKSARMAX model adeptly captures the variability present in the observed dataset.

thumbnail
Fig 7. ACF and PACF of MKSARMAX quantile residuals for the Guarapiranga UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g007

thumbnail
Fig 8. Fitted MKSARMAX and 8 steps forecast for the Guarapiranga UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g008

thumbnail
Table 6. Fitted MKSARMAX for the Guarapiranga UV data. CMLE coefficients, confidence intervals, statistic and p-value of Wald test, and MBIC are presented as well as the statistics and p-value of Ljung-Box, Jarque-Bera, and ARCH tests.

https://doi.org/10.1371/journal.pone.0324721.t006

The comparison between the MKSARMAX model and its competitors is presented in Tables 7 and 8. We fitted the SARMAX and the SARMA models with the same autoregressive and moving-average terms as MKSARMAX, and for the KARMA model, the exogenous regressor was added to model the seasonality. Considering the absolute value loss function, the Diebold-Mariano test indicates that the MKSARMAX model has a better traditional forecast than the Holt-Winters model (p-value = 0.0721). The MKSARMAX model demonstrates superior performance over the SARMA, SARMAX, KARMA, and Holt-Winters models regarding most of the accuracy measures, except for the MAE, RMSE, and MAPE values in the in-sample fit, from 6 to 8-steps in the traditional forecast, and from 5 to 8-steps in the rolling window forecast. Specifically, for the in-sample prediction, the MDA value of the MKSARMAX model is about higher, the MAE and RMSE values are approximately lower, and the MAPE value is approximately lower than those of the additive Holt-Winters method. For the first-step forecast, MKSARMAX presented MAE, RMSE, and MAPE values around lower than the SARMA model, lower than the SARMAX model, lower than the additive Holt-Winters method, and lower than the KARMA model. For the cumulative 8-steps forecast, the MKSARMAX model presents the MAE, RMSE, and MAPE values approximately , , and , respectively, lower than the additive Holt-Winters method in the traditional forecast and about , , and in the rolling window forecast. Fig 9 illustrates the out-of-sample 8-steps forecast alongside the observed values, corroborating the evidence from Table 8 that the competing models get better rolling window forecasts than the traditional forecast, but not enough to surpass the MKSARMAX performance in the initial steps. We can attribute the better performance of the MKSARMAX in the initial steps over the competitors, especially the SARMAX model, to the flexibility of the MK distribution.

thumbnail
Fig 9. Eight steps out-of-sample traditional and rolling window forecasts comparison of MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models for the Guarapiranga UV dataset.

https://doi.org/10.1371/journal.pone.0324721.g009

thumbnail
Table 7. In-sample prediction accuracy measures MAE, RMSE, MAPE, and MDA for MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models fit of observations from to 138 of the Guarapiranga UV dataset. The best value for each measure is shaded light gray.

https://doi.org/10.1371/journal.pone.0324721.t007

thumbnail
Table 8. Out-of-sample traditional and rolling window forecasts accuracy measures MAE, RMSE, MAPE, and MDA for MKSARMAX, SARMA, SARMAX, additive Holt-Winters, and KARMA models forecast of the Guarapiranga UV dataset. The best value for each measure is shaded light gray.

https://doi.org/10.1371/journal.pone.0324721.t008

In summary, the MKSARMAX model was a reliable tool for predicting and forecasting both UV datasets, along with favorable outcomes in the Monte Carlo simulations. The numerical assessments indicate that the MKSARMAX model is competitive for modeling and forecasting double-bounded hydro-environmental time series, being able to accommodate stochastic seasonality and exogenous regressors, in addition of having predictions within the double-bounded support of the variable of interest. The improvements that MKSARMAX model brings to modeling and forecasting hydro-environmental time series are crucial in water resource management. The applications showed that the model more accurately forecasted the useful water volume in the reservoirs in the following months, which can lead to more effective strategies to prevent risk conditions to the water supply for the population.

6 Conclusion

To fulfill the gap in the literature on stochastic time series models, in which the traditional Gaussian-based ARMA model is the most popular, we proposed a model based on the MK distribution. Unlike the symmetric Gaussian distribution, which has a unbounded support over , the MK distribution accounts for asymmetry and features a double-bounded support. The MK distribution was chosen due to its superior performance compared to the beta and Kumaraswamy distributions in fitting the useful water volume of 37 reservoirs in [2]. Specifically, the MK and its reflected distribution fitted better for approximately of the reservoirs, whereas the Kumaraswamy and beta distributions each performed best for only about of the reservoirs. Given that the MK distribution is derived from a transformation of the Kumaraswamy distribution, and considering that the KARMA model, which has a structure to accommodate the presence of serial autocorrelation in the conditional median of Kumaraswamy-distributed time series, outperformed the ARMA model in fitting a relative humidity dataset [9], there is a strong indication that an ARMA model based on the MK distribution could be competitive with existing models. Furthermore, the proposed model incorporates both stochastic seasonality and exogenous regressors, distinguishing it from previously introduced models such as SARMA and KARMA, and offering a broader set of features for time series modeling.

In this paper, we introduced the MKSARMAX model, designed for fitting and forecasting double-bounded hydro-environmental time series characterized by stochastic seasonal dynamics. Several hydro-environmental variables are not symmetric and present bounded support, being adequately modeled by models that consider these characteristics. Any improvement in forecasting these variables is valuable because its practical implications are related to better water resource management. An inference approach, out-of-sample forecasting, diagnostic check, and observed information matrix were tailored for the proposed model. We conducted extensive Monte Carlo simulations, comprising 5000 replications of synthetic hydro-environmental time series, to assess the performance of the conditional likelihood inferences, indicating the consistency of the conditional maximum likelihood estimators even in moderate sample sizes.

Additionally, we conducted two experiments with measured hydro-environmental time series to validate our model further. Monthly UV from two different Brazilian reservoirs than the ones analyzed in [2] were chosen to be fitted and forecasted by MKSARMAX and competitor models. Overall, the proposed model outperformed competing models such as SARMAX, Holt-Winters, and even other models that consider double-bounded support data, such as KARMA and SARMA models – similar to what was observed in [2], in terms of prediction and forecast accuracy. For instance, in the Caconde UV application for 1-step forecast, the derived MKSARMAX model demonstrated approximately improvement in MAE, RMSE, and MAPE values, compared to the SARMA and KARMA models, while in the Guarapiranga UV application, it exhibited a better MAE, RMSE, and MAPE values compared to the additive Holt-Winters method. The accurate UV forecast in the following months can assist in the monitoring of the risk of floods and droughts, and management can take actions to attenuate the impact on water supply and power generation. Those findings corroborate the evidence in [2] for MK to be considered a good distribution to fit hydro-environmental variables.

Despite its advantages, the MKSARMAX model presents certain limitations. First, the process of identifying an optimal specification among the various possible model configurations is computationally intensive. Second, the model assumes that issues such as missing data and outliers are addressed prior to estimation, requiring user intervention. In light of these limitations, future research could focus on developing a robust version of the model capable of handling outliers more effectively. Additionally, the integration of machine learning-based hybrid approaches may offer further improvements in model performance. Another promising direction is the incorporation of prediction intervals, which would complement the point forecasts currently provided by the model and offer a more comprehensive assessment of forecast uncertainty.

In conclusion, our study highlights the effectiveness of the proposed MKSARMAX model as a valuable tool for fitting and forecasting double-bounded hydro-environmental data. The model applied to the useful water volume datasets indicates that the MKSARMAX can offer substantial contributions to the field of hydro-environmental stochastic modeling.

Supporting information

References

  1. 1. Box G, Jenkins G, Reinsel G, Ljung G. Time series analysis: forecasting and control. 5th edn. Hoboken, New Jersey: Wiley. 2015.
  2. 2. Sagrillo M, Guerra RR, Bayer FM. Modified Kumaraswamy distributions for double bounded hydro-environmental data. J Hydrol. 2021;603:127021.
  3. 3. Rocha AV, Cribari-Neto F. Beta autoregressive moving average models. TEST. 2008;18(3):529–45.
  4. 4. Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. J Appl Statist. 2004;31(7):799–815.
  5. 5. Benjamin MA, Rigby RA, Stasinopoulos DM. Generalized autoregressive moving average models. J Am Statist Assoc. 2003;98(461):214–23.
  6. 6. Zeger SL, Qaqish B. Markov regression models for time series: a quasi-likelihood approach. Biometrics. 1988;44(4):1019–31. pmid:3148334
  7. 7. Li WK. Time series models based on generalized linear models: some further results. Biometrics. 1994;50(2):506–11. pmid:8068850
  8. 8. McCullagh P, Nelder J. Generalized linear models, 2nd edn. Chapman and Hall. 1989.
  9. 9. Bayer FM, Bayer DM, Pumi G. Kumaraswamy autoregressive moving average models for double bounded environmental data. J Hydrol. 2017;555:385–96.
  10. 10. Bayer FM, Rosa CM, Cribari-Neto F. A novel data-driven dynamic model for inflated doubly-bounded hydro-environmental time series. Appl Math Model. 2025;137:115680.
  11. 11. Cribari-Neto F, Scher VT, Bayer FM. Beta autoregressive moving average model selection with application to modeling and forecasting stored hydroelectric energy. Int J Forecast. 2023;39(1):98–109.
  12. 12. Kumar B, Yadav N. A novel hybrid model combining$\beta$SARMAand LSTM for time series forecasting. Appl Soft Comput. 2023;134:110019.
  13. 13. Scher VT, Cribari-Neto F, Bayer FM. Generalized $\beta$ARMA model for double bounded time series forecasting. Int J Forecast. 2024;40(2):721–34.
  14. 14. Bayer FM, Pumi G, Pereira TL, Souza TC. Inflated beta autoregressive moving average models. Comp Appl Math. 2023;42(4).
  15. 15. Santos KH, Cribari‐Neto F. A varying precision beta prime autoregressive moving average model with application to water flow data. Environmetrics. 2024;35(8).
  16. 16. Park J-S, Seo S-C, Kim TY. A kappa distribution with a hydrological application. Stoch Environ Res Risk Assess. 2008;23(5):579–86.
  17. 17. Choi YM, Yang YJ, Kwon SH. Validity of ocean wave spectrum using Rayleigh probability density function. Int J Ocean Syst Eng. 2012;2(4):250–8.
  18. 18. Stefanan AA, Palm BG, Bayer FM. Zero-inflated Rayleigh dynamic model for non-negative signals. IEEE Access. 2024;12:187099–111.
  19. 19. Amponsah W, Dallan E, Nikolopoulos EI, Marra F. Climatic and altitudinal controls on rainfall extremes and their temporal changes in data-sparse tropical regions. J Hydrol. 2022;612:128090.
  20. 20. Matti B, Dahlke HE, Lyon SW. On the variability of cold region flooding. J Hydrol. 2016;534:669–79.
  21. 21. Bayer FM, Cintra RJ, Cribari-Neto F. Beta seasonal autoregressive moving average models. J Statist Comput Simulat. 2018;88(15):2961–81.
  22. 22. Pumi G, Valk M, Bisognin C, Bayer FM, Prass TS. Beta autoregressive fractionally integrated moving average models. J Statist Plan Inference. 2019;200:196–212.
  23. 23. Lewis-Beck M. Applied regression: an introduction. 40th edn. Newbury Park, CA: Sage Publications. 2008.
  24. 24. Melo M da S, Alencar AP. Conway–Maxwell–Poisson seasonal autoregressive moving average model. J Statist Comput Simulat. 2021;92(2):283–99.
  25. 25. Banaś J, Utnik-Banaś K. Evaluating a seasonal autoregressive moving average model with an exogenous variable for short-term timber price forecasting. Forest Policy Econ. 2021;131:102564.
  26. 26. Manigandan P, Alam MS, Alharthi M, Khan U, Alagirisamy K, Pachiyappan D, et al. Forecasting natural gas production and consumption in united states-evidence from SARIMA and SARIMAX models. Energies. 2021;14(19):6021.
  27. 27. Melchior C, Zanini RR, Guerra RR, Rockenbach DA. Forecasting Brazilian mortality rates due to occupational accidents using autoregressive moving average approaches. Int J Forecast. 2021;37(2):825–37.
  28. 28. Shad M, Sharma YD, Narula P. Forecasting Southwest Indian monsoon rainfall using the beta seasonal autoregressive moving average ($\beta$SARMA) model. Pure Appl Geophys. 2023;180(1):405–19.
  29. 29. Shad M, Sharma YD, Narula P. Wind speed prediction using non-gaussian model based on Kumaraswamy distribution. Energy Sources Part A: Recov Utiliz Environ Effects. 2023;46(1):719–35.
  30. 30. Serrano ALM, Rodrigues GAP, Martins PH dos S, Saiki GM, Filho GPR, Gonçalves VP, et al. Statistical comparison of time series models for forecasting brazilian monthly energy demand using economic, industrial, and climatic exogenous variables. Appl Sci. 2024;14(13):5846.
  31. 31. Nash J. Compact numerical methods for computers: linear algebra and function minimisation. 2nd edn. Bristol, New York: Adam Hilger; 1990.
  32. 32. Xiang Y, Gubian S, Suomela B, Hoeng J. Generalized simulated annealing for global optimization: the gensa package. R J. 2013;5(1):13–28.
  33. 33. Palm BG, Bayer FM, Cintra RJ. Signal detection and inference based on the beta binomial autoregressive moving average model. Digit Signal Process. 2021;109:102911.
  34. 34. Wald A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans Amer Math Soc. 1943;54(3):426–82.
  35. 35. Pawitan Y. In all likelihood statistical modelling and inference using likelihood. Oxford University Press. 2001.
  36. 36. Dunn PK, Smyth GK. Randomized quantile residuals. J Comput Graph Statist. 1996;5(3):236–44.
  37. 37. Ljung GM, Box GEP. On a measure of lack of fit in time series models. Biometrika. 1978;65(2):297–303.
  38. 38. Jarque CM, Bera AK. A test for normality of observations and regression residuals. Int Statist Rev/Revue Internationale de Statistique. 1987;55(2):163.
  39. 39. Engle RF. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica. 1982;50(4):987.
  40. 40. Akaike H. On entropy maximization principle. In: Krishnaiah P, editor. Applications of Statistics: Proceedings of the Symposium Held at Wright State University, Dayton, Ohio, 14-18 June 1976. North Holland Publishing Company. p. 27–41.
  41. 41. Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6(2).
  42. 42. R Core Team. R: language and environment for statistical computing; 2023. https://www.R-project.org/
  43. 43. Stefanan AA, Palm BG, Bayer FM. MKSARMAX model to fit unit time series and selection model algorithm; 2024. https://github.com/alinestefanan/MKSARMAX.git
  44. 44. ONS. Dados hidrologicos/volumes do Operador Nacional do Sistema Eletrico; 2024. https://www.ons.org.br/Paginas/resultados-da-operacao/historico-da-operacao/dados hidrologicos volumes.aspx
  45. 45. ONS. Conhecimento/Glossario do Operador Nacional do Sistema Eletrico. 2023. https://www.ons.org.br/paginas/conhecimento/glossario
  46. 46. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Statist Assoc. 1937;32(200):675–701.
  47. 47. Tsay R. Analysis of financial time series. 2nd edn. Hoboken, New Jersey: Wiley. 2005.
  48. 48. NASA. Earth observatory: Brazil battered by drought. 2021. https://earthobservatory.nasa.gov/images/148468/brazil-battered-by-drought
  49. 49. ANA. Hidreletrica Caconde (SP) reduzira liberacao mınima de agua ate o fim de 2021; 2021. https://www.gov.br/ana/pt-br/assuntos/noticias-e-eventos/noticias/hidreletrica-caconde-sp-reduzira-liberacao-minima-de-agua-ate-o-fim-de-2021
  50. 50. Hyndman RJ, Khandakar Y. Automatic time series forecasting: theforecast Package forR. J Stat Soft. 2008;27(3): 1–22.
  51. 51. Millington N. Producing water scarcity in São Paulo, Brazil: The 2014-2015 water crisis and the binding politics of infrastructure. Politic Geograph. 2018;65:26–34.