A new accuracy measure based on bounded relative error for time series forecasting

Many accuracy measures have been proposed in the past for time series forecasting comparisons. However, many of these measures suffer from one or more issues such as poor resistance to outliers and scale dependence. In this paper, while summarising commonly used accuracy measures, a special review is made on the symmetric mean absolute percentage error. Moreover, a new accuracy measure called the Unscaled Mean Bounded Relative Absolute Error (UMBRAE), which combines the best features of various alternative measures, is proposed to address the common issues of existing measures. A comparative evaluation on the proposed and related measures has been made with both synthetic and real-world data. The results indicate that the proposed measure, with user selectable benchmark, performs as well as or better than other measures on selected criteria. Though it has been commonly accepted that there is no single best accuracy measure, we suggest that UMBRAE could be a good choice to evaluate forecasting methods, especially for cases where measures based on geometric mean of relative errors, such as the geometric mean relative absolute error, are preferred.


Introduction
Forecasting has always been an attractive research area since it plays an important role in daily life. As one of the most popular research domains, time series forecasting has received particular concern from researchers [1][2][3][4][5]. Many comparative studies have been conducted with the aim of identifying the most accurate methods for time series forecasting [6]. However, research findings indicate that the performance of forecasting methods varies according to the accuracy measure being used [7]. Various accuracy measures have been proposed as the best to use in the past decades. However, many of these measures are not generally applicable due to issues such as being infinite or undefined under certain circumstances, which may produce misleading results. The criteria required for accuracy measures have been explicitly addressed by Armstrong and Collopy [6] and further discussed by Fildes [8] and Clements and Hendry [9]. As discussed, a good accuracy measure should provide an informative and clear summary of the error distribution. The criteria should also include reliability, construct validity, computational a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 In this paper, a new accuracy measure is proposed to address the issues mentioned above. Specifically, by introducing a newly defined bounded relative absolute error, the new measure can address the asymmetric issue of sMAPE while maintaining its other properties, such as scale-independence and outlier resistance. Further, we believe that the new measure improves the interpretability based on relative errors with a selectable benchmark than sMAPE which uses the percentage errors based on the observation values. Given that [6] claimed that measures based on relative errors are the most reliable, we believe our measure is reliable in this sense.

Review of accuracy measures
Many accuracy measures have been proposed to evaluate the performance of forecasting methods during the past couple of decades. A table of most commonly used measures were listed in the review of 25 years of time series forecasting [1]. There was also a thorough review on accuracy measures by Hyndman and Koehler [18]. In this section, we mainly focus on new insights or new measures that have been introduced since 2006.
For a time series with n observations, let Y t denote the observation at time t and F t denote the forecasts of Y t . Then the forecasting error e t can be defined as (Y t -F t ). Let e Ã t denote the forecasting error at time t obtained by some benchmark method. That means e Ã t ¼ ðY t À F Ã t Þ, where F Ã t is the forecast at time t by the benchmark method.

Scale-dependent measures
The measures based on absolute or squared errors are also known as scale-dependent measures since their scale depends on the scale of the data. They are useful in comparing forecasting methods on the same set of data. However, they should not be used across data sets that are on different scales. The most commonly used scale-dependent measures are Mean Absolute Error (MAE), Mean Squared Error (MSE) and RMSE: MAE had been cited in the very early forecasting literature as a primary measure of performance for forecasting models [26]. As shown in Eq 1, MAE directly calculates the arithmetic mean of absolute errors. Hence, it is very easy to compute and to understand. However, it may produce biased results when extremely large outliers exist in data sets. Specifically, even a single large error can sometimes dominate the result of MAE.
MSE, which calculates the arithmetic mean of squared errors, was used in the first M-Competition [12]. However, its use was widely criticized later as inappropriate [6,27]. MSE is more vulnerable to outliers since it gives extra weight to large errors. Also, the squared errors are on different scale from the original data. Thus, RMSE, which is the squre root of MSE, is often preferred to MSE as it is on the same scale as the data. However, RMSE is also sensitive to forecasting outliers [28].

Percentage-based measures
To be scale-independent, a common approach is to use percentage errors based on observation values. Two example measures based on percentage errors are MAPE and sMAPE defined as: It should be noted that absolute values are used in the denominator of sMAPE defined in this paper. This definition is different but equivalent to the definition in Makridakis [10] and Makridakis and Hibon [7] when forecasts and actual values are all non-negative. The absolute values in the denominator can avoid negative sMAPE as pointed out by Hyndman and Koehler [18].
MAPE was used as one of the major accuracy measures in the original M-Competition [12]. However, the percentage errors could be excessively large or undefined when the target time series has values close to or equal to zero [19]. Moreover, Armstrong [20] pointed out that MAPE has a bias favouring estimates that are below the actual values. This was illustrated by extremes: "a forecast of 0 can never be off by more than 100%, but there is no limit to errors on the high side". Makridakis [10] discussed the asymmetric issue of MAPE with another example which involves two forecasts on different actual values. However, we believe that the example by Makridakis is beyond the idea of Armstrong in 1985. To our understanding, we believe that the assumption concerning the asymmetric issue of MAPE described by Armstrong [20] is: i), the estimates are non-negative while the actual value is positive; ii) the forecasting range is asymmetric that 0 is the lower bound for lower estimates while there is no upper bound for upper estimates; iii), errors for lower estimates and upper estimates should be symmetric (an extreme case: 0 as the worst lower estimate should have the same absolute error as the worst upper estimate which is infinite). sMAPE can produce symmetric errors in the asymmetric forecasting range as stated in the above assumption. However, it is more natural to consider the symmetric property in a symmetric forecasting range for lower and upper estimates. Thus, sMAPE was widely criticized as an asymmetric measure [21,22]. Regardless of the asymmetric issue, an advantage of sMAPE is that it does not have the issue of MAPE from being excessively large or infinite. Also, due to the error bounds defined, sMAPE is more resistant to outliers since it gives less significance to outliers compared to other measures which do not have bounds for errors.

Relative-based measures
Another approach for accuracy measures to be scale-independent is to use relative errors based on the errors produced by a benchmark method (e.g. the naïve method). The most commonly used such measures are MRAE and the geometric mean relative absolute error (GMRAE): MRAE can provide a clearer intuition of the performance improvement compared to the benchmark method. However, MRAE has a similar limitation as MAPE, in that it can also be excessively large or undefined, when e Ã t is close to or equal to zero. GMRAE is favoured since it is generally acknowledged that the geometric mean is more appropriate for averaging relative quantities than the arithmetic mean [6,8]. According to an alternative representation of GMRAE shown above in Eq 7, a key step for calculating GMRAE is to make an arithmetic mean of log-scaled error ratios. This makes GMRAE more resistant to outliers compared to MRAE which uses the arithmetic mean of original error ratios. However, GMRAE is still sensitive to outliers. More specifically, GMRAE can be dominated by not only a single large outlier, but also an extremely small error close to zero. This is because there is neither upper bound nor lower bound for the log-scaled error ratios used by GMRAE. Also, it should also be noticed that zero errors, both in e t and e Ã t , have to be excluded from the analysis. Thus, GMRAE may not be sufficiently informative.
Rather than use the average of relative errors, one can also use the relative of average errors obtained by a base measure. For example, when the base measure is RMSE, then relative RMSE (RelRMSE) is defined as: RelRMSE is a commonly used measure proposed by Armstrong and Collopy [6] where RMSE Ã denotes the RMSE produced by a benchmark method. Similar measures, such as RelMAE and RelMAPE, can be easily defined. They are also called relative measures. An advantage of relative measures is their interpretability [18]. However, the performance of relative measures is restricted by the component measure. For example, RelMAPE is also undefined when MAPE is undefined. Further, RelMAPE can also be easily dominated by extreme large outliers since MAPE is not resistant to outliers. Thus, it makes no sense to compute RelMAPE if MAPE, as the component, is skewed.
Another disadvantage of relative measures is that they are only available when there are several forecasts on the same series [18]. As a related idea of relative measures, MASE does not have the above issue. It is defined as: In MASE, the absolute error |e t | for each observation is scaled by the average in-sample error MAE Ã Ã produced a benchmark method (e.g. one-step naïve method, or seasonal naïve method for seasonal data). Thus, MASE will not produce infinite or undefined values except in the irrelevant case where all historical data are equal. However, MASE is still vulnerable to outliers [24]. Moreover, it has to be assumed that the period-to-period difference of the time series is stationary, so that the scaling factor is a consistent estimator of the scale of the series.
For comparisons of forecasting methods on multiple time series, MASE is equivalent to the weighted arithmetic mean of relative MAEs [24]: where m denotes the number of time series, n i denotes the number of observations for the i th time series and N ¼ P m i¼1 n i . As pointed out by Davydenko and Fildes [24], using the arithmetic mean of MAE ratios introduces a bias towards overrating the accuracy of a benchmark method. They proposed the measure AvgRelMAE as an alternative to MASE, based on the geometric mean to average the scaled quantities.
It should be noticed that AvgRelMAE uses out-of-sample MAE Ã i as the scaling factor while MASE uses in-sample MAE ÃÃ i . Though AvgRelMAE was shown to have many advantages such as interpretability and robustness [24], it still has the same issue with MASE since they are based on RelMAE. As mentioned above, the accuracy of RelMAE is constrained by the accuracy of MAE. Since MAE can be dominated by extreme outliers, the MAE ratio r i does not necessarily represent an advisable comparison of forecasting methods based on the errors of the majority of forecasts for the i th time series.

A new accuracy measure
The criteria for a useful accuracy measure have been explicitly addressed in the literature [6,8,9,11]. As reviewed in the previous Section, many measures have been proposed with various advantages and disadvantages. However, most of these measures suffer from one or more issues. In this section, we propose a new accuracy measure which adopts the advantages of other measures such as sMAPE and MRAE without having their common issues. Specifically, the proposed measure is expected to have the following properties: (i) Informative: it can provide an informative result without the need to trim errors; (ii) Resistant to outliers: it can hardly be dominated by a single forecasting outlier; (iii) Symmetric: over estimates and under estimates are treated fairly; (iv) Scale-independent: it can be applied to data sets on different scales; (v) Interpretability: it is easy to understand and can provide intuitive results.
It has been mentioned above in the review that sMAPE is resistant to outliers due to bounded error defined. We would like to propose a new measure in a similar fashion to sMAPE without its issues. Since relative errors are more general than percentage errors in providing intuitive results, we use the Relative Absolute Error (RAE) as the base to derive our new measure.
Since RAE has no upper bound, it can be excessively large or undefined when je Ã t j is small or equal to zero. This issue can be easily addressed by adding a |e t | to the denominator of RAE, which introduces a bounded RAE (BRAE): In BRAE, the added |e t | can ensure that the denominator will be no less than the numerator. It means BRAE will have a maximum error of 1 while the minimum error is 0 when |e t | is equal to zero. Due to the upper bound of BRAE, an accuracy measure based on BRAE will be more resistant to forecasting outliers. It can be noticed that the asymmetric issue of sMAPE has also been addressed in BRAE by adding a |e t | rather than a |F t | to the denominator. Also, a measure based on BRAE is more appropriate than sMAPE for intermittent demand data which have many zero-valued observations. To avoid the issue of being undefined, BRAE is defined to be 0.5 for the special case when |e t | and je Ã t j are both equal to zero. In practice, the one-step naïve method is a commonly used benchmark where e Ã t ¼ Y tÀ 1 À Y t . However, it should be noticed that the naïve method is not necessarily an effective benchmark. For example, when most forecasting methods can generally produce much smaller errors than the naïve method, BRAE will have the same issue as percentage error based measure stated above. Thus, it is preferable to use a properly competitive method as a benchmark, such that a value of around 0.5 is obtained by BRAE.
Based on BRAE, a measure called Mean Bounded Relative Absolute Error (MBRAE) can be defined as: Though MBRAE is adequate to compare forecasting methods, it is a scaled error that cannot be directly interpreted as a normal error ratio reflecting the error size. In fact, the process of calculating GMRAE also contains a mean of log-scaled error ratio which is not easily interpretable. But this issue is addressed by converting the log-scaled error to a normal ratio with the exponential function. Similarly, a transformation can be made to MBRAE to obtain a more interpretable measure which is termed the unscaled MBRAE (UMBRAE): With UMBRAE, the performance of a proposed forecasting method can be easily interpreted, in terms of the average relative absolute error based on BRAE, as follows: when UMBRAE is equal to 1, the proposed method performs roughly the same as the benchmark method; when UMBRAE < 1, the proposed method performs roughly (1−UMBRAE) Ã 100% better than the benchmark method; when UMBRAE > 1, the proposed method is roughly (UMBRAE−1) Ã 100% worse than the benchmark method.
In general, UMBRAE is informative without the need to trim extreme errors. At the same time, based on the bounded errors, UMBRAE is resistant to outliers. It is also symmetric and obviously scale-independent. The benchmark used by UMBRAE is selectable where the naïve method can be easily applied. A competitive benchmark is preferable to obtain more intuitive results. To the best of our knowledge, UMBRAE has not been proposed before. We suggest it as a generally applicable accuracy measure for time series forecasting. UMBRAE would be particularly useful for the cases where the performance of forecasting methods are not expected to be dominated by forecasting outliers.

Evaluation and results
In this section, the performance of UMBRAE is evaluated. The naïve method is used as the benchmark for UMBRAE. Properties such as reliability and sensitivity have been well investigated in the study by Armstrong and Collopy [6]. In their study, MAPE and MRAE have been assessed to be acceptable in terms of reliability and good in terms of sensitivity. In fact, these properties, especially reliability, cannot be easily examined. For example, in the reliability tests, if forecasting methods are expected to have the same rankings when they are evaluated by a reliable accuracy measure, these forecasting methods themselves have to perform stably on different time series. It is difficult to find such forecasting methods in the real world. Thus, these properties are not examined in our study. Instead, it is assumed that UMBRAE, based on relative errors, will also be reliable and sensitive to error changes. Consequently, our evaluation will be mainly focused on the expected properties mentioned in the previous Section. To make comparisons, other common measures mentioned in the review Section are also examined in our evaluation. Comparisons are firstly made with synthetic time series to specifically examine the required properties. Then the M3-Competition data with 3003 time series [7] are used to demonstrate how these measures perform with real-world data.

Evaluation with synthetic data
Three groups of synthetic time series data are used in the comparative study. These synthetic data are not designed to be representative of real-world data. Rather, they are selected to clearly show the drawbacks of accuracy measures in terms of the required properties. In the synthetic evaluations, the average one-step naïve error is used to scale errors for MASE.
One of the most desired properties of an accuracy measure is the ability to resist outliers. Thus, the first group of synthetic data is made to examine whether the accuracy measure is resistant to a single forecasting outlier. As shown in Fig 1, Y t is the objective time series with 10 observations, which are randomly generated under the normal distribution (mean = 300, sd = 100). F n t is the forecasting series of Y t . Specifically, F 1 t does not have obvious forecasting outlier and its forecasting errors measured by MAPE are approximately 10%. The other three forecasts are the same as F 1 t except that they all have a forecasting outlier for the eighth observation. Though occasionally occurring large errors should also be considered in evaluating the performance of a forecasting method, it is assumed that a single large outlier should not affect the whole performance significantly. However, the results in Fig 1 shows that the errors reported by some accuracy measures have been significantly dominated by the single forecasting outlier. The worst is RMSE where its error for F 4 t has become approximately 36 times larger than its error for F 1 t . Though MASE has been scaled from MAE, it in fact performs the same as MAE in dealing with the forecasting outlier. The errors given by MAE and MASE for F 4 t have both been distorted to be about 15 times larger than for F 1 t . In contrast, sMAPE, GMRAE and UMBRAE are less sensitive to this single forecasting outlier. UMBRAE reports the smallest differences for the four time series.
The second group of time series data is created to evaluate whether over-estimates and under-estimates are treated 'fairly' by the accuracy measures. As presented in Fig 2, Y t is the same time series as which was used in the single forecasting outlier resistance evaluation. In this scenario, F 1 t makes a 10% over-estimate error to all observations in Y t while F 2 t makes a 10% under-estimate. The results in Fig 2 show that all the accuracy measures except sMAPE have given the same error for F 1 t and F 2 t . sMAPE produces a larger error for F 2 t which indicates it puts a heavier penalty on under-estimates than on over-estimates.
Davydenko and Fildes [24] suggested another scenario to examine the property of symmetry for measures. In this scenario, the reward given for improving the benchmark is expected to balance the penalty given for reducing the benchmark by the same quantity. We also use this to examine our measure UMBRAE. Suppose that a time series has only two observations (y) and there is one forecasting method to be compared with another benchmark method. For the benchmark method, it makes the forecasts f with errors (y−f) of 1 and 2 respectively. In contrast, the forecasting method produces errors of 2 and 1 respectively. As an expected result, the forecasting method has an error of 1 measured by UMBRAE based on the benchmark method. Thus, UMBRAE is also symmetric for this case.
Normally, the scale-dependent issue of accuracy measures is related to their capability of evaluating forecasting performance across data series on different scales. Accuracy measures based on percentages or relative ratios are clearly suited to perform such evaluations and no synthetic data are made for this. However, the scale-dependent issue also exists within a data series. Thus, the third group of synthetic data shown in Fig 3 is  of accuracy measures dealing with data on different scales within a single time series. In this data set, Y t is a time series generated by the Fibonacci sequence from 2 to 144. As the forecasts to Y t , all forecasting values of F 1 t are set to have a 20% over-estimate error of the relevant observation of Y t . In contrast, F 2 t has the same mean absolute error as F 1 t but its errors are on different percentage scales from 1440% to 0.2%. Specifically, F 2 n has the same absolute error as F 1 11À n . For instance, F 2 1 has the same absolute error as F 1 10 which is 28.8. As presented in Fig 3, MAE, RMSE, MASE and even GMRAE do not show any difference between the two forecasts. MRAE and MAPE, however, have produced substantially different results for the two cases. The errors measured by them for F 2 t are approximately ten times larger than for F 1 t . In contrast, UMBRAE and sMAPE give a moderate difference for the two forecasts.

Evaluation with the M3-Competition data
The M-Competitions are well-known empirical studies which employ various real-world time series data in comparing the performance of forecasting methods. In this study, we use the M3-Competition [7] Data which contains 3003 time series to evaluate our proposed measure. The forecasting data are available with R package 'Mcomp' maintained by Hyndman. The 'Mcomp' package for R is available from Hyndman's website: http://robjhyndman.com/ software/mcomp/. Among the 24 forecasting methods in the M3-Competition, 22 are used in our evaluation since their forecasts are available for all the 3003 time series. Since the one-step naïve method is used by many accuracy measures as the benchmark, it is also listed in the results as a forecasting method. As an alternative version of MASE, AvgRelMAE which use geometric mean to average errors across time series, is also included in this evaluation. To simplify the results, errors are only measured at the first six forecasting horizons across the 3003 time series, which are available from all of the 22 forecasting methods.
The results are listed in Table 1. It can be noticed that errors by MAE and RMSE are relatively large numbers which is meaningless without comparisons. UMBRAE is able to give interpretable results where a forecasting method with an error < 1 can be considered to be better than the benchmark method in terms of the average relative absolute error based on BRAE. As shown in the results, the naïve method, which is the benchmark used by UMBRAE, has an error of 1. Errors of other forecasting methods measured by UMBRAE are all less than 1. This indicates that these forecasting methods are better than the naïve method. However, MRAE gives the opposite result in which the naïve method is ranked as the best. It has to be noticed that all the errors excluding that for the naïve method measured by AvgRelMAE are smaller than 1, whereas all the errors measured by MASE are much larger than 1. The rank correlation coefficient of different measures is shown in Table 2. The correlation between RMSE, or MRAE, and other measures is extremely low. In contrast, UMBRAE shows substantially high agreement with most of other measures, where the average Spearman rank correlation is 0.516. Particularly, UMBRAE has remarkably high correlations with GMRAE and AvgRel-MAE which are 0.995 and 0.990 respectively.
To eliminate the influence of outliers and extreme errors, we also use trimmed means to evaluate the accuracy measures. A 3% trimming level is used in our study. As shown in Table 3, most errors measured by MAE, RMSE, MASE, MRAE and MAPE have significant differences compared to that without trimming shown in Table 1. The rankings of forecasting methods made by these measures also have significant changes. In contrast, errors and rankings measured by other measures have less changes. Particularly, the value of UMBRAE is quite invariant to trimming, where differences appear only after the third decimal point for most of the forecasting methods. It can also be noticed that the rankings made by UMBRAE in Table 3 keep the same as that in  rankings. As shown in Table 3, the rank correlations between UMBRAE and other measures are much higher on average as shown in Table 4.

Error Rank Error Rank Error Rank Error Rank Error Rank Error Rank Error(%) Rank Error(%) Rank Error
To show the error distributions in a similar manner to that in [24], we use the errors produced by the forecasting method ForecastPro as an example. Figs 4 to 11 show the distributions of the eight underlying error measurements used in the nine accuracy measures mentioned in this paper. In each Fig, the top plot shows the kernel density estimate of the errors illustrating its distribution, while the bottom shows a box-and-whisker plot which more clearly highlights the outliers. From these Figs, it can be seen that the distribution of error measurements used in UMBRAE is more evenly distributed, with fewer outliers than in the other measures. Fig 1 shows that MRAE and MAPE can be easily dominated by a single forecasting outlier. This is because they are based on the arithmetic mean and there are no upper bound defined for the single error. In practice, the poor resistance to forecasting outliers may produce misleading results. This can be illustrated by our evaluation on the M3-Competition data. As shown in Table 1, MRAE gives significantly different rankings from other measures. It suggests the naïve method performs the best while almost all the other accuracy measures indicate that the naïve method is the worst. By examining the forecasting data, we can find that the results measured by MRAE are seriously distorted by the extreme large relative absolute errors where the naïve errors are small. With the geometric mean, GMRAE has shown remarkable resistance to the forecasting outliers. However, one disadvantage of measures based on the geometric mean is that zero-error forecasts have to be excluded. Thus, these measures may not be sufficiently informative. In contrast, due to the bounded errors defined, we have shown that UMBRAE can perform as well as GMRAE in resisting forecasting outliers. In fact, the errors and rankings given by UMBRAE are remarkably correlated to which measured by GMRAE, especially in Tables 3 and 4 where extreme errors are trimmed. Thus, for the cases where measures such as GMRAE are preferred, UMBRAE could be an alternative measure since it is much easier to use without the need to trim errors.

Discussion
It can also be noticed in Figs 4 to 11 that all the accuracy measures except AvgRelMAE (see Fig 7), GMRAE (see Fig 8) and UMBRAE (see Fig 11) have highly skewed distributions with long tails including extremely large forecasting outliers. Although undefined and zero errors (0.5%) have been trimmed, GMRAE still contains about 10.2% forecasting outliers including some large log-transformed errors such as -10.76 and 8.08. Although the bounded errors used  by sMAPE (see Fig 10) and UMBRAE also contain some outliers, there are no extremely large errors. Specifically, UMBRAE follows a symmetric distribution and it only produces about 3% outliers which will not affect the result significantly. It has to be noted that UMBRAE does not necessarily always provide the same information as GMRAE. For example, given a time series with a million observations, if the forecasting method and the benchmark method produces errors (y−f) which are e and e Ã following the standard normal distribution, UMBRAE and GMRAE will both be approximately 1. However, if the forecasting method produces errors of 2e, the value of GMRAE will be approximately 2 as one may expected. But, UMBRAE will give an error of approximately 1.67 which is less than 2. This is because the bounded error jej jejþje Ã j used by UMBRAE will not be increased too much when error e is doubled for the cases where |e| is much larger than |e Ã |. In other words, a twice worse forecast will not be given an error of twice in significance by UMBRAE when the forecast is much worse than most of other forecasts. In fact, this is the key strategy of UMBRAE for resisting outliers. Also, the above expectation of error 2 is based on the estimation by 'relative  average error'. However, it is arguable the 'average relative error' is not necessarily the same as the 'relative average error'. This can be more or less reflected by the synthetic test shown in Fig  3. More discussions about this will be given later in this section in terms of the scale-independency. We believe that the above issue does not invalidate the use of UMBRAE in practice.
One of the common concerns about an accuracy measure is whether it is symmetric. Two different cases were used to evaluate the property of symmetry for accuracy measures. In our point of view, the first case is about the symmetry in the absolute quantity which concerns whether the same over-estimates and under-estimates can be treated fairly by a measure. As shown in Fig 2, only sMAPE is not symmetric in the absolute quantity (due to the asymmetric bounded errors used). This issue has been addressed by UMBRAE with symmetric bounded N . Normally, a measure which uses the arithmetic mean should not be symmetric in such relative quantity. However, UMBRAE, which uses the arithmetic mean for part of its calculations, has shown a symmetric result. This is because UMBRAE does not work directly on the original error ratios. The original relative errors have been converted to bounded relative errors for UMBRAE before calculating the arithmetic mean. In fact, this is quite similar to the process of calculating GMRAE which is based on the geometric mean. As a result, it is not an issue for UMBRAE to use the arithmetic mean. Figs 8 and 11 show that both errors used by GMRAE and UMBRAE follow a symmetric distribution.  It is necessary (or, at least, highly desirable) for an accuracy measure to be scale-independent when assessing forecasting methods across data on different scales. Normally, measures based on percentages or ratios in the same range are considered to be scale-independent. However, we argue that it is not enough for these percentages or ratios to be in the same range. To be truly scale-independent, these error percentages or ratios should also be closely related to the scale of data for specific observations. Otherwise, they may lead to misleading results. For example, in Table 1, the error of MASE for the naïve method is 2.134. This is a somewhat confusing result which may be intuitively interpreted as indicating that the naïve method performs worse than the naïve method itself! In fact, it means the naïve method gives smaller errors on average for the forecasting data than its errors for the in-sample data. In contrast, AvgRelMAE does not have this issue since it uses the average error on out-of-sample as the scaling factor.  3 shows that MASE fails to distinguish the difference between the two forecasts which are clearly different considering the error percentages at different observations. This is because every single error used by MASE at different observations is scaled by the same scaling factor. GMRAE also fails in this evaluation. We notice that this is because GMRAE, in fact, has the same issue as MASE. Every single error of GMRAE can also be considered to be a scaled error based on a consistent scaling factor GMAE Ã , which is the geometric mean of the benchmark errors e Ã . According to the above, we conclude that MASE, AvgRelMAE and GMRAE are relatively scale-independent because they assume that the scaling factor is a consistent estimator. In contrast, UMBRAE is scale-independent and it is closely related to the error ratios at observations. Thus, it can reasonably show the difference between the two forecasts with respect to error percentages.
Another important property of an accuracy measure is its interpretability. As Table 1 shows, the numerical errors measured by MAE and RMSE have little intuitive meaning without comparisons, and have therefore been scored as 'fair'. Comparatively, measures which produce errors in percentages or ratios based on a benchmark are more interpretable. The benchmark used by an accuracy measure is also important for its interpretability. In Table 1, errors measured by MAPE are all small errors around 10%. However, these small errors are less meaningful without comparisons. This is because these small percentages are based on the original values of observations. Thus, they do not necessarily indicate a good performance. In contrast, errors measured by UMBRAE are more interpretable. An error of 0.77 indicates that the forecasting method performs approximately 23% better than the benchmark method.
As shown in Table 5, the accuracy measures are rated by the key criteria concerned in this paper. Measures are considered to be less informative if undefined or zero errors have to be excluded. The property of symmetry is rated in both absolute quantity and relative quantity as discussed above. Measures are rated as relatively scale-independent because they assume that the scaling factor is a consistent estimator. Relative-based accuracy measures are considered to be more interpretable than other measures since they can provide more intuitive results in terms of performance without extra comparisons. sMAPE is rated as poor in interpretability since its error, which has a range of (0,200), is not as easy as MAPE to understand. In summary, we show that UMBRAE (i) is informative and uses all available errors; (ii) can perform as well as GMRAE in resisting forecasting outliers without the need to trim zero-error forecasts; (iii) is symmetric in both absolute quantity and relative quantity; (iv) is scale-independent; (v) is interpretable and can provide intuitive result. As such, UMBRAE combines the best features of various alternative measures into a single new measure. Thus, we believe UMBRAE is an interesting new measure because it constitutes a simple, flexible, easy to use and understand measure that is resistant to outliers. Also, the forecasting benchmark for calculating UMBRAE is selectable, and the ideal choice should be a forecasting method to be outperformed. As a well-known benchmark, the naïve method can be easily applied as a default to show whether a forecasting method is generally good or not.

Conclusion
We have proposed a new accuracy measure UMBRAE based on bounded relative errors. As discussed in the review of sMAPE, one advantage of the bounded error is that it gives less significance to outliers since it does not have the issue of being excessively large or infinite. Evaluation on the proposed measure along with related measures has been made on both synthetic and real-world data. We have shown that UMBRAE combines the best features of various alternative measures without having their common drawbacks. UMBRAE, with selectable benchmark, can provide an informative and interpretable result based on bounded relative error. It is less sensitive to forecasting outliers than other measures. It is also symmetric and scale-independent. Though it has been commonly accepted that there cannot be any single best accuracy measure, we suggest that UMBRAE is a good choice for general use when evaluating the performance of forecasting methods. Since UMBRAE, in our study, performs similar to GMRAE without the need to trim zero-error forecasts, we particularly recommend UMBRAE as an alternative measure for the cases where GMRAE is preferred.
Although we have shown that UMBRAE has many advantages as described above, its statistical properties have not been well studied. For example, the way how UMBRAE reflects the properties of the distributions of errors is unclear. Moreover, one possible underlying drawback for UMBRAE is that the bounded error used by UMBRAE will reach the maximum value 1.0 when the benchmark error (Y t À F Ã t ) is equal to zero even if the forecast is good. This may produce a biased estimate especially when the benchmark method produces a large number of zero errors. Although this drawback may not be relevant for the majority of real-world data, in the future, we would like to address this issue.