Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Equity premium forecasting with reliability-screened forward-looking signals

  • Jeonggyu Huh,

    Roles Conceptualization, Funding acquisition, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Mathematics, Sungkyunkwan University, Suwon, Republic of Korea

  • Jaegi Jeon,

    Roles Conceptualization, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Graduate School of Data Science, Chonnam National University, Gwangju, Republic of Korea

  • Seungwon Jeong

    Roles Data curation, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

    junior1492@jnu.ac.kr

    Affiliation Global-Learning & Academic research institution for Master’s· PhD students, Chonnam National University, Gwangju, Republic of Korea

Abstract

Forecasting the equity risk premium is challenging because predictive relationships are unstable out of sample. We propose a two-stage framework that generates forward-looking signals from standard macro-financial predictors and admits them only when they satisfy predictor-level reliability criteria. In Stage 1, each predictor is forecast one step ahead to obtain an expected movement and an uncertainty proxy. In Stage 2, lagged predictors are augmented with the admitted signals and mapped into next-period excess returns using random forests, optionally combined with SHAP-guided screening and dimension reduction. Using monthly U.S. data from 1952 to 2024, we evaluate both benchmark-relative out-of-sample accuracy and the economic value of forecasts in a constrained mean-variance allocation. Selective admission improves out-of-sample accuracy and, in selected specifications, yields economically meaningful gains in risk-adjusted performance and drawdown control. Tail-conditional diagnostics show that these gains are concentrated disproportionately in downside market states, helping explain when statistical improvements translate into economic value. The main qualitative patterns remain robust across alternative learners, across the S&P 500 and the CRSP value-weighted market index, and under net of transaction cost evaluation and alternative portfolio volatility checks. Taken together, the findings suggest that forward-looking predictor information is most useful when admitted selectively on the basis of predictor-level reliability, with its practical value lying less in universal forecast improvements than in more robust and downside-sensitive equity premium forecasting.

1 Introduction

Whether the equity risk premium can be forecast in real time remains a central question in empirical finance and a practical concern for portfolio allocation. A long tradition documents in-sample links between macro-financial predictors and subsequent market returns, yet robust out-of-sample improvements over simple benchmarks are notoriously difficult to obtain and often unstable across samples and evaluation designs [1,2]. At the same time, even modest forecasting gains can be economically meaningful for mean-variance investors because portfolio weights scale with the predicted premium relative to risk [3]. This tension—statistical fragility versus potential economic value—motivates renewed attention not only to forecasting accuracy itself, but also to how predictive information should be engineered and disciplined under realistic real-time data constraints. More fundamentally, equity premium forecasting is a demanding prediction problem: Realized returns are noisy, the conditioning set can be high dimensional, predictive relationships may shift across regimes, and the relevant interactions among predictors may be nonlinear. Recent work increasingly approaches such environments using advanced machine learning and time-series econometric tools designed to handle instability, nonlinearities, and high-dimensional inputs.

A key inspiration for our approach comes from the factor timing literature, which emphasizes that expected returns vary over time and that exploiting this variation can matter for investors [48]. In practice, however, strategies that attempt to forecast the equity premium directly face a difficult signal-extraction problem. Rather than applying a highly flexible end-to-end learner directly to noisy realized returns, we take a modular detour: We first forecast the predictor panel itself—variables observed at the forecast origin that often evolve more smoothly than realized returns—and then use those forecasts to construct forward-looking inputs for equity premium prediction. The key intuition is that the current level of a predictor need not fully summarize the state relevant for next period premia. Two episodes with similar observed predictor-levels can imply different expected returns if the predictor is expected to deteriorate further, stabilize, or reverse, and if that transition is more or less uncertain. In this sense, robust equity premium forecasting may require not only past predictor-levels, but also disciplined forward-looking information about where those predictors are likely to move and how reliable those implied movements are.

This predict-the-predictors perspective connects to a complementary strand that constructs forward-looking state variables by embedding expectations about future cash flows or risks into signals observed today. Prominent examples include implied expected return measures inferred from market prices and cash-flow forecasts [9,10], as well as option implied measures that extract forward-looking information about tail risk and the risk neutral distribution [11,12]. Our notion of forward-looking differs in both object and construction. We do not rely on derivatives or survey data, nor do we propose a new implied premium series. Instead, we generate forward-looking signals internally from the standard macro-financial predictor panel and ask whether they can improve real-time equity premium forecasting when they are admitted only after clearing a predictor-level out-of-sample reliability screen.

In this study, we propose a two-stage framework that converts the standard macro–financial predictor panel into forward-looking features for equity premium forecasting. In Stage 1, each predictor is forecast one-step-ahead to form a forward-looking pair capturing its expected movement and a proxy for forecast uncertainty. These generated signals are admitted into the combined feature set only when they clear a predictor-level out-of-sample reliability threshold. In Stage 2, we augment lagged predictor levels with the admitted forward-looking pairs, optionally apply SHAP-based pre-screening for parsimony, reduce dimension via PCA or PLS, and learn a nonlinear forecasting rule for the next-period equity premium using a random forest [13] as the baseline learner. In implementation, we optionally apply Shapley (SHAP) value importance based screening for parsimony and use PCA or PLS for dimension reduction. We implement the pipeline under real-time availability constraints and apply it to monthly U.S. data spanning 1952–2024 using a standard predictor set from the seminal works of [1] and [2], with the S&P 500 as the baseline market and, as a cross-index robustness check, the CRSP value-weighted index evaluated under the same training, validation, and test windows as the baseline S&P 500 analysis. To check that the conclusions are not tied to a single Stage 2 learner, we also report supplementary robustness exercises using XGBoost and LightGBM.

We evaluate statistical predictability using benchmark-relative out-of-sample R2 and tail-conditional counterparts that distinguish downside from upside accuracy, and we complement these with formal forecast-comparison tests against the historical mean benchmark and against matched past-only designs. Economic value is assessed by mapping forecasts into a simple constrained mean-variance allocation following [3] and by examining the resulting portfolio performance net of transaction costs. Empirically, the gains from forward-looking augmentation are selective rather than uniform: Support is more evident in benchmark-relative comparisons than in stricter head-to-head tests between otherwise matched specifications, where one model uses only lagged predictor-levels and the other augments those same predictors with the generated forward-looking signals. The most economically relevant improvements arise when the forward-looking signals are admitted selectively rather than mechanically. These gains are especially visible in PLS-family representations, which more often concentrate predictive improvements in downside states and help explain why investment performance can improve even when average-fit gains remain modest. A sigma-ablation exercise further shows that the Stage 1 uncertainty proxy is not redundant and is most informative in downside-sensitive settings, while the XGBoost and LightGBM results broadly preserve the same qualitative pattern.

This paper makes three contributions. First, we propose a disciplined feature-construction pipeline that transforms standard macro-financial predictors into forward-looking signals—expected movements and uncertainty proxies—admits them through predictor-level real-time reliability screening, and uses SHAP-based pre-screening to build parsimonious forecasting sets while retaining an interpretable view of dominant predictor channels. Second, we evaluate the framework in a way that is explicitly state-dependent and economically implementable, combining tail-conditional diagnostics, formal forecast-comparison evidence, and portfolio results measured net of transaction costs. Third, we show that the clearest economically meaningful gains arise in downside-sensitive settings, especially in PLS-family representations, rather than as a blanket improvement in average forecast accuracy across all specifications.

The remainder of the paper is organized as follows. Section 2 reviews the related literature on equity premium predictability, forward-looking predictors, and machine learning based nonlinear forecasting. Section 3 presents the two-stage methodology and evaluation metrics. Section 4 describes the data, real-time implementation, and experimental design. Section 5 reports the main empirical results, covering Stage 1 forecastability, Stage 2 predictive accuracy, formal forecast-comparison evidence, portfolio performance net of transaction costs, and interpretability. Section 6 concludes with practical implications, limitations, and directions for future research.

2 Related literature

A long-standing literature examines whether aggregate excess stock returns are predictable using information available at the forecast origin. Classic studies consider linear predictive regressions with valuation ratios and term structure variables, including dividend-based measures and interest rate information [1417]. This evidence is often interpreted through time variation in discount rates, and macro-finance state variables have been proposed to summarize such movements [18]. At the same time, predictability is well known to be fragile in finite samples and unstable across subsamples, motivating careful real-time evaluation and robustification strategies [1,19]. A prominent response emphasizes economically motivated restrictions and simple devices that can improve out-of-sample performance and avoid implausible forecasts [3], while another response aggregates information across many predictors via forecast combinations [20]. The predictor set has also been expanded beyond standard macro finance variables to include technical indicators that proxy for market trends and investor behavior [21]. Relatedly, [22] model market-trend dynamics by transforming momentum signals into probabilistic trend-transition features and using them as inputs to an LSTM-based trading model. Related time-series studies also examine whether auxiliary market information improves forecasts: Using rolling cointegration analysis and out-of-sample Granger-causality tests on Chinese stock-index and stock-index-futures prices, [23] finds that the two markets are generally cointegrated but that neither series consistently improves the other’s forecastability. In an agricultural price forecasting setting, [24] estimates models built from own-market history, nearby cash-market prices, and futures histories for U.S. corn buying locations, finding that nearby-market information improves forecasts with larger gains at longer horizons.

More recent work reframes conditional expected return estimation as a high dimensional prediction problem and applies statistical learning methods to capture nonlinearities and interactions among predictors. A central benchmark is [25], which systematically compares modern machine learning methods to traditional linear models in predicting and economically exploiting return variation. Related research moves beyond purely predictive performance toward economically structured representation learning: [26] proposes conditional autoencoder models that learn latent factors and nonlinear conditional exposures under asset pricing restrictions, thereby linking modern neural architectures to factor model interpretations. In a similar spirit, [27] develops deep learning architectures that explicitly separate time-series and cross-sectional components of risk premia and provides tools to interpret the resulting forecasts. Complementary work further advances deep learning tools for modeling nonlinear return predictability and conditional asset pricing relations, with an emphasis on disciplined out-of-sample validation in high dimensional settings [28,29].

Beyond the core equity premium literature, recent neural network studies apply relatively simple nonlinear architectures to Chinese rent indices, the Chinese energy security index, the China commodity price index, and U.S. corn cash prices, with the rent index, CCPI, and corn applications reporting stable and accurate forecasts and the corn study further showing that futures prices can improve one-step-ahead accuracy [3033]. Gaussian process regression has likewise been used successfully in price forecasting, with Bayesian-optimized and cross-validated GPR models applied to daily coffee prices and ten Chinese steel product price indices [34,35]. Graphical techniques provide another route for modeling complex dependence patterns: VECM-DAG frameworks using the PC algorithm and LiNGAM have been employed to recover contemporaneous causal structures in Jiangsu housing markets and regional Chinese scrap-steel prices, revealing heterogeneous transmission channels and dominant nodes in the adjustment process [36,37]. Ensemble and composite methods offer a further complementary perspective, as combined-forecast frameworks improve U.S. corn cash price forecasting and shrinkage-based composite forecasts outperform individual time series models for the Chinese stock index across multiple horizons, underscoring the broader value of flexible forecasting architectures for capturing complex and potentially nonlinear dynamics [38,39].

A complementary literature constructs forward-looking predictors—objects observed at time t that embed beliefs about future cash flows, risks, or return distributions—and relates these signals to expected returns and realized excess returns. One prominent route backs out implied expected returns from market prices together with cash-flow forecasts, producing an implied equity premium series that can be used as a forward-looking state variable in empirical analysis [9,10]. Options markets provide another source of forward-looking information about risk compensation and return distributions; for example, option implied risk measures have been linked to subsequent market returns, and related work combines derivative implied information with time-series modeling to construct forward-looking market risk premium measures [11,12,40]. Relatedly, [41] show that economic policy uncertainty indices from ten developed countries contain out-of-sample predictive information for U.S. excess stock returns, with several non-U.S. EPU measures outperforming the domestic U.S. EPU index itself. From the perspective of predictor construction, these strands share a common message: rather than treating expected returns as an unobservable latent object, one can build practically usable forward-looking predictors by leveraging data sources that already embed market beliefs about the future.

Closest in spirit to our setting is a smaller branch that uses explicitly expectation driven or forecast generated regressors—direct measures of expected business conditions or professional macroeconomic forecasts—as inputs for studying expected returns and discount-rate variation. [42] connect directly measured expectations about business conditions to subsequent stock returns, highlighting the informational role of expectation variables beyond contemporaneous macro realizations. [43] examine what survey expectations imply for predictability in financial markets and how expectation data interact with standard predictability evidence. [44] use survey based beliefs about risk and return to link perceived economic conditions and expected premia, emphasizing that expectations themselves can act as economically meaningful predictors. [45] similarly exploit professional macroeconomic forecasts to construct expectation-based indices and relate them to expected returns, underscoring that explicitly forecast based inputs can play a distinct role relative to realized macro variables. Taken together, this literature motivates viewing forward-looking predictors not only as market implied objects from prices and derivatives, but also as generated expectation measures—an idea that naturally aligns with our focus on constructing forward-looking signals at the predictor-level and using them as inputs to risk premium forecasting models.

In empirical implementations, these forward-looking constructions naturally raise an additional practical question: how to screen and organize a large set of candidate regressors in a way that preserves out-of-sample performance while remaining interpretable. A recent explainable AI strand addresses this issue by leveraging SHAP not only for post hoc interpretation but also as a practical screening device for variable selection: covariates are ranked by aggregated Shapley attributions, a compact subset is retained, and the predictive model is refit on the reduced set to improve parsimony and robustness [4648]. In financial forecasting, this SHAP-guided selection has begun to appear in end-to-end pipelines where the screened variables are explicitly reused for downstream prediction and decision-making. [49] selects a small set of top SHAP-ranked predictors from a broad feature pool, retrains a LightGBM return forecaster, and translates the resulting forecasts into a risk controlled, periodically rebalanced equity portfolio. [50] similarly use SHAP to identify the most influential inputs in a deep learning trend classifier and show that restricting the model to the top ranked features can preserve—and in some metrics improve—predictive performance while enhancing transparency. [51] goes further by integrating SHAP-based recursive feature elimination with hyperparameter optimization in an LSTM forecasting framework, aiming to obtain parsimonious yet accurate stock predictions. These pipelines primarily target cross-sectional stock return prediction and stock-selection strategies. By contrast, we use SHAP as an auxiliary screening and interpretation layer within a market level equity premium forecasting framework built on predictor generated forward-looking signals and tail-state diagnostics. Taken together, these studies motivate our use of SHAP importance based screening as a disciplined dimension-reduction step that aligns interpretability with predictive performance and with the economic evaluation of return and risk premium forecasts.

3 Methodology

We propose a two stage framework for forecasting the equity risk premium. Stage 1 forecasts each predictor to produce forward-looking signals; Stage 2 incorporates these signals into a machine learning forecasting model for the equity premium.

Our objective is to map the engineered features Xt into a risk premium forecast. Following [25], we write

where denotes the forecasting model used in this study; the specific model is described in Section 4.2.2. We now describe in detail how Xt is constructed.

3.1 Forecasting the predictors

We adopt a unified procedure that extracts forward-looking information from each predictor by producing a one-step-ahead conditional mean forecast and its one-step-ahead conditional residual variance as a measure of forecast uncertainty. We compute the variance to quantify the state-dependent precision of the mean forecast and carry these quantities forward to the next stage; how they are used is described later.

Let denote the K predictor series. For each predictor, we estimate a flexible ARIMAX () model, proposed by [52]:

where is the differencing operator of order dk; and are the lag polynomials for the autoregressive (AR) and moving average (MA) components, respectively; is the intercept; is a vector of exogenous variables with coefficient vector , and is the residual term. This provides the one-step-ahead forecast, .

To quantify the time-varying forecast uncertainty, we model the conditional variance of the ARIMAX residuals using a GARCH () process:

where is the conditional variance and is its one-step-ahead forecast implied by the recursion. The pair constitutes the forward-looking signals passed to the next stage.

Predictors differ in persistence, seasonality, and noise, so bespoke models might lift fit for a few series. Our aim, however, is not to maximize first-stage fit but to produce comparable reliability scores in a consistent and fair way. Using a unified ARIMAX–GARCH modeling framework across all series ensures that cross-predictor differences in out-of-sample R2 reflect signal quality rather than model flexibility or researcher discretion.

3.2 Feature construction and dimension reduction

Stage 2 maps the predictor information available at time t into a feature vector Xt and uses it to forecast the next period equity premium rt+1. The key input to this stage is the collection of predictors described in Section 4.1, together with the forward-looking quantities generated in Stage 1. In what follows, we define the candidate feature pools and the optional processing steps—screening and dimension reduction—that we use to obtain parsimonious inputs for learning.

We begin with two base feature pools. The Past pool uses lagged predictor information only and is given by

The Combined pool augments each lagged predictor with a forward-looking pair produced in Stage 1. Specifically, for each predictor k, Stage 1 provides a one-step-ahead conditional mean forecast and a one-step-ahead conditional residual volatility proxy . We collect these into a predictor-level feature triplet

and stack them across predictors to form

The Combined pool is designed to encode not only the current level of each predictor but also its anticipated movement and the uncertainty around that movement, allowing the model to exploit forward-looking information without introducing additional ad hoc state variables.

Economically, this augmentation is motivated by the idea that the current level of a predictor need not fully summarize the state relevant for next-period equity premia. Two periods with similar zk,t can imply different required returns if the same predictor is expected to deteriorate further in one case but to stabilize or reverse in the other. The conditional mean forecast therefore provides a parsimonious summary of the near-term direction in which the macro-financial state is likely to move, while captures how precisely that transition can be read at the forecast origin. In this sense, the uncertainty term is useful not only as a reliability measure for the generated mean forecast, but also as an indicator of local fragility or regime instability, under which the mapping from predictors to expected returns may differ. The Combined pool is thus intended to capture both where the economy currently is and where it is likely headed. This economic interpretation also motivates the admission rule introduced below: when the first-stage forward-looking signals are weakly forecastable, adding them is more likely to inject noise than to sharpen the return forecast.

Because the forward-looking components are generated forecasts, their quality can vary substantially across predictors. To prevent noisy first stage forecasts from diluting the signal, we impose a predictor-level admission threshold when forming the Combined pool. Let denote the out-of-sample predictive fit of the Stage 1 model for predictor k, computed using the benchmark definitions in Section 3.3. If , we drop only the forward-looking pair from predictor k while always retaining the lagged level zk,t. This rule keeps the baseline information set intact and uses Stage 1 only when it delivers sufficiently reliable forward-looking content. The resulting post-threshold working pool therefore consists of all lagged predictor-levels together with only the admitted forward-looking columns.

To manage redundancy and improve parsimony, we optionally apply screening and/or dimension reduction to the base pools. First, we consider SHAP-based screening: we fit a preliminary model on the training data, compute feature attributions via TreeSHAP, and rank predictors by mean absolute SHAP importance. Implementation details are provided in Section 4.2.2. We then retain only the top N predictors and carry forward the corresponding features from the chosen base pool. Let denote the resulting SHAP-screened feature vector.

Second, we apply linear dimension reduction to the working pool using either PCA or PLS. Let F denote the feature matrix over the training window formed from the chosen working pool (either directly from or , or from the SHAP-screened ), and let r denote the corresponding vector of target equity premia. PCA constructs orthogonal components that maximize the variance of the projected features by selecting loading vectors that solve

PLS instead targets predictive directions by maximizing squared covariance with the equity premium, choosing according to

In both cases, the constraints normalize the loadings and ensure that the extracted score vectors are mutually uncorrelated across components. Let denote the resulting m-dimensional representation at time t, constructed by projecting the feature vector from the working pool onto the selected directions.

Overall, depending on the specification under study, the Stage 2 input is taken from

Given the input representation Xt, Stage 2 then fits a forecasting function and produces the one-step-ahead equity premium forecast

3.3 Performance evaluation

Our primary measure of forecast accuracy is the out-of-sample R2 () in the sense of [3]. Let rt+1 denote the target at time t + 1 and the corresponding one-step-ahead forecast formed using information available at time t. We define

where is the historical average return estimated through period t, and denotes the out-of-sample test set disjoint from training and validation. Positive values indicate that the predictive model reduces mean squared error relative to the historical mean benchmark.

In addition, when evaluating one-step-ahead forecasts for highly persistent series as in our Stage 1 predictor forecasting, it is often more conservative to benchmark against a random-walk forecast. In that case, the out-of-sample statistic is computed as

so that performance is assessed relative to the no change prediction rt+1 = rt. Throughout this paper, we use the historical mean benchmark for equity premium forecasting and for return or growth type targets, and the random walk benchmark for most persistent predictor-level targets in Stage 1. An exception is monthly inflation, for which predictor-level out-of-sample reliability is evaluated relative to a trailing 12-month average formed recursively from the most recent available observations and lagged by one period to preserve real-time timing. We adopt this benchmark because monthly inflation is persistent but noisy at short horizons, so a recent-year average provides a more meaningful and less noise-sensitive real-time baseline than a one-month no-change forecast.

Additionally, we also report the relative root mean squared error (RRMSE) as a supplementary scale-based measure of forecast accuracy. Let denote the relevant benchmark forecast. We define

Values below one indicate that the model outperforms the benchmark in root mean squared forecast error, while values above one indicate worse forecast accuracy. The benchmark used in RRMSE follows the same rule as for : The historical mean for equity premium forecasting and return or growth-type targets, and the random-walk forecast for highly persistent predictor-level targets.

To connect accuracy to market states, we also compute tail-conditional out-of-sample R2. For a quantile level , let

and define the downside and upside index sets

The corresponding conditional R2 statistics are

Both metrics evaluate the reduction in squared forecast errors within either left-tail (downside) or right-tail (upside) months of the test period, relative to the same benchmark as above. Economically, a positive indicates improved accuracy precisely in adverse market states–relevant for downside risk management–whereas a positive reflects better accuracy in favorable states and stronger upside participation.

4 Data and experimental setup

We study monthly U.S. equity premium predictability using the two stage design introduced earlier. Here we detail the practical setup: data, preprocessing, feature construction options, and the training and evaluation framework.

4.1 Data

Our sample spans January 1952 to December 2024. The equity premium is measured as the monthly market return minus the risk-free rate. Primary results are reported for the S&P 500, with robustness checks on the CRSP value-weighted index. Our baseline follows the 17 predictors of [1], augmented with five variables highlighted as promising in more recent work [2]: a composite index of technical indicators (tchi), tail risk (tail), average stock correlation (avgcor), the output gap (ogap), and growth in personal consumption expenditures (gpce). A key inclusion criterion is the availability of a continuous series back to at least 1952, ensuring a sufficiently long sample for robust out-of-sample evaluation.

To support economic interpretation and to stabilize later attribution exercises, we organize the predictors into six families following [1,2]—Valuation; Rates, Term Structure & Credit; Macroeconomic; Equity Issuance & Financing; Market Risk & Comovement; and Technical. This classification is also the one used later when we aggregate importance measures at the group-level.

Because our two stage design constructs forward-looking signals at the predictor-level, all subsequent steps rely on a real-time predictor panel that respects publication lags and appropriately aligns mixed frequencies. We describe these preprocessing and alignment rules—together with the Stage 1 forecasting procedure built on top of them—in Section 4.2.1.

4.2 Implementation details

This section describes the empirical implementation of our framework in a real-time setting. Building on the data described in Section 4.1, we summarize the preprocessing required to avoid look-ahead, the construction of the feature sets used in the forecasting experiments, and the training and evaluation protocol adopted throughout.

4.2.1 Stage 1: Preprocessing and forecasting each predictor.

Data preprocessing and mixed-frequency alignment

Stage 1 is implemented on a real-time predictor panel that respects publication lags and aligns mixed frequency series without look-ahead. We first shift each predictor by a fixed release lag based on its native frequency—one month for monthly series, three months for quarterly series, and one year for annual series—before any further transformation or standardization. Because each predictor is forecast in turn using the remaining variables as candidate exogenous regressors, frequency alignment is performed relative to the target variable in each forecasting model: regressors observed at the same frequency enter under the same lagged timing; when the target is higher frequency, lower frequency regressors are carried forward using their most recently released observation until the next release; and when the target is lower frequency (e.g., quarterly or annual), higher frequency regressors are aggregated to the target periodicity as described below.

Concretely, the aggregation step follows standard stock-flow conventions so that the mapped regressor preserves the variable’s economic interpretation. In particular, variables naturally interpreted as end of period state quantities—such as interest rate levels and spreads (e.g., lty, tbl, tms, dfy) and valuation ratios (e.g., d/p, d/y, e/p, d/e, b/m), as well as indices such as tchi and ogap—are mapped to the period-end value. Variables representing within-period conditions such as average states or risk—such as inflation and market risk/comovement measures (e.g., infl, svar, avgcor, tail)—are mapped to the period average. Cumulative flow series are mapped to the period sum (e.g., ntis), and return type series are mapped to period returns via geometric compounding (e.g., dfr, ltr).

To limit obvious redundancy in Stage 1, we did not impose a blanket exclusion rule at the level of the six broad predictor families in Table 1. Instead, the candidate exogenous set for each target predictor was restricted only in a limited, pair-specific way, excluding variables that were regarded as mechanically overlapping or as near-equivalent representations of the same underlying quantity. This mainly applied to closely related measures such as alternative valuation ratios or tightly linked rate/spread variables. Accordingly, the family classification in Table 1 should be interpreted as descriptive and used for economic organization, rather than as the formal criterion determining admissible exogenous regressors in Stage 1. The empirical implementation uses a single admissible exogenous regressor for each target predictor; Table 2 reports the selected regressor for each series.

thumbnail
Table 1. Predictor families, variable names, and native frequencies.

https://doi.org/10.1371/journal.pone.0341578.t001

thumbnail
Table 2. ARIMAX specifications and in-/out-of-sample R2 for each target variable. For each series we report the selected exogenous predictor, the optimal ARIMAX order (pm, d, qm) – where pm denotes the number of autoregressive lags, d the order of differencing, and qm the number of moving-average lags – and the resulting in-sample () and out-of-sample () values.

https://doi.org/10.1371/journal.pone.0341578.t002

Forecasting model and expanding window refits

The objective of Stage 1 is to construct, for each predictor zk, a forward-looking pair consisting of the one-step-ahead conditional mean and the conditional residual volatility for its forecast uncertainty. We implement this by estimating an ARIMAX model for the conditional mean and a GARCH model for the residual variance on the real-time dataset described above, maintaining the exclusion map throughout. The procedure is run on an expanding window with annual refits. We reserve January 1952 through December 1971 as an initial training period and then produce the first one-step-ahead forecasts for 1972, repredicting the models each year using the expanding sample available at that point.

To stabilize the mean component, the ARIMAX orders (pm, d, qm) are selected once at the initial calibration using the Akaike information criterion (AIC) from a compact grid—pm and qm ranging from 0 to 2, and d equal to 0 or 1—and then held fixed. In contrast, the GARCH orders (pv, qv) are reselected via AIC at each annual refit, allowing the volatility dynamics to adapt over time; for this, the GARCH order pv is set to 1 or 2, and the ARCH order qv ranges from 0 to 2. This unified template is intentionally parsimonious: our aim is not to maximize first stage fit for any single predictor, but to generate comparable one-step-ahead forecasts across series so that cross-predictor differences primarily reflect signal quality rather than model flexibility.

From the resulting expanding window forecasts, we compute a predictor-level out-of-sample as a reliability score, using the benchmark definitions in Section 3.3. These scores are subsequently used to screen the forward-looking components when constructing the Combined feature set.

4.2.2 Stage 2: Feature construction, reduction, and learning.

Using the Stage 1 outputs, we construct the feature sets used for equity premium forecasting. We consider two base pools: the Past set, which uses lagged predictors only, and the Combined set, which augments each lag with the Stage 1 forward-looking pair, as defined in Section 3.2. When forming the Combined pool, we apply a predictor-level admission threshold to each predictor’s Stage 1 reliability score . The lower bound should be interpreted as the weakest admissible screen, excluding only predictors with negative Stage 1 out-of-sample R2, whereas the positive cutoffs form a fixed coarse sensitivity grid that imposes progressively stricter admission rather than an optimized tuning rule. If predictor k’s reliability falls below , we drop only its forward-looking pair while always retaining the lagged predictor-level in the feature pool.

After constructing the working pool, we standardize the resulting features, with one exception: we keep the conditional volatility forecasts on their raw scale to preserve the informational content of their magnitude. We then consider additional steps to address multicollinearity and to obtain parsimonious representations. First, we optionally apply SHAP-based screening. We fit a preliminary tree-based model on the training window, compute TreeSHAP attributions for observations in the corresponding validation block, and rank predictors by their mean absolute SHAP values averaged over the validation block. We form predictor-level importance scores by aggregating the mean absolute SHAP values over the validation block and retain the top N predictors, with , carrying forward the corresponding features to the working pool. Second, we perform PCA or PLS either on the unscreened pool or on the SHAP-screened pool; for completeness, we also report PCA/PLS applied directly to the full feature set without SHAP screening.

Models are retrained annually on an expanding training window with a fixed-length rolling validation block for hyperparameter selection. In our baseline implementation, we use random forests to implement the forecasting rule due to their ability to capture nonlinearities and interactions while mitigating overfitting in high-dimensional settings. For the S&P 500, the initial training window is 1972–1991 and the moving validation block is 1992–1999; out-of-sample evaluation runs 2000–2024. To assess robustness across equity universes, we replicate the same evaluation protocol on the CRSP value-weighted index using the identical training, validation, and test windows; detailed results are reported in S2 Appendix. Unless otherwise noted, the validation criterion is mean squared error. Random forests use 300 trees with maximum depth fixed at 1. We tune the fraction of features considered at each split over and the minimum fraction of samples per leaf over . For PCA/PLS, the number of components is selected from . Tail-conditional accuracy measures are computed in the left- and right-tail months of the test window; main results are reported at q = 10%, and a sensitivity analysis over additional quantile levels is provided in S3 Appendix. As a further robustness check on the choice of learner, we repeat the Stage 2 exercise using gradient-boosted tree ensembles—specifically XGBoost and LightGBM—in place of random forests; these results are also reported in S2 Appendix.

5 Results and discussion

This section reports the empirical results and discusses their implications for real-time equity premium forecasting, with an emphasis on how selective admission of forward-looking signals affects both statistical accuracy and portfolio performance.

5.1 Stage 1 forecastability and reliability screening

Before turning to equity premium forecasting results, we first summarize how forecastable the underlying macro–financial predictors are in real time, since this stage determines which forward-looking signals are admitted into the combined feature pool. In Stage 1, we fit a parsimonious ARIMAX specification with a single admissible exogenous regressor for each predictor zk to generate one-step-ahead conditional mean forecasts , and pair them with a one-step-ahead residual uncertainty proxy ; for each target series, we identify the single best-performing exogenous regressor through the cross-prediction procedure described in Section 4.2.1.

Table 2 reports, for each predictor, the selected exogenous variable, the ARIMAX order (pm, d, qm) determined by the AIC, and the resulting in-sample and out-of-sample fit statistics, where out-of-sample performance is evaluated using the benchmark rules in Section 3.3. Supplementary assessment of the baseline Stage 1 ARIMAX–GARCH specification is reported in S1 Appendix.

A clear dispersion emerges across predictors. Several series—such as investment i/k, payout d/e, and market-wide risk/comovement measures including svar and avgcor—exhibit materially positive out-of-sample performance in this first stage exercise, whereas selected valuation, issuance, and macroeconomic variables such as b/m, ntis, and gpce display little to no incremental predictive content, with near zero or negative; rate, term-structure, and credit variables show mixed results.

We carry these predictor-level values forward as reliability scores that discipline the construction of the combined feature set in Stage 2. Concretely, when forming the combined pool, we apply the threshold to each predictor’s Stage 1 reliability and drop only its forward-looking pair when the reliability falls below , while always retaining the contemporaneous lag zk,t. This design makes the role of forward-looking information transparent: it is included only when the underlying predictor demonstrates adequate real-time forecastability, and it allows the subsequent empirical analysis to isolate the incremental value of admitting these generated signals relative to using lagged predictors alone.

5.2 Main Results and Portfolio Performance

This section reports our main Stage 2 equity premium forecasting results and evaluates their economic value. We summarize statistical accuracy using out-of-sample R2 measures and translate each one-step-ahead forecast into a constrained mean-variance portfolio to assess realized performance.

Following [3], at each time t, the portfolio weight allocated to the risky asset is determined by:

where is a backward-looking variance estimate computed from a rolling 5-year window, is the risk aversion coefficient, and and impose no short selling and leverage constraints. Robustness to alternative volatility estimators used in this portfolio-scaling rule is reported in S4 Appendix. Portfolio performance is reported on an annualized basis. Let and denote the annualized mean excess return and volatility of the portfolio. We report the Sharpe ratio , and the certainty equivalent return

We also report maximum drawdown

where Wt is the cumulative wealth. To emphasize downside risk, we compute the Sortino ratio with

and we quantify trading intensity by annualized turnover

Throughout Tables 3–5, “Past” denotes models that use lagged predictors only, whereas “Comb.” denotes the combined feature set that augments these lags with Stage 1 forward-looking signals admitted under the predictor-level reliability threshold . “SHAP-PCA/PLS” indicates that predictors are first pre-screened by SHAP importance and then reduced by PCA or PLS before forecasting; is the minimum individual Stage 1 out-of-sample R2 required for a predictor’s forward-looking forecast to enter the combined set. We first assess the economic value of the forecasts via portfolio outcomes and then examine statistical predictive accuracy.

thumbnail
Table 3. Portfolio performance by feature representation and estimation method. The table reports Sharpe ratio, Sortino ratio, certainty equivalent return (CER) with risk aversion , maximum drawdown (MDD), and turnover for various forecasting strategies. The reliability threshold refers to the minimum individual predictor out-of-sample R2 from Stage 1 required for its forward-looking forecast to be included in the combined feature set.

https://doi.org/10.1371/journal.pone.0341578.t003

thumbnail
Table 4. Net of transaction cost portfolio performance by feature representation and estimation method. All entries are computed after deducting proportional transaction costs of 25 basis points per unit of turnover. The table reports Sharpe ratio, Sortino ratio, certainty equivalent return (CER) with risk aversion , maximum drawdown (MDD), and turnover for various forecasting strategies. The combined feature set admits forward-looking signals only for predictors whose Stage 1 out-of-sample R2 exceeds the reliability threshold .

https://doi.org/10.1371/journal.pone.0341578.t004

thumbnail
Table 5. Out-of-sample R2 for the equity risk premium forecasts by feature representation, estimation method, and predictor level threshold . The table reports the in-sample R2 and three out-of-sample R2 measures relative to the historical mean benchmark. The threshold refers to the minimum individual predictor out-of-sample R2 from Stage 1 required for its forward-looking forecast to be included in the combined feature set. and denote out-of-sample R2 in downside (left-tail) and upside (right-tail) months, as defined in Section 3.3.

https://doi.org/10.1371/journal.pone.0341578.t005

Table 3 provides a direct economic lens on the Stage 2 forecasts by comparing the implied mean-variance portfolios to standard benchmarks. A simple buy-and-hold allocation delivers a Sharpe ratio of 0.46 with an MDD of 0.50, while the conditional CAPM and FF3 benchmark strategies exhibit notably lower risk-adjusted performance and comparable or larger drawdowns.

Our primary focus is the incremental benefit of augmenting lagged predictors with the admitted forward-looking signals relative to using lagged predictors alone. The evidence in Table 3 suggests that such gains are possible, but not uniform across representations or threshold choices. In the raw feature block, a moderate positive threshold performs better than both the past-only baseline and the weakest screen: at , the combined specification raises the Sharpe ratio from 0.2719 to 0.3507 and lowers MDD from 0.5909 to 0.5469 relative to Past, although it still remains below buy-and-hold.

PCA offers a weaker investment profile overall: Combined variants are sometimes mildly better than Past in Sharpe or CER, but drawdowns remain high and the gains are not consistent across . By contrast, PLS-based representations benefit more from selective admission. The strongest gross PLS combined case occurs at , where Sharpe, Sortino, and CER all exceed the past-only PLS specification and MDD falls from 0.6561 to 0.4595.

SHAP screening sharpens this pattern. SHAP-PCA combined at improves both Sharpe and downside protection relative to its Past counterpart, while SHAP-PLS delivers the strongest gross performance in the table. The specification reaches a high performance overall, and the and variants also remain stronger than the past-only SHAP-PLS baseline on risk-adjusted return measures. Overall, Table 3 suggests that forward-looking augmentation can yield economically meaningful gains, but mainly in selected representations and thresholds rather than uniformly across all combined designs.

Table 4 complements this picture by evaluating net of transaction cost portfolio performance at the representative interior threshold after deducting proportional transaction costs. The raw combined specification still improves on raw Past in Sharpe, CER, and MDD, whereas PCA remains essentially unchanged. For PLS, the combined design continues to reduce drawdown materially but no longer improves Sharpe or CER relative to PLS Past. SHAP-PCA combined remains competitive, improving on its Past counterpart in Sharpe, Sortino, CER, and MDD, and SHAP-PLS combined remains the strongest net of transaction cost combined specification, with a Sharpe ratio of 0.5653 and a CER of 0.0530, while also reducing turnover relative to past-only SHAP-PLS. S4 Appendix additionally reports the corresponding net of transaction cost results over the full -grid, as well as sensitivity checks under 10 and 50 basis point transaction cost assumptions.

From a practitioner’s perspective, these results suggest that the additional modeling complexity is justified only when it survives a joint screen based on risk-adjusted return, drawdown control, and trading intensity after costs. Under that lens, the most compelling specification in the main text is SHAP-PLS at , not because the two-stage framework uniformly dominates simpler strategies, but because this specification remains competitive after costs on the metrics practitioners are most likely to care about. By contrast, other combined designs appear more mixed: some improve drawdown without clearly improving Sharpe, while others show only modest net gains once turnover is taken into account. The practical value of the framework should therefore be viewed as selective rather than universal, with the strongest case arising when forward-looking signals are paired with disciplined admission and supervised screening.

A well-known intuition from [3] is that even small gains in out-of-sample R2 can deliver large economic value for a mean-variance investor. By contrast, [53] emphasize that minimizing a forecasting loss need not maximize a portfolio objective, so higher out-of-sample R2 need not translate into a higher Sharpe ratio. Our evidence reflects both sides of this wedge: some specifications that rank favorably by aggregate do not deliver commensurate gains in realized portfolio performance, while some economically attractive variants show only modest aggregate improvements in statistical fit. With this distinction in mind, Table 5 summarizes the statistical forecasting evidence more directly.

Table 5 summarizes predictive performance in Stage 2 across feature families, dimension-reduction choices, and the predictor-level admission threshold . It reports in-sample R2 (), aggregate out-of-sample R2 relative to the historical mean benchmark (), the tail-conditional measures and , and RRMSE. A first takeaway is that forward-looking signals are not free: under the weakest admissible screen (), several combined specifications remain weak or negative in aggregate out-of-sample fit. A moderate positive threshold can stabilize performance in some blocks, but not mechanically. In the non-SHAP PCA block, aggregate moves from −0.0015 under Past to 0.0015 at and 0.0024 at under the combined feature set. In the PLS block, Past yields 0.0021, while the combined set improves to 0.0062, 0.0040, and 0.0039 at , 0.10, and 0.15, respectively, before turning negative again at . Thus should again be interpreted as a selectivity device that trades off signal quality against information retention, rather than as a monotone tuning parameter.

Second, SHAP screening yields the strongest and most consistent performance. SHAP-PCA Past already produces positive aggregate (0.0041), and all of its combined variants remain positive, with the strongest aggregate fit at (0.0157) and RRMSE as low as 0.9921. SHAP-PLS performs even better overall: The past-only version reaches 0.0153, while the combined set peaks at 0.0212 for and 0.0163 for , with the specification also delivering the lowest RRMSE in the table (0.9893). These patterns are consistent with the view that target-aligned SHAP screening removes weak or redundant inputs before the representation step.

Finally, the tail-conditional diagnostics show that predictive gains are state dependent. PCA and SHAP-PCA combined specifications tend to lean more toward upside accuracy, as reflected in positive alongside negative at several thresholds. By contrast, combined PLS specifications often load more heavily on downside months: for example, PLS with combined feature set at yields but . SHAP-PLS is more mixed across thresholds, with positive in both tails but strongly skewed toward downside accuracy (, ). These state-conditional patterns help explain why aggregate forecast fit and portfolio outcomes need not move one-for-one, especially for path-dependent objects such as maximum drawdown.

5.3 Additional validation and robustness checks

5.3.1 Forecast significance tests.

Table 6 complements the -based comparisons with formal forecast significance tests. We use the one-sided Diebold–Mariano (DM) test as our primary loss based test throughout. For comparisons against the historical mean benchmark, we additionally report the Clark–West (CW) statistic. The reason is that the historical mean is a parsimonious benchmark, whereas our competing specifications are richer conditional forecasting models estimated recursively from larger feature sets. In such settings, the larger model can be penalized in raw MSPE comparisons by estimation noise even when it contains incremental predictive information, which is precisely the case for which the CW adjustment was developed [54].

thumbnail
Table 6. Diebold–Mariano (DM) and Clark–West (CW) test statistics for one-step-ahead equity risk premium forecasts. The left block reports one-sided benchmark relative test statistics against the historical mean forecast. The Pairwise block reports one-sided DM test statistics comparing the Combined specification with the matched Past specification within the same method and threshold. denotes the difference in out-of-sample R2 between the Combined and Past specifications. Superscripts *, **, and *** denote statistical significance at the 10%, 5%, and 1% levels, respectively, based on one-sided tests.

https://doi.org/10.1371/journal.pone.0341578.t006

Accordingly, the left block of Table 6 reports benchmark relative DM and CW test statistics, where larger positive values indicate stronger evidence that the reported specification improves upon the historical mean benchmark. The right block reports the paired comparison between the Combined and matched Past specifications within each method and threshold; there, , and positive values of both and the DM statistic favor the Combined specification.

The benchmark-relative results in the left block of Table 6 show that the DM evidence is selective rather than broad-based. The clearest joint support from both DM and CW appears in the SHAP-screened specifications, most notably for SHAP-PCA with and for the SHAP-PLS Past specification. More generally, benchmark-relative support is visibly stronger under CW than under DM. This pattern is most apparent for the PLS and SHAP-PLS families, where several Combined specifications receive positive CW support even when the corresponding DM statistic does not reach conventional significance. We interpret this asymmetry cautiously: It suggests that some richer conditional specifications contain incremental information relative to the historical mean benchmark, while the raw loss-differential evidence remains modest.

The right block of Table 6 asks a stricter question: whether augmenting past predictors with admitted forward-looking signals improves upon the matched Past specification within the same representation. Here the evidence is more limited. The clearest result is the unreduced specification at , where the Combined design delivers a positive together with a statistically significant DM statistic. PCA at and also yields positive values and positive DM statistics, but these remain below conventional significance thresholds. For PLS, SHAP-PCA, and SHAP-PLS, the incremental gains are either small or unstable across thresholds. Taken together, the pairwise results indicate that the incremental contribution of forward-looking augmentation is selective rather than pervasive.

This pattern is consistent with the broader literature on aggregate equity premium prediction. A long line of work emphasizes that out-of-sample market return predictability is difficult to establish reliably, and that when predictability is present it is often weak and unstable in economic magnitude and statistical significance [13,20,55]. In this context, the limited pairwise DM significance in Table 6 should not be interpreted as evidence against the usefulness of the approach. Rather, it underscores the difficulty of extracting robust real-time signals from a single noisy aggregate return series. Our contribution is therefore not that forward-looking augmentation uniformly dominates past-only information, but that under disciplined admission and representation choices it can isolate weak but economically relevant predictive content, especially in specifications that also perform well in the downside-sensitive portfolio exercises. This interpretation is also in line with the view that poor out-of-sample R2 does not by itself rule out return predictability, and that even modest predictive improvements can matter economically [3,56].

5.3.2 Sigma ablation study.

To make the role of the Stage 1 uncertainty channel explicit, Table 7 reports a direct sigma ablation for the combined feature representation at the representative interior admission threshold . The table compares each specification with and without the uncertainty proxy , and reports the difference in performance between the specification with and the corresponding specification without . We focus on in the main text to keep the ablation compact and directly interpretable, while the full -grid of the sigma ablation study is reported in S5 Appendix.

thumbnail
Table 7. Sigma ablation in Stage 2 for the combined features set at . Throughout the table, denotes the difference in performance between the specification that includes the Stage 1 uncertainty proxy and the corresponding specification that excludes it. Panel A reports out-of-sample prediction metrics, and Panel B reports gross portfolio performance. Positive values of , , , , Sharpe, Sortino, and CER indicate better performance with , whereas negative values of RRMSE and MDD indicate improvement because forecast error and maximum drawdown are reduced.

https://doi.org/10.1371/journal.pone.0341578.t007

Panel A shows that the contribution of is not uniform across representations, but it is clearly not redundant. The cleanest comparison is the non-SHAP block, since it isolates the incremental role of the uncertainty proxy without additional screening effects. In the raw combined specification, adding sigma raises and , lowers , and modestly improves RRMSE. Thus, even in the simplest non-SHAP specification, the uncertainty proxy improves forecasting performance primarily in adverse market states rather than uniformly across the test sample.

The PCA block, by contrast, is essentially unchanged. All prediction metrics remain close to zero, indicating that once the combined signal is compressed along variance-maximizing directions, the uncertainty channel adds little incremental information. This stands in clear contrast to the PLS block, which delivers the strongest forecasting gains without SHAP screening. There, sigma inclusion increases , , and , but reduces while improving RRMSE. The asymmetry between and suggests that helps the second-stage learner identify states in which predictor-level forecasts are less precise and the mapping into the equity premium becomes more fragile or more nonlinear. In that sense, the uncertainty proxy functions as a state dependent reliability signal, which is precisely the role envisioned in Stage 1.

The SHAP-screened specifications reinforce this interpretation, but we treat them as complementary rather than primary evidence. Both SHAP-PCA and SHAP-PLS show positive changes in , and in both cases the improvement is again more pronounced on the downside than on the upside. These results are consistent with the non-SHAP evidence, but they should be interpreted with some caution because sigma inclusion may interact with screening and feature selection.

Panel B shows that the economic implications broadly mirror the statistical evidence. In the raw specification, sigma inclusion produces modest but consistently positive improvements in Sharpe, Sortino, and CER, while also reducing maximum drawdown. The PCA block again remains economically negligible. The strongest non-SHAP portfolio gains arise in the PLS block, where adding sigma improves the Sharpe ratio, the Sortino ratio, CER, and maximum drawdown, although turnover increases. Thus, in the representation where the gains are most visible, those gains also translate into a more attractive investment profile.

At the same time, the turnover results do not move in a single direction across methods. In particular, the PLS specification exhibits a higher turnover when sigma is included, whereas the raw and PCA specifications do not. This is useful for interpretation: The portfolio gains associated with are not simply the result of mechanically smoother portfolio weights or lower trading activity. Rather, they reflect an informational improvement in the signal entering the portfolio decision.

The SHAP-based portfolio results are again stronger in magnitude, with sizable gains in Sharpe, Sortino, CER, and drawdown control for both SHAP-PCA and SHAP-PLS. However, as in Panel A, we interpret these as robustness evidence rather than as the cleanest sigma comparison. The main takeaway from Table 7 is therefore not that sigma is universally helpful across all feature-processing choices. Instead, the evidence indicates that the Stage 1 uncertainty proxy has a distinct and economically meaningful role: its contribution is concentrated in downside-state forecasting, and it becomes most valuable when the combined signal is represented in a disciplined supervised low-dimensional form, especially PLS.

This interpretation is broadly in line with [41], who show that international economic policy uncertainty measures contain out-of-sample predictive information for U.S. excess returns, suggesting more generally that uncertainty related signals can be most useful when they are incorporated through a disciplined summary representation.

5.4 Interpretability: What drives predictability?

In this section, we study which economic channels are most closely associated with equity premium predictability over time by summarizing predictor importance at the group-level. Using the six predictor families defined in Section 4.1 (see Table 1 for the complete mapping), we compute mean absolute TreeSHAP attributions on the validation set and aggregate them within each family to obtain a time-varying measure of group-level importance. This aggregation helps mitigate multicollinearity induced attribution dilution among closely related predictors and yields a more stable, interpretable diagnostic of the channels emphasized by the forecasting model.

This group-level diagnostic also helps rationalize why SHAP-based screening improves out-of-sample performance. Because screening ranks predictors by their validation-based TreeSHAP contributions, it prioritizes variables that the fitted model relies on most when producing real-time forecasts, while de-emphasizing redundant signals within the same economic family. Consequently, the downstream forecasting stage is effectively guided toward the channels that carry predictive content at the forecast origin, which is consistent with the performance improvements documented in Tables 3–5.

Several broad patterns emerge. During the dot-com boom–bust (2000–2003), valuation signals are especially prominent, while issuance and financing variables are more visible than in later years, which is consistent with work linking valuation ratios and issuance waves to subsequent return reversals [57,58]. In the mid-2000s expansion, macroeconomic information gains relative importance, particularly around consumption–wealth conditions, while rates and credit variables remain an important backbone [18,59]. During the global financial crisis (2007–2009), the emphasis shifts more clearly toward rates/credit and market risk/comovement, in line with evidence on credit spreads, intermediary balance sheets, and tighter funding conditions [60,61]. Through much of the 2010s, market risk/comovement remains persistently important, and technical signals become more visible toward the late 2010s, which is consistent with the role of broad financial-cycle conditions and trend-sensitive information under prolonged monetary accommodation [21,22,62]. The 2020 pandemic shock again raises the relative importance of market risk/comovement together with rates/credit [63,64], while the 2022–2024 inflation-tightening period is characterized by renewed macroeconomic and rates/credit importance, with market risk/comovement remaining elevated rather than disappearing [65].

This group-level view also helps make the role of SHAP-based screening easier to understand. The screening step is meant to keep the model focused on the predictors that are most informative at the forecast origin, while avoiding an unnecessarily wide set of nearby substitutes that often carry overlapping information. Read in that way, Fig 1 is not saying that one family causes the equity premium in a structural sense. Rather, it shows which kinds of signals the fitted forecasting system is leaning on most heavily in different periods. That is also why the family-level summary is useful: Even when the importance of individual variables moves around within a family, the broader economic channel can still remain visible in an interpretable way. Additionally, S6 Appendix examines, for a representative SHAP-PLS specification, how the composition of the annually selected SHAP-ranked predictor bundles evolves over time.

thumbnail
Fig 1. Time-varying group-level importance of equity premium predictor families.

Each bubble corresponds to one of the six predictor families defined in Table 1 in a given calendar year. For each year, group-level importance is computed from mean absolute TreeSHAP values on the validation set and then aggregated within family. Bubble area reflects that family’s share of total annual group-level importance across the six families, so larger bubbles indicate that the fitted forecasting model relied more heavily on that family in that year.

https://doi.org/10.1371/journal.pone.0341578.g001

6 Conclusion

This paper studies real-time equity premium forecasting through a disciplined two-stage predict-the-predictors framework. Rather than mapping a high-dimensional and potentially unstable predictor panel directly into future excess returns, we first forecast each macro-financial predictor one step ahead and use its expected movement and forecast uncertainty to construct an augmented feature set. A predictor-level admission rule then determines whether these generated signals are sufficiently reliable to enter the second-stage forecasting model. Within this architecture, SHAP-based screening also serves as a useful layer of parsimony and interpretation, helping organize the augmented information in a more target-aligned way.

Empirically, the value of forward-looking augmentation is selective rather than uniform. Across representations and admission thresholds, reliability-screened forward-looking signals can improve benchmark relative forecasting performance and generate economically meaningful differences in portfolio outcomes, especially when predictive performance is examined through downside state diagnostics rather than average fit alone. The formal forecast comparison evidence is more supportive in benchmark relative comparisons than in stricter pairwise comparisons against matched past-only designs, which points to a nuanced but still meaningful contribution. The contribution of the paper is therefore not that combined inputs mechanically dominate past-only designs in every specification, but that forward-looking feature construction offers a disciplined way to extract incremental predictive and economic value from standard predictors when signal quality is screened, information is represented parsimoniously, and performance is evaluated in state-dependent as well as economic terms. The sigma ablation further suggests that the uncertainty channel is not redundant and can be especially informative in downside sensitive settings.

These findings also carry practical implications. For investors and asset allocators, they suggest that forward-looking transformations of standard predictors should be admitted selectively and judged not only by average forecast fit but also by their implications for downside protection, turnover, and implementable portfolio performance. For risk managers monitoring macro financial conditions, the framework offers a structured way to summarize which predictor families become more informative across regimes and when uncertainty related signals become more relevant. For policymakers and macro financial surveillance institutions, the framework may also serve as a descriptive tool for tracking when valuation, macroeconomic, credit, or market risk signals become more informative, which may be useful for regime monitoring and scenario-based risk assessment. In this sense, the paper contributes not only a forecasting design, but also a transparent bridge between real-time prediction, interpretable feature construction, and economic decision-making.

Several limitations point to natural next steps within the same empirical architecture. First, uncertainty is currently carried from Stage 1 to Stage 2 in a modular way. Bayesian or state space formulations could instead propagate predictive distributions more explicitly through the pipeline. Second, although we report net of transaction cost portfolio evidence, trading frictions are not yet embedded directly in model estimation or model selection. Future work could incorporate cost-aware validation rules or turnover sensitive objectives. Another extension is to examine alternative portfolio mappings, such as risk-parity allocation or utility specifications that place greater weight on downside outcomes, to assess whether the same forecasting signals remain valuable beyond the constrained mean–variance benchmark. Third, while our evaluation highlights downside state performance, the learning objective itself remains symmetric. Asymmetric, quantile-based, or downside-weighted losses could better align statistical learning with path-dependent investment goals. Finally, it would be valuable to examine the external validity of the framework beyond the present U.S. monthly setting, including emerging and frontier equity markets and, more ambitiously, higher-frequency settings such as mixed-frequency, intraday, or even tick-level environments in which information, frictions, and uncertainty evolve more rapidly. Relatedly, future work could extend to simple neural-network learners and richer meta-ensemble architectures, and could also allow dependence-aware first-stage models that exploit cross-predictor structure more explicitly.

Supporting information

S1 Appendix. Supplementary assessment of the stage 1 ARIMAX–GARCH specification.

https://doi.org/10.1371/journal.pone.0341578.s001

(PDF)

S2 Appendix. Predictive robustness across alternative learners and equity universes.

https://doi.org/10.1371/journal.pone.0341578.s002

(PDF)

S3 Appendix. Sensitivity of the tail-conditional R2 diagnostics to alternative quantile levels.

https://doi.org/10.1371/journal.pone.0341578.s003

(PDF)

S4 Appendix. Robustness of portfolio performance to transaction costs and alternative volatility estimators.

https://doi.org/10.1371/journal.pone.0341578.s004

(PDF)

S5 Appendix. Full sensitivity of the sigma ablation study.

https://doi.org/10.1371/journal.pone.0341578.s005

(PDF)

S6 Appendix. Time variation in SHAP-ranked predictor bundles.

https://doi.org/10.1371/journal.pone.0341578.s006

(PDF)

References

  1. 1. Welch I, Goyal A. A Comprehensive Look at The Empirical Performance of Equity Premium Prediction. Rev Financ Stud. 2007;21(4):1455–508.
  2. 2. Goyal A, Welch I, Zafirov A. A Comprehensive 2022 Look at the Empirical Performance of Equity Premium Prediction. Rev Financ Stud. 2024;37(11):3490–557.
  3. 3. Campbell JY, Thompson SB. Predicting excess stock returns out of sample: can anything beat the historical average? Rev Financ Stud. 2008;21(4):1509–31.
  4. 4. Cohen RB, Polk C, Vuolteenaho T. The Value Spread. J Finance. 2003;58(2):609–41.
  5. 5. GREENWOOD R, HANSON SG. Share Issuance and Factor Timing. J Finance. 2012;67(2):761–98.
  6. 6. Moskowitz TJ, Ooi YH, Pedersen LH. Time series momentum. J Financ Econ. 2012;104(2):228–50.
  7. 7. Haddad V, Kozak S, Santosh S. Factor Timing. Rev Financ Stud. 2020;33(5):1980–2018.
  8. 8. Ehsani S, Linnainmaa JT. Factor Momentum and the Momentum Factor. J Finance. 2022;77(3):1877–919.
  9. 9. Claus J, Thomas J. Equity Premia as Low as Three Percent? Evidence from Analysts’ Earnings Forecasts for Domestic and International Stock Markets. J Finance. 2001;56(5):1629–66.
  10. 10. Gebhardt WR, Lee CMC, Swaminathan B. Toward an Implied Cost of Capital. J Account Res. 2001;39(1):135–76.
  11. 11. Duan J-C, Zhang W. Forward-Looking Market Risk Premium. Manag Sci. 2014;60(2):521–38.
  12. 12. Martin I. What is the Expected Return on the Market?*. Quart J Econ. 2016;132(1):367–433.
  13. 13. Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32.
  14. 14. Keim DB, Stambaugh RF. Predicting returns in the stock and bond markets. J Financ Econ. 1986;17(2):357–90.
  15. 15. Campbell JY. Stock returns and the term structure. J Financ Econ. 1987;18(2):373–99.
  16. 16. Campbell JY, Shiller RJ. The dividend-price ratio and expectations of future dividends and discount factors. Rev Financ Stud. 1988;1(3):195–228.
  17. 17. Fama EF, French KR. Dividend yields and expected stock returns. J Financ Econ. 1988;22(1):3–25.
  18. 18. Lettau M, Ludvigson S. Consumption, Aggregate Wealth, and Expected Stock Returns. J Finance. 2001;56(3):815–49.
  19. 19. Stambaugh RF. Predictive regressions. J Financ Econ. 1999;54(3):375–421.
  20. 20. Rapach DE, Strauss JK, Zhou G. Out-of-Sample Equity Premium Prediction: Combination Forecasts and Links to the Real Economy. Rev Financ Stud. 2009;23(2):821–62.
  21. 21. Neely CJ, Rapach DE, Tu J, Zhou G. Forecasting the Equity Risk Premium: The Role of Technical Indicators. Manag Sci. 2014;60(7):1772–91.
  22. 22. Song J, Jeon J. Deep momentum networks with market trend dynamics. PLoS One. 2025;20(9):e0331391. pmid:40892940
  23. 23. Xu X. The rolling causal structure between the Chinese stock index and futures. Financ Mark Portf Manag. 2017;31(4):491–509.
  24. 24. Xu X. Using Local Information to Improve Short-Run Corn Price Forecasts. J Agric Food Ind Organ. 2018;16(1).
  25. 25. Gu S, Kelly B, Xiu D. Empirical Asset Pricing via Machine Learning. Rev Financ Stud. 2020;33(5):2223–73.
  26. 26. Gu S, Kelly B, Xiu D. Autoencoder asset pricing models. J Econ. 2021;222(1):429–50.
  27. 27. Lo AW, Singh M. Deep-learning models for forecasting financial risk premia and their interpretations. Quant Finance. 2023;23(6):917–29.
  28. 28. Feng G, He J, Polson NG. Deep learning for predicting asset returns. arXiv preprint arXiv:180409314. 2018. https://doi.org/10.48550/arXiv.1804.09314
  29. 29. Chen L, Pelger M, Zhu J. Deep Learning in Asset Pricing. Manag Sci. 2024;70(2):714–50.
  30. 30. Xu X, Zhang Y. Corn cash price forecasting with neural networks. Comput Electron Agricult. 2021;184:106120.
  31. 31. Xu X, Zhang Y. Rent index forecasting through neural networks. JES. 2021;49(8):1321–39.
  32. 32. Jin B, Xu X. Chinese energy security index price forecasting through the neural network. Innov Emerg Technol. 2025;12:2550036.
  33. 33. Jin B, Xu X. China commodity price index (CCPI) forecasting via the neural network. J Finan Eng. 2025;12(03):2550003.
  34. 34. Xu X, Zhang Y. Price forecasts of ten steel products using Gaussian process regressions. Eng Appl Artif Intell. 2023;126:106870.
  35. 35. Jin B, Xu X. Machine Learning Coffee Price Predictions. J Uncert Sys. 2024;17(04):2450023.
  36. 36. Xu X, Zhang Y. An integrated vector error correction and directed acyclic graph method for investigating contemporaneous causalities. Decis Anal J. 2023;7:100229.
  37. 37. Jin B, Xu X. A study of contemporaneous residential real estate price causation across major jiangsu province cities: Methodology using vector error-correction models and directed acyclic graphs. Econ Open. 2025;2550008.
  38. 38. Xu X. Corn Cash Price Forecasting. Am J Agri Econ. 2020;102(4):1297–320.
  39. 39. Xu X, Zhang Y. Individual time series and composite forecasting of the Chinese stock index. Mach Learn Appl. 2021;5:100035.
  40. 40. BOLLERSLEV T, TODOROV V. Tails, Fears, and Risk Premia. J Finance. 2011;66(6):2165–211.
  41. 41. Huang Y, Ma F, Bouri E, Huang D. A comprehensive investigation on the predictive power of economic policy uncertainty from non-U.S. countries for U.S. stock market returns. Int Rev Financ Analys. 2023;87:102656.
  42. 42. Campbell SD, Diebold FX. Stock Returns and Expected Business Conditions: Half a Century of Direct Evidence. J Bus Econ Stat. 2009;27(2):266–78.
  43. 43. Bacchetta P, Mertens E, van Wincoop E. Predictability in financial markets: What do survey expectations tell us? J Int Money Finance. 2009;28(3):406–26.
  44. 44. Amromin G, Sharpe SA. From the Horse’s Mouth: Economic Conditions and Investor Expectations of Risk and Return. Manag Sci. 2014;60(4):845–66.
  45. 45. Deng Y, Wang Y, Zhou T. Macroeconomic Expectations and Expected Returns. J Financ Quant Anal. 2024;60(4):1760–96.
  46. 46. Marcílio WE, Eler DM. From explanations to feature selection: assessing SHAP values as feature selection mechanism. In: 2020 33rd SIBGRAPI conference on Graphics, Patterns and Images (SIBGRAPI). IEEE; 2020. p. 340–7.
  47. 47. Liu Y, Liu Z, Luo X, Zhao H. Diagnosis of Parkinson’s disease based on SHAP value feature selection. Biocybern Biomed Eng. 2022;42(3):856–69.
  48. 48. Ahmed U, Jiangbin Z, Almogren A, Sadiq M, Rehman AU, Sadiq MT, et al. Hybrid bagging and boosting with SHAP based feature selection for enhanced predictive modeling in intrusion detection systems. Sci Rep. 2024;14(1):30532. pmid:39690165
  49. 49. Zhang X. Stock Return Forecasting Using SHAP-Based Feature Selection and Risk-Controlled Portfolio Construction. In: 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025). Atlantis Press; 2025. p. 770–8.
  50. 50. Muhammad D, Ahmed I, Naveed K, Bendechache M. An explainable deep learning approach for stock market trend prediction. Heliyon. 2024;10(21):e40095. pmid:39568823
  51. 51. Luo T. SHAP-Based Recursive Feature Elimination and Hyperparameter Optimization for Enhanced Financial Stock Forecasting. In: Proceedings of the 2025 International Conference on Big Data, Artificial Intelligence and Digital Economy. 2025. p. 111–20.
  52. 52. Box GEP, Tiao GC. Intervention Analysis with Applications to Economic and Environmental Problems. J Am Stat Assoc. 1975;70(349):70–9.
  53. 53. Cong LW, Tang K, Wang J, Zhang Y. AlphaPortfolio: Direct construction through deep reinforcement learning and interpretable AI. Available at SSRN 3554486. 2021.
  54. 54. Clark TE, West KD. Approximately normal tests for equal predictive accuracy in nested models. J Econom. 2007;138(1):291–311.
  55. 55. Guo H. Earnings Extrapolation and Predictable Stock Market Returns. Rev Financ Stud. 2025;38(6):1730–82.
  56. 56. Cochrane JH. The Dog That Did Not Bark: A Defense of Return Predictability. Rev Financ Stud. 2007;21(4):1533–75.
  57. 57. Campbell JY, Shiller RJ. Valuation ratios and the long-run stock market outlook. J Portf Manag. 1998;24(2):11–26.
  58. 58. Ofek E, Richardson M. DotCom Mania: The Rise and Fall of Internet Stock Prices. J Finance. 2003;58(3):1113–37.
  59. 59. Fama EF, French KR. Business conditions and expected returns on stocks and bonds. J Financ Econ. 1989;25(1):23–49.
  60. 60. Gilchrist S, Zakrajšek E. Credit Spreads and Business Cycle Fluctuations. Am Econ Rev. 2012;102(4):1692–720.
  61. 61. ADRIAN T, ETULA E, MUIR T. Financial Intermediaries and the Cross‐Section of Asset Returns. J Finance. 2014;69(6):2557–96.
  62. 62. Miranda-Agrippino S, Rey H. U.S. Monetary Policy and the Global Financial Cycle. Rev Econ Stud. 2020;87(6):2754–76.
  63. 63. Baker SR, Bloom N, Davis SJ, Kost K, Sammon M, Viratyosin T. The unprecedented stock market reaction to COVID-19. Rev Asset Pricing St. 2020;10(4):742–58.
  64. 64. Haddad V, Moreira A, Muir T. When Selling Becomes Viral: Disruptions in Debt Markets in the COVID-19 Crisis and the Fed’s Response. Rev Financ Stud. 2021;34(11):5309–51.
  65. 65. Campbell JY, Pflueger C, Viceira LM. Macroeconomic Drivers of Bond and Equity Risks. J Polit Econ. 2020;128(8):3148–85.