Deep learning has made significant advances across a wide range of applications, such as achieving human-like accuracy in image recognition tasks (Schroff et al., 2015) and machine translation (Sutskever et al., 2014). These applications can be considered as stationary problems where the function of interest does not change with time. In some domains the prediction problem is non-stationary, where the underlying relationships between input and output change over time (also known as concept drift in machine learning, Gama et al., 2014). Recently, Aydore et al. (2019) proposed the Dynamic Exponentially Time-Smoothed Stochastic Gradient Descent optimization algorithm (DTS-SGD) which allows a neural network to adapt to a time-varying function. To the best of our knowledge, this is the first attempt on applying deep learning in a time-varying context.
The motivating application of our work is in predicting cross-sectional stock returns at time t using history up to 1. Literature has documented evidence of non-stationarity of the true asset pricing model. Bossaerts and Hillion (1999) studied the model selection problem in the context of predicting international stocks and found that models chosen by common statistical selection criteria fail to retain predictive power out of sample and is indicative of model non-stationarity. Pesaran and Timmermann (1995) documented similar findings in U.S. stocks. At every month, the authors performed linear regressions with permutations of regressors and compared both statistical and financial measures for model selection. Both predictability and regression coefficients of the selected model changed over time.
Machine learning in financial markets is still in its infancy. Weigand (2019) provided a recent survey of state of the art for machine learning applied to empirical finance and noted that literature is dominated by regression-based techniques. More recent works employed “gold standard” techniques, such as dropouts (Sri- vastava et al., 2014) and batch normalization (Ioffe and Szegedy, 2015), as well as momentum and learning rate decay in optimization (Kingma and Ba, 2015). Messmer (2017) performed a random search over hyperparameter space for the best neural network configuration for stock return prediction. The author found that the best network had 7 hidden layers, 78 hidden units, a low learning rate of just 10
, and achieved
5 % of linear regression in U.S. large and mid-cap stocks
. Abe and Nakayama (2018) compared neural networks to Support Vector Machine and random forests in predicting one-month ahead Japanese stock returns. Performance metrics were rank correlation and directional accuracy, arguing that these are more relevant to investors. Batch size was set to the entire stock universe in each month. The best performing model (by rank correlation) had 8 hidden layers and achieved rank correlation of 5.82 %. Gu et al. (2019) compared several machine learning algorithms for predicting monthly returns of U.S. equities. The data set consists of all stocks listed on NYSE, AMEX and NASDAQ, with 94 firm characteristics, 74 sector dummy variables, as well as interaction terms with 8 macroeconomic indicators resulting in 920 features. The authors reported both tree-based algorithms and neural networks resulted in an improved
4 % respectively. Shallow networks outperformed deeper networks which the authors have attributed to the small data set and low signal-to-noise ratio. The best performing network had three hidden layers, with 32–16–8 nodes for each hidden layer respectively. This observation is particularly interesting. If stock returns are a result of complex interactions of factors then one would expect a deeper and/or wider network to perform well. Majority of the data set was made available to the public.
To the best of our knowledge, this is the first time a comprehensive U.S. equities feature set was released, providing a rich source of relatively untapped data for machine learning research. For this reason, the work of Gu et al. (2019) forms the basis of this paper. Our contributions in this paper are as follows:
• We propose the online early stopping algorithm which allows the network to track a time-varying function. We achieve mean rank correlation of 4.69 % on the U.S. equities data set, compared to 2.44 % under an expanding window approachstudied in Gu et al. (2019). This algorithm can be applied to an existing network and requires significantly less time to train than the setup used in Gu et al. (2019).
• We show that features exhibit time-varying importance and that the true model changed over time. We find that certain features, such as market capitalization (the size effect) faded in importance over time. This highlights the importance to have a non-stationary model.
• We provide an alternative viewpoint to the shallow–deep learning debate. Our analysis suggests only a small set of features contributed to predictive performance. This may be due to most features lacking in predictive power, or features are correlated and L1 regularization has encouraged the network to only use a subset. This would likely lead to a simpler network.
In the rest of this paper, we denote the algorithm of Gu et al. (2019) as DNN (Deep Neural Network) and our proposed Online Early Stopping as OES. This paper is organized as follows. Section 2 defines our cross-disciplinary problem and survey existing works on this problem, covering online optimization and deep learning. Section 3 outlines our main contribution of this paper — the OES algorithm which introduces non-stationarity to a neural network and improves over DTS-SGD. Data and experimental setup is outlined in Section 4 and results are presented in Section 5. Finally, Section 6 discusses the empirical finance problem and some future works.
2.1 Definitions
We denote vectors with bold lower-case letters and matrices with bold uppercase letters. The i-th stock at time . To simplify notations, we define return of stock i as return over the next period, i.e.,
(
is price at time
is dividend at t if a dividend is paid, and zero otherwise.
Similar to a classical online learning setup, a player iteratively makes portfolio allocation decisions at each time period. We call this iterative process per interval training. There are n stocks in the market, each with m characteristics, forming input matrix is feature vector
of stock i. Player predicts stock returns ˆ
by choosing
parameterizes prediction function
. Market reveals
regression purposes, investor incurs squared loss,
We adopt the same customary assumptions in online optimization as Aydore et al. (2019):
• is bounded:
• is L-Lipschitz:
•
We denote the gradient of ) and stochastic gradient as ˆ
)], or where the context is obvious,
respectively.
The true function drifts over time and is approximated by F with time-varying
. Investor’s objective is to minimize loss incurred by choosing the best
using observed history up to
1. Both the function form and time-varying dynamics of
are not known. Hence a neural network is used to model the cross sectional relationship at each t and the non-stationarity is formulated as a network weights tracking problem.
In the simplest sense, a fully connected neural network consists of an input layer, one or more hidden layers, and an output layer. The output of each layer acts as input to the next layer and loss is “backpropagated” by taking the partial derivative of loss with respect to weights. Each layer consists of activation function , and output
b). The i-th layer of the network is denoted as
. For brevity, we drop the layer designation, and denote the entire network as F and weight vector set
is the number of layers. The network is trained with stochastic gradient descent (or variants) at time t (but dropping the subscript t for simplicity as the context is clear),
where is the weight vector at optimization iteration
is step size. At time
denotes the number of optimization iterations that are used to train the network. Interested readers are referred to the text by Goodfellow et al. (2016) for a comprehensive review of neural networks.
2.2 Early stopping in neural network training
High learning capacity models such as neural networks can often be overfitted. Optimization can be terminated early based on some stopping criteria determined using a validation data set. This procedure is called early stopping. An effective stopping criterion is to monitor loss on a validation set (Morgan and Bourlard, 1990; Reed, 1993; Prechelt, 1998; Mahsereci et al., 2017), where a portion of data is reserved for this set. Training is stopped when the validation loss decreases by less than a predefined amount. Algorithm 1 contains the schematics of an early stopping algorithm with one modification adapted from Algorithm 7.1 and Algorithm 7.2 in Goodfellow et al. (2016). Validation is performed before the first training step to allow for the case where = 0 (i.e., we start from the optimal weights).
Early stopping can be seen as a regularization technique that limits the optimizer to search in the parameter space near the starting parameters (Sjberg and Ljung, 1995; Goodfellow et al., 2016). In particular, given optimization steps the product
can be interpreted as the effective capacity which bounds reachable parameter space from
, thus behaving like
regularization (Goodfellow et al., 2016).
For time series problems where chronological ordering is important, popular approaches include expanding window (each new time slice is added to the panel data set) and rolling window (the oldest time slice is removed as a new time slice is added) (Rossi and Inoue, 2012). Instead of randomly splitting training and test sets, the can be used where the end of the series is withheld for evaluation. This is unsatisfactory for two reasons. First, each time period is drawn from a different data distribution D (hereon denoted as
set drawn at time t). A pooled regression with window size k effectively assumes data at t+1 is drawn from
. Secondly, if data is scarce in terms of time periods (for instance, monthly data with a window size of 12 months), estimates for optimal optimization steps ˆ
can have large stochastic error. To the best of our knowledge, there is no procedure for adapting early stopping to be used in an online and time-varying context.
2.3 Online optimization
Optimizing network weights to track a function evolving under unknown dynamics is an online optimization problem. A discussion on relevant concepts in online optimization is provided. Interested readers are encouraged to read the text by Shalev-Shwartz (2012) for an introduction. In online optimization literature, iterate is often denoted as and loss function as f. We have used
as iterate to be consistent with our parameter of interest and
to avoid conflict with our use of f.
In online convex optimization, a player iteratively chooses iterate where Θ is a set of admissible iterates. Nature reveals potentially adversarial loss function
and player incurs loss
is a convex set of loss functions. The most basic performance measure of an online learning algorithm is the static regret (also called average regret), defined as the difference in average loss between the player and the best fixed optimum in hindsight. More formally,
where ) is the best minimizer over
is to design algorithms that minimize
, a cumulative deficit against the best minimizer. One of the simplest online learning algorithm for static regret is the Follow-the-Leader algorithm (Kalai and Vempala, 2005; Shalev-Shwartz, 2012),
defined as,
At each round t, the algorithm simply selects the best minimizer in the data seen to date. This algorithm shares some resemblance to the expanding window training scheme used by Gu et al. (2019), where the model is re-trained on the entire data set at every interval and will converge to the best fixed optimum. There are two limitations with average regret: the distribution of loss functions J must be stationary, and J must be convex.
Recently, Hazan et al. (2017) extended online convex optimization to the non-convex and stationary case. This was further extended by Aydore et al. (2019) to the non-convex and non-stationary case, proposing to measure performance with
where ) is the exponentially weighted history of loss functions,
is the normalization factor, and
Non-convex optimization is NP-Hard
Therefore, existing non-convex optimization algorithms focus on finding local minima (Hazan et al., 2017). For this reason, dynamic local regret is measured as the sum of (squared) norm of weighted average training gradient, rather than against a best in hindsight optimum (in static regret). To minimize dynamic local regret, Aydore et al. (2019) proposed the DTS-SGD algorithm, as presented in Algorithm 2. Note that weights are
updated with the weighted sum of past evaluated gradients ˆthan ˆ
) which is past gradient functions evaluated at current
(equivalent to average gradient over a pooled data set).
In analyzing DTS-SGD, we note two potential weaknesses. Firstly, neural networks are notoriously difficult to train. Geometry of the loss function is plagued by the abundance of local minima and saddle points (see Chapter 8.2 of Goodfellow et al., 2016). Momentum and learning rate decay strategies (for instance, Sutskever et al., 2013; Kingma and Ba, 2015) have been introduced which requires multiple passes over training data, adjusting learning rate each time to better traverse the loss surface. DTS-SGD is a single weight update at each time period which may have difficulty traversing highly non-convex loss surfaces. Secondly, during our simulation tests, we observed that loss can increase after weight update. One possibility is that a past gradient is taking the weights further away from the current local minima. On our U.S. equities data set, we observed exploding gradient when using DTS-SGD and could not complete training. Our proposed OES algorithm addresses both issues by allowing multiple passes over the training data set and is compatible with existing optimizers (for example, Kingma and Ba, 2015).
3.1 Tracking a restricted optimum
In this section we present our main theoretical results. Our goal is to track the unobserved minimizer of as closely as possible. In regret analysis, it is desirable to have regret that scales sub-linearly to T, which leads to asymptotic convergence to the optimal solution. Hazan et al. (2017) demonstrated that in the non-convex case, a sequence of adversarially chosen loss functions can force any algorithm to suffer regret that scales with
. Locally smoothed gradients (over a rolling window of w loss functions) were used to improve regret, with a larger w advocated by Hazan et al. (2017) which leads to lower regret. Aydore et al. (2019) extended this to use rolling weighted average of past gradients which gives recent gradients a higher weight to track a dynamic function. Inevitably, smoothing will track a time-varying minimizer with a tracking error that is proportionate to the averaging window size and smoothing parameter
To address this, we propose a tracking target of our algorithm. At time t, the online player selects
observed
. As our goal is to closely track the underlying function, we propose to restrict the admissible weight set to the path formed from
extending along the gradient vector
. The point
along this path with the minimum
is the restricted optimum. To illustrate, let
starting point of optimization,
). The possible scenarios during training are (also illustrated in Figure 1):
2, then moving along g will also improve
g is perpendicular to
has reached a local minima of
2. If2, then following g will not improve
) and training should terminate.
Figure 1: As move along the direction of g, if the angle between g and
are less than
2 (left), then training will improve
). If the angle is greater than
2 (right), then training will not improve
).
This observation motivates our online early stopping algorithm. In this section, we will use to denote restricted optimal weights at
to denote the online player’s choice of weights. Suppose
evolves under the dynamics of,
where is sampled from an unknown distribution.
can be interpreted as a distance measure which provides the optimal prediction weights on
are restricted to travelling along the direction of
). In this context,
is the minimum gradient suffered by the player. Next, let
optimal number of optimization steps at time
be the estimated number of optimization steps. At iteration t, we solve optimal optimization steps
We start from 2 as solving
which we have not yet observe. This leads to optimal weights (the restricted optimum) trained on
for prediction on
and can be approximated by,
which implies . To make predictions on
and train
Thus, is a measure of expected variations between
. Using our
-smooth assumption (in Section 2.1) and substituting in definitions of
where we start from t = 2 as our algorithm requires at least 2 observations. The elegance of Equation 8 is that it conforms with the conventional notion of regret, with cumulative gradient deficit against an optimal outcome in place of cumulative loss. As is the unbiased estimator of
, Equation 8 indicates that the cumulative deficit is asymptotically bounded by the variance of
. This concept is illustrated in Figure 2. If
is constant, then
will converge to
the optimal weights are achieved. Conversely, if
has high variance, then the player will suffer a larger cumulative gradient deficit.
3.2 Online early stopping algorithm
Our strategy is to modify the early stopping algorithm to recursively estimate and is outlined in Algorithm 3. Algorithm 3 outlines the online early stopping
Figure 2: Illustration of estimating . Suppose
= [
] is a row vector with two elements. Twenty one random
vectors were drawn with each
pair represented as an arrow. The circle has radius
is regularized by limiting how far it can travel from
which is
.
procedure which consists of two steps: (i) recursively estimate optimal training steps 2 by training on
and validating on
(line 3); (ii) train on
on line 3 is outlined in Algorithm 1. At each iteration, two trailing intervals of data are used to train
. On line 3, optimal weights at
2 (or randomly initialized if t = 2) is trained on
and validated on
(line 3) which rolls
forward by one period. The network is then trained on
iterations. At this point,
represents the best estimated weights
and is ready to be used for predictions. At the next iteration, we start from
(which has been validated against
) in order to estimate
1. In our implementation of the algorithm, we have used stochastic gradient ˆ
instead of the full gradient
In this work, we conduct two empirical studies. First is based on simulation data which highlights the use of online early stopping, and the second on predicting U.S. stock returns based on the data set in Gu et al. (2019).
4.1 Simulation
To illustrate the use of online early stopping, we have created the following synthetic data set:
• T = 180 months, each month consists of n = 200 observations.
• Each observation has m = 100 features, forming input matrix of and output vector
• Let be the value of feature j of stock i at time t. Each feature value is randomly set to
• Each feature has latent relationship 1). Latent relationship follows a Wiener process and drifts over time.
• Each output value is Thus,
is non-linear with respect to
and the relationship changes over time.
We have used the same network setup as Section 4.2 (outlined in Table 1) but with a batch size of 50. DNN was re-fitted at every 10-th intervals. Hyperparameters for OES were chosen using the first 60 intervals as training data and next 60 intervals as validation data. Out-of-sample performance was calculated on the remaining 60. DTS-SGD follows the same training scheme as OES, with additional hyperparameters:
4.2 Model and U.S. equities data
The U.S. equities data set in Gu et al. (2019) consists of all stocks listed in NYSE, AMEX, and NASDAQ from March 1957 to December 2016. Average number of stocks exceeds 5,200. Excess returns are calculated as forward one month stock returns over Treasury-bill rates. Covariates include 94 firm level features, 74 industry dummy variables (based on first two digits of SIC code) and interaction terms with 8 macroeconomic indicators. Firm level characteristics include share price based measures, valuation metrics and accounting ratios. The purpose of interacting firm level characteristics with macroeconomic indicators is to capture any time-varying dynamics that are related to (common across all stocks) macroeconomic indicators. For instance, suppose valuation metrics have a stronger relationship with stock returns during periods of high inflation. Then, this information will be encoded in the interaction term. The aggregated data set therefore contains 94 (8 + 1) + 74 = 920 features. Each feature has been appropriately lagged to avoid look-forward bias, and are cross-sectionally ranked and scaled to [
Table A.6 in Gu et al. (2019) contains the full list of firm characteristics.
A subset of the data is available on Dacheng Xiu’s websitewhich contains 94 firms level characteristics and 74 industry classification. Our main result uses 94 + 74 = 168 firm level features but results with the full 920 features are also provided as a comparison. At this point, it is useful to remind readers that our goal is to track a non-stationary function when time-varying dynamics are unknown. In other words, we assume that time-varying dynamics between stock returns and features are not well understood or are unobservable. As such, the subset of data without interaction terms is sufficient for our problem. If macroeconomic indicators do encode time-varying dynamics, our network will track changing macroeconomic conditions automatically.
Data is divided into 18 years of training (from 1957 to 1974), 12 years of validation (1975-1986), and 30 years of out-of-sample tests (1987-2016). Training and validation sets are rolled forward by 12 months at the end of every December and the model is re-fitted. We use monthly total returns of individual stocks from CRSP. Where stock price is unavailable at the end of month, we use the last available price during the month. Table 1 records test configurations as outlined in Gu et al. (2019) and in our replication. A total of six hyperparameter combinations were tested. Batch size of 1,000 for OES was chosen arbitrarily.
Table 1: Disclosed model parameters in Gu et al. (2019) and in our replication. We have filled missing values with the cross-sectional median or zero if median is unavailable. “H” is hidden layer activation. “O” is output layer activation.
To train OES, we have kept the first 18 years (to 1974) as training data and next 12 years (to 1986) as validation data. For each permutation of hyperparameter set, we have trained an online learner up to 1986. Hyperparameter search is only performed on this period, as opposed to every year in Gu et al. (2019). As the algorithm does not depend on a separate set of data for validation, we simply take the hyperparameter set with the lowest monthly average MSE over 1975-1986 as the best configuration to use for rest of the data set.
5.1 Performance metrics
As outlined in Section 2.1, our problem is based on an investor making iterative portfolio allocation decisions. Gu et al. (2019) used pooled de-meaning as the main performance metric,
where is the pooled out-of-sample data set covering January 1987 to December 2016. This is a viable performance metric when one is interested in measuring prediction accuracy over all periods as a whole, but does not tell us how well an investor would have done on average over time. Secondly, asset returns are known to exhibit non-Gaussian characteristics (Cont, 2001). Our analysis of stock returns (Table 2) largely confirms a data set that is potentially impacted by outliers. Therefore, we provide two additional metrics. First, average monthly Spearman’s rank correlation as a non-parameteric measure that does not depend on variance of dependent variable,
This is also the primary performance metric in Abe and Nakayama (2018). Second, average monthly as a more conventional complement to
5.2 Simulation results
In this section, we demonstrate the use case of our method. Our synthetic data requires the network to adapt to time-varying dynamics. Table 3 records results of the simulation. As expected, DNN struggled to learn the non-stationary relationships, with mean 26 % and mean rank correlation of
significantly outperformed the other two methods in this simple simulation, achieving mean
64 % and mean rank correlation of 69.63 %. There is a preference for higher
regularization and learning rate. In Aydore et al. (2019), the authors reported problems of exploding gradient with the original method in Hazan et al. (2017) and that DTS-SGD provided greater stability. In our simulation test, we observed gradient instability with DTS-SGD as well. During training, loss can increase after a weight update. This could be an issue with this general class of optimizers. Lastly, we find that mean
tends to be slightly lower than
Table 2: Descriptive statistics of monthly excess returns of U.S. equities over April 1957 to December 2016. Monthly excess returns appear to contain some outliers, particularly on the positive end.
Table 3: Simulation results. We observed gradient instability with DTS-SGD which may have contributed to its underperformance.
5.3 Predicting U.S. stock returns
Next, we compare our results against results in Gu et al. (2019), keeping in mind that our method should be compared against DNN without interaction terms. DTS-SGD did not complete training with a reasonable range of hyperparameters due to exploding gradient and was omitted from this section. As an overarching comment, for both DNN and OES on U.S. stock returns are very low. In our replication (Table 4, without interactions), DNN achieved
OES achieved
51 %. However, OES performed substantially better on mean rank correlation, a non-parametric measure, at 4.68 % while DNN scored 2.44 %. We observed similar performance with or without interaction terms, suggesting that the 8 macroeconomic time series have little interaction effect with the 94 features. In the subsequent results in this section, we only report statistics without interaction terms.
Table 4: Predictive performance on U.S. equities. DNN outperforms OES on metrics which depend on variance of predicted values but underperformed OES on rank correlation which is non-parametric. Performance was similar with or without interactions terms. Mean hyperparameters are calculated over 10 ensemble networks and across all periods. As reported are results in Gu et al. (2019).
So why do the two metrics diverge? The answer lies in Table 5 and Figure 3. In here, we form decile portfolios based on predicted returns over next month and track their respective realized returns. OES predicted values appear to span a wider range than DNN, this may have contributed to a lower a pooled data set which will average out time-varying effects. As a result, the average gradient will likely be smoother. This is evident from the lower mean
penalty and higher learning rate chosen by validation. By contrast, OES trains on
Table 5: Predicted and realized mean returns by decile where each row rep- resents a decile. P10-1 is mean returns of P10 less P1 and shows the return spread between the best predicted stocks relative to the worst predicted stocks. As reported are original results from Table A.9 in Gu et al. (2019).
Figure 3: Cumulative mean returns by decile sorted based on predictions by DNN and OES. PN indicates Decile N. High N portfolios (i.e., P10) should be higher and low N portfolios (i.e., P1) should be lower.
each period individually and the norm of the gradient presented to the network at each period is likely to be larger. This led to lower learning rate and higher mean penalty chosen by validation. Hence, variance of OES predicted values is higher and may require a higher level of regularization.
In Table 5 and Figure 3, we observe that the prediction performance of DNN is concentrated on the extremities, namely P1 and P10, with realized mean returns of 91 % respectively. Stocks between P3 and P7 are not well separated. By contrast, OES was better at ranking stocks across the entire spectrum. Realized mean returns of OES are more evenly spread across the deciles, resulting in mean ranked correlation that is almost twice as high as DNN. For the case of an investor iteratively making portfolio allocation decisions, this also reinforces our argument that
may not be the best performance metric.
5.4 Time-varying feature importance
So far, our tests are predicated on time-varying relationships between features and stock returns. How do features’ importance change over time? To illustrate, at every month we train the OES model and make a prediction. Rank correlation for the baseline model is calculated, then each feature is iteratively set to zero and rank correlation is calculated again. Feature importance of the j-th feature is measured as . Note that a feature can have negative importance. For instance, sum of importance for 1 month price momentum is strongly negative, indicating the network was betting on short term reversal. This exercise is different to Section 3.3 in Gu et al. (2019), where a feature is set to zero before training, meaning the network can learn a different model to the baseline. Instead, our method measures what has been learned by the network.
First, we track feature importance over January 1987 to December 1991. The top 10 features with the highest absolute average delta were (in order of decreasing importance): dolvol (monthly traded value), mvel1 (log market capitalization), mom12m (12-month minus 1-month price momentum), ill (illiquidity), mom6m (6-month minus 1-month month price momentum), idiovol (CAPM residual volatility), std dolvol (36-month traded value volatility), maxret (30-day max daily return), turn (turnover), and betasq (CAPM beta squared). Rolling 12-month averages were calculated to provide a more discernible trend, as illustrated in Figure 4. Feature importance exhibits strong non-stationarity. Average rank correlation delta can transit between positive and negative, indicating potential for periods of poor performance had an investor naively invested with a style, i.e., always invest with momentum. Features such as dolvol had trended towards zero over time, indicating loss of explanatory power.
Next, we divide the out-of-sample period into six 5-year blocks and examine
Figure 4: Top 5 features based on rolling 12-month average rank correlation delta to baseline over 1987-1991. ill, mvel1 and dolvol have distinctly time-varying importance and have drifted towards zero over time, suggesting loss of explanatory power.
change in importance for all features. Figure 5 records that a small set of features contributed most of the efficacy and the set of best features can change over time. For instance, the size effect (mvel1) has diminished over time, consistent with literature (i.e., Horowitz et al., 2000). This underscores the importance to have a dynamic model that adapts to changes in the true model.
5.5 Training Deeper Networks
Gu et al. (2019) documents that shallow networks outperformed deeper networks. The authors found that performance peaked at three hidden layers for a neural network and tree-based algorithms tend to select trees with fewer leaves. The authors attributed this tendency for a compact model to relatively small amount of data and low signal-to-noise ratio.
We confirm that performance peaks at three hidden layers. Table 6 records performance metrics of both DNN and OES with three, four and five hidden layers, with each layer expanding by the pyramid rule. Mean rank correlation peaks at 2.44 % for DNN and falls to 1.96 % at five layers, while OES peaks at 4.69 % and falls to 4.02 %. However, we offer an alternative explanation for the lack of improvement with deeper networks. As observed in Figure 5, only a small subset of features contribute to the network’s performance. The feature set contains many features that are related. For instance, both 6- and 12-month momentum scores
Figure 5: Average rank correlation delta to baseline (in decimal) in 5-year blocks. The OES network appeared to only use a handful of features. Some features appear to have loss their importance over time (i.e., dolvol, maxret, mom12m and mvel1).
Table 6: Predictive performance on U.S. equities using different network topologies. 3 hidden layers has 32–16–8 nodes, 4-layers has 64–32–16–8, and 5-layers has 128–64–32–16–8.
Figure 6: Cross-correlation of all features, calculated each month then aver- age over all time periods. Features are then clustered using correlation as a distance measure. Diagonal of the matrix has been set to zero. Groups of related features are visible in the data set.
are present, as well as return on equity, assets and invested capital. An inspection of cross-correlation confirms our hypothesis. For every month, we construct cross-correlation matrix of all features, then average of the correlation matrix over time. Clustering was applied using 1 as the distance measure, where
correlation between feature in row i and column j in the matrix. The resultant correlation matrix is shown in Figure 6. The correlation matrix is suggestive of three broad families of features, as well as many features that do not fit in a family. One in the top left corner, consisting of size, liquidity and profitability. A second group in the middle consisting of volatility, and accounting measures of liquidity (cash and quick ratios). And a third group in the bottom right corner consisting of yield metrics. The effective feature set is likely to be smaller than the 94 features listed. Another point to note is the use of
regularization which encourages a sparse network and could explain the low feature utilization.
5.6 Using dropout for regularization
In Section 5.3, we observed that OES outperform DNN based on mean rank correlation, a non-parametric measure, but underperformed on which depends on variance of estimated values. We have attributed this to inadequate regularization. In this section, we investigate the use of dropout in place of
penalty. Dropout is a popular regularization strategy in deep learning and works by randomly setting nodes to zero during training. It can be interpreted as a computationally efficient way of averaging over multiple sub-networks (Goodfellow et al., 2016).
Dropout layers were added to each hidden layer, with additional hyperparameters {0.1, 0.2, 0.3, 0.4, 0.5}. Results are recorded in Table 7. Results without interaction terms, with and dropout are compared. All three performance measures of OES improved with dropout compared to
penalty. Whilst
negative, it is now much closer to DNN. Optimal dropout rates tend to be high, at 41 %. Interestingly, DNN did not benefit from dropout. Mean learning rate ticked up, from 0.67 % to 0.79 %. We hypothesize that as DNN is trained over all history, the magnitude of gradient at every step is already low (relative to OES). Dropout resulted in an even lower magnitude and is thus over-regularizing. We envisage that future work could further explore the hyperparameter space. In particular, whether using dropout can lead to training of deeper networks.
Stock return prediction is an arduous task. The true model is noisy, complex and time-varying. Mainstream deep learning research has focused on problems that
Table 7: Prediction results using (left) and dropout regularization (right), and without interaction terms. Dropout improved predictive accuracy of OES, but does not appear to benefit DNN.
are stationary and, arguably, non-stationary applications have seen less advancements. In this work, we propose an online early stopping algorithm that is easy to implement and can be applied to an existing network setup. We show that a network trained with OES can track a non-stationary function and achieved superior performance to DTS-SGD, a recently proposed online non-convex optimization technique. Our method is also significantly faster, as only two periods of training data are required at each iteration, compared to the pooled method used in Gu et al. (2019) which re-trains the network on the entire data set annually. In our tests, the pooled method took 5.5 hours to iterate through the entire data set (an ensemble of ten networks therefore takes 55 hours). By contrast, our method took 44.25 mins for a single pass over the entire data set (an ensemble of ten networks took 7.4 hours).
Gu et al. (2019) suggested that a small data set and low signal-to-noise ratio were reasons for the lack of improvement with a deeper network. To this end, we show that only a handful of features contribute to prediction performance. This may be due to correlation between features and the use of regularization which encourages sparsity. We also find evidence of time-varying feature importance. In particular, features such as log market capitalization (the size effect) and 12-month minus 1-month momentum appear to have lost their importance towards the end of our test period — a result which has strong implications for practitioners forecasting stock returns using known asset pricing anomalies. Lastly, we believe time-varying neural network is a relatively less explored domain of machine learning that has many potential applications and deserves further research.
Abe, M. and Nakayama, H. (2018). Deep learning for forecasting stock returns in the cross-section.
Aydore, S., Zhu, T., and Foster, D. P. (2019). Dynamic local regret for non-convex online forecasting. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32, pages 7980–7989. Curran Associates, Inc.
Bergmeir, C., Hyndman, R., and Koo, B. (2018). A note on the validity of cross- validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120:70–83.
Bossaerts, P. and Hillion, P. (1999). Implementing statistical criteria to select return forecasting models: What do we learn? The Review of Financial Studies, 12(2):405–428.
Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance, 1:223–236.
Gama, J. a., ˇZliobait˙e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4):44:1–44:37.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
Gu, S., Kelly, B. T., and Xiu, D. (2019). Empirical asset pricing via machine learning. Review of Financial Studies, forthcoming.
Hazan, E., Singh, K., and Zhang, C. (2017). Efficient regret minimization in non-convex games. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1433–1441, International Convention Centre, Sydney, Australia. PMLR.
Horowitz, J. L., Loughran, T., and Savin, N. (2000). The disappearing size effect. Research in Economics, 54(1):83–100.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, volume 37 of , pages 448–456. JMLR.org.
Jegadeesh, N. and Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance, 48(1):65–91.
Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations, pages 1–13.
Mahsereci, M., Balles, L., Lassner, C., and Hennig, P. (2017). Early stopping without a validation set. CoRR, abs/1703.09580.
Messmer, M. (2017). Deep learning and the cross-section of expected returns. Working Paper.
Morgan, N. and Bourlard, H. A. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 630–637. Morgan-Kaufmann.
Pesaran, M. H. and Timmermann, A. (1995). Predictability of stock returns: Robustness and economic significance. Journal of Finance, 50:1201–1228.
Prechelt, L. (1998). Early stopping - but when? In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 55–69, London, UK, UK. Springer-Verlag.
Reed, R. D. (1993). Pruning algorithms-a survey. Transactions on Neural Networks, 4(5):740–747.
Rossi, B. and Inoue, A. (2012). Out-of-sample forecast tests robust to the choice of window size. Journal of Business & Economic Statistics, 30(3):432–453.
Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823.
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194.
Sjberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with application to neural networks. International Journal of Control, 62(6):1391–1407.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D., editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA. PMLR.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In NIPS.
W.Banz, R. (1981). The relationship between return and market value of common stocks. Journal of Financial Economics, 9(1):3–18.
Weigand, A. (2019). Machine learning in empirical asset pricing. Financial Markets and Portfolio Management, 33:93–104.