"The Squawk Bot": Joint Learning of Time Series and Text Data Modalities for Automated Financial Information Filtering

2019·Arxiv

Abstract

Abstract

Multimodal analysis that uses numerical time series and textual corpora as input data sources is becoming a promising approach, especially in the financial industry. However, the main focus of such analysis has been on achieving high prediction accuracy while little effort has been spent on the important task of understanding the association between the two data modalities. Performance on the time series hence receives little explanation though human-understandable textual information is available. In this work, we address the problem of given a numerical time series, and a general corpus of textual stories collected in the same period of the time series, the task is to timely discover a succinct set of textual stories associated with that time series. Towards this goal, we propose a novel multi-modal neural model called MSIN that jointly learns both numerical time series and categorical text articles in order to unearth the association between them. Through multiple steps of data interrelation between the two data modalities, MSIN learns to focus on a small subset of text articles that best align with the performance in the time series. This succinct set is timely discovered and presented as recommended documents, acting as automated information filtering, for the given time series. We empirically evaluate the performance of our model on discovering relevant news articles for two stock time series from Apple and Google companies, along with the daily news articles collected from the Thomson Reuters over a period of seven consecutive years. The experimental results demonstrate that MSIN achieves up to 84.9% and 87.2% in recalling the ground truth articles respectively to the two examined time series, far more superior to state-of-the-art algorithms that rely on conventional attention mechanism in deep learning.

Introduction

Current multimodal analysis that combines time series with text data often focuses on extracting features from text corpus and incorporates them into a forecasting model for enhancing prediction. Little attention is paid to the aspect of using text as a means of explaination for the patterns observed in the time series (Akita and others 2016; Weng, Ahmed, and Megahed 2017). In many emerging applications, given a time series, one can ask for finding a small set of documents that can reflect or influence the time series. Taking “quantamental investing” (Wigglesworth 2018) as an

Figure 1: Illustration of our model that learns to daily discover top relevant text articles timely associated with the current state (characterized by the last m time steps) in a given time series.

example. When trading a stock, investors do not solely base their decisions on its historical prices. Rather, the decisions are made with a careful consideration of the news and events collected from the markets. With the dramatically increasing amount of available news nowadays (Schumaker et al. 2012), a natural question hence to ask is what would be the most relevant news associated with a particular stock series. As illustrated in Figure 1, given recent historical values of Apple stock time series as one input, and today’s textual news collected from a public media as another input, “Steve Jobs threatened patent suit” is outputted as relevant news. Certainly, for different stock time series, the set of relevant associated news articles would be different. Likewise with the cloud business, accurately associating a cloud monitoring metric (time series) with textual complaints from clients can help technicians focus on the right issues, tremendously reducing their efforts and expertise in resolving complaints. Though filtering relevant text stories can rely on keywords, we argue that accurately identifying such keywords associated with a specific time series is not an easy task as requiring domain knowledge. Moreover, in applications like cloud systems, it is unclear on what would be keywords associated with each of thousands of monitoring time series. Furthermore, filtering information based on fixed keywords would potentially lose essential information as time series keeps changing and so do the textual contents. Hence an automated system is highly desirable and becomes more relevant.

In this paper, we address the above mentioned important, yet challenging problem through developing a novel, multi-modal, neural network that jointly learns numerical time series and textual documents in order to discover the relation between them. The discovered text articles are returned as a means of recommended documents for the current state of the time series. As shown in Fig.1, our model consists of (1) a textual module that learns representative vectors for input text documents, (2) the MSIN (Multi-Step Interrelation Network), which takes as input both the time series and the sequence of textual representative vectors to learn the association between them and hence discovering relevant text documents, (3) a concatenation and dense layer that aggregates information from the two data modalities to output the next value in the time series.

The novelty from our proposed model stems from the introduction of the Multi-Step Interrelation Network, which allows the incorporation of semantic information learnt from the textual modality to every time step modeling of the time series. We argue that multiple data interrelations are important and necessary in order to discover the association between the categorical text and numerical time series. This is because not only they are of different data types, but also no prior knowledge is given on guiding the machine learning model to look at what text in relation to a given time series. Hence, compared to existing multimodal approaches either learning two data modalities sequentially or in parallel, our proposed MSIN network effectively leverages the mutual information impact between the two data modalities, textual representative vectors (learnt by the Text Module in Fig.1) and time series’ values, through multiple steps of data convolution. During such process, it gradually assigns large attention weights to “important” text vectors, while rules out less relevant ones, and finally converges to an attention distribution over the text articles that best aligns with the current state of the time series. MSIN also technically relaxes the strict alignment between the two data modalities (e.g. at the time stamp level), allowing it to concurrently deal with two (or multiple) input data sequences of different lengths, which currently is not supported by most conventional recurrent neural networks (Hochreiter and Schmidhuber 1997; Cho and others 2014; Chung et al. 2014). We perform detailed empirical analysis over large-scale financial news and stock prices datasets spanning over seven consecutive years. The model was trained on data of the first six years and evaluated on the last year. We show that MSIN achieves strong performance of 84.9% & 87.2% in recalling ground-truth relevant news w.r.t. two specific stock time series, the Apple (AAPL) and Google (GOOG), which are prominent companies during the examined period.

Learning text articles representation

The first network component in our model is the textual module that learns to represent text articles as numerical vectors so that they are comparable to the numerical time series. The order among words within each text article is important in learning their semantic dependencies. Hence, our network exploits the long-short term memory (LSTM) (Hochreiter and Schmidhuber 1997) to learn such dependencies and aggregates them into an article’s representative vector. At each time stamp t, an input data sample to the textual module is a sequence of n text documents

(e.g., n news articles released at day t by Thomson Reuters). And its output sample is a sequence n representative vectors denoted (we omit notation t in these ’s and ’s to minimize clutter). Each text document (with ) in turn is a sequence of maximum K words denoted by (the superscript txt denotes for text modality). Each is the one-hot-vector representation of the -th word in document the vocabulary size. We use an embedding layer to transform each into a low dimensional dense vector a linear transformation:

Often, is much smaller than can be trained from scratch; however, using or initializing it with pre-trained vectors from GloVe (Pennington and others 2014) can produce more stable results. In our examined datasets (see Section ), we found that setting is sufficient given the vocabulary size V = 5000.

The sequence of embedded words for a document article is then fed into an LSTM that learns to produce their corresponding encoded contextual vector. The key components of an LSTM unit are the memory cell which preserves essential information of the input sequence through time, and the non-linear gating units that regulate the information flow in and out of the cell. At each step (corresponding to -th word) in the input sequence, LSTM takes in the embedding vector , its previous cell state and the previous output vector , to update the memory cell , and subsequently outputs the hidden representation for . From this view, we can briefly represent LSTM as a recurrent function f as follows:

in which the memory cell is updated internally. Both in which is the number of hidden neurons. Our implementation of LSTM closely follows the one presented in (Zaremba and others 2014) with two extensions. First, in order to better exploit the semantic dependencies of an -th word with both its preceding and following contexts, we build two LSTMs respectively taking the sequence in the forward and backward directions (denoted by the head arrows in Eq.(3)). This results in a bidirectional LSTM (BiLSTM):

The concatenated vector leverages the context surrounding the -th word and hence better characterizes its semantic as compared to the embedding vector which ignores the local context in the input sequence. Second, we extend the model by exploiting the weighted mean pooling from all vectors to form the overall

Figure 2: Memory cell design of the multi-step interrelation network (MSIN).

representation of the entire j-th text document:

where in which are respectively referred to as the parameterized context vector and matrix whose values are jointly learnt with the BiLSTM. This pooling technique resembles the successful one presented in (Con- neau and others 2017) that learns multiple views over each input sequence. Our implementation simplifies them by adopting only a single view (u vector) with the assumption that each document (e.g., a news story or article) contains only one topic (relevant to the time series). Note that, similar to convolutional neural networks (Kim 2014), a max pooling can also be used in replacement for the mean pooling in defin-ing . We, however, attempt to keep the model simple since using max function adds a non-smooth function and thus generally requires more training data to learn a proper transformation. We apply our text module BiLSTM to every text article collected at time period t and its output is a sequence of representative vectors , each corresponds to one text document at input.

Multi-step interrelation of data modalities

Our next task is to model the time series, taking into account the information learnt from the textual data modality, in order to discover a small set of relevant text articles that well align with the current state of the time series. An approach of using a single alignment between the two data modalities (empirically evaluated in Section ) generally does not work effectively since the text articles are highly general, contain noise, and especially we do not have step-wise synchronization between the text and time series. The number of look-back time points m in the time series can be much different from the number of text documents n collected at the time moment, the value that can vary from time to time. To tackle these challenges, we propose the novel MSIN network that broadens the scope of LSTM so that it can handle concurrently two input sequences of different lengths. More profoundly, we develop in MSIN a novel neural mechanism that integrates information captured in the representative textual vectors learnt in the previous textual module to every step reasoning in the time series sequence. Doing so allows MSIN leverage the mutual information between the two data modalities through multi-steps of data interrelation. Consequently, it gradually filters out irrelevant text articles while focuses only on those that correlate with the current patterns learnt from the time series, as it advances in the time series sequence. The chosen text articles are hence captured in the form of a probability mass attended on the input sequence of textual representative vectors.

In specific, inputs to MSIN at each time stamp t are two sequences: (1) values of last m steps in the time series modality (superscript ts denotes the time series modality); (2) a sequence of n text representative vectors learnt by the above textual module . Its outputs are the set of hidden state vectors (described below) and the probability mass vector outputted at the last state of the series sequence. The number of entries in equal to the number of text vectors at input. Its values are non-negative which encode text documents relevant to the time series sequence. A larger value at j-th entry of reveals a more relevant to the time series.

A memory cell of our MSIN is illustrated in Fig.2. As ob- served, the number of gates in MSIN remains fixed yet we augment the information flow within the cell by the information learnt in the text modality. To be concrete, MSIN starts with the initialization of the initial cell state and hidden state by using two separate single-layer neural networks applied on the average state of the text sequence:

where ; and , as the number of neural units. These are parameters jointly trained with our entire model.

MSIN incorporates information learnt in the text articles to every step it performs reasoning on the time series in a selective manner. Specifically, at each timestep in the time series sequence1, MSIN searches through the text representative vectors to assign higher probability mass to those that better align with the signal it discovers so far in the time series sequence, captured in the hidden state vector particular, the attention mass associated with each text representative vector is computed at -th timestep as follows:

where and . The parametric vector is learnt to transform each alignment vector to a scalar and hence, by passing through the softmax function, is the probability mass distribution over the text representative sequence. We would like the information from these vectors, scaled proportionally by their probability mass, to immediately impact the learning process over the time series. This is made possible through generating a context vector

in which is initialized as a zero vector. As designed, MSIN constructs the latest context vector as the average information between the current representation of relevant text article (1st term on the right hand side of Eq.(9)) and the previous context vector . By induction, influence of context vectors in the early time steps is fading out as MSIN advances in the time series sequence. MSIN uses this aggregated vector to regulate the information flow to all its input, forget and output gates:

and the candidate cell state:

where and . Let denote the Hadamard product, the current cell and hidden states are then updated in the following order:

By tightly integrating the information learnt in the textual modality to every step in modeling the time series, our network distributes burden of work in discovering relevant text articles throughout the course of the series sequence. The selected relevant documents are also immediately exploited to better learn patterns in the time series.

Output Layer: Given representative vectors learnt from the textual domain and the hidden state vectors of the time series, we use a concatenation layer to aggregate them and pass it through an output dense layer. The entire model is trained with the output as the next value in the time series.

As ablation studies, we consider two variants to our proposed model. First, in order to see the impact of multistep of interrelation between data modalities, we exclude that process from our model, use a conventional LSTM to model the time series and subsequently align its last state with the representative textual vectors to find the information clues. This simplified model has fewer parameters yet the interaction between two data modalities is limited to only the last state of the time series sequence. We name it LSTMw/o (LSTM without interaction). Second, excluding the relevant text discovery task, we build a model that treats numerical series and text data independently. Hence, two networks used for two data modality are trained in parallel and their outputs are being fused only prior to the last dense layer for making prediction. This model is closely related to the recent work (Akita and others 2016), and we name it LSTMpar. Empirically evaluating these models also highlights our model’s strength in discovering text articles relevant to the time series.

Figure 3: (a) Precision and (b) Recall computed w.r.t. “Apple” headlines annotated by Reuters.

Experiment

Datasets: We analyze dataset consisting of news headlines (text articles) daily collected from Thomson Reuters between years in seven consecutive years from 2006 to 2013 (Ding and others 2014), and a daily stock prices time series of a specific company collected from Yahoo! Finance for the same time period. For the results reported below, we form each data sample from the two data modalities as: stock values in the last m = 5 days (one week trading) as a time series sequence, and all news headlines, ordered by their released time, in the latest 24 hours as a sequence of text documents. With this setting, a trained model aims at discovering a small set of relevant news articles associated with the time series in a daily basis. We evaluate models with each of the two company-specific stock time series: (1) Apple (AAPL), and (2) Google (GOOG). Each dataset (text and time series) is split into the training, validation and test sets respectively to the yearperiods of 2006-2011, 2012, and 2013. We use the validation set to tune models’ parameters, while utilize the test set for an independent evaluation. As mentioned at the beginning, neither time series identity nor fixed keywords have been used to train our model since such an approach requires extensive domain knowledge, is prone to information lost, while also has limited applications. Instead, we let the models automatically learn such identity and keywords by themself. We examine such findings through investigating the text they select to associate with the given time series (more discussion later). Baselines: We name our model MSIN as its novel network component. For baseline models, we implement LSTMw/o, LSTMpar as described in the previous section; GRUtxt based on (Yang and others 2016), CNNtxt (Kim 2014) that both exploit the textual news modality; and GRUts that analyzes the time series. For a conventional machine learning technique, we implemented SVM (Weng, Ahmed, and Mega- hed 2017) that takes in both time series (as vectors) and the uni-gram for the textual news. To emphasize the key contributions, we mainly discuss in the following the results of our model against those from LSTMw/o and GRUtxt methods in the key task of discovering relevant documents in association with a given time series. This is because they are only the methods that can infer selected documents based on their neural attention mechanism. For comparison results against other techniques and on the prediction task, we report them in the supplementary.

Relevant text articles discovery

The Thomson Reuters corpus provides meta-data indicating whether a news article is about a specific company and we use

Figure 4: (a) Precision and (b) Recall computed w.r.t. “Google” headlines annotated by Reuters.

such information as the ground truth relevant news, denoted by GTn. Nonetheless, it is important to emphasize that such information was not used to train our model(s). Neither the identity of time series nor pre-selection of company-specific news articles have been used. Rather, we let MSIN learn itself the association between the textual news and the stock series via jointly analyzing the two data modalities simultaneously. MSIN is hence completely data-driven and is straightfoward to be applied to other corpus such as Bloomberg news source, or other applications like cloud business, where similar ground-truths are not available.

The GTn headlines allow us to compute the rate of discovering relevant news stories in association with a stock series through the precision and recall metrics. Higher ranking (based on attention mass vector GTn headlines on top of each day signifies better performance of an examined model. Fig.3(a-b) and Fig.4(a-b) plot these evaluations for the AAPL and GOOG time series respectively, when we vary the number of returned daily top relevant headlines k between 1 and 5 (shown in x-axis). For example, at k = 5, MSIN achieves up to 84.9% and 87.2% in recall while retains the precision at 46.8% and 59.6% respectively to the GTn sets of AAPL and GOOG. Other settings of k also show that MSIN’s performance is far better than the competitive models. Our novel design of fusing two data modalities through multiple-step data interrelations allows our model to effectively discover the complex hidden correlation between the time-varying patterns in the time series and a small set of corresponding information clues in the general textual data. Its precision and recall significantly outperform those of LSTMw/o that utilizes only a single step of alignment between the two data modalities, and of GRUtxt which solely explores the textual domain with conventional attention mechanism in deep learning (Yang and others 2016).

Explanation on the discovered textual news As concrete examples for qualitative evaluation, we show in Tables 1 all news headlines from three specific examined days, along with those discovered by our model (highlighted in green and by setting their cumulated probability mass of attention ) as relevant news when it was trained with the AAPL series, and evaluated performance on the independent test dataset. As clearly seen on date 2013-01-22, MSIN gives high attention mass to “steve jobs threatened patent suit to enforce no-hire policy” though none of words mention the Apple company. Likewise, on 2013-09-06, “china unicom, telecom to sell latest iphone shortly after u.s. launch” received the 2nd highest probability mass (21%), in addition

Table 1: News headlines from 3 examined days in 2013 (test set). Relevant news headlines discovered by MSIN associated with AAPL stock series are blue-highlighted by setting: accumulated probability mass . (Full set of discovered relevant news are uploaded on our repository due to space constraint)

to the 1st one “apple hit with u.s. injunction in e-books antitrust case” (37%). Although ground truth news headlines (often containing company name keyword) have been used to quantitatively evaluate MSIN’s performance (Fig.3 and 4), we believe that these uncovered news headlines demonstrate the success and potential of MSIN, as our model is capable to unearth the news content that never explicitly mention company names. We observe similar performance of MSIN when it is trained with the GOOG stock series, using the same

Table 2: News headlines from 3 examined days in 2013 (test set). Relevant news headlines discovered by MSIN associated with GOOG stock series are blue-highlighted by setting: accumulated probability mass

news corpus as presented in Table 2. The results present the probability mass of attention of MSIN on three days selected from the test dataset. Note that date 2013-08-14 are deliberately shown in both Tables 1 and 2 to demonstrate that the set of relevant news are clearly dependent on which time series has been used to train our model along with the general text corpus. Their relevancy to each of two time series is obvious and intuitively interpretable.

Related work

Large number of single-modality studies analyzing either time series data or unstructured text documents have been proposed in the literature. Some of these studies are based on classical statistical methods (Wu and others 2016; Michell and others 2018) or neural networks (Qiu, Song, and Akagi 2016; Zhong and Enke 2017). Recent studies from the financial domain explore both time series of asset prices and the news articles (Schumaker et al. 2012; Weng, Ahmed, and Megahed 2017; Akita and others 2016) which are related to our work. Typically, these studies attempt to transform text from financial news into various

numerical forms including news sentiment, subjective polarity, n-grams and combine them with stock data. These handcrafted features require extensive pre-processing and are also extracted independently from the time series data. A more recent model (Akita and others 2016) relies on RNNs that enable it to model stock series in their natural form and later merge them with the vector representation of textual news prior to making the market prediction. The goal of all these studies remains to improve prediction accuracy but not for the time series explanation purpose. They hence lacks the capability of providing relevant interpretation for time series based on the textual information. Our work is also related to multi-modal deep learning studies (Baltrusaitis and others 2019) which generally can be classified into three categories: early, late, and hybrid (in-between), depending on how and at which level data from multiple modalities are fused together. In early fusion (Val- ada and others 2016; Zadeh et al. 2016), multi-modal data sources are concatenated into a single feature vector prior to being used as inputs for a learning model, while in late fusion, data is aggregated from the outputs of multiple models, each trained on a separate modality and fused later based on aggregation rules such as, averaged-fusion (Nojavanasghari and others 2016), tensor products (Zadeh et al. 2017), or a metamodel like gated memory (Zadeh et al. 2018). The hybrid (in-between) fusion is the trade-off paradigm, which allows the data to be aggregated at multiple scales, yet often requiring synchronization among data modalities, such as in the synchronized gesture recognition (Neverova and others 2015; Rajagopalan and others 2016). Our model is related to the third category; yet, we deal with asynchronous multimodals of numerical time series and unstructured text and relax the constraints on the time-step synchronization between modalities. More significantly, we perform data fusion through multiple steps and at the low-level features which strengthens our model in learning associated patterns across data modalities.

Conclusion

Jointly learning both numerical time series and unstructured textual data is an important research endeavor to enhance the understanding of time series performance. In this work, we presented a novel neural model that is capable of discovering the top relevant textual information associated with a given time series. In dealing with the complexity of relationship between two data modalities, with different data sampling rates and lengths, we develop MSIN that allows the direct incorporation of information learnt in the textual modality to every time step modeling on the behavior of time series, considerably leveraging their mutual influence through time. Through this multi-step data interrelation, MSIN can learn to focus on a small subset of textual data that best aligns with the time series. We demonstrate the performance of our model in the financial domain using time series of two stock prices, which are trained along with a corpus of news headlines collected from Thompson Reuters. Our MSIN model discovers relevant news stories to the stocks that do not even explicitly mention the company name, but include highly relevant events that may influence or reflect their performance

in the market.

References

[Akita and others 2016] Akita, R., et al. 2016. Deep learning for stock prediction using numerical and textual information. In IEEE/ACIS.

[Baltrusaitis and others 2019] Baltrusaitis, T., et al. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2).

[Cho and others 2014] Cho, K., et al. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP.

[Chung et al. 2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.

[Conneau and others 2017] Conneau, A., et al. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

[Ding and others 2014] Ding, X., et al. 2014. Using structured events to predict stock price movement: An empirical investigation. In EMNLP.

[Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation.

[Kim 2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

[Michell and others 2018] Michell, K., et al. 2018. A stock market risk forecasting model through integration of switching regime, anfis and garch techniques. Applied Soft Computing 67:106–116.

[Neverova and others 2015] Neverova, N., et al. 2015. Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(8).

[Nojavanasghari and others 2016] Nojavanasghari, B., et al. 2016. Deep multimodal fusion for persuasiveness prediction. In ACM International Conference on Multimodal Interaction.

[Pennington and others 2014] Pennington, J., et al. 2014. Glove: Global vectors for word representation. In EMNLP.

[Qiu, Song, and Akagi 2016] Qiu, M.; Song, Y.; and Akagi, F. 2016. Application of artificial neural network for the prediction of stock market returns. Chaos, Solitons & Fractals 85.

[Rajagopalan and others 2016] Rajagopalan, S. S., et al. 2016. Extending long short-term memory for multi-view structured learning. In ECCV.

[Schumaker et al. 2012] Schumaker, R. P.; Zhang, Y.; Huang, C.-N.; and Chen, H. 2012. Evaluating sentiment in financial news articles. Decision Support Systems 53(3):458–464.

[Valada and others 2016] Valada, A., et al. 2016. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In International Symposium on Experimental Robotics. Springer.

[Weng, Ahmed, and Megahed 2017] Weng, B.; Ahmed, M. A.; and Megahed, F. M. 2017. Stock market one-day ahead movement prediction using disparate data sources. Expert Systems with Applications 79:153–163.

[Wigglesworth 2018] Wigglesworth, R. 2018. The rise of quantamental investing: Where man and machine meet. Financial Times.

[Wu and others 2016] Wu, L., et al. 2016. Grey double exponential smoothing model and its application on pig price forecasting in china. Applied Soft Computing 39.

[Yang and others 2016] Yang, Z., et al. 2016. Hierarchical attention networks for document classification. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[Zadeh et al. 2016] Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31(6):82–88.

[Zadeh et al. 2017] Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; and Morency, L.-P. 2017. Tensor fusion network for multi-modal sentiment analysis. arXiv preprint arXiv:1707.07250.

[Zadeh et al. 2018] Zadeh, A.; Liang, P. P.; Mazumder, N.; Poria, S.; Cambria, E.; and Morency, L.-P. 2018. Memory fusion network for multi-view sequential learning. In AAAI.

[Zaremba and others 2014] Zaremba, W., et al. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.

[Zhang and Wallace 2015] Zhang, Y., and Wallace, B. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.

[Zhong and Enke 2017] Zhong, X., and Enke, D. 2017. Forecasting daily stock market return using dimensionality reduction. Expert Systems with Applications 67:126–139.

Supplementary

Supp. Thomson Reuters Data

Fig. 5 plots the distribution of numbers of news headlines released per day by Reuters. As a skewed distribution (mode being around 12 headlines per day), we limit all models to explore up to 25 news articles per day. This setting retains more than 95% of the total number of headlines in the original textual corpus. For a small portion of days having exceptionally large numbers of daily news articles (those located to the farright of the distribution shown in Fig. 5 ), we keep the last 25 news headlines on each of these days. We use NLTK toolkit (available at https://www.nltk.org/) for tex preprocessing, keep the vocabulary size at 5000 unique words, and remove ones that are not found in the 400k words of the GloVe.

Supp. Statistics on Ground Truth News Headlines Supp. Parameters Setting We set the parameter ranges for our models and its counterparts as follows: The number of layers’ neural units , regularization

Figure 5: Distribution of numbers of news headlines released per day from the Thomson Reuters news corpus.

Table 3: Statistics on “ground truth” news (GTn) headlines on test dataset associated with company-specific time series. %GTn shows percentage of GTn over all news headlines. %GTd shows percentage of days having at least one GTn. GTn/GTd shows average percentage of GTn on GTd. Max#GTn shows the maximum number of GTn headlines in a single day.

, dropout rate {0.0, 0.1, 0.2, 0.4}, and the number of timesteps (sequent length) for stock time series . We choose 50 as the word embedding dimension with the usage of the pre-trained GloVe model (Pennington and others 2014). For GRUtxt, we follows the model presented in (Yang and oth- ers 2016) using two Bi-LSTMs along with a self-attention layer applied at the sentence level. With CNNtxt, we implement an architecture with three kernel filters of sizes of {2, 3, 4}, using max pooling as recommended in (Kim 2014; Zhang and Wallace 2015). We use two layers of LSTMs as encoder-decoder for the GRUts, while SVM is used with the linear kernel and C is tuned from the log range Our model and the baselines were implemented and trained on a machine with two Tesla K80 GPUs, running Tensorflow 1.3.0 as backend. We performed the random search using the validation set to tune hyperparameters.

The final values we used for AAPL dataset is , GOOG is , both with dropout of 0.2 and using pre-trained GloVe embedding, For S&P500, the values are 0.005 with dropout of 0.1 and initializing with GloVe for word embedding.

Supp. Evaluation Metrics

In empirical evaluation of our models and its counterparts, we have used the precision and recall measurements. While their calculation with respect to predicting class labels remains as usual, their computation with respect to the ground truth news headlines (GTn) is slightly changed, being adaptive to the setting k as the number of top relevant headlines to be returned, as reported in Fig. 3(a-b), Fig.4(a-b). This is because some days have as few as only one GTn, other days may have as many as 9 (in AAPL) or 5 (in GOOG) as shown in Appendix , while the cardinality k is fixed in computing precision and recall. Specifically, we compute the Pre@k and Rec@k (reported in Fig. 3(a-b), Fig.4(a-b)) averaged from all days having at least one ground truth headline as follows:

in which are respectively the numbers of true positive and false positive headlines at day i, and GTndenotes the number of ground truth news headlines up to k cardinality on day i. The denominator in Rec@k ensures that the number of false negative headlines will not be beyond either the cardinality k or the actual number of ground truth headlines on day i.

Supp. Experiment on S&P500 Time Series

We show in Table 4 the headline news in three days selected from the test dataset, along with the probability mass of attention (on valid news headlines) of our MSIN model, its variant LSTMw/o and GRUtxt respectively in 3rd-1st columns. As observed, our model MSIN places more probability mass on the headlines whose contents directly report the market performance, while zeros out mass on headlines that report company-events or local markets. Its attention mass is also more condense, with clear focus on a small set of top relevant news, as compared to those of LSTMw/o and GRUtxt that both spread out on multiple headlines. The headlines highlighted in Table 4 are those outputted by our model by setting their cumulated probability mass of at least 50%.

The focus of our work is on discovering relevant text articles in association with the numerical time series. As using the global market time series for experiment, nevertheless, we would like to test the impact of selected relevant text articles discovered by our model in forecasting the overall market movement in the subsequent day. For this task, we further compare its performances against other baseline models, including CNNtxt (Kim 2014) that both exploit the textual news modality; GRUts that analyzes the time series; and LSTMpar, closely related to (Akita and others 2016), which trains in parallel two LSTM networks, each on one data modality, then fuses them together prior to the output layer. Moreover, for a conventional machine learning technique, we implemented SVM (Weng, Ahmed, and Mega- hed 2017) that takes in both time series (as vectors) and the uni-gram for the textual news.

Table 5 reports the forecasting accuracy, precision and recall of all models. Two values in each entry of precision

Table 4: News headlines from three days selected from the test set, along with the attention mass (annotated in the numeric columns) by: GRUtxt, LSTMw/o and MSIN. Colored headlines are relevant news outputted by our model based on setting the accumulated probability mass

and recall columns are respectively reported for the forecasting of up and down market on the next day respectively. As observed, there is not much difference in the prediction accuracy of GRUtxt and CNNtxt though they analyze the textual news with different network topologies. Without exploiting the temporal order in the time series and the semantic dependencies of textual news, SVM’s performance seems less comparable to its counterpart neural networks. Out of all models, our MSIN achieves higher prediction accuracy. Its solution also converges to a more balanced classification boundary as reflected in the precision and recall measures.

In order to further observe the impact of textual news on the models’ classification accuracy, we vary the latest time, up to which the daily news headlines are collected, to 9:00 (before market open), 15:40 (before market close), 19:00, and 23:59 on the same day the market performance is predicted. These evaluations are reported in Table 6. A clear trend is seen that, the prediction accuracy is higher as the news articles are collected closer to the market closing time, which confirms the indicative information embedded in the textual news toward predicting the S&P500 performance. While most of neural models tend to perform feature engineering better than that

Table 5: Performance of all models on forecasting market movement. Precision and recall are reported w.r.t. two prediction of up and down market respectively.

Table 6: Performance of all models on forecasting market movement as varying the daily latest time up to which news headlines are collected.

of SVM, only our MSIN can further offer better interpretation due to its join-training of both evolving time series and the news articles. It is noted that evaluations at the time stamp of 19:00 (or 23:59) will shift a model from a predictive system to a purely explanatory one.