LSTM based stock prediction using weighted and categorized financial news

A significant correlation between financial news with stock market trends has been explored extensively. However, very little research has been conducted for stock prediction models that utilize news categories, weighted according to their relevance with the target stock. In this paper, we show that prediction accuracy can be enhanced by incorporating weighted news categories simultaneously into the prediction model. We suggest utilizing news categories associated with the structural hierarchy of the stock market: that is, news categories for the market, sector, and stock-related news. In this context, Long Short-Term Memory (LSTM) based Weighted and Categorized News Stock prediction model (WCN-LSTM) is proposed. The model incorporates news categories with their learned weights simultaneously. To enhance the effectiveness, sophisticated features are integrated into WCN-LSTM. These include, hybrid input, lexicon-based sentiment analysis, and deep learning to impose sequential learning. Experiments have been performed for the case of the Pakistan Stock Exchange (PSX) using different sentiment dictionaries and time steps. Accuracy and F1-score are used to evaluate the prediction model. We have analyzed the WCN-LSTM results thoroughly and identified that WCN-LSTM performs better than the baseline model. Moreover, the sentiment lexicon HIV4 along with time steps 3 and 7, optimized the prediction accuracy. We have conducted statistical analysis to quantitatively assess our findings. A qualitative comparison of WCN-LSTM with existing prediction models is also presented to highlight its superiority and novelty over its counterparts.

prediction. So, there is a need of properly categorized news headlines for overall PSX, sectors, and stocks. Furthermore, these news categories should be segregated enough so that their impact can be observed on the whole stock market, on individual sector and on individual stock separately. In literature, a lot of work has been done related to the news categorization using training data. But there is no labelled data for PSX. In [1], a work has been done to capture news impact on stock market. News are grouped in eight categories using financial expert's manual efforts. These news categories are more general like financial, economic, foreign relations etc. These news are taken on weekly basis not on daily basis. Furthermore, this dataset is not public. The contribution of this work is twofold. Firstly, a properly categorized news headlines dataset is created for PSX from the corpus of unlabelled news headlines. News categorization is performed using category name as a seed keyword. Then context related to these keywords are fetched using (Parts of Speech) POS tagging. On the basis of the final list of keywords, news headlines are filtered out accordingly. This categorized dataset is divided into training and test sets then a supervised classification method is used to ensure the segregation of each category. In Fig. 1, dataset preparation process is illustrated. Secondly, this categorized dataset will be published for researchers. So that, it can be used from different aspect to explore PSX volatility using news headlines. This paper is organized as follows: In Section II related work is discussed. Section III describes the complete process of data collection and preparation. Section IV validates finalized dataset. Finally, Section V concludes the paper.

II. RELATED WORK
The task of text categorization is dominated by the supervised techniques where large number of labelled training data is used. In [2], labelled data is used by a supervised approach to classify the news headlines into three categories. Then these categories are used to analyze the relation between news and stock price.
In [3], company specific news articles are considered for text categorization. These articles are labelled into four categories using an automatic process for text categorization using a hand-made thesaurus. In [13], [14] sentiment scores are used to categorize stock related news. This labelled news data is aligned with stock prices to forecast stock movements. In [15], news are labelled using sentiment scores then labelled data is validated by two economic scientists. Labelled data can be prepared using manual filtration on the basis of some domain based constraints. In [1], news are categorized into different category using manual effort by domain experts. In order to reduce manual effort for text categorization, many approaches have been proposed in literature. These approaches initially rely on manually provided category related keyword list. Then these keywords are used for text categorization based on similarity measure between documents and keywords per category. Then manual effort is reduced by providing category name as initial keyword. In [7], further improvement for reducing manual effort is achieved by providing category name as only input keyword for text categorization. Then Latent Semantic Analysis (LSA) based similarity and Word-Net based similarity are multiplied to calculate the final similarity score of document with category name. Results showed improvement in precision with Reuters-10 corpus. With further improvements and adaptation in lexical references and context model presented in [7], a category name as an initial input based categorization scheme is presented in [8].

III. DATA COLLECTION AND PREPARATION
The News International, is the largest English language newspaper in Pakistan. Firstly, we selected publically available "The News" archives for daily news headlines 1 . News archived dataset is available from 2006 to date. 1 https://www.thenews.com.pk/todaypaper-archive/

A. Data Collection
In literature, it is discussed that news headlines are more useful for stock prediction than complete news articles [5]. Scraper is developed to scrape news headlines from 2006 to 2018 and to store it in CSV file format. These news headlines are grouped in different categories like Business, Top Stories, Sports, World, Karachi, Islamabad, Lahore, and Peshawar. Each row of news headlines dataset is aligned with its publishing date. Table I illustrates some rows of scrapped data. Table I presents example of news headlines from every category. It is observed that Top Story, World, and Business are those special categories which have news related to financial market. So finally, Top Story, World, and Business categories are selected to consider for data preparation.

B. Data Preparation
The selected news corpus contains around 2.5 million news headlines. News corpus word cloud is illustrated in Fig. 2. Word Cloud is a technique to show which words are the most frequent among the given text [9].  The word cloud in Fig. 2 doesn't show the highest frequency words that are generally related to stock market. This shows that stock market related keywords are required to filter out relevant news headlines using string matching. For keyword based text categorization, manual generation of keyword lists for each category is required. Initially, category name is taken as a seed keyword to filter out news for relevant category [9,10]. Then this initial keyword is used to extract further related keywords for specific category [8].
• Assigning Category Name as a Seed Keyword In first step, category name is taken as a seed keyword for string matching [9,10]. The names of categories are taken from the official website of PSX 2 . The name of stock market, its sector names, and stock names are taken as category names for news headlines categorization. In Table II, some categories name and their description are mentioned.

• Extracting Context of Seed Keywords
Categorization by using the seed keywords alone may increase false positive rate. So there is a need to enrich seed keywords by further inspection of dataset under consideration. It is achieved by extraction of further category related keywords in the context of seed keyword. And then the final list of keywords is used to filter out news headlines for relevant category. Context related keywords are extracted by creating a POS tree for each sentence. The POS tagging is a standard Natural Language Processing (NLP) technique. All NLP implementations have this feature but spaCy, which is an NLP open source library is chosen. For instance, the top 10 nouns and verbs co-occur with the seed keyword "kse" are shown in Figure 3. Now, list of keywords, that is a seed keyword, verb and nouns in the context of seed keyword are used for further processing. Different combination of these keywords are used to filter out news headlines using Python string matching. Moreover, if news are filtered for specific sector of PSX, then all stock symbols are also added in the list of keywords for that category. Since news related to the sector's stock is also relevant news headline for that sector. Stock symbols are taken from official website of PSX. For instance, list of keywords for oil and gas sector is "oil and gas", "oil prices", "mari", "ogdc", "ppl", "pol", etc. . For the category of stock related news, only stock symbol is used as a keyword, for example, news for stock "PSO" is filtered using the keyword {"pso"}. In Fig. 4, some of the news headlines are shown using the keyword "oil price" for the category "Oil and Gas Sector".

• News Filtration and Categorization
News filtered using a keyword's list for specific category are combined into a group while duplicate news are discarded. Publishing date of every news is 'Oil prices fall', 'Oil prices up by 2.5 to 7.17pc', 'Hike in oil prices criticised', 'Oil prices rise above $73 per barrel', 'High oil prices to stay, says US energy secretary', 'Shell sees profits climb on soaring oil prices', 'Oil prices recede on easing Iran tension', 'Oman repaying foreign debt due to high oil prices', 'Oil prices march higher', 'Oil prices rebound $1 after sell-off', 'Oil prices rebound' • Dataset Finalization and Potential Uses Initially, the uncategorized dataset contains 2.5 million news headlines. For this work, PSX, some leading sectors, and stocks are considered as a category. After performing all the news categorization steps discussed above, around 11k news headlines for different categories of PSX are labelled and fetched. Remaining news headlines are irrelevant for PSX and not included in final dataset.
In Fig. 8 all steps of keyword based news headlines categorization scheme are shown. This dataset contains properly categorized news headlines for PSX with publishing date. Researchers can use this dataset to capture news headlines impact on stock market volatility from different aspects. News headlines are segregated in three groups. The first group contains news headlines for whole PSX, the second group combines news headlines for specific sector, and the third group has news headlines for specific stock. News impact can be observed from whole stock market to specific sector and specific stock separately. Moreover, these news impacts can be combined to observe collective impact of all news groups in market volatility. The dataset will be published in near future.
Input: Set of category names as a seed keyword Input: Unlabelled news headlines corpus Output: Categorized news headlines dataset Step 1: Initialize a set of keyword list using category name for each category.
Step 2: Extend keyword list using nouns and verbs used in the context of seed keyword.
Step 3: Perform string matching to filter out all news headlines where a keyword from a keywords list occurred.
Step 4: Combine the search result for the whole keyword list and discard duplicate news headlines. Assign category name as a label to each news headline.
Step 5: Extract all labelled news headlines as a final dataset and ignore all unlabelled news headlines.

IV. VALIDATION OF NEWS CATEGORIES SEGREGATION
Text classification is one of the fundamental task in NLP.
With recent advancement in neural networks, deep neural networks are considered more promising for text classification than shallow models [12]. In order to validate news categories segregation, a fully connected artificial neural network based classification model is employed.
The classification model contains 2 hidden layers with softmax activation function in output layer. For multiclass classification problem, softmax activation function is used specifically in output layer. Softmax outputs a vector that represents the probability distributions of a list of classes that is used to find out the class which has maximum probability. This classification model is implemented using Keras library written in Python. Keras, is an open source library that provides user friendly environment to enable fast experimentation with deep neural networks [10]. Model's parameters are tuned in a traditional way using simple trial and error method for parameter tuning. For experiment, dataset is divided into training and test set. Model validation is performed using Keras feature for automatic validation [10]. To evaluate model performance, normalized confusion matrix is used. With imbalanced classes in dataset, normalized confusion matrix is a good technique to summarize the performance of classification technique. It visually interprets results, where a row represents an actual class and a column represents predicted class [11]. Initially, experiment is performed for nine news headlines categories. In Fig. 9, it is observed that all classes have accuracy more than 88 percent although the data is imbalanced.
This shows the degree to which categories are segregated in dataset. A properly categorized dataset enhances classification model performance and it is clearly shown in Fig. 9.

V. CONCLUSION
In this paper, news headlines categorization scheme is presented while there is no training data. The scheme can be applied to any categorization problem in which categories are described by initial sets of seed keywords and provided dataset is unlabelled. The presented scheme extracts the domain related category name as a seed keyword with negligible manual effort. The proposed scheme utilizes NLP based techniques to extract context of seed keyword that is used to further refine the categorization scheme's results. Furthermore, final dataset is validated using ANN based supervised multiclass classification technique and demonstrated using normalized confusion matrix.