When Bitcoin encounters information in an online forum: Using text mining to analyse user opinions and predict value fluctuation

Bitcoin is an online currency that is used worldwide to make online payments. It has consequently become an investment vehicle in itself and is traded in a way similar to other open currencies. The ability to predict the price fluctuation of Bitcoin would therefore facilitate future investment and payment decisions. In order to predict the price fluctuation of Bitcoin, we analyse the comments posted in the Bitcoin online forum. Unlike most research on Bitcoin-related online forums, which is limited to simple sentiment analysis and does not pay sufficient attention to note-worthy user comments, our approach involved extracting keywords from Bitcoin-related user comments posted on the online forum with the aim of analytically predicting the price and extent of transaction fluctuation of the currency. The effectiveness of the proposed method is validated based on Bitcoin online forum data ranging over a period of 2.8 years from December 2013 to September 2016.


Introduction
The advancement of the ubiquitous Internet has resulted in the emergence of unprecedented types of currencies that are distinct from the established currency system. The rise of these socalled cryptocurrencies, of which the total supply is increased by using a unique method known as "mining", has changed the way economic transactions are conducted among Internet users to a great extent. Following the introduction of Bitcoin in 2008 [1], a range of cryptocurrencies comparable to Bitcoin have come into existence since 2010 [2][3][4]. Currently, Bitcoin and other cryptocurrency variants are often used for online payments and transactions [4][5][6] with their circulation gradually increasing over time [3,6].
In parallel with the increasing circulation of Bitcoin, a growing number of Bitcoin users take to social media or online Bitcoin forums to share information [6]. Yet, despite the plethora of information posted by Bitcoin users, the linkage between such postings and Bitcoin transactions has not been well-documented.
The present research builds on previous findings regarding Bitcoin-related online forums, and proposes a method to analytically predict the fluctuations in Bitcoin transaction counts PLOS  and value using the data collected from user comments posted on the online forum. First, we extracted keywords of interest from user comments on the online forum. We analysed the relationship between the Bitcoin transaction count and price based on the extracted keywords and quantification. Then, we developed a model based on deep learning [7,8] to predict the Bitcoin transaction count and price. The proposed method efficiently processed the readily accessible online data, and identified as well as utilized the elements that online forum users perceived as important.

Related work
Research on cryptocurrencies, particularly on Bitcoin, has been extensively conducted from diverse perspectives, e.g. the analysis of user sentiment as manifested by social media including Twitter [9,10]. The aim is to determine the value of Bitcoin relative to social phenomena and incidents that have taken place since the introduction of the currency. These social phenomena and incidents include research on the extent to which Bitcoin price fluctuations are related to web search query volumes on Google Trend and Wikipedia, i.e. the extent to which these query volumes predict the Bitcoin price and trade volume [11][12][13][14]. Some recent research has focused on the characteristics of Bitcoin online forums. People who share common interests tend to post comments concerning certain topics on online forums [15][16][17][18][19]. Bitcoin is mostly traded on the web with many users making buying/selling decisions based on information acquired on the Internet [6,20]. Therefore, it is possible to observe how users respond to daily Bitcoin price fluctuations, and to identify or predict future fluctuations in the Bitcoin price and trade volume [6,20]. In addition, forum users are analysed and classified into Bitcoin user groups [6].
Some researchers simply analysed sentiments based on comments posted by forum users or focused on users per se without considering the information derived from cumulative user comment data gathered during a sample period [17,21,22], while others analysed online user comments.
In this regard, topic modelling has been actively explored as an effective technique for analysing user opinions from their online textual postings [23]. Topic modelling [24,25] is a textmining technique that extracts a set of prevailing topics and relevant keywords out of a largescale document corpus. This topical information provides users with an instant overview of the corpus, thereby obviating the need to read through comments, which would otherwise be a tedious, time-consuming process.
Recently, collaborative filtering and topic modelling have been integrated for generating scientific article recommendation systems on an online community [26]. A Temporal Latent Dirichlet Allocation (TM-LDA) system was used to conduct an in-depth analysis of the online social community by employing an advanced Latent Dirichlet Allocation (LDA) topic modelling algorithm [27]. Likewise, application of the LDA approach to Chinese social reviews revealed the sentiments underlying some social events and services [28].

System overview
This section provides an overview of the proposed method. First, we gathered the data relevant to Bitcoin for the purpose of the experiment. More specifically, Bitcoin-related posts on the online forum, daily Bitcoin transaction counts, and its price were gathered. We also extracted and rated significant keywords from the data gathered on the online forum. Then, we selected the data of higher score ratings to generate the prediction model based on deep learning and used the model to predict the fluctuation in the Bitcoin price and transaction count (see Fig 1).

Data crawling
Data crawling was the first step in our analysis. The online environment for Bitcoin transactions is well defined and the rise/fall in its price depends on the supply and demand arising from users [2,3,5,6]. We postulated that user comments on the targeted online Bitcoin forum would have an impact on the fluctuation of the Bitcoin price and transaction count. Thus, we crawled and analysed the relevant data.
The large online forum is home to a variety of Bitcoin-related topics, where users actively engage in conversations by posting comments and forming threads [6,29]. The bulletin boards on the Bitcoin online forum are largely comprised of four different sections. Each section consists of three to five sub-sections. For example, the 'Bitcoin' section is sub-divided into 'Development & Technical Discussion', 'Mining', 'Bitcoin Discussion', 'Project Development', and 'Technical Support'. We crawled the 'Bitcoin Discussion' subsection under the 'Bitcoin' section where comments are posted most actively.
The threads of comments and replies posted from 1 December 2013, when Bitcoin started to sweep the globe, until 21 September 2016 were crawled. Each thread, including the topics and all relevant replies, the time when such posts appeared on the forum, the number of replies posted, and view counts were crawled as well. Duplicate sentences were removed from the replies that quoted earlier posts or replies prior to crawling. We collected data in a legitimate manner, in compliance with the terms and conditions. Moreover, the collected data did not involve any personally identifiable information. The.json files of the Bitcoin forums crawled are presented in the Supporting Information.
Furthermore, we used Coindesk to crawl the daily Bitcoin price and the number of transactions for the abovementioned sample period (See Table 1).  In addition, we reinforced the learning model by crawling the widely used Google Trend data and Wikipedia usage data. Google Trend shows the search interest in a certain keyword on a scale of 1 to 100 based on its search volume on Google for a certain sample period. Google Trend data is widely used to analyse data and phenomena in multiple disciplines [30][31][32][33][34]. We gathered Google Trend data related to the keyword "Bitcoin". The Wikipedia usage volume data is based on the page views of a certain keyword on a certain day, and broadly used in many analytical studies on data or Internet phenomena [34][35][36]. Again, we gathered data about the keyword "Bitcoin" on Wikipedia. Table 1 outlines the arrangement of opinion and market data crawled.

Analysis of user comment data
Our intention was to extract significant keywords used in Bitcoin transactions from the aforementioned crawled data. Therefore, we conducted topic modelling on every user comment to extract the keywords, which were in turn subjected to kernel density estimation for score rating.
Concept building. Our main goal was to extract quantitative features related to diverse characteristics from documents (see Fig 2). We considered the feature value as the degree of relevance for a feature. In detail, the feature value represents the extent to which a document has a particular characteristic. For example, sentiment analysis concerns one such quantitative feature, or the extent to which a document is positive or negative. We generalised this idea to various other user-defined characteristics. Examples of such characteristics include the extent to which a document is related to finance, immigration, and family issues. In particular, we built a lexicon, i.e. a set of keywords, relevant to the characteristics and utilised it to assign a feature value to a document by computing the degree to which the document contains those characteristics defined in the lexicon and other potentially relevant keywords. In this study, we considered a characteristic to be a concept describing a particular phenomenon or object, and defined a concept by constructing a set of keywords, whose meanings were relevant.
Concepts can play an important role in document analysis in diverse fields. That is, one can build useful domain-specific concepts in economics, politics, and social sciences and define the characteristics of documents with respect to these concepts. In the case of a spam-filtering task on documents and comments, for example, we can actively employ a 'spam' concept consisting of suspicious terms that usually appear in spam mails to measure the likelihood of the comment being unsolicited mail. Here, the concept building process was composed of two steps: (1) the initial construction of a relevant keyword set, followed by its (2) user-interactive expansion. In order to facilitate the first step, we provided a user with the initial sets of coherent keywords obtained with two different techniques. The first technique we used was topic modelling, which algorithmically computes those representative keywords emerging from a document corpus. The user can then select some of them as an initial word set for their own concepts. As the other method to provide initial keywords, we computed the representative keywords from the centroid vectors obtained by k-means clustering on word embedding vectors [37].
Once a user formed an initial, small-sized lexicon for a particular concept, the second step was to interactively expand it by using a recently proposed visual analytics system named Con-ceptVector. Based on the initial lexicon given as user inputs, ConceptVector recommended potentially relevant keywords to enable users to easily add a subset of them to the lexicon. As the lexicon expanded, ConceptVector adjusted the recommended keywords that match the semantic meaning of the concept.
The foregoing procedure is discussed further below. Topic modelling for initial lexicon building. The topic modelling approach we used to extract representative keywords emerging from a document corpus is non-negative matrix factorisation, where the non-negativity allows users to interpret the value from factor matrices as the relevance score of a word or a document to a particular topic as mentioned above.
In particular, we constructed a document-term matrix A from the 17,381 forum articles and 627,122 user comments collected from the Bitcoin forum (See Table 1). Each article contains five attributes, 'content', 'topic', 'comments', 'date', and 'views', whereas each comment contains 'content' and 'date' features. Using the 'date' field, we split the document-term matrix per day for our analysis. We then applied the topic modelling to each so as to extract the different topic sets and their representative keywords across different dates.
The mathematical details of this process are as follows. Given a document-term matrix A 2 R mÂn where m is the number of articles and n is the dictionary size, Non-negative Matrix Factorization(NMF) approximately factorises it into two matrices W 2 R mÂd and H 2 R dÂn , where d represents the number of topics (50 in our study), e.g.
The columns in the resulting matrix W correspond to different topics and the keywords corresponding to the dimensions of the k largest value in each column function as the representative keywords of the topic.
Expanding the lexicon via word recommendation. We proposed two types of concepts in the system. A unipolar concept represents exactly one concept such as crude oil and immigration. A bipolar concept has two polarities that oppose each other, e.g. positive vs. negative, progressivism vs. conservatism. In the case of building a concept, the system has positive, negative, and irrelevant word sets. When a user provides a word as an input, the system provides 50 recommended words that are potentially relevant to the seed word. We then automatically sorted the recommended words into five clusters, using the k-means clustering, to gather closely related terms into one group.
Once the lexicon of a concept is created by user interactions, the document rating process utilises the concept built in the process above. Because of the lack of expression resulting from the limited number of words a person could manage, we applied the kernel density estimation (KDE) in the word rearranging phase.
Computation of document relevance to concept. Prior to the KDE, the concept had a limited number of descriptive terms for a characteristic, which resulted in a lack of expression and description. Therefore, the KDE served for the probabilistic smoothing over every word. This smoothing process is the most important procedure for document analysis since the score rating process cannot consider synonyms or closely related words that also represent a specific concept. Based on the assumption that the input terms describe the concept sufficiently well, we constructed a kernel that exerts influence on the entire vocabulary. Concept-Vector adopts a Gaussian kernel as described below.
For the class y 2 {positive,negative,irrelevant}, the conditional probability for each class can be calculated by the distance function d that represents the distance between a word in the word set in each class and the kernel k that ensures a proper balance between the given word and the others. The conditional probability of a keyword z for a class c can be computed as below: which can also be seen as the relevance score to each class. Since our final goal was to obtain scores by taking all classes into consideration, we rated a concept in view of all classes. For instance, 'happy', in the case of a bipolar concept, was rated for the positive, negative, and irrelevant classes. We calculated the bipolar rating as below: biscoreðzÞ ¼ relðzÞ Á fpðy ¼ positive; zÞ À pðy ¼ negative; zÞg ð3Þ The range of the bipolar score is [-1, 1] because the max value of p(y = positive,z) and p(y = negative,z) are 1.

Prediction modelling
Granger causality test. The Granger causality test is based on the supposition that if a variable X causes Y, then any change in X will methodically happen before any change in Y [17,22,38]. As shown in past research, slacked estimations of X display a measurably noteworthy connection with Y [17,22,38]. Nevertheless, connection does not imply causation. We test whether the time arrangement of a discussion of conclusions contains any prescient data with respect to vacillations in the Bitcoin transaction and price.
Our time arrangement at the Bitcoin transaction count and price, indicated by S t , reflects day-to-day change in the Bitcoin transaction count and price. To test whether the idea of gathering feelings in the time arrangement could forecast the change in the vacillation in terms of the Bitcoin transaction and price, we considered the difference clarified by two linear models as in (5) and (6) below. The first model uses just n slacked estimations of S t for the forecast. However, the second model uses the n slacked estimations of both S t and the time series of a concept of forum opinions, meant by X t−1 ,Á Á Á,X t−n . We completed the Granger causality test as indicated by the models in (5) and (6).
In view of the consequences of the Granger causality test, we can reject the null hypothesis, whereby the time series of a concept of forum opinions does not predict fluctuations in the Bitcoin transaction count and price with a high level of confidence. The Granger causality test was performed on the Bitcoin transaction count and price for a time lag of 1 to 12 days.
Deep learning model. Using the gathered data and the analysed and rated comment data, we built a model for predicting the fluctuation in the Bitcoin price and transaction through deep learning. Deep learning is widely used for addressing diverse challenges [8,39]. Despite the quantitative and qualitative increases in Bitcoin-related formal and informal data following the broadening applicability of Bitcoin, deep learning has rarely been used to explore Bitcoin price trends and to address other Bitcoin-related challenges. We created a setting to apply deep learning to the data spanning a period of 2.8 years.
As the first step, we standardised the data to improve its applicability to the learning model. An example of applicable input data is provided in Table 2.
Subsequently, to use the input data for prediction, we set up a deep learning model. Multiple hidden layers were accumulated for learning to identify deep data structures. Specifically, 1, 2, 3, and 5 hidden layers were constructed to select the layer structure that returned the best possible prediction result. The number of neurons that were allocated to each hidden layer was 1,024.
As for the input layers, based on the input data provided in Table 2, 15 input data points were represented as serial vectors to allocate neurons based on the cumulative number of days spent on learning, i.e. 45, 75, 105, and 180 neurons were allocated to cumulative 3, 5, 7, and 12 days. As for the output layer, two neurons were allocated while the probability of rise/fall was represented with the softmax function. The prediction model was built using Google Tensorflow [7], and GPU operation (nVIDIA CUDA) was used to accelerate the deep learning process.    Because mining is a means of earning Bitcoin, many users share their opinions about its efficiency. In addition, the fundamental algorithm by which the Bitcoin is operated, namely 'blockchain,' is often discussed. Other than mining, Bitcoin can also be earned by transactions. Therefore, it is possible to conduct transactions with investment character, in which case related concepts include 'transaction' and 'investment'. Moreover, the 'wallet', a kind of repository in which Bitcoin can be stored and used in subsequent transactions via mining, has given rise to many opinions. In addition, it would be possible to more accurately verify users' considerations when they use Bitcoin through 'security' concepts relative to the problems that may occur as a result of mining and transactions.

Concept building results
The 'silkroad', a large marketplace that uses Bitcoin as a currency, has been exploited for illegal transactions and money laundering. Security therefore not only became a popular issue on the Bitcoin forum but also resulted in social problems, leading to the closure of the site. Although the situation was resolved when the site was closed towards the end of 2013, words regarding related exchange markets and companies attracted considerable attention from users. Therefore, many opinions on illegality related with the use of Bitcoin and consequent problems were verified through the concept of 'illegality'.
Since the emergence of Bitcoin, many types of similar cryptocurrencies have been developed and are in use. Users' discussion on the presence and availability of other cryptocurrencies can be found through the 'altcoin' concept. China dominates the pricing of Bitcoin with large funds, of which the trend manifest in the postings on the forum can be viewed via the concept 'China'.

Results of Granger causality test and correlation test
In view of the after-effects of the Granger causality test, the null hypothesis was rejected. This suggests that the time series of the gathered data failed to forecast the fluctuation in Bitcoin transaction volume and price-i.e. β {1,2,Á Á Á,b} 6 ¼ 0-with a high level of confidence. The Granger causality test was performed on the Bitcoin transaction count and price for a time lag of 1 to 12 days. Tables 3 and 4 list the test results.
In addition, the Pearson Correlation Coefficient between the rating of each concept and Bitcoin price and transaction is shown in Table 5.
The foregoing results are partially indicative of the significance of the extracted keyword data. However, this process was only used for the purpose of verification. The entire data set was used to build the actual deep learning model for prediction.

Prediction results
We built and applied the deep learning model based on the gathered and KDE-based rating data to predict the Bitcoin transaction and price. For the period from 1 December 2013 to 21 September 2016, 90% of the data were used for learning, with the remaining 10% used for validation. The accuracy rate, the Matthews correlation coefficient (MCC), and the F-measure were used to evaluate the performance of the proposed model. Table 6 presents the prediction results. The most accurate prediction model for the Bitcoin price (accuracy rate = 80.39%) is based on the three-layer neural network and the previous twelve-day learning data. The most accurate prediction model for Bitcoin transaction (accuracy rate = 81.37%) is based on the two-layer neural network and the previous twelve-day learning data. Table 4 presents the results relative to the layer and learning data structures. Both three or more hidden layers and cumulative learning data for 12 days or longer resulted in negligible differences. Less than two hidden layers and cumulative learning data for less than 7 days proved to be insufficient for learning and compromised the prediction accuracy. Conversely, overfitting could possibly occur with the prediction accuracy failing to significantly improve, if more than five hidden layers and cumulative data for over 12 days were used.

Discussion
We analysed the user comments posted on a Bitcoin online forum to predict the fluctuation in the Bitcoin price and transaction count. Based on the easily accessible online data, the proposed method predicted the Bitcoin price fluctuation with an accuracy rate of over 80%. Moreover, online user postings influenced Bitcoin transactions. The proposed method shed light on some aspects of Bitcoin-related user comments affecting their decisions to buy/sell the cryptocurrency. The causality test result indicated some topics associated with Bitcoin transactions. The Granger causality test result highlighted the concept 'China' as having a high causality toward the Bitcoin price with the p-value being 0.05 or less, which was significant. These findings suggest China exerts a strong influence on the Bitcoin price.
Furthermore, such concepts as Blockchain', 'Altcoin', and 'Transaction' had a high causality toward Bitcoin transaction count with the p-value being 0.05 or less, which was significant. This finding suggests that topics related to the circulation and transaction of other types of cryptocurrencies have an impact on the Bitcoin transaction volume.
In addition, the correlation test found significant linear relations in most concepts, excluding 'Silkroad', which showed an insignificant linear relation. Hence, the experimental findings revealed some user comments that had the most significant relationship with and effects on the fluctuation in Bitcoin price and transactions.
That said, the proposed method has a limitation in terms of its broader applicability due to the fact that the concepts were constructed for a long period of time. For instance, the correlation coefficient of the concept 'Silkroad' was 0 or lower even though its construction was based on topics often mentioned by users in relation to some events taking place during a certain period, which hindered the extension of the analysis of the concept to the entire sample period. Thus, appropriate subdivision of the sample period would help to obtain a more accurate understanding of the users for topic modelling and to refine the analysis with additional approaches including sentiment analysis.
Moreover, the present findings warrant further studies on the analysis of user comments relative to the characteristics of Bitcoin forums. To increase the accuracy of prediction, it is necessary to address a few challenges. The present work is focused on analysing online forum user comments and adds some formal or structured data to predict the fluctuation in the Bitcoin price and transactions. However, it may add to the reliability of the findings if the search results and relevant content on search engines were quantitatively analysed or if the social network data were analysed as they did in some comparable previous studies [21,40]. Furthermore, it may be an efficient preliminary study to analyse and classify online forum users per se [41][42][43][44][45]. In addition, the postings may be worth filtering more meticulously [46][47][48][49][50] to more accurately corroborate the findings.
Information derived from online forum users seems to be well-suited for extensive research on cryptocurrencies as well as Bitcoin. In the same vein, keywords manifested in online forum user comments could be used for further in-depth analysis and understanding of cryptocurrency transactions. Online forum users' propensities could also be a cue to identify the characteristics inherent in each cryptocurrency. Moreover, online forums are great sources of abundant informal and formal information, which serves to appreciate cryptocurrencies from diverse perspectives including money laundering, which is closely associated with cryptocurrencies [51][52][53][54].

Conclusion
With the increasing circulation of Bitcoin, its acceptability has drawn much attention in many ways [2,3,5,14]. The present study is noteworthy in that it analysed the topics often mentioned by Bitcoin users and linked their meanings to Bitcoin transactions. The proposed method for predicting the fluctuation in the Bitcoin price and transactions based on user opinions on online forums is conducive to understanding a range of cryptocurrencies other than Bitcoin and increasing their usability, although it needs to be reinforced. In addition, the present approach to the salience of user comments on online forums is likely to yield more significant results in many other fields.