DiTeX: Disease-related topic extraction system through internet-based sources

This paper describes the web-based automated disease-related topic extraction system, called to DiTeX, which monitors important disease-related topics and provides associated information. National disease surveillance systems require a considerable amount of time to inform people of recent outbreaks of diseases. To solve this problem, many studies have used Internet-based sources such as news and Social Network Service (SNS). However, these sources contain many intentional elements that disturb extracting important topics. To address this challenge, we employ Natural Language Processing and an effective ranking algorithm, and develop DiTeX that provides important disease-related topics. This report describes the web front-end and back-end architecture, implementation, performance of the ranking algorithm, and captured topics of DiTeX. We describe processes for collecting Internet-based data and extracting disease-related topics based on search keywords. Our system then applies a ranking algorithm to evaluate the importance of disease-related topics extracted from these data. Finally, we conduct analysis based on real-world incidents to evaluate the performance and the effectiveness of DiTeX. To evaluate DiTeX, we analyze the ranking of well-known disease-related incidents for various ranking algorithms. The topic extraction rate of our ranking algorithm is superior to those of others. We demonstrate the validity of DiTeX by summarizing the disease-related topics of each day extracted by our system. To our knowledge, DiTeX is the world’s first automated web-based real-time service system that extracts and presents disease-related topics, trends and related data through web-based sources. DiTeX is now available on the web through http://epidemic.co.kr/media/topics.


Introduction
Concerns about disease-related issues have increased due to the year-after-year appearance of diseases such as MERS, Zika virus, Avian Influenza, and Ebola virus. Developing a vaccine to prevent these diseases consumes considerable amount of time and very high sums of money. In addition, Centers for Disease Control and Prevention (CDCs) have existed in Europe and USA since the end of World War II and many other countries have established their own CDCs respectively, such as Korea Centers for Disease Control and Prevention (KCDC) and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 disease-related topics. In addition, DiTeX provides user-friendly interface exploiting effective visualization techniques such as disease-related trend graphs and a word cloud.

Related works
There have been many studies and services that automatically collect unstructured and irregular data and extract disease outbreak information for users. Disease surveillance systems collect data from various web-based resources and extract only highly-relevant data through text algorithms, and then uses this data to display information such as disease routes, occurrence status, and risk. BioCaster [24], developed by Collier, is a project that ran from 2006 to 2012. BioCaster collects data from Internet news articles, public health workers, and users, and applied various filtering methods along with BioCaster ontology. In addition, BioCaster uses text mining technology to perform real-time tracking what is occurring and what is and likely to occur. EpiSPIDER [25] collects news, Twitter, and WHO articles and combines these with geographic data to provide users with information about diseases. EpiSPIDER also provides users with a map interface and a word cloud to show users what topics are most active, and various filtering functions that allow users can find the information they want. Finally, HealthMap collects and visualizes disease-related data from the World Health Organization (WHO), Google News, and validated official alerts. HealthMap [23] extracts only disease-related data, links between diseases and regions, measures the risk of disease, and visualizes those risks of diseases through color coding. HealthMap collects the data of 87 disease categories and 89 countries; the accuracy of the HealthMap disease classifier is 84%.
Many researchers have researched to extract health-related information through web-based data [26][27][28]. Yingjie Lu et al. extract and categorize the most active topics related to health in online communities [26]. They collect data on health-related social media services and categorize them into five clusters. As a result, their results show an average accuracy of 83.5%. Kyle W. Prier et al. define health topics through Twitter [27]. They categorize topics that are relevant to tobacco and define words that are relevant to each topic. They analyze the association between topics and words, and find the most relevant word set. Jiang Bian et al. analyze Twitter using NLP and Machine Learning [28]. They argue that they could extract health-related topics from Twitter although Twitter is difficult to analyze because of the huge-level of noises.
Some papers [23][24][25] present service systems that track diseases when disease-related events occur, but users cannot find comprehensive information on disease-related topics unfortunately, while DiTeX focuses on the extraction and production of disease-related topics. Other papers [26][27][28] study methodologies that extract health-related information through webbased data, while DiTeX focuses on both the service system and the methodology. It also focuses on disease rather than health. We believe that DiTeX may help people to capture newly emerging diseases and a variety of disease-related information. DiTeX can analyze diseaserelated trends over time and extract real-time information on disease-related topics. To our knowledge, DiTeX is the world's first automated web-based real-time service system that extracts and presents disease-related topics, trends and related data through web-based sources. DiTeX collects information from the news and Twitter which are public data. DiTeX complied with the terms of service for NAVER News API (https://developers.naver.com/ products/terms/) and Twitter search API (https://twitter.com/ko/tos).

Ranking algorithm
While many researchers have been working hard to extract topics that people are interested in, many unnecessary attributes of Internet-based data such as repetition of the same meaningless content and writing for marketing purposes are hampering the topic extractions. To solve this problem, many researchers have proposed various ranking algorithms. Term Frequency (TF) [29], one of the most widely used techniques in Information Retrieval (IR) and text mining, is a technique for extracting the most frequently mentioned words in a document. However, since TF assigns equal value to all words, it is not an ideal method as it interprets repeated meaningless words as important. Researchers have proposed various ranking algorithms such as Term Frequency-Inverse Document Frequency (TF-IDF) [36], SMART weighting scheme [32], BM25 [33], Robertson-Sparck-Jones weighting [34], INQUERY [35], and Combined Component Approach (CCA) [37].

System architecture
The specifications of DiTeX are as follows: the server was Windows Server 2016 Essentials x64, and the CPU is an Intel1 Core ™ I7-4790 with 8 GB of RAM. DiTeX consists of a total of eight modules (see Fig 1). The data crawler collects news and SNS data. The back-end web changes the shape of the data according to the front-end web and performs processing to insert the

Id
Formula Description t 01 tf Number of times a term occurs in a document [29] t 02 1 + log(tf) Natural logarithm of tf [30] t 03 0:5 þ 0:5þtf max tf tf factor normalized by maximum tf [29,31] t 04 1þlogðtf Þ 1þlogðavg tf Þ Part of SMART weighting scheme formula [32] t 05 Part of Okapi BM25 ranking formula [33] (k 1 = 1000) An alternative to Inverse document frequency (idf) [30] t 07 logð NÀ df þ0:5 0:5 Þ A variation of the Robertson Sparck Jones weight [34] t 08 logð NÀ df df Þ A probabilistic inverse collection frequency [29] t 09 Part of INQUERY formula [35] t 10 [29] t 11 dl Document length (in bytes) normalization [31] collected data into the database. The content extractor removes the useless attributes and deletes duplicate data before storing final data in the database. The term manager retrieves the data from the database and creates the term extractor and term objects. Front-end web performs the role of managing view pages and visualizing disease-related data. Finally, the database was PostgreSQL 9.6, a relational database management system (RDBMS) [39]. The database consisted of the contents

Data crawler
Data Crawler collects 100 articles and tweets per hour based on a search Application Programming Interface (API) [40] provided by NAVER [41] and Twitter. The Search API is a REpresentational State Transfer (REST) API [42] structure that sends a specific Uniform Resource Locator (URL) to JavaScript Object Notation (JSON) [43] in the GET response of the Hypertext Transfer Protocol (HTTP) [44]. JSON is a lightweight text data type that consists of a single pair of name and value, used in various computer language environments. The Data Crawler stores a text file with a name that includes the date of the data collected from this technique and the search keyword. Table 2 shows the search keywords collected by the data crawler. These are the most commonly used words on the detail page that describe the statutory infectious diseases and infectious diseases designated by the KCDC.

Back-end web
The back-end web is based on Spring 4 [45] and Jetty [46]. In addition, Model View Controller (MVC) [47], one of the computational design patterns, was applied. The main task was to convert the data requested by the front-end web into JSON data. The back-end web executed the modules (contents extractor, term manager, term extractor, and term object) that we set up at a specific time in the job scheduler, which operated periodically. The processes performed by the modules were as follows: First, the content extractor retrieved the data collected by the data crawler and extracted only sentences from the data. The content extractor removed any unnecessary elements from the extracted sentences (URL, HTML tags, retweets, etc.). We used regular expression to remove unnecessary elements and change them to blank. Since these sentences were redundant and likely to be similar, we checked the similarity between sentences using Sift4 [48]. Sift4 is a string distance algorithm inspired by Jaro-Winkler [49] and the longest common subsequence principle [50]. We used Sift4 to extract the unique sentences that allow for the removal of duplicate and similar sentences. The content extractor stores these complete sentences in content table of the database. The content table consists of date collected, data type (News, SNS), and sentences.
Second, the term manager retrieved the sentences stored in the database by the content extractor. The term manager creates the term extractor and passes one statement to it. Following this, the term extractor extracts words from a sentence. The technique for extracting words was OpenKoreanTextProcessorJava [51]. OpenKoreanTextProcessorJava was developed by mixing JAVA [52] and Scala [53] and is the most widely used Natural Language Processing (NLP) technique in Korea. It has the largest Korean corpus and is constantly adding new words. The term extractor performed the generalization and tokenization of sentences. Generalization correctly changes misspelled words in a sentence. Through this process, we could obtain sentences composed of correct words. When this modified sentence performed a tokenization function, it delivered a list of tokens consisting of words/tags. We only collected from this list if the length of the word was greater than 1 and the word was a noun. Third, term object refers to the object of extracted words. The term manager has term object list and creates new term object when a new word is found. After collecting the words of all sentences, the term manager applied the ranking algorithm to the term object list, and the value calculated from the CCA.
Finally, after calculation of the entire term object list was completed, the term manager and term object data were stored in database.

Front-end web
The front-end web is the part of our system that visualizes disease-related data for users. Here, we used JQuery to request the information the users wanted from our system. JQuery [54] uses Asynchronous JavaScript and XML (AJAX) [55] to send data to and from the server to provide information for the dates requested by the user. The data received from the server is in JSON format, and the front-end web uses this data to create charts, tables, and word clouds. We used dataTables.js [56], D3.js [57], and chart.js [58] to provide visualization to users. Data-Tables.js is a library that creates a ranking table of disease-related topics and D3.js is a library that creates word clouds through which users can view multiple words. Chart.js provides the ability to create various charts, and we provided a trend of disease-related topics through the line chart.  Fig 2E is the word cloud that helps users to grasp disease-related topics at a glance. Fig 2A provides the comprehensive view of the data collected (from the left: the total number of collected data, the total number of extracted words, the average of the extracted words, the total bytes, the average bytes, and the number of unique words stored in the database). We can see through this part that news is more redundant than SNS information. The number of data collected for news and SNS was the same; however, the number of news articles after back-end web processing was greatly reduced. Although there was no difference in the number of words registered in the database (despite the difference in the total number of data and the number of extracted words on average), the news repeatedly uses the same words and SNSs use a variety of words. Fig 2B shows the top 10 disease-related topics with the highest CCA. With this function, we confirmed that the sources of SNSs information were often unclear, whereas the sources of news information were clear. This is because the indirect topics on SNSs consistently ranked highly. For example, on February 7, 2018, news returned data in the Top 10 for "Noro virus." However, SNSs showed the same return for "virus," owing to SNS users mentioning "virus" more often than "Noro virus". Fig 2E is a word cloud created using the top 200 disease-related topics. The higher the CCA, the larger the topic size; thus, users can see what the most important information. News sources have more words for disease names, whereas SNSs have many words to describe symptoms. This occurs because SNS users generate data based on their personal experiences.

Result
To extract newly emerging disease-related topics quickly, we prepare another page called to "New Topics", which extracts newly appearing disease-related topics compared to those of the previous day. Its interface is similar to "Topics" page's, but the details of Fig 2B, Fig 2D and Fig  2E are different. Fig 2B lists the topics that present new disease-related topics or a sharp rise in the CCA compared with previous data. Fig 2D is a trend graph of new disease-related topics, which confirms when that topic became active. Finally, Fig 2E is a word cloud created using these new topics. Users can check newly emerging disease-related topics through our "New topics" page quickly.

Discussion
We compared the rankings of disease-related topics extracted through the CCA with those of other ranking algorithms. Finally, we show disease-related information on the "Topics" and "New topics" pages by date, thus proving the efficacy of our system. Table 3 lists the disease-related topics for the TF, TF-IDF, TF-IDF (log), SMART, INQU-ERY, and CCA algorithms. Egg and pesticide contamination denotes the accident in which various pesticides in eggs have exceeded the regulation standards in Europe and Korea. "Ham Sausage" denotes the incident involving ham and sausage made of German and Dutch pork being the main cause of hepatitis. Finally, "Noro virus" denotes the case that occurred at the Pyeongchang Olympics in Korea. All cases are good examples by which to judge the performance of the topic extraction of CCA for incidents that have been reported globally. Therefore, we compared the topic extraction rates of five ranking algorithms for both news and SNS sources.
All algorithms except INQUERY are excellent for extracting disease-related information from news sources. However, the difference in performance of ranking algorithms for SNSs was significant. CCA showed all events in 50th, but other algorithms were often larger than that. Among the ranking algorithms apart from CCA, the algorithm with the best performance was TF-IDF. TF-IDF can capture disease-related topics from SNS at a higher rank. TF-IDF is an older algorithm; however, it still performs well when extracting disease-related information from SNSs. On the other hand, CCA shows even better performance than TF-IDF. For "Egg Pesticide" contamination, TF-IDF ranked 103 rd on SNS, whereas CCA ranked 53 rd -an increase rate of approximately 50%. Thus, CCA shows excellent extraction rates for diseaserelated topics from both news and SNS sources.
We use four evaluation metrics to evaluate the performance of DiTeX: Rand statistic, Jaccard coefficient, Fowlkes and Mallows (FM) index [59], and Odds Ratio [60], which are measurement techniques that can show the accuracy and possibility of capturing the events. The four metrics are defined as follows: Odds Ratio : where TP is the number of news and SNS data related to the event that occurred, TN is the number of news and SNS data that are not related to events that have not occurred, FP is the number of news and SNS data related to events that have not occurred, and FN is the number of news and SNS data that is not related to the event that occurred. TP and TN are "good topic", and FP and FN are "bad topics". The Rand statistic in statistics is used to measure the similarity of two datasets such as "good topic" and "bad topic". This means that the Rand statistic can determine the accuracy of capturing the topics in our DiTeX. The Jaccard coefficient in statistics is a measurement method of comparing the similarity and diversity of datasets. In other words, we can use the Jaccard coefficient when we analyze the accuracy that DiTeX captures "good topic" and "bad topic". The FM index is also a measurement technique that determine the similarity between two datasets. Finally, the Odds Ratio is a measure of association between "good topic" and "bad topic". That is, the Odds Ratio indicates the possibility of DiTeX to capture "good topic" versus "bad topic".  Table 3, in all analyzes. Table 4 lists disease-related topics captured by our system; specifically, our system can extract at least two disease-related topics on a monthly basis, capturing both domestic and international disease-related events. The increase in disease-related topics since March 2018 is shown to demonstrate the usefulness of our system through recent disease-related events. Through the "Topics" and "New Topics," pages we can increase the rate of information dissemination for various disease-related topics and find other related information. In addition, many of these events are overshadowed in media by high-profile events such as celebrity and political scandals; however, our systems can capture them.

Conclusion
DiTeX is the world's first system that extracts important disease-related topics from web-based data. Web-based data are important resources in disease surveillance systems, but it still brings unresolved issues. Online news sources interrupt the extraction of disease-related data by repeatedly using major keywords. SNS data is generated by many unspecified users, making it very difficult to extract accurate information owing to noise such as typos, repetition of meaningless words, and spam. In order to solve these problems, we successfully extracted data with   minimal duplication using string similarity checks and converted incomplete sentences into complete sentences through open-source NLP to facilitate disease-related topic extraction.
Our system extracts the most important disease-related topics using CCA, an advanced ranking algorithm. Finally, we develop our system for a web environment accessible from a variety of platforms. It allows general public to access and search for important disease-related topics at any time. Moreover it will be valuable research resources for public disease specialists because it provides not only current but also long term information.

Limitations and research agenda
We have three further ongoing works. The first is the ambiguity of the word. A representative example is "Virus". "Virus" can be used in a biological mean like "Zika virus", it can be also used in the computer like "Computer virus", and it can be used as "happy virus" on Twitter [17]. A "happy virus" means a person who makes you smile no matter what. The data crawler collects web-based data through search keywords, so it is difficult to identify the ambiguity of "Virus". We solve the ambiguity through machine learning. We combine DiTeX with the artificial neural network mixed Word2Vec [61], which allows the computer to understand human languages, so that only disease-related data can be extracted. The second is the multi-national language support. DiTeX currently supports Korean language, so there is some limitation on collecting multi-language data. Therefore, we expand the collection range of the data crawler and develop a multilingual NLP. Finally, DiTeX extracts disease-related topics by synthesizing the data collected the day before, so it has a problem that it cannot respond immediately when an infectious disease occurs. In order to solve the real-time problem, we develop DiTeX that can extract disease-related topics every hour. DiTeX will be able to extract disease-related topics in real-time. We are trying to expand the utility of DiTeX to surpass the three limitations. We believe that DiTeX can be the system that can be used globally and can make great helps for public health in near future.