Text mining of 15 million full-text scientific articles

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

development in article length and publication sub-topics during these nearly 250 23 years. We showcase the potential of text mining by extracting published protein-24 protein, disease-gene, and protein subcellular associations using a named entity 25 recognition system, and quantitatively report on their accuracy using gold 26 standard benchmark data sets. We subsequently compare the findings to 27 corresponding results obtained on 16.5 million abstracts included in MEDLINE 28 9 which increased from 0.70 to 0.73. Likewise, the Core Full-texts always 188 performed better than Core Abstracts, signifying that some associations are only 189 reported in the main body of the text. Consequently, traditional text mining of 190 abstracts will never be able to find this information.  4). Inspecting the gain in AUC we found that it is lower, compared to having a 205 document weight (Supplementary Table 2 (Fig 3). We found that full-texts have 215 the highest TPR@10%FPR for disease-gene associations (Supplementary Table  216 3). When considering protein-protein associations and protein-compartment 217 associations, full-texts perform equivalently to Core Abstracts and Core Full-218 texts. The result was similar to when we evaluated the AUC across the full range, 219 removing the document weight has the biggest impact on the full-texts 220 ( Supplementary Fig 5) found that these were often conference proceedings, collections of articles etc., 244 which were not easily separable without manual curation. We also managed to 245 identify the list of references in the majority of the articles thereby reducing 246 some repetition of knowledge that could otherwise lead to an increase in the 247 false positive rate. 248 249 We have encountered and described a number of problems when working with 250 full-text articles converted from PDF to TXT from a large corpus. However, the 251 majority of the problems did not stem from the PDF to TXT conversion, which 252 could potentially be solved using a layout aware conversion tool. Examples corpus. This may in turn further increase the benchmark results for full-text 259 articles. Nevertheless, the reality is that many articles are not served that way. 260 Consequently, the performance gain we report here should be viewed as a lower 261 limit as we have sacrificed quality in favor of a larger volume of articles. The 262 solutions we have outlined here will serve as a guideline and baseline for future 263 studies. 264

265
The increasing article length may have different underlying causes, but one of 266 the main contributors is most likely increased funding to science worldwide 267 [28,29]. Experiments and protocols are consequently getting increasingly 268 complex and interdisciplinary -aspects that also contribute to driving 269 meaningful publication lengths upward. The increased complexity has also been 270 found to affect the language of the articles, as it is becoming more 271 specialized [30]. It was outside the scope of this paper to go further into socio-272 economic impact. We have limited this to presenting the trends from what could 273 be computed from the meta-data. 274 275 Previous papers are -in terms of benchmarking -only making qualitative 276 statements about the value of full-text articles as compared to text in abstracts. 277 In one paper a single statement is made on the potential for extracting 278 information, but no quantitative evidence is presented [31]. In a paper targeting 279 pharmacogenomics it is similarly stated that that there are associations that only 280 are found in the full-text, but no quantitative estimates are presented [20]. In a 281 paper analyzing around 20,000 full-text papers a search for physical protein 282 13 interactions was made, concluding that these contain considerable higher levels 283 Articles were grouped into four bins, determined from the 25%, 50%, and 75% 350 quantiles, respectively. These were found to be 1-4 pages (0-25%), 5-7 pages 351 (25-50%), 8-10 pages (50-75%) and 11+ pages (75%-100%  (Supplementary Fig 2). The assignment of categories used in 362 this study was taken from the existing index for the journal made by the 363 librarians at the DTU Library. For the temporal statistics, the years 1823-1900 364 were condensed into one. 365 366

Pre-processing of PDF-to-text converted documents 367
Following the PDF-to-text conversion of the Springer and Elsevier articles we ran 368 a language detection algorithm implemented in the python package langdetect 369 v1.0.7 (https://pypi.python.org/pypi/langdetect). We discarded 902,415 articles 370 that were not identified as English. We pre-processed the remaining raw text 371 from the articles as follows: 372 1. Non-printable characters were removed using the POSIX filter [[:^print:]]. 373 2. A line of text was removed if digits make up more than 10% of the text, or 374 symbols make up more than 10% of the text, or lowercase text was less 375 than 50%. Symbols are anything not matching [0-9A-Za-z]. 376 3. Removal of acknowledgements and reference-or bibliography-lists using 377 a rule-based system explained below. 378 4. Text was split into sentences and paragraphs using a rule-based system 379 described below. 380

381
We assumed that acknowledgements and reference lists are always at the end of 382 the article. Upon encountering either of the terms: "acknowledgement", 383 "bibliography", "literature cited", "literature", "references", and the following 384 misspellings thereof: "refirences", "literatur", "références", "referesces". In some 385 cases the articles had no heading indicating the start of a bibliography. We tried 386 to take these cases into account by constructing a RegEx that matches the typical 387 way of listing references (e.g. Keywords were identified based on several rounds of manual inspection. In each 391 round, 100 articles in which the reference list had not been found was randomly 392 selected and inspected. We were unable to find references in 286,287 and 393 2,896,144 Springer and Elsevier articles, respectively. Manual inspection of 100 394 randomly selected articles revealed that these articles indeed did not have a 395 reference list or that the pattern was not easily describable with simple metrics, 396 such as keywords and RegEx. Articles without references were not discarded. 397

398
The PDF to text conversion often breaks up paragraphs and sentences, due to 399 new page, new column, etc. Paragraph and sentence splitting was performed 400 using a ruled-based system. If the previous line of text does not end with a ".!?", 401 and the current line does not start with a lower-case letter, it is assumed that the 402 line is part of the previous sentence. Otherwise, the line of text is assumed to be a