Neural networks for open and closed Literature-based Discovery.

Literature-based Discovery (LBD) aims to discover new knowledge automatically from large collections of literature. Scientific literature is growing at an exponential rate, making it difficult for researchers to stay current in their discipline and easy to miss knowledge necessary to advance their research. LBD can facilitate hypothesis testing and generation and thus accelerate scientific progress. Neural networks have demonstrated improved performance on LBD-related tasks but are yet to be applied to it. We propose four graph-based, neural network methods to perform open and closed LBD. We compared our methods with those used by the state-of-the-art LION LBD system on the same evaluations to replicate recently published findings in cancer biology. We also applied them to a time-sliced dataset of human-curated peer-reviewed biological interactions. These evaluations and the metrics they employ represent performance on real-world knowledge advances and are thus robust indicators of approach efficacy. In the first experiments, our best methods performed 2-4 times better than the baselines in closed discovery and 2-3 times better in open discovery. In the second, our best methods performed almost 2 times better than the baselines in open discovery. These results are strong indications that neural LBD is potentially a very effective approach for generating new scientific discoveries from existing literature. The code for our models and other information can be found at: https://github.com/cambridgeltl/nn_for_LBD.


Node combination methods
A neural network approach to LBD with node embeddings requires the model input to be a single vector so the embeddings of the nodes involved in a link need to be combined. This can be done in several ways. Concatenating the embeddings is simple and preserves all information but doubles the size of the input. (Grover and Leskovec, 2016) used four methods which preserve the input size and we experimented with all five methods, detailed in Table 1. Operator Definition Average

The LION Test Cases and Evaluation
These cases are described in detail in (Pyysalo et al., 2018). A condensed version is presented here for completeness.
To identify discoveries, the cancer researchers involved in the project first surveyed articles published between 2006 and 2016 in journals that publish works pertaining to biomolecular cancer, such as Science, Nature, The Lancet, British Journal of Cancer, and Cell. In the initial pass, they sought to identify specific cancer-related discoveries that can be characterized as a causal chain of three concepts, i.e. that fit the constraints of the traditional ABC paradigm of LBD. This initial literature survey yielded 50 candidate discoveries. The second stage filtered the candidates to identify discoveries that could have potentially been found by LBD: the two connections A-B and B-C should be found in the literature at some point in time before the connection between A and C is published. They identified cases where in some year in the past, A-B and B-C each co-occurred in at least 100 publications but where no or very few publications had A-C co-occur. To avoid possible bias towards a particular NLP methods or LBD tools the filtering was performed manually using PubMed searches. In this manner the 50 candidates were culled to 16 which were then assessed by all project participants. This yielded a final set of 5 triples that represented specific recent discoveries on the molecular biology of cancer that could have potentially been suggested by an LBD system prior to their publication. The ontology and database identifier in the relevant resources were manually identified for each of the concepts in the dataset. In addition to these 5 cancer cases, in an effort to continue the trend of prior work, 5 cases from Swanson were also evaluated by the system. Details of these can be found in Table 2 which is adapted from (Pyysalo et al., 2018).

Results
The results of the neural approaches are means of the means which were calculated over 5 runs. The standard deviations reported are of the mean ranks. The results of the baselines are means of the method across all relevant cases and the standard deviations are those over those ranks. The best rank is in boldface type. We sought to determine what methods gave the lowest mean ranks and lowest variance (measured by standard deviation). Where necessary, we use results from (Pyysalo et al., 2018).
Wherever there are models that do not use aggregators or accumulators, the results are simply placed in the first column -this is merely for convenience, the column headers would not apply to such models. The best for a particular approach is underlined while the best of all approaches is in bold.
There were some experiments which produced ties with the gold which were of an amount to make them useless for real-world use. We defined that number as 10; methods which produced more than 10 ties with the gold are reported with a '*' instead of their performance.

Cancer Discoveries and Swanson Cases
Results for Closed Discovery performed on the 5 Cancer discovery cases on which LION was originally evaluated are in Tables 3 and 4. Results for Open Discovery performed on the 5 Cancer Discovery cases on which LION was evaluated as reported in the paper. Means are in Table 5 and medians are in Table 6.
Results for Open Discovery performed on the 5 Swanson cases on which LION was evaluated. Means are in Table 7 and medians are in Table 8.
Results for Open Discovery performed on the 5 Cancer and 5 Swanson cases on which LION was evaluated. Means are in Table 9 and medians are in Table 10.  Due to rounding, some scores seem equal in the tables but are not. Where this occurs and involves a best performer, the unrounded number was used to break the ties.

Additional Analyses
The existing approaches performed much better on mean rank for open discovery than they did on closed discovery, so there was more room for improvement there. This lower baseline explains to some degree why the performance improvements were more pronounced for closed discovery (Table 3)    , the best performer for mean and median were different. A conclusion to be drawn from all the results tables is that although the best neural network-based approaches performed the best, simply using neural networks is not sufficient to produce the best results as there are several instances where the best existing approaches outperformed some neural approaches.