Using semantics to scale up evidence-based chemical risk-assessments

Background The manual processes used for risk assessments are not scaling to the amount of data available. Although automated approaches appear promising, they must be transparent in a public policy setting. Objective Our goal is to create an automated approach that moves beyond retrieval to the extraction step of the information synthesis process, where evidence is characterized as supporting, refuting, or neutral with respect to a given outcome. Methods We combine knowledge resources and natural language processing to resolve coordinated ellipses and thus avoid surface level differences between concepts in an ontology and outcomes in an abstract. As with a systematic review, the search criterion, and inclusion and exclusion criterion are explicit. Results The system scales to 482K abstracts on 27 chemicals. Results for three endpoints that are critical for cancer risk assessments show that refuting evidence (where the outcome decreased) was higher for cell proliferation (45.9%), and general cell changes (37.7%) than for cell death (25.0%). Moreover, cell death was the only end point where supporting claims were the majority (61.3%). If the number of abstracts that measure an outcome was used as a proxy for association there would be a stronger association with cell proliferation than cell death (20/27 chemicals). However, if the amount of supporting evidence was used (where the outcome increased) the conclusion would change for 21/27 chemicals (20 from proliferation to death and 1 from death to proliferation). Conclusions We provide decision makers with a visual representation of supporting, neutral, and refuting evidence whilst maintaining the reproducibility and transparency needed for public policy. Our findings show that results from the retrieval step where the number of abstracts that measure an outcome are reported can be misleading if not accompanied with results from the extraction step where the directionality of the outcome is established.


Introduction
The current methods used to conduct chemical risk assessments do not scale to recent regulatory changes such as the European Union's REACH initiative that dramatically increases the number of chemicals to be assessed and the US EPAs trend towards cumulative risk assessments that consider multiple chemicals or combinations of chemical and non-chemical stressors. Manual processes used to synthesize evidence include 4 steps-retrieval, extraction, verification, and analysis [1]. Systematic reviews must include the search criterion, the databases searched and the search terms used [2,3], and the inclusion and exclusion criterion to make the review scope clear to a reader, enable others to replicate or extend the work, and to instill trust by making it difficult to cherry-pick results. Automated systems for risk assessment have been developed to accelerate the retrieval step of the information synthesis process [4,5]; however systems that employ black-box machine learning are not ideal in a public policy setting because it can be unclear why an abstract has been retrieved and because most system reports do not provide explicit inclusion and exclusion criteria.
Automated systems that identify relevant studies should not be confused with systems that extract, verify, and analyze the results from those studies. The latter steps in the review process differentiate studies that find an effect, from studies that find no-effect, and studies that refute the hypothesis that there is an effect. For example, an abstract that states "Furthermore, evidence is presented that AHTN is not genotoxic, does not induce peroxisome proliferation" [6] was labeled as indirect genotoxic peroxisome proliferation [4,5], but this study refutes the hypothesis that the chemical was genotoxic and induces peroxisome proliferation. The authors of the previous work are clear that the system "does not exclude abstracts with no-effect results" [7], but this lack of differentiation between studies that find an association from those that do not is a major gap between automated approaches and manual efforts that attempt to include all evidence and then quantify the amount of supporting and refuting evidence. Thus stating that "a significant difference can be seen with higher numbers of abstracts for melanoma, reflecting the existing knowledge about the metastatic potential of melanoma" [8] conflates the number of abstracts that measure an outcome with those that find an association with melanoma because abstracts that do not find an association are not removed from the total number of abstracts retrieved. Similarly, consider the stated objective of a text mining approach to create a blood exposome database: "We aimed to generate a comprehensive blood exposome database of endogenous and exogenous chemicals associated with the mammalian circulating system through text mining and database fusion." [9]. This could be easily misinterpreted as the number of chemicals that are positively associated, but the system does not analyze the direction of the association, but rather just retrieves studies that mention the mammalian circulating system.
Our goal is to bridge the gap between manual and automated approaches to systematically review literature on chemicals for risk assessments so that assessments can consider a greater number of chemicals and exposure sources. Such tools would reduce the time needed to conduct or reassess an individual or cumulative risk assessment as new information becomes available. Further, such tools can be used by scientific researchers to help focus their research questions based on existing literature.
In keeping with the definition of a systematic review the search criterion is explicit and definitions of each target outcome are provided. The proposed knowledge-based approach is consistent with manual processes that include explicit inclusion and exclusion criterion. Synonyms from the Unified Medical Language System (UMLS) are collected and abstracts are preprocessed using natural language processing (NLP) to manage coordinated ellipses [10] to avoid mismatches between the terms used and the text descriptions. For example, the NLP preprocessing identifies cell proliferation from sentence 1 and necrotic cell death from sentence 2 that would have otherwise been missed as the words in each of these phrases do not appear consecutively in the text. Once each target outcome is identified, explicit and observational claims from the Claim Framework [11] are used to characterize the evidence as supporting, where there is an increase (e.g. sentence 1), neutral where a change is reported but the directionality of the change is not provided (e.g. sentence 2). Refuting evidence, where there a decrease is reported, and negation are also identified for all claims, such as in sentence 3 that contains negated refuting evidence. Lastly, the spectrum of evidence from refuting to neutral to supporting is shown in a waffle plot that provides decision makers with both the amount of literature available and the directionality of that evidence. We demonstrate the benefit of moving from the retrieval to the extraction and analysis stages of the systematic review process using a set of 27 xenobiotic chemicals that are relevant to human exposure paradigms, known carcinogens, known endocrine disrupting chemicals, and/or known toxicants [12][13][14][15]. These chemicals are typical of those used by toxicologists and other scientists when conducting research on chemical exposures and cancer outcomes to guide research questions and experimental designs as well as to develop risk/benefit assessments. The target outcomes of cell death and cell proliferation together indicate one of the hallmarks of cancer [16] and because many of the selected chemicals are known endocrine disruptors and/or toxicants and studies indicate that endocrine disrupting chemicals often exert their toxic effects by interfering with proliferation and/or cell death.

Related work
The target entities in this paper-cell death, cell proliferation and cells-appear in 21, 33 and 97 biomedical vocabularies respectively (https://bioportal.bioontology.org/). For example, the Gene Ontology (GO) [17] captures cell death and cell proliferation as biological processes and cells as a cellular components. The entities also appear in labeled text collections that were created to drive the development of automated information extraction tools such as the shared tasks on Gene Regulation Ontology (GRO) [18] and the Cancer Genetics task [19] that included cell death and cell proliferation. The GENIA collection comprises 2,000 abstracts on transcription factors for human blood cells and includes manual annotations for cell, cell types, cell components, and cell lines [20] and the CRAFT corpus comprises 97 full text articles on mouse genes [21] and includes manual annotations for multiple ontologies that capture our target entities. In contrast to the biomedical search terms for existing text corpora we provide the exact search string used to select the abstracts for each of the 27 chemicals that form our collection of 482,314 abstracts.
Many systems detect biomedical entities automatically (see [22] for a review). Such systems employ a knowledge-based approach such as MetaMap [23] where the system searches for expressions from an existing vocabulary (e.g. the exact phrase 'cell death'), or a machine learning approach where a model is induced from training data. Automated approaches can be further characterized into those that employ traditional machine learning algorithms such as Naïve Bayes classifiers, or Support Vector Machines [24] and those that use a neural networks such as deep learning. In this work we extend the knowledge-based approach by using natural language processing to overcome surface level differences between the concept representations used by authors and how concepts are captured in a knowledge resource. Deep learning is also used to classify result or conclusion sentences to avoid including an author's motivation or stated hypothesis with the outcomes of a study.
This work also relates to argumentation in biomedicine such as the Claim Framework [11] that captures how scientists who conduct empirical research report their findings. The Framework was developed by analyzing full text articles and comprises five types of claims: explicit, implicit, comparison, associations and correlations. Explicit claims are the most prevalent and require that a sentence include two entities (an entity that has been changed and an entity that is responsible for the change), and how the first entity changes the second such as in the sentence 'The [CaN inhibitor cyclosporine A (CsA)] entity1 reduced [change] [cell growth] entity2. '. Observations are also included in this analysis where authors report a changed entity but do not include the entity that was responsible for the change.
Explicit claims are equivalent to the causal claim in [25]. The post error analysis found that the text needed to address a query appeared at the end of the abstract, which suggests that sentences were likely from the result and conclusion sections. In contrast to argumentation systems that strive to identify major claims [25] or to differentiate between major and minor claims, which has been shown to have low inter-rater reliability [26] we make no judgments regarding the veracity of a claim. Instead, directionality and negation of each claim are show to the decision maker as six discrete steps from refuting to negated refute, neutral, negated neutral, negated support, and finally to supporting evidence.
Other work that has contrasted supporting and refuting evidence has framed the task as identifying contradictions [25,27]; however, neutral changes such as 'Results showed a change in cell death' do not fit into this framework. Moreover, 'There was no significant increase in cell proliferation' could mean that there was no change or that there was an increase that did not reach statistical significance (or even that there was a decrease although that is arguably less likely). Several of the examples from the contradiction papers may be better represented as a comparison claim [28][29][30][31] that uses a ternary relationship (rather than the binary relationship in an explicit claim) that captures at least two entities that are being compared, and the measure that was used in the comparison. For example, cd-induced apoptosis was used to compare the cells in the gradable comparison sentence '[Cd-induced apoptosis] outcome was highergradable in [GSK-3beta-knockdown cells] entity1 than in [normal cells] entity2 .
The claims reported here are also similar to manually constructed networks that capture statistically significant relationships between gene and proteins and cell proliferation [32] or cell death [33]. In contrast to that work, we do not constrain the relationships to only those that are statistically significant. The results for the 27 chemicals analyzed here show a high level of disagreement reported in the literature that can be seen clearly in the waffle plots but would be very difficult to discern from a dense 'hair ball' network graph.
Both entity and argumentation efforts have discussed elliptical coordinated compound noun phrases (CCNPs), where an author will save space by not repeating words, for example, an author will use the backward ellipses T or B cells rather than T cells or B cells. Without dealing with CCNPs, the system would fail to capture T cells. Annotators who constructed the GENIA corpus could mark the non-consecutive text, but most argumentation systems do not deal with this phenomenon. For example, in [26], liver and cardiac toxicities was marked as 1 entity which resulted in low inter-annotator agreement because some entities in lists such as eanthralogia/myalgia were separated, but CCNPs were not. In [25], the coordination issue was avoided by asking annotator to identify the entire sentence that supported or contradicted a query. For example, annotated the sentence "Among older adults, consumption of tuna or other broiled or baked fish, but not fried fish is associated with lower incidence of CHF' was identified as causal, but in the claim framework this would be characterized as an association (not an explicit aka causal claim) and broken into 4 separate associations between tuna and CHF, broiled fish and CHF, baked fish and CHF and the negated association between fried fish and CHF. We process CCNPs using the approach in [10], where syntax from the Stanford Dependency grammar [34] is used to identify candidate forward (e.g. cell death and proliferation), backward (e.g. T and B cells) and complex (e.g. normal human and animal cells) ellipses. A semantic strategy is then employed that uses rules (e.g. if a word appears multiple times when expanded the candidate phrases is not included) and heuristics (the number of times that a modifier is used with a head noun) to establish which noun phrases should be expanded. Experiments with 21,280 full text articles showed that more than 1 million noun phrases were impacted by coordinated ellipses and that 10.79% of all noun phrases would be missed if coordination were not addressed. The approach achieved 73% precision, 75% recall, 74% F-score and 95% accuracy for new noun phrases. Precision was higher for backward (82.62 vs. 78.63) and forward expansions (64.82 vs. 60.17) coordinated noun phrases and lower for complex expansions (63.41 vs. 72.59).

Search strategy
The paper captures evidence about cell changes, death and proliferation associated with 27 chemicals with known genotoxic or non-genotoxic mode of actions [5] (see Table 4). Each chemical name along with the synonyms produced by PubChem were reviewed by an expert (JS) who searched for synonyms and reviewed the published literature in PubMed, references listed in identified manuscripts, and textbooks to ensure that the terms in PubChem were relevant. The PubMed search was conducted in July 2019 (see Fig 1 for additional constraints and S1 Appendix for the actual search strings used).

Text preprocessing
The XML PubMed Baseline Repository (updated December 2018) were processed on an AWS server along with daily update files to July 14 th , 2019 (the last file was pubmed19n1318). Markup tags in the XML were used to identify the Background, Objectives, Methods, Results and Conclusions sections from structured abstracts and any remaining markup tags were removed using JSoup [35]. Abstracts available in English were processed using a the Ling Pipe biomedical text class [36] with additional abbreviations that occur frequently in the biomedical literature. After processing, non-ASCII characters in the extended ASCII set were replaced with ASCII approximations such as removing tildes and carons. MEDLINE abstracts can include itemized lists such as "Four categories represented a positive correlation: (1) increasing abnormal CEA with progressing disease, (2) decreasing abnormal CEA with disease regression, (3) unchanged abnormal CEA with stable disease, (4) change from normal to abnormal CEA with progressive disease."(PMID 982100) that can interfere with the dependency parse. Thus, sentences were further processed to convert lists depicted with (a), (b) and (1), (2) into the constituent parts in order to improve the quality of subsequent parsing. In the example above the preamble of "Four categories represented a positive correlation" would be the first sentence and each of the constituent list items would become a separate sentence. Lastly, dependencies were generated using the Stanford parser version 1.9.2 [34].
During the preprocessing the system identifies and resolves elliptical coordinated noun phrases using the process described in [10] (see related work). For example, sentence 4 mentions two cells p53-effective cells and p53-defective cells, however the word "cells" appears only with the second of these noun phrases. Without attending to elliptical coordinated noun phrases, the system would detect that p53-defective cells had been induced but would neglect to capture that p53-effective cells were also induced. 4. Example sentence with coordinated ellipses: The p53 transactivation target Gadd45alpha was induced in both p53-effective and p53-defective cells after 4 h cadmium treatment, and this was associated with an acute inhibition of mitosis. (PMID 17174997) a. p53-effective cells b. p53-defective cells Of the 482,101 abstracts retrieved using the search strategy only 76,587 (15.89%) provide section headings and of the structured abstracts most (69,901, 91.27%) include a result or conclusion section. BioBERT embeddings were used, which is a pre-trained Bidirectional Encoder Representations from Transformers (BERT) [37] model that was trained using biomedical text [38]. The model was trained on the structured abstracts to predict result or conclusion sentences in the unstructured abstracts. The model performed well on structured abstracts (accuracy 0.9363, F1 0.9396, precision 0.9464, recall 0.9329) and on a set of 560 manually annotated unstructured abstracts comprising 4,793 sentences that were assessed by 3 annotators (accuracy 0.9404, F1 0.9561, precision 0.9525, and recall 0.9597).

Target outcomes
We introduce a semantics approach that combines knowledge resources and human natural language processing to overcome surface level differences between the way that knowledge is represented in a formal ontology and how authors discuss those concepts in abstracts.

PLOS ONE
In this paper, two primary outcomes cell proliferation and death capture critical points in the cell life cycle and the underlying mechanisms associated with cancer. A broad definition of proliferation would include explicit mentions of cell proliferation along with any genes, processes, biomarkers, and assays [39] that are involved in the process. Although genes are common markers of cell proliferation [40], the National Cancer Institute's definition of cell proliferation is used in this paper, which is "An increase in the number of cells as a result of cell growth and cell division" (see Fig 2). Thus, the mitosis step of the cell life cycle is within scope, but changes within the cell such as DNA replication is out of scope (DNA replication is also considered separately in [5]), as are changes in enzymes (most notably changes in peroxisome proliferation) and tumor changes that do not refer to cells. Abstracts that include proliferation indexes and mitotic markers are also detected and included as a cell proliferation target outcome. The second target outcome of cell death includes direct mentions of cell death along with necrotic and apoptotic expressions. The secondary outcome in this study captures any mention of cell changes that are not cell death or cell proliferation. This is essentially a less specific reference to the target primary outcomes.
The semantics approach we propose combines knowledge resources with a natural language processing method that tends to ellipses. First expressions for the target concepts-in this paper cell, cell death, and cell proliferation are drawn from the Unified Medical Language System (UMLS) and online thesauri. The UMLS organizes knowledge as concepts (identified using a Concept Unique Identifier (CUI)) that unifies expressions from hundreds of different medical ontologies to improve the system recall (i.e. so that entities of interest are not missed). For example, the CUI for Cell Death is C0007587 and includes apoptosis in which cells are no longer needed and necrosis where the cells die due to injury. The MeSH taxonomy (one of the resources within the UMLS) includes the more general concept of Regulated Cell Death and the narrower concept of anoikis, a form of programmed cell death (see Fig 3). Thus, a specific ontology (in this case MeSH), can enable a user to crisply define the scope of their target outcome measures. Fig 3 shows only MeSH, but the UMLS online browser (https://uts.nlm.nih. gov/uts/umls/home) which was used to identify the expressions in this paper includes multiple ontology and thesauri resources.
Each CUI in the UMLS is assigned 1 or more of the 134 semantic types that capture categories of concepts that can identify additional concepts that are broader or narrower than the initial target outcome. Cell is a both concept name and a semantic type (see Fig 4 which shows Cell as a semantic type). As with concepts, the UMLS enables a user to add additional expressions by exploring more general semantic types such as Fully Formed Anatomical structures to sharply define the target outcome and ensure good coverage. In a knowledge-based approach, terms from the knowledge resource are compared directly with text in the abstract. However, the scientific literature often includes modifiers that are not mentioned explicitly in an ontology and surface level differences such as coordinated ellipses mean that an exact match strategy would miss target outcome expressions. Two strategies were used to overcome this issue. First, natural language processing is used to attend to coordinated ellipses (see text preprocessing). Second, the primary target outcomes were characterized as either single or multi-word expression. Any phrases containing a single word expression such as angioproliferate and apoptosis were included in the set of target outcomes. Multi-word expressions were deconstructed into <cell> <proliferation> and <cell><death> and then combinations of words capturing synonyms of cell, proliferation, and death were combined to form the target outcomes. The UMLS was searched using the online UMLS browser, and online dictionaries and thesaurus were consulted. All terms were verified by the domain expert (JF) who augmented her expertise with searches in PubMed, reference reviews and textbooks to produce a dictionary of cell terms. Very few additional terms were added during the manual step of this process. Similarly, a set of synonyms for proliferation and death were identified. Lastly phrases that included at least 1 cell term and either 1 proliferation or 1 death term were included as target outcomes. Phrases that included 'pathway' or 'pathways', or a word that started with 'factor%' or ''peroxisom%' were removed to satisfy the exclusion criteria and avoid including tumor necrosis factor (TNF) and phrases involving peroxisomes.

PLOS ONE
In addition to noun phrases, the primary target outcome entities can be expressed as a prepositional phrase, such as proliferation of cultured gastric cancer cells which is illustrated in sentence 5. For these cases, the claim framework was used to detect changes where an increase in cells was captured as cell proliferation and a decrease in cells was captured as cell death. 5. Example sentence where the outcome is a prepositional phrase: Enzastaurin suppressed the proliferation of cultured gastric cancer cells and the growth of gastric carcinoma xenografts. (PMID 18339873)

Extracting claims
The Claim Framework captures how scientists communicate results and comprises five types of claims: explicit, implicit, comparisons, associations, and observations [11]. An explicit claim is the most frequent claim type used in full text articles and requires two entities (an entity that has been changed and an entity that is responsible for the change), along with a change term that captures how the first entity changes the second. The analysis reported here considers only the entity that has been changed where entity is constrained to cell death, cell proliferation, and cells changes that are not death or proliferation. An observation claim reports how an entity has changed but does provide information about what was responsible for the change in the same sentence, such as in, 'Results show a statistically significant increase in cell proliferation'. Both semantics and syntax are used to identify claims automatically. The semantics in the initial system to detect explicit claims [11] uses a set of anchor terms comprising 174 directionality verbs (55 indicating an increase and 74 indicating a decrease) and 208 change verbs from the TREC collection [42]. An evaluation using abstracts from [5] resulted in updates to the initial system, which now uses 215 directionality terms (58 increasing, 86 decreasing, and 71 general change verbs and 235 causality verbs). As with the initial version, the base form of each verb is expanded to capture all tenses and nominalized forms before being comparing with the abstract text. Observational claims, where an author does not specify the entity responsible for the change, had not been previously implemented. The system now detects observations using the same syntax and semantics for the explicit claims, but where the entity responsible for the change is not identified. The preprocessing step that reconciles coordinated ellipses has also been added to the system so that example the neutral change to human lymphocyte proliferation is captured from "Inorganic arsenic effects on human lymphocyte stimulation and proliferation" and the negated neutral change on proliferation and apoptosis is captured from "It had no effect on proliferation, apoptosis, or differentiation".
With respect to syntax, a set of rules were constructed that connect each anchor term through dependency paths from the Stanford parser to the target outcomes. In addition to a direct connection between an anchor and a target outcome (e.g. increases cell death), the system captures connections through a prepositional phrase (e.g. induction of apoptosis), and through measurement terms (e.g. the amount of cell death).
The error analysis revealed that one last change was needed to the original system because the target outcomes in this paper implicitly indicate a change, such as in sentence 6, where cell proliferation is captured a prepositional phrase. To resolve this issue explicit claims and observations were first applied to the text. The syntactic rules were then reapplied to any increase in cells for cell proliferation and any decrease in cells for cell death. Thus, the system would report a refuting cell proliferation for sentence 6, where inhibit is the change term and proliferation of U251 malignant glioma cells captures the target outcome cell proliferation. In a manual systematic review, an extraction worksheet helps reviewers identify the results of a study [1] and our system is strongly influenced by this human practice. To avoid capturing claims that reflect an author's description of previous work or their proposed hypothesis that has yet to be verified, the system only includes sentences from the result or conclusion section, where the section is labeled as result or conclusion in structured abstracts and where the label is predicted from a deep learning model for unstructured abstracts (see text preprocessing for details).

Characterizing evidence
Supporting evidence is either an explicit or observational claim where the target outcome has increased. Refuting evidence reports a decrease in the target outcome and neutral evidence shows that there is a change, but the language used in the abstract lacks the specificity to determine if the target outcome has increased or decreased. Negation is also captured and can occur within the noun phrase or within the relation, thus there are 12 possible claim directions. Table 1 shows examples for the target outcome cell proliferation and includes both noun phrase and non-noun representations.
An abstract can report multiple directions of evidence for the same target outcome and sometimes within the same sentence. Consider sentence 7 for the target outcome necrotic death where the system captures two directions of evidence from the words triggered that is supportive and attenuates that is refuting. 7. Sentence with multiple lines of evidence for the same outcome: We show here that Nec-1 also effectively attenuates necrotic death triggered by Cd. (PMID 19135076) a. Supporting triggered necrotic death b. Refuting attenuates necrotic death The system first identifies outcomes (see section on target outcomes) and then identifies claims that include those outcomes; thus, negation can be applied to the claim, the outcome, both the claim and the outcome or neither the claim nor the outcome. Table 1 provides a summary of how negation at the claim level that also includes the polarity (support, neutral or refuting) and the entity level are reconciled to arrive at the direction of evidence that are reported and shown as visual summaries. The direction of evidence is ordered from left to right with respect to the extent to which the outcome supports a change in evidence i.e. (Refute -> Negated Refute -> Neutral ->Negated Neutral -> Negated Support ->Support).
It's not clear if an abstract that reports the same claim multiple times should be considered more compelling than an abstract that makes a claim only once. Consider 3 example thiobenzamide abstracts that report changes in cell proliferation. As shown in Table 2 all three abstracts included 2 supporting claims, and the second abstract also has refuting and neutral claims. If the number of claims is considered then cell proliferation would have 1 refuting claim, 1 neutral claim, and 6 supporting claims (n = 8). However, if the number of abstracts were considered then there would be 1 refuting abstract, 1 neutral abstract and 3 supporting abstracts (n = 3).

PLOS ONE
Using semantics to scale up evidence-based chemical risk-assessments

Target outcome detection
There are no gold standards that capture the directionality of cellular outcomes, cell death, and cell proliferation. An earlier study used machine learning to detect cell death and cell proliferation abstracts during the retrieval stage of a risk assessment. The accompanying manual annotations had substantial agreement (Kappa statistic 0.68) for inter-rater reliability [4]. We require that authors explicitly mention cell proliferation or cell death (or a synonym) and many of the 340 abstracts that were annotated as cell proliferation (out of 3,078 total abstracts from 15 journals), or cell death (380 abstracts) do not mention the target outcomes. The annotations from the earlier work suggests that annotators were inferring cell proliferation from internal cell processes such as peroxisome proliferation, which does not always lead to cell proliferation. We include mitogens, a protein that induces a cell to proliferate, but it appears that the previous work did not identify those abstracts. It is not clear if the annotators in the prior work were asked to identify all abstracts that measure cell proliferation or death, or if they were only asked to identify abstracts in which these outcomes increased. These nuances underscore the need to provide a clear definition of each target outcome as part of the system reporting and clear instructions that identify any abstract where the target outcome was measured, regardless of the result. Differences in the scope limit the utility of measuring precision and recall of our system with respect to the earlier manual annotations so we focus instead on how the different scoping choices might change the subsequent decision making. Specifically, is there a difference between claims in the entire collection (i.e. reported anywhere in the 3078 abstracts), compared with claims made in the annotated abstracts and un-annotated abstracts? For cell death, the abstracts that were not in the manually identified set of abstracts had a greater proportion of supportive evidence than those in the manually annotated abstracts (see Fig 5). In contrast, for cell proliferation, abstracts that were not manually annotated but did report cell proliferation had a greater proportion of refuting evidence. This is consistent with the distribution of evidence found in our larger collection of 482,314 abstracts.

Outcome mentions
Human language often follows a power law distribution where a small number of expressions capture a large proportion of the expressions; thus, the target outcomes were evaluated by manually inspecting the 100 most frequent expressions for each of the primary target outcomes. There were 7 errors for cell proliferation, a reference to an assay, an increase in cell size rather than the number of cells, mitotic spindle, 2 expressions for cell migration, an antiproliferation agent, and geo-accumulation index. There were 14 errors for cell death, 8 expressions

PLOS ONE
Using semantics to scale up evidence-based chemical risk-assessments referred to proteins, 3 referred to genes, an apoptosis assay, apoptotic potential and apoptotic mechanism. The top 20 terms (see Table 3) show that authors are more likely to use negation with cell proliferation, where the 4 th and 5 th most frequent expressions capture an antiproliferative effect or activity, but there is only 1 negated cell death term in the top 20. There were 24,435 cell proliferation expressions, 16,591 cell death expressions, and 195,903 cell mentions that were neither proliferation nor death. This suggests that the approach is robust with respect to additional modifiers that were not in the original knowledge base.
With respect to ellipses, 5,402 cell proliferation expressions from 4,283 abstracts would have been missed if ellipses were not resolved (the most frequent expressions were cell proliferation, antiproliferative effect, cell growth and antiproliferative activity). With respect to cell death, 5,880 expressions from 4,566 abstracts would have been missed (the most frequent expressions were cell apoptosis, cell cycle apoptosis, oxidative apoptosis, and growth apoptosis) if the system did not resolve coordination. For general cell terms, 46,823 terms from 29,113 abstracts were added (frequent expressions were cell differentiation, cell migration, cell invasion, and normal cell). Table 4 shows that the search criterion identified 482,314 abstracts relevant to the 27 chemicals, where the number of abstracts ranged from 118 for Thiobenzamide (chemical 27) and 186,580 for Pyridine (chemical 23). With respect to the target outcomes, cell proliferation was reported more often than cell death (average 7.5%, min 3.2% and max 32.6% versus average 5.6%, min 1.1% max 26.5%) and general cell terms were reported in 36.1% of the abstracts (min 15.8% and max 86.7%).
When conducting a risk assessment, the manual processes should only consider the results from the current study being reviewed, and not use an author's interpretation of previous work. The system therefore should identify only the outcomes in the result or conclusion sections of an abstract. The total number of abstracts that include a result or conclusion target outcome was 27,810 for cell proliferation, 22,020 for cell death and 137,550 for a general cell term (see Table 4). Table 4 also shows the difference between the total number of outcomes mentioned anywhere in an abstract that would be identified during the retrieval step, and how many outcomes appear in the result or conclusion sentences. Table 4 also provides an approximate upper bound on the number of claims that can be identified within the collection (approximate because a single outcome can have multiple change terms). Note that some abstracts include more than one chemical.

Claim extraction
To evaluate the precision of the claims extracted, a random sample of 50 sentences from each outcome were manually inspected. The accuracy was 81.3% (82% for cell proliferation, and 88% for cell death and 75% for general cells). Of the 150 sentences, 58 were refuting, 28 were neutral, and 64 were supporting and the accuracy was 84, 67 and 75% respectively which suggests that the system is more accurate with respect to refuting claims, than for neutral, or supporting claims.
With respect to recall, a random sample of 200 sentences (100 each for cell proliferation and death) were manually reviewed that did not capture a claim but included a primary outcome and at least 1 anchor term. Passive tense can be an issue for claim extraction so 50 sentences included an anchor term before the outcome, which are more likely to use active tense, and the 50 sentences used an anchor term after the outcome. There were 8 sentences that included a claim that was not detected (3 sentences before the outcome and 5 after) for cell proliferation and 13 sentences (7 before the outcome and 5 after) that missed a valid claim about cell death, producing a recall of 92% and 87% for proliferation and death respectively. It does not appear that passive tense impacts the recall of the claims. Table 4 shows the number of abstracts that report a target outcome and can thus be used during the retrieval step of a risk assessment, whereas Figs 6-8 show the distribution of refuting, neutral, and supporting evidence extracted. If all the abstracts (or claims) were supportive then the waffle plot would be entirely green. With respect to cell proliferation, 53.6% of abstracts included a refuting claim, 22% were neutral, and 38.7% were supporting (6.4% of

PLOS ONE
Using semantics to scale up evidence-based chemical risk-assessments abstracts included negated evidence). When considering the number of claims the rates are 45.9, 16.5 and 33% for refuting, neutral, and supporting claims respectively. None of the chemicals have entirely supporting evidence and more than half of the chemicals (15/27) have more refuting evidence than supporting evidence, such as Sulindac (chemical 25) where 67.2% of the claims refute the hypothesis that cell proliferation increases (see Fig 6). However, 12 chemicals do have more supporting evidence than refuting evidence with respect to cell proliferation.
In contrast to proliferation, 38.9% of abstracts refute the hypothesis that cell death increases, 19.7% provide neutral evidence, and 76.7% of the evidence is supportive (6.4% of the abstracts include negated claims). When considering the number of claims the rates are 25% refute, 10.6% neutral, and 61.3% support. None of the chemicals have more refuting evidence and 26 of the 27 chemicals have more supporting evidence than refuting evidence (see Fig 7).
The distribution between refuting, neutral, and supporting evidence for general cell changes were more evenly distributed than for cell proliferation or death and there were

PLOS ONE
46.6%, 34.5%, and 41.4% of abstracts (note that the total is greater than 100% as an abstract often reports more than 1 claim). When considering the number of claims, the distribution was 37.7%, 25.0% and 32.6% for claims that refute, were neutral, or supportive. (see Fig 8).

Impact on decision-making
Cell proliferation and death capture diametrically opposed biological processes within the cell cycle, so it makes sense to ask if a chemical is more strongly associated with cell proliferation or death and how that decision might change if using data from only the retrieval step, versus data from the extraction step that detects supporting, neutral, or refuting claims as shown in genistein (chemical 17), where more abstracts report cell proliferation than cell death (a finding that is consistent with [5]). The information retrieval step identifies abstracts that measure an outcome but measuring an outcome should not be confused with being associated with an outcome. Unfortunately, this distinction can be easily misinterpreted when presented with figures that capture the number of abstracts retrieved, as shown in Fig 9A (see Table 5, chemical 17 for the underlying data used in Fig 9). However, if the directionality of the claims is considered, then there are more abstracts that refute an increase in cell proliferation and more abstracts that support an increase in cell death (see Fig 9B). The result is the same if the number of claims (rather than number of abstracts) are considered, or if the percentage of supporting evidence rather than the raw numbers are considered (see Table 5). Thus, a decision maker would conclude that genistein is more closely associated with cell The number of genistein abstracts that report cell proliferation is greater than the number that report cell death (A); however, more claims refute that cell proliferation increases, whereas more claims support an increase in cell death. https://doi.org/10.1371/journal.pone.0260712.g009

PLOS ONE
Using semantics to scale up evidence-based chemical risk-assessments proliferation if considering only the abstracts retrieved, and cell death if considering the supporting evidence. Table 5 summarizes the analysis conducted for chemical 17 for all the chemicals. If the number of abstracts that report cell proliferation versus cell death is used as a proxy for association, a decision maker would conclude that 20/27 chemicals are more associated with proliferation than death. However, if the number of abstracts that show an increase in cell proliferation or death (i.e. that had supporting evidence) was used, the decision would change from proliferation to death for 13 of the 27 chemicals. If instead the percentage of abstracts that had supporting evidence was considered, the decision would change for 21 chemicals (20 from proliferation to death and 1 from death to proliferation). If instead the number of claims rather than the number of abstracts was used, the decision would change for 17 chemicals (16 from proliferation to death and 1 from death to proliferation) and if the percentage of claims was used the decision would change 19 times (18 from proliferation to death and 1 from death to proliferation). The overall choice would also change from proliferation to death regardless of which claim measure was used. These results suggest that authors of automated systems should specify which step of the information synthesis process is being automated and potentially a caution to readers that simplyreporting an outcome should not be interpreted as an association (either positive or negative). Although the proposed approach moves us closer to the manual risk assessment process, there are other tasks in a systematic review process that are not part of this system. For example, decision makers still need to search the grey literature (studies conducted but not published) and follow references to minimize bias (the latter is a candidate for automation). Similarly, no attempt is made to assess the quality of the study which is required in human systematic reviews [2]; however, it would seem that further work in this regard is needed for systematic reviews involving animal studies where 71% of preclinical systematic reviews did not assess the methodological quality [43]. Understanding how a stressor impacts the cell cycle is just one of the many outcomes that a decision maker would consider when establishing public policy around potential carcinogens but cellular level outcomes are just one of the many streams of evidence that includes amongst other endpoints genetic markers, and evidence on humans and animals is weighted treated differently when determining if there is a sufficient amount of evidence to change policy. We also do not attempt to differentiate between major and minor claims, however human inter-rater reliability to establish this distinction has been reported as low [26].
In addition to providing insight about outcomes for risk assessments, this approach may also contribute to discussions around publication bias and the way in which authors choose to describe their findings. The search criterion used in this study considered only the chemical name, but the chemicals considered were selected because of their potential role in cancer. Against that backdrop cell proliferation might be considered a negative outcome (i.e. that cancer is progressing) whereas cell death might be a positive outcome (i.e. that the cancer progression has been halted). This might influence an author's preference to frame the negative outcome (proliferation) using refuting evidence and the positive outcome (death) using supporting evidence. It is notable that authors use more neutral claims when reporting cells in general that are neither favorable nor unfavorable. Further work is needed to unpack the relationship between framing and the directionality that authors use when reporting outcomes.

Conclusions
Public policy regarding chemicals takes place against a complex backdrop of legal regulations such as Section 6(b) of the Toxic Substances Control Act (TSCA) in the US, and the Regulation No 1907/2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), efforts in the EU. Regardless of the statutory requirements, the human processes used to synthesize evidence can slow down efforts to update public policy and do not scale to cumulative risk assessments where multiple stressors are considered. Automated systems that augment human efforts are urgently needed, but such techniques will only be adopted if they are accurate and consistent with the level of transparency needed in this setting.
The approach introduced in this paper combines domain expertise to clearly articulate target outcomes, knowledge resources to capture target outcomes, and natural language processing methods to overcome surface level differences between how a target outcome is represented in a formal ontology and how those same concepts are reported in the scientific literature. To be consistent with the manual efforts used to conduct a chemical risk assessment, the search strategy, and the inclusion and exclusion criterion must also be explicit. In contrast to work that automates the retrieval step of the information synthesis process, the approach presented here automates the extraction step and provides decision makers with a visualization using waffle plots that reflect the distribution of supporting, neutral, and refuting evidence for a given outcome. This is consistent with a fundamental tenant of a systematic review where all evidence is provided to a reader, not just the evidence that supports an author's position.
Experiments using 482K abstracts for 27 chemicals show that refuting evidence (where the target outcome has decreased) was higher for cell proliferation (45.9%) and general cell changes (37.7%) than for cell death (25.0%), moreover that only cell death had more supporting claims (61.3%). If the number of abstracts that measure an outcome was used as a proxy for association there would be a stronger association with cell proliferation than cell death (20/ 27 chemicals). However, if the amount of supporting evidence was used (that the outcome increased) the conclusion would change for 21 of the 27 chemicals-20 from proliferation to death and 1 from death to proliferation. This suggests that results from the retrieval step (i.e. the number of abstracts that measure an outcome) can be misleading if not accompanied with results from the extraction step where the directionality of the outcome is established.